The traditional method of sensitive data identification for data stream has a large amount of calculation and does not reflect the impact of time on the data value, and the mining accuracy is not high. In view of the above problems we firstly adopt the sliding window mechanism to divide the data flow according to time and delay the dataset according to the characteristics of the data flow in the sliding window to achieve the purpose of saving time and space. At the same time, threshold sensitivity analysis is used to find out the optimal threshold. Finally, a
With the rapid development of network technology, Internet platforms such as search engines, social networks, and e-commerce have generated a large amount of data when it is convenient for users. Now it is entering the era of big data where data is explosively growing. People are paying more and more attention to the protection of personal information, and data has become the most valuable thing at the moment. This has led to the mining and protection of sensitive information, that is, through the mining of super large amounts of data to obtain important information of users. However, mining sensitive data can also lead to privacy leakage. Therefore, many researchers began to focus on sensitive data mining and protection.
Baidu Encyclopedia defines sensitive information as follows: being used for improper behavior or being released or modified by others without the consent of the parties would be unfavorable to the implementation of the national interest or government plans or unfavorable to personal privacy rights enjoyed by individuals, including personal privacy information, business management information, financial information, personnel information, IT operation, and maintenance information. Among them, the data stream has strong time characteristics, and there is also the risk of sensitive information being tampered with and eavesdropped. However, expired stream data tends to be less valuable. The identification of sensitive data based on text content is a typical application of data mining. The method proposed in [
This paper summarizes the advantages of the above algorithm and proposes a threshold self-learning algorithm based on the sliding window, which can ensure the shortest mining time based on the mining of accurate information.
The protection of sensitive data is to prevent data from being leaked while ensuring the usefulness of the data [
The sensitive data recognition method based on text content mentioned in [
The algorithm mainly uses enumeration tree as the data structure to save data. Firstly, the enumeration tree is initialized according to the initial sliding window dataset and absolute support degree; then the algorithm uses the time characteristics of data arrival and departure to mine sensitive data and prunes the enumerate trees. Finally, the algorithm sets the upper and lower boundaries of data changes to improve the mining efficiency.
For the sliding window
This paper improves the algorithm
With the continuous advancement of mining technology, it is becoming easier for people to obtain sensitive data, and personal privacy is seriously threatened. Therefore, how to effectively protect the sensitive data excavated becomes another important research area. Current methods for protecting sensitive data include the privacy protection method based on
The dataset studied in this paper mainly comes from online commentary. The dataset
Let the phrase after lexical analysis be
Data preprocessing flow chart.
The threshold cannot be dynamically changed in the algorithm
Use
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15) (16) (17)
This paper optimizes the
Assuming that table
For a large dataset
The
The experiment was run on a PC with a 1.90GHz Core i3 processor, 44GB of memory, and a Windows 8.1 operating system. The lexical analysis processing was implemented using python programming language, and the
The dataset selected in the data mining uses two kinds of tourist reviews of the Tongcheng tourism website as an experimental dataset. Table
Dataset characteristics.
Data set | Minimum data item length | Maximum data item length | Total data item length |
---|---|---|---|
The underwater world | 1 | 6 | 14900 |
| |||
Nanjing Presidential Office | 1 | 4 | 4857 |
The experiment acquires the total data length, the longest data item length, the shortest data item length, and the running time. By modifying the threshold multiple times, the relationship between sensitive data mining time and threshold is finally determined, as shown in Figure
Threshold determination time comparison.
Sensitive data recognition rate on Dataset1
Sensitive data recognition rate on Dataset2
Algorithm comparison on Dataset1
Algorithm comparison on Dataset2
Figures
Figures
The experiment in Figure
Mining time of experiment on two datasets.
Mining time on Dataset1
Mining time on Dataset2
The size of the dataset selected for the protection of sensitive data is 28,900. The content of the dataset includes the name of the visitor, the content of the comment, and the time of the comment. The change of the value, the number of anonymous groups obtained, and the time consumed are used to prove the feasibility of the algorithm.
The experimental comparison shows that the
Anonymous group number comparison
Time cost comparison
This paper first uses NLP’s lexical data package THULAC to preprocess the dataset. Then according to the temporal characteristics of the data stream, a sliding window-based sensitive data mining algorithm is proposed, which takes the most important attributes of time and adopts the data structure of the enumeration tree. The storage of the calculation results is realized. By defining the upper and lower bounds of the data item type, the enumeration tree and the data collection information are updated only when the relative support degree reaches the upper and lower bounds of the type change, thereby saving the calculation time. Finally, the threshold self-learning function is used to determine the threshold for finding the minimum time spent in ensuring the accuracy of mining data. This method can determine the optimal threshold in the same dataset, thereby improving the experimental efficiency. In the protection of sensitive data, using
The data used to support the findings of this study are available from the corresponding author upon request or from the following link:
The authors declare that they have no conflicts of interest.
This research was supported by the Natural Science Foundation of China (No. 61772034, No. 61871412) and Natural Science Foundation of Anhui Province (No. 1808085MF172).