Differentially Private Autocorrelation Time-Series Data Publishing Based on Sliding Window

,


Introduction
Time-series data are a set of sequential, large, and continuous data sequences. In general, time-series data can be regarded as a dynamic dataset that grows infinitely over time. Using the correlation between data values to analyze and mine time-series data can bring considerable benefits to government, enterprises, and social public services. For example, in this outbreak of COVID-19, monitoring and analyzing the patient's physical condition can effectively treat the disease and control the spread of the epidemic. e navigation software needs to count the total amount of traffic in a specific time range of each road to calculate the best route to the destination. e above example illustrates the importance of publishing time-series data for knowledge discovery and acquisition. However, if the curator does not adopt appropriate privacy protection technology and publish the data directly, it will leak personal sensitive information and violate citizens' privacy.
Traditional data publishing mainly uses anonymous technology, such as k-anonymity [1] model and its derivative model [2,3] for privacy protection. However, these methods are strongly dependent on the attacker's background knowledge assumptions and cannot provide an effective and rigorous method to prove its privacy protection level. Some studies [4,5] adopt the technology of combining blockchain and artificial intelligence (AI) to protect the privacy of data, but the technology will bring into low efficiency, and once there may be some vulnerabilities, it will confront some risks of significant attack. Differential privacy [6] is a strict and provable privacy protection technology, which can protect users' sensitive information from leaking their privacy [7]. By adding random noise, it limits the impact of any record on the released statistical results to blur the existence of the record in the dataset, and such users' privacy will be fundamentally protected.
is model is widely used for the release of various data in many application scenarios [8]. For the privacy leakage problem in time-series data publication, the existing work can also be solved by using the differential privacy model. Dwork et al. [9] achieved event-level differential privacy in the scenario of continuous statistics data publication. In order to reduce the noise added in the original time-series data, Chan et al. [10] proposed to adopt a binary tree-based divide-and-conquer method to decompose and store time-series data.
Motivations and Contributions. Traditional differential privacy is widely used for data publication, while Kifer et al. [11] pointed out that it still encountered the risk of leaking personal privacy for publishing correlated time-series data. e current publication methods of differential privacy on correlated time-series data mainly include the methods of establishing correlation models, such as covariance matrix [12] and Markov [13,14], and data transformation, e.g., the Fourier transform [15] and discrete wavelet transform (DWT). e abovementioned methods of differential privacy focus on the publication of independent and identically distributed (IID) data, which will lead to the following problems: Insufficient privacy protection: adding independent identically distributed noise to the correlated data will cause the attacker to filter out the noise through filtering attacks and other methods, thus causing the user's privacy to be disclosed. Low data utility: since IID noise is added to the correlated data, it will lead to the reduction of privacy protection level. In order to maintain the same level of differential privacy protection, more noise needs to be added, resulting in a sharp decrease in the utility of published data.
ese issues indicate that the current methods of differential privacy are not suitable for processing time-series data with correlation. Although Wang et al. [16] proposed the CTS-DP method to resist filtering attacks by adding noise consistent with the correlation of the original data, it ignored the periodicity of time-series data and failed to provide adequate privacy protection. It also does not apply to the publication of dynamic data. Compared to the existing work, our main contributions are summarized as follows.
First, because time-series data exhibit periodic changes and have strong autocorrelation, even if a single record in the dataset is deleted, an attacker can infer information about missing records from other correlated records. We propose periodic sensitivity to replace the global sensitivity in traditional differential privacy to avoid this situation and provide a stronger degree of privacy protection under the same privacy budget. Second, based on the periodic sensitivity, we propose a sliding window mechanism to process infinitely growing and correlated time-series data. ird, we theoretically proved that our proposed correlated time-series data publication algorithm based on sliding window (SW-ATS) satisfies differential privacy. And compared with the state-of-the-art method, the experimental results show that SW-ATS can reduce more errors and provide stronger privacy protection.

Related Work
In the early research on differential privacy data publication, most literature studies assume that the data are independent. At present, the research on differential privacy on correlated data is still relatively limited. Because the main research obstacle of correlated differential privacy is that correlated records can provide additional information for attackers, while traditional mechanisms can hardly model it. In this case, meeting the definition of differential privacy is a complex task. Kifer et al. [11] proposed for the first time that differential privacy would reduce privacy guarantees on correlated datasets if the correlation between data is not considered. For example, suppose that a record r has an impact on a group of records. Even if the record r is deleted from the dataset, the relevant information of r can be derived from this group of records. In this case, the traditional differential privacy cannot provide enough privacy protection. Chen et al. [17] treated social networks as correlated datasets and solve the problem of insufficient privacy protection by multiplying the global sensitivity by the number of correlated records. However, this method introduces too much noise, making the utility of datasets decline sharply.
In the research of correlated time-series data, Cao et al. [18] used internal coupling and internal coupling behavior functions to model related information and used these functions in the association framework to express the degree of association between behaviors. ey proposed a hidden Markov detection model to detect abnormal transaction behavior based on grouping. ey defined a time interval and assumed that behaviors falling within the same interval are related behaviors. Song et al. [19] proposed a hybrid coupling framework, which uses some special attributes to identify the relationship between records. Zhang et al. [20] proposed a related network traffic classification algorithm, using IP address to identify network traffic correlated records. Zhou et al. [21] mapped correlated records to an undirected graph and proposed a multi-instance learning algorithm.
Wang et al. [16] proposed the concept of sequence indistinguishability and proved that the correlations between the original time series and the time series after adding noise are consistent; then, the added noise meets the differential privacy. e differential time-series data publication algorithm CTS-DP proposed by them adds correlated noise to ensure the correlation of added noise. Zhu et al. [12] defined correlation sensitivity. ey considered the correlation between records and proposed an effective related differential privacy solution, CIM (correlated iteration mechanism). CIM uses the covariance matrix to describe the correlation between sequences and uses the covariance matrix as the weight to calculate the sensitivity function. Experimental results show that this solution is superior to traditional differential privacy in terms of the mean squared error in response to large batches of queries. is also shows that the correlated differential privacy can successfully protect privacy while maintaining the practicality of the data.
Some scholars convert the correlated time-series data to another independent domain for processing while retaining the main characteristics of the original sequence. Rastogi et al. [15] proposed a Fourier transform (FPA) method to solve this problem. In FPA, the discrete Fourier transform (DFT) is used to convert the correlated data into an independent Fourier domain. Approximately reconstruct the DFT coefficients of the original sequence. To overcome the shortcomings of FPA when applied to short-term and nonstationary sequences, discrete wavelet transform (DWT) was proposed in [22,23]. DWTextends the range of FPA and retains more features of the sequence. Although there are difficulties in ensuring differential privacy, the literature [24][25][26] uses principal component analysis (PCA) to extract the features of the dataset to another dimension, and the disturbance data published can be applied to some common statistical learning applications. Table 1 provides a summary of recent studies in correlated time-series data publication of differential privacy.
Summary. Currently, on the issue of differential privacy correlated time-series data, some methods add independent noise on the correlated time-series data, which is easy to be attacked. e other methods add correlated noise but ignore the periodic changes of time-series data, resulting in insufficient privacy protection. What is more, the current method can only be applied to the publication of static data.
is article attempts to solve the following problems: How to dynamically publish correlated time-series data? How to deal with the lack of privacy intensity due to the periodic changes of correlated time-series data?

Preliminary Knowledge
3.1. Differential Privacy. Dwork et al. [11] proposed the differential privacy model for the first time, which is a strong privacy protection framework. By limiting the influence of the change of a single record in the dataset on the query results, the attacker cannot accurately obtain the sensitive information in the record even if he knows all the record information except a certain record.
Definition 1 (ε-differential privacy [26]). Consider two neighboring datasets, D and D ′ . For each output O⊆range(A) of a neighboring dataset, if the random algorithm A satisfies then the algorithm A satisfies ε-differential privacy.
Definition 2 (Global sensitivity [28]). Suppose there is a query function f: D ⟶ R d , which takes a dataset D as input and outputs a d-dimensional real vector R. For any neighboring datasets, D and D ′ , the global sensitivity of the function f is defined as Definition 3 (Laplace mechanism [28]). Given a dataset D and a function f: D ⟶ R d with sensitivity GS f . e random algorithm, provides ε-differential privacy protection.
Theorem 1. parallel combinatorial properties [29]). With a random algorithm sequence A 1 , A 2 , . . . , A n and the random processes of any two algorithms that are independent of each other, the privacy protection budget is 3.2. Problem Definition. Time-series data are a set of sequential, large, and continuous data sequences. In general, time-series data can be regarded as a dynamic dataset that grows infinitely over time. For example, Table 2 shows the blood glucose data collected by different users within one month of time-series data. Considering the following scenarios, user A wants to query the average value of blood glucose data within the range of T 1 -T 2 ; user B wants to query the number of people whose blood pressure is greater than 140 mmHg at time T 3 ... e goal of this article is to use differential privacy technology to publish correlated time-series data, and users can obtain meaningful query results under the premise that personal privacy in the database is not leaked. e curator aggregates the time-series data of all users and divides it into D n � D 1 , . . . , D n subdatasets according to the data attributes.
Each 1 , . . . , S i,m pieces of disjoint time-series data according to the user dimension. e curator finally publishes all data on the premise of satisfying differential privacy and responds to user queries as shown in Figure 1.
For any piece of time-series data X, it can be treated as a short-term stationary sequence, and its autocorrelation can be expressed using an autocorrelation function.
Definition 4 (Autocorrelation function [30]). e correlation of time-series data can be expressed by the autocorrelation function. For the original time-series data X, the autocorrelation function can be expressed as Among them, N 0 represents the power spectral density of X and δ(τ) represents the impulse function.
Definition 5 (Sequence indistinguishability [16]). If the original time-series data X and the noise sequence Z to be released have the same normalized autocorrelation functions, that is, then the noise sequence and the original sequence are indistinguishable to the attacker, and the attacker cannot Security and Communication Networks simply use knowledge about the correlation of the original sequence to launch the attack.

Correlated Time-Series Data Publishing Algorithm Based on Sliding Window
In real life, time-series data are a dynamic dataset with infinite growth over time. erefore, on the basis of the CTS-DP algorithm, this paper uses the sliding window mechanism for any length of time-series data to realize the continuous publication of time-series data under the premise of satisfying differential privacy. In order to solve the problem of insufficient privacy protection in the CTS-DP algorithm, we propose periodic sensitivity instead of global sensitivity to achieve greater privacy protection.

Sliding Window Model.
Define time-series data X � D 1 , D 2 , . . . , D t , where D t represents the data value at time t. e sliding window model is used to model the timeseries data X, each sliding window is defined as w i , and the sliding window size is w. e data contained in each sliding window is X W i � D i , D i+1 , . . . , D i+w−1 , and the data to be published after processing by the algorithm is e sliding window in time-series data refers to specifying an interval on the time-series data, which contains the latest data. e purpose is to limit the infinite data stream and obtain data characteristics. With the arrival of new data, the data in the sliding window are processed after the amount of data reaches the set sliding window size. en slide the window forward and wait for the next set of data. Figure 2 shows the process of publishing time-series data using the sliding window model. Differential privacy protection under time-series data is divided into two levels: the event level and the user level [9]. e former protects every event in the time-series data sequence, while the latter protects all user behaviors. is paper is aimed at the privacy protection of the event level, protecting each event in the time-series data sequence.

e Sampling Period of Time-Series Data.
Time-series data usually have a strong characteristic of periodic change. According to the characteristic of timing data showing a periodic change, the sampling period of the timing data can be determined. For example, the blood glucose of normal people remains in a constant range before three meals a day and before bedtime. Usually, the sampling frequency of health data within a day is taken as a period. Taking the blood glucose data as an example, the blood glucose data are sampled four times a day, and then the sampling period of blood glucose data is T � 4. For some data that can only obtain a single statistical value in a day, such as the number of steps, the sampling frequency of the data within a week or month can be used as the period, that is, T � 7 or T � 30.

Periodic Sensitivity.
Since the time-series data have strong periodic changes, if the global sensitivity is still adopted at this time, it will indeed increase the risk of privacy leakage.
For example, someone's blood pressure surged recently due to staying up late. If users query the blood pressure value of a day at this time, they will have a higher probability to infer the other approaching blood pressure samples. erefore, in order to ensure that the data are not leaked, it is necessary to delete all the sampling data before and after approaching this blood pressure value. At this time, if the global sensitivity is still sampled to generate Laplacian noise,

Algorithm
Advantage Limitation Pufferfish [27] e algorithm takes into account the correlation between data Does not satisfy differential privacy PCA [24][25][26], DFT [15], and DWT [22,23] Under the premise of keeping the main characteristics of the sequence unchanged, the correlation time series is transformed into another independent domain for processing Independent noise is added and the sequence correlation is destroyed to some extent CIM [12] Literature [12] proposed correlated sensitivity to reduce noise and utilized a correlation coefficient matrix to describe the correlation of a series It is only applicable to the publication of histogram statistics CTS-DP [16] e correlation noise is added to the original time-series data Dynamic data cannot be processed and privacy protection is inadequate  it is obviously unable to better protect the data from leakage. Based on this, this paper proposes periodic sensitivity to replace global sensitivity to provide stronger privacy protection.
Definition 6 (periodic sensitivity). According to the attribute N of time-series data, determine the sampling period T of this attribute, and then the periodic sensitivity is defined as Among them, X represents a piece of time-series data of attribute N, Q represents the query function, −T i means removing all data in the i-th sampling period, and |T| represents the number of sampled data points in a period.

Algorithm Design.
e SW-ATS algorithm can iteratively process and publish the existing data (static data) in the database, and the recently arrived data (dynamic data) can be processed and published after the data volume meets the sliding window size. Or adjust the size of the sliding window to the size of the newly added data before publishing. e establishment process of the SW-ATS algorithm is shown in Algorithm 1.
Algorithm 1 shows the basic framework of SW-ATS. SW-ATS divides the original time series X into n subsequences according to the sliding window length L (line 1) and iteratively processes the subsequences in each sliding window (2∼9 lines). First, calculate the autocorrelation function of the subsequence Sub i (line 3) and periodic sensitivity (line 4); then generate 4 groups of white Gaussian noise (line 5) with the same length as the subsequence and the power spectral density of , where λ � Δf/ε (the ratio of sensitivity and privacy budget) (line 4); four groups of Gaussian white noise are convolved with the impulse response to obtain four groups of Gaussian noise sequences with autocorrelation function R G ′ (τ) � ����������� R Sub i Sub i (τ)/8 (line 6); finally, Laplacian noise can be obtained by using the sum of the two Gaussian noise groups' squares minus the sum of the squares of the other two, sample from which at intervals of 1 can calculate the Laplacian noise of length L (line 7); by splicing all Laplace noise of length L and adding them to the original time series, the final noise-adding sequence X is gained and ready for publishing (lines 8 and 9).
For the newly added data, when the amount of data reaches the size of the sliding window, the sequence X t is obtained, and steps 3-7 in Algorithm 1 is directly executed to obtain the sequence Sub t 'with noise. en execute X t � X t + Sub i ' and publish X t .
Proof. Literature [16] has proved that if the original timeseries data and the noise sequence added to the time-series data meet Definition 5, then the published noise sequence meets ε-differential privacy.
erefore, according to eorem 1, the algorithm SW-ATS satisfies ε-differential privacy. Proof. Literature [16] has proved that if the autocorrelation function R G ′ satisfies R G ′ (τ) � ����������� R Sub i Sub i (τ)/8, then the autocorrelation function of the noise sequence calculated by where L is the length of the sliding window. When the length of the sliding window is the same as the original sequence, the time complexity of the algorithm is O(n 2 ). With the continuous increase of new data, only the latest data can be calculated, so for the recently arrived data, the time complexity is O(L 2 ).

Utility Analysis.
is paper uses the differential privacy utility definition proposed by Blum et al. [31] to perform utility analysis.
Definition 7 ((α, β)-accuracy [31]). For a query set Q, if for each query Q t ∈ Q and the original dataset X, the privacy protection mechanism M can satisfy equation (9) with a probability 1 − β, then M satisfies (α, β)-accuracy.
For any query Q t ∈ Q, it is known that β > 0 holds, and the generalized Laplace mechanism satisfies

Experimental Evaluation
is experiment uses MATLAB language to realize the correlated time-series differential privacy publishing algorithm based on sliding window. e experimental environment is Inter (R) Core (TM) i5 2.7 GHz, 4 GB memory, Windows 7 operating system. We used two realworld datasets in our evaluations as this has helped in illustrating the effectiveness of our approach in real-world applications.
Diabetes (http://archive.ics.uci.edu/ml/datasets/Diabetes). Diabetes dataset is a representative standard classification dataset in the UCI machine learning dataset. e records were obtained from two sources: an automatic electronic recording device and paper records. e automatic device had an internal clock to timestamp events, whereas the paper records only provided "logical time" slots (breakfast, lunch, dinner, and bedtime). For paper records, fixed times were assigned to breakfast (08 : 00), lunch (12 : 00), dinner (18 : 00), and bedtime (22 : 00).
Steps. e data are collected by teachers and students through smart bracelets and mobile phones. Table 3 shows some of the fields in the dataset, including start date, end date, and value. It means that the number of steps someone took during the period from 2019-05-14 10 : 37 : 07 to 2019-05-14 11 : 49 : 32 is 956 steps. Moreover, the start and end dates of each sampling are not fixed, indicating that the smart bracelet and mobile phones collect and count the number of steps in multiple periods within a day. After sorting out, the step data collected in each period of the day are merged to obtain the step data in the unit of day.
Metrics. In the experiment, to verify the effectiveness of the proposed algorithm in this paper, SW-ATS and CTS-DP algorithms are compared. In terms of data utility evaluation, the mean absolute error (MAE) was used to measure the effectiveness. MAE was defined as follows: where N represents the length of the time series, and the lower MAE means the better utility of data.

Experimental Results.
Nowadays, CTS-DP is the stateof-the-art method to publish correlation time-series data. erefore, we choose the CTS-DP algorithm as a comparison. Figure 3 shows a graph of the experimental results of the two algorithms under different sliding window sizes when the privacy budget ε is 1 and 0.5, respectively. In the Diabetes dataset, a peace of time series was randomly selected from the experiment for processing. Each algorithm was tested 1000 times, and the experimental results were averaged 1000 times. It can be seen that the result of SW-ATS is obviously better than that of CTS-Input: original time series X Output: time series to be published after adding noise X (1) Read the original time series X and divide X into n subsequences Sub 1 , Sub 2 , . . . , Sub n using the sliding window length L, where n � ⌊|X|/L⌋. (2) for i � 1 to n: (3) Calculate the autocorrelation function R Sub i Sub i (τ) of the subsequence Sub i . (4) According to the query function q, calculate the periodic sensitivity PS q of the time-series data X, where PS q is computed by equation (5). (5) Generate four IID Gauss white noise series G 1 , G 2 , G 3 , G 4 , which have the same length as |Sub i |. In addition,

Impact of Sliding Window Size on Data Utility.
′ at the end of Z. (9) end for (10) X � X + Z (11) Return X ALGORITHM 1: SW-ATS. DP, and the average error is reduced by 37.5%. As the size of the sliding window continues to increase, the MAE of SW-ATS also increases. In the Steps dataset, the dataset was first divided into 7 intervals according to the number of steps (an interval less than or equal to 3000 steps and an interval greater than 21000), and then the number of people in each interval was counted every day to form 7 statistical time-series data. e experimental results also show that the results of SW-ATS are better than those of CTS-DP, and the average error is reduced by 24.9%. With the increasing size of the sliding window, the effect of SW-ATS keeps increasing, but it is always smaller than that of CTS-DP. Figure 4 shows the comparison of the results of the two algorithms under different privacy budgets when the sliding window sizes are 5T � 35 and 10T � 70, respectively. With the increase of the privacy budget, the MAE of both algorithms is decreasing, and the algorithm SW-ATS proposed in this paper is always better than CTS-DP. e average error of the algorithm SW-ATS in the Diabetes dataset is 25.1% less than that of CTS-DP, and the decrease in the average error in the Steps dataset is 12.5%.

Privacy Protection Strength Calculation.
In this paper, we use the filtering-based attack method proposed by Xiong et al. [32] to calculate the privacy protection strength. e privacy protection strength after the attack is where R is a vector, representing the autocorrelation function of the noise sequence. P is the cross-correlation function of the original sequence and the noisy sequence, and ε represents the privacy budget. e smaller the ε′, the  Security and Communication Networks 7 higher the privacy protection strength. Figure 5 shows the comparison of the privacy protection strength of the two algorithms. It can be seen that on the two datasets, as the privacy budget continues to increase, the privacy protection strength of the two algorithms has a downward trend. However, the privacy protection strength of the SW-ATS algorithm is always higher than that of the CTS-DP. is proves that the periodic sensitivity proposed in this paper is effective and SW-ATS can protect the privacy of users to a greater extent from being leaked.

Experimental Conclusions.
Each time CTS-DP releases data, it needs to process all the time-series data involved in the query. When new data arrive, CTS-DP needs to recalculate all the time-series data to be released and does a lot of unnecessary calculations. With the continuous growth of data flow, the calculation cost of the CTS-DP algorithm will become larger and larger and may cause the system to crash in extreme cases. e SW-ATS algorithm proposed in this paper introduces a sliding window mechanism on the basis of CTS-DP, which can both process the latest data and respond to queries with different time starting points and lengths. is reduces a lot of unnecessary calculations and greatly saves the system resources.
e experimental results show that, under the sliding windows of different sizes, the error of SW-ATS is reduced by about 31% than that of CTS-DP, and under different privacy budgets, the error is reduced by about 19%.

Conclusions and Future Works
In this paper, we proposed a sliding window-based differential privacy publishing algorithm for autocorrelation time series, which is applied to the publishing of time-series data. We proved that SW-ATS satisfies ε-differential privacy. e experimental results show that the algorithm is significantly better than the comparison algorithm in the publishing of time-series data and can be applied to the publishing of dynamic data.
Although SW-ATS is effective, there are still some aspects to be improved in the future. One is that the periodic sensitivity depends on the sampling period of the timing data. Only when the time-series data have an obvious sampling period, SW-ATS can have a better protection effect. If the time-series data are sampled randomly, the privacy protection strength may not meet the expectations. At the same time, in order to calculate the periodic sensitivity, the length of the sliding window must be greater than three times the length of the sampling period. At present, the SW-ATS algorithm only considers the autocorrelation of single attribute and can only process the time-series data of a single attribute each time. e data of each attribute not only have self-correlation but also have a mutual correlation. It is the next research direction of this paper to consider the correlation between multiple attributes and publish multidimensional correlation time-series data.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.  Security and Communication Networks 9