Differential Privacy Principal Component Analysis for Support Vector Machines

In big data era, massive and high-dimensional data is produced at all times, increasing the difficulty of analyzing and protecting data. In this paper, in order to realize dimensionality reduction and privacy protection of data, principal component analysis (PCA) and differential privacy (DP) are combined to handle these data. Moreover, support vector machine (SVM) is used to measure the availability of processed data in our paper. Specifically, we introduced differential privacy mechanisms at different stages of the algorithm PCA-SVM and obtained the algorithms DPPCA-SVM and PCADP-SVM. Both algorithms satisfy (ε, 0)-DP while achieving fast classification. In addition, we evaluate the performance of two algorithms in terms of noise expectation and classification accuracy from the perspective of theoretical proof and experimental verification. To verify the performance of DPPCA-SVM, we also compare our DPPCA-SVM with other algorithms. Results show that DPPCA-SVM provides excellent utility for different data sets despite guaranteeing stricter privacy.


Introduction
Data mining is a hot spot in the field of artificial intelligence and database research. In the past ten years, it has been widely used in our life, revolutionizing the face of the whole world. However, while providing convenience for people, data mining also brings about a series of hidden dangers. For example, in many modern information systems, the amount of data is very large, increasing the difficulty of data processing. Principal component analysis (PCA) [1] is a standard data analysis method that can reduce the dimension of data, make the data easier to process, and cut down the computational overhead of the algorithm. More specifically, it obtains low-dimensional data by projecting the original high-dimensional data into the principal component space composed of the eigenvectors of the data covariance matrix. What is more, these low-level data can represent most of the original data, revealing the nature of the data.
In addition, the privacy leakage of personal data is another drawback under the development of the data mining.
ere is no doubt that principal component analysis is an efficient method to reduce the dimensionality of data, but it also brings about some safety risks while dealing with private data. Financial systems and healthcare systems often contain private or sensitive information. If data processing algorithms like PCA are performed directly on the original data, the output of them will be likely to leak private information. As a result, how to protect sensitive information while dealing with data is one of the urgent problems in the field of the data mining. Differential privacy (DP) [2] is an effective privacy protection method which can protect individual information while ensuring basic statistics of original data through adding a proper amount of noise to the query results or analysis results [3]. Under the differential privacy protection model, the calculation results of data set are not sensitive to the changes of specific records, and the risk of privacy leakage when adding or deleting a record is controlled within a very small range [4]. In addition, the notion of differential privacy has two types: (ε, 0)-DP and (ε, δ)-DP [5]. (ε, 0)-DP is usually called pure differential privacy, while (ε, δ)-DP with δ > 0 is called approximate differential privacy. (ε, δ)-DP is a weaker version of (ε, 0)-DP as the former provides freedom to violate strict differential privacy for some low probability events. e method that combines differential privacy and principal component analysis is mainly divided into input perturbation and output perturbation. Input perturbation adds noise to covariance matrix in the principal component analysis algorithm, while output perturbation adds noise to the output of the desired algorithm.
ere are several researches on principal component analysis of differential privacy. Blum et al. [6] first proposed the early input perturbation framework SuLQ. SuLQ (Sublinear Queries) guarantees (ε, δ)-DP through perturbing the covariance matrix A. It adds a matrix N of i.i.d. Gaussian noise and applies the PCA algorithm to matrix A + N. One drawback of this method is that matrix A + N is not symmetric and the largest eigenvalue may not be real when the probability is 1. erefore, Chaudhuri et al. [7] modify SuLQ by adding a symmetric noise matrix, so that eigenvalues are all real. e algorithm MOD-SuLQ also satisfies (ε, δ)-DP. Besides, Chaudhuri et al. proposed a new method, PPCA, which randomly samples a k-dimensional subspace from a distribution that ensures differential privacy and is biased towards high utility. Kapralov and Talwar [8] argued that the algorithm (Chaudhuri et al.) lacks convergence time guarantee, and they designed a complex algorithm using the exponential mechanism but is complicated to implement for high-dimensional data. Dwork et al. [9] provided the algorithms for (ε, δ)-DP, adding Gaussian noise to the original sample covariance matrix. Inspired by Dwork, Hafiz et al. [10,11] and Wu-xuan Jiang et al. [12] designed their algorithms for (ε, 0)-DP. Both of them added Wishart noise and selected parameters with a better range of utility.
Obviously, previous algorithms were mostly based on the idea of input perturbation, and few people are involved in principal component analysis of differential privacy based on the output perturbation. At the same time, the privacy protection level and performance of algorithms on the basis of these perturbation methods are rarely compared and analyzed. What is more, most of the current principal component analyses of differential privacy can only guarantee (ε, δ)-DP and lack a method to measure data availability.
In this paper, support vector machine (SVM) [13] is added to measure the availability of processed data through comparing the accuracy of classification. On the basis of it, we combine principal component analysis, differential privacy, and support vector machines to propose two new algorithms: DPPCA-SVM based on input perturbation and PCA-DPSVM based on output perturbation. Both of these algorithms guarantee (ε, 0)-DP. In addition, for the purpose of analyzing the performance of algorithms based on different perturbations, we evaluate the classification accuracy of SVM, PCA-DPSVM, and DPPCA-SVM. Meanwhile, DPPCA-SVM is compared with other recent algorithms, such as MOD-SuLQ-SVM [7] and AG-SVM [9]. rough these experiments, we find that DPPCA-SVM and PCA-DPSVM provide better privacy protection while accomplishing the task of data processing. What is more, results show that DPPCA-SVM provides excellent utility for different data sets despite guaranteeing stricter privacy.
ere is no doubt that DPPCA-SVM and PCA-DPSVM are valuable. On the one hand, it is efficient for them to cope with the two major difficulties in the current data processing field, dimensionality reduction, and privacy protection. On the other hand, they can also provide ideas for researchers in choosing the location of disturbance and the magnitude of noise to achieve differential privacy. Our main contributions are as follows: (1) We propose DPPCA-SVM and PCA-DPSVM through applying Laplace mechanism to PCA-SVM algorithms. Furthermore, we give proof for (ε, 0)-DP of them. (2) We contrast the performance of two algorithms in terms of noise expectation via theoretical analysis. Less noise means less error and better classification accuracy. rough theoretical verification, we ensure that DPPCA-SVM has better performance than PCA-DPSVM. (3) We conduct the experiments to verify the performance of algorithms DPPCA-SVM and PCA-DPSVM in terms of classification accuracy on three real data sets. en we compare our DPPCA-SVM with other recent algorithms AG-SVM and MOD-SuLQ-SVM, and the experimental results show that our algorithm can provide stronger privacy guarantee and excellent data utility. At last, we show that using principal component analysis before SVM can obviously save on computational complexity. e rest of the paper is organized as follows: Section 2 introduces principal component analysis, support vector machines, and differential privacy. Section 3 first describes the two algorithms we proposed and then analyzes the privacy and utility of them. Section 4 shows the performance of algorithms on three real data sets compared with other algorithms. Section 5 concludes the paper.

Principal Components Analysis (PCA). Given a data set
contains information about d attributes of n samples (generally d ≪ n), and we assume that the norm of each sample satisfies ‖x i ‖ 2 ≤ 1. Define the d × d symmetric covariance matrix of the original data as e principal components are obtained by computing eigenvalues and corresponding eigenvectors of the covariance matrix A: where λ i (1 ≤ i ≤ d) is one of the eigenvalues of covariance matrix A, illustrating the proportion of information that corresponding component contains. Larger λ i means that the component is more important. We assume that λ i are sorted in the descending order, that is, λ 1 ≥ λ 2 ≥ · · · λ d ≥ 0. v i is the corresponding eigenvector. In order to reduce data dimension, a target dimension k is needed. We want to select first k eigenvectors, which correspond to the top k eigenvalues. For selecting the number k, threshold α(0 ≤ α ≤ 1) is introduced to denote accumulative contribution rate of principal components. Given a parameter α, target dimension k can be decided by (3) We project the original data X to V k to get low-dimensional data: where Y ∈ R n×k ; we substitute Y � [y 1 , y 2 , . . . y n ] T for X � [x 1 , x 2 , . . . , x n ] T , so the data dimension is reduced and the computational complexity of algorithm can be saved.

Support Vector Machine (SVM). For a data set
, data x i is the i-th record, and label y i ∈ − 1, 1 { }. e classification decision function is defined as where sgn(·) represents the symbol function, w denotes vector orthogonal to optimal hyperplane, and b is a constant.
To obtain the estimations of w and b, the following minimization problem should be solved: e objective function is obviously a convex function. e Lagrangian dual function can be introduced to convert the constrained original objective function into an unconstrained Lagrangian objective function [14]. e formula is transformed as follows: where α i is a Lagrangian multiplier, 0 ≤ α i ≤ C, and C is a constant, denoting penalty factor. Let So, the classification decision function can be simplified as follows:

Differential Privacy
Definition 1 (differential privacy) (see [15]). A randomized mechanism M satisfies ε-differential privacy, if, for any two neighbouring data sets D and D ′ (with at most one different sample) and for all outputs O(O⊆range(M)), where ε > 0 is the privacy budget controlling the strength of privacy guarantee. Lower ε ensures more privacy guarantee.
Definition 2 (sensitivity) (see [16]). For a function f: D ⟶ R d and any two neighbouring data sets D and D ′ , the sensitivity of function f is defined as e sensitivity describes the largest change due to a sample replacement. Sensitivity Δf is only related to the function f. e Laplace mechanism adds independent noise to data; we use lap(b) to represent the noise sampled from Laplace distribution with a scaling of b.
Definition 3 (Laplace mechanism) (see [17]). Given a data set D, for a function f: D ⟶ R d , with sensitivity Δf, the mechanism M provides ε− differential privacy satisfying where lap(·) is a random variable. Its probability density function is Security and Communication Networks 3

Proposed Algorithms and Analysis
In this section, we propose two algorithms: DPPCA-SVM and PCA-DPSVM. Both algorithms can achieve fast classification while reducing the risk of sample leakage in data set. rough theoretical analysis, we prove that they satisfy (ε, 0)-DP differential privacy. Meanwhile, the utility of two proposed algorithms is investigated in this section. Table 1 shows the notations that will be used in this paper.

Algorithm Description.
e algorithm DPPCA-SVM takes the low-dimensional data with privacy protection as the input of support vector machines. It can simplify data and reduce complexity of computation at the same time.
Meanwhile, it provides adequate protection for private data. In DPPCA-SVM, we compute covariance matrix of original data X and then add symmetric noise matrix to it. Each element in noise matrix is sampled from Laplace distribution. Afterwards, we follow standard PCA to calculate first k eigenvectors to make up principal components space. en original high-dimensional data is projected to principal components space to obtain low-dimensional data. e lowdimensional data now is under privacy protection and can be directly applied to support vector machines to train classification function. Algorithm DPPCA-SVM is described in Algorithm 1.
Algorithm DPPCA-SVM introduces differential privacy into principal component analysis to achieve privacy protection of training data. We also can add noise to the real classification hyperplane (w, b) computed by support vector machines so that attackers are unable to obtain real classification hyperplane and training data by simulating testing data.
e idea of algorithm PCA-DPSVM is to take the original low-dimensional data computed by principal component analysis as the input of support vector machines. en we add noise to real classification hyperplane. Since parameter b does not contain information about training data and will not leak privacy, the perturbation of classification hyperplane is concentrated on the perturbation of parameter w. e algorithm PCA-DPSVM is described in Algorithm 2.

Privacy Analysis.
Before proving that algorithm DPPCA-SVM satisfies (ε, 0)− differential privacy, we should analyze its sensitivity. Suppose that there are two neighbouring data sets ′ , and we assume the normalized data vector ‖x i ‖ 2 ≤ 1.

Lemma 1.
In algorithm DPPCA-SVM, for all the input data, denote f(X) � 1/nX T X; then the sensitivity of function f(X) is equal to 2 d/n.
Proof. Suppose that A 1 and A 2 are the covariance matrices of X and X ′ , respectively.
According to Definition 2, the sensitivity of function f(X) is equal to max‖A 1 − A 2 ‖ 1 . en we have where ‖ · ‖ 1 denotes the l 1 norm; for a matrix C ∈ R m×n , Proof. For A ′ derived from algorithm DPPCA-SVM on X and X ′ , we obtain A ′ � A 1 + N 1 and A ′ � A 2 + N 2 , where N 1 and N 2 are the corresponding noise matrices.
p(N 1 ) and p(N 2 ) are the density functions of output functions at neighbouring data sets X and X ′ , respectively. According to Lemma 1, we have and we can obtain erefore, algorithm DPPCA-SVM satisfies (ε, 0)− differential privacy.
Before proving that algorithm PCA-DPSVM satisfies (ε, 0)− differential privacy, we should also analyze its sensitivity. Suppose that there are two neighbouring classification data sets D 1 � (x 1 , y 1 ) · · · (x i , y i ) · · · (x n , y n ) and D 2 � (x 1 , y 1 ) · · · (x i , y i ) · · · (x n ′ , y n ′ ) , where x i ∈ R k and x n ≠ x n ′ , and we assume the normalized data vector Security and Communication Networks Lemma 2. In algorithm PCA-DPSVM, for all the input data, denote f(x, y) � n i�1 α i x i y i , and then the sensitivity of function f(x, y) is equal to 2Cn d.
Proof. In Section 3.2, we get the classification hyperplane normal vector expression w � n i�1 α i x i y i ; suppose that w 1 and w 2 are the classification hyperplane normal vectors of D 1 and D 2 , respectively. According to Definition 2, the sensitivity of function f(x, y) is max‖w 1 − w 2 ‖ 1 . en we have According to ‖A + B‖ 1 ≤ ‖A‖ 1 + ‖B‖ 1 , we have

Security and Communication Networks
For normalized ‖x i ‖ 2 ≤ 1 and 0 ≤ α i ≤ C, we have Proof. For w ′ derived from algorithm PCA-DPSVM on D 1 and D 2 , we obtain w ′ � w 1 + N 1 , w ′ � w 2 + N 2 , where N 1 and N 2 are the corresponding noise vectors.
p(N 1 ) and p(N 2 ) are the density functions of output functions at neighbouring data sets D 1 and D 2 , respectively. According to Lemma 2, we have and we can obtain erefore, algorithm PCA-DPSVM satisfies (ε, 0)− differential privacy. □ 3.3. Utility Analysis. In Section 3.2, we prove that algorithms DPPCA-SVM and PCA-DPSVM both satisfy (ε, 0)− differential privacy; next we evaluate the performance of the two algorithms. It is obvious that adding noise can protect data privacy but at the same time it will have negative effect on data utility. Noise magnitude directly determines effect magnitude. So we evaluate the performance of the two algorithms in terms of noise magnitude.
We observe that θ < 1, which means that, for a given privacy parameter ε, algorithm DPPCA-SVM adds less noise compared to PCA-DPSVM. □ 3.4. Algorithm Comparison. In Section 3.3, we prove that DPPCA-SVM based on input disturbance has better data availability compared to PCA-DPSVM based on output disturbance. In this section, two other differential privacy principal component analysis algorithms based on input disturbance are introduced. Besides, we compare DPPCA-SVM with them theoretically. e algorithm AG was proposed by Dwork. It provides (ε, δ) − DP privacy protection through adding a symmetric Gaussian noise matrix to the algorithm PCA. e goal of the algorithm is to output a subspace that can protect privacy and preserve data matrix variance as much as possible. e upper triangular element of the matrix is an independent and identically distributed value satisfying N(0, Γ 2 ), where Γ � Δf ��������� � 2ln(1.25/δ)/ε. e algorithm MOD-SuLQ proposed by Chaudhuri is an improvement of SuLQ. SuLQ is the only algorithm that provides principal component analysis of differential privacy at the beginning, and it was proposed by Blum in 2005. Blum adds Gaussian noise to covariance matrix A to ensure differential privacy and publishes the first k eigenvectors of the perturbed covariance matrix. But this method has a fatal problem. When the perturbed covariance A + N is an asymmetric matrix, the maximum eigenvalue may not be real, and the corresponding eigenvector will be very complicated. erefore, instead of adding an asymmetric Gaussian matrix, Chaudhuri adds a symmetric matrix with i.i.d. Gaussian entries N. As a result, the perturbed covariance matrix is symmetric but not necessarily semipositive definite, so some eigenvalues may be negative. However, the eigenvectors are all real, which does not affect subsequent calculations.
Obviously, both of the above algorithms, like DPPCA-SVM, implement privacy protection via input disturbance. However, the mechanism and scale of the noise they add are quite different. erefore, it is necessary to compare these three algorithms theoretically and experimentally. For input perturbation methods, the magnitude of the noise is usually used to reflect availability of data. As a result, noise scale and privacy level of these algorithms are presented in Table 2. We find that the noise scale added by AG is the smallest among the three algorithms, followed by DPPCA-SVM, and that added by MOD-SuLQ is the largest. In addition, both AG and MOD-SuLQ just meet (ε, δ) − DP, while DPPCA-SVM satisfies (ε, 0) − DP. erefore, we can infer that DPPCA-SVM achieves stricter privacy protection while adding noise of a suitable scale. Furthermore, in order to intuitively compare the utility of these algorithms in experiments, we apply support vector machine on the basis of AG and MOD-SuLQ, resulting in AG-SVM and MOD-SuLQ-SVM.

Experimental Results and Analysis
In this section, we will first give some experimental results to verify that algorithm DPPCA-SVM outperforms PCA-DPSVM in data utility. Next, in order to verify the effectiveness of the algorithm DPPCA-SVM, we compare it with the existing algorithms SVM, AGPCA-SVM, and MOD-SuLQ-SVM for classification accuracy.
ree UCI data sets are used in our experiments: Sensorless [18], Covtype [19], and Musk [20]. e data set information is shown in Table 3; the target dimension k is the value when the principal component contribution rate α > 85%.

Experiments for Classification Accuracy of Algorithms DPPCA-SVM and PCA-DPSVM.
We compare algorithms DPPCA-SVM, PCA-DPSVM, and SVM in terms of classification accuracy; higher classification accuracy indicates higher data utility. We set privacy budget ε ∈ [0.005, 1.5]. Figure 1 shows that, with the increase of ε, the classification accuracy of three algorithms on all data sets is continuously improved. Classification accuracy is SVM > DPPCA-SVM > PCA-DPSVM.
Algorithm SVM does not involve privacy protection, and its classification accuracy is not affected by privacy budget. Algorithm DPPCA-SVM adds noise to protect private information of data, making classification accuracy slightly lower than that of SVM. However, as the privacy budget increases, the noise added decreases. At the same time, classification accuracy of DPPCA-SVM gradually increases and approaches that of SVM. Figure 1 also visually shows the comparison of algorithms DPPCA-SVM and PCA-DPSVM. When privacy budget is large, the accuracy of algorithm PCA-DPSVM is also very high, which is close to that of DPPCA-SVM, but when privacy budget is small, the strength of privacy protection is large, and the classification accuracy is not satisfactory and is worse than that of DPPCA-SVM.
According to the above experiment, the following conclusions can be achieved. On the one hand, when a certain amount of noise is added to meet the requirements of differential privacy, the availability of results will be inevitably affected. Although the impact will decrease as the privacy budget increases, there is still a slight gap in data availability between the noise-added algorithm and the original algorithm. On the other hand, we found that the method of perturbing the covariance matrix in principal component analysis (DPPCA-SVM) is better than the method of perturbing the normal vector of classification hyperplane in support vector machine (PCA-DPSVM). In other words, if an algorithm is asked to achieve the same level of differential privacy protection, the input disturbance method is more effective than the output disturbance, and the data availability is higher.

Experiments for Classification Accuracy of Algorithm
DPPCA-SVM and Other Algorithms. From Section 4.1, we know that algorithm DPPCA-SVM outperforms PCA-DPSVM in data utility. at is to say, from the perspective of classification accuracy and privacy protection, input perturbation which provides privacy protection for low-dimensional data is better than output perturbation acting on the classification hyperplane. Furthermore, in order to verify the effectiveness of algorithm DPPCA-SVM, we compare it with MOD-SuLQ-SVM [7] and AG-SVM [9] in terms of classification accuracy. In our experiment, we take privacy budget ε ∈ [0.000005, 1]. Another privacy parameter δ is set to 1/n 2 . en, the classification accuracy variation curves of the four classification methods under different privacy budgets are    Figure 2. e higher the classification accuracy is, the better the usability of the algorithm is.
In Figure 2, we observe that classification accuracy is generally SVM > AG-SVM > DPPCA-SVM > MOD-SuLQ-SVM. Since no noise is added, there is no doubt that SVM has the highest classification accuracy. In addition, other algorithms achieve differential privacy by adding different noise, making classification accuracy slightly lower than that of SVM. Among them, AG-SVM and MOD-SuLQ-SVM both provide (ε, δ)− differential privacy, while DPPCA-SVM provides (ε, 0)− differential privacy. As we all know, (ε, 0)− differential privacy usually provides stronger privacy guarantee and weaker data utility than (ε, δ)− differential privacy. However, the accuracy of our DPPCA-SVM is only slightly lower than that of AG-SVM and higher than that of MOD-SuLQ-SVM in general. Moreover, when the value of privacy budget is large enough (ε > 0.5), the classification accuracy of DPPCA-SVM is higher than that of AG-SVM. erefore, compared with other algorithms, DPPCA-SVM not only provides superior privacy protection but also has relatively high data availability.
In the next experiment, we evaluate the classification accuracy of algorithms SVM, AG-SVM, DPPCA-SVM, and MOD-SuLQ-SVM when privacy budget is fixed and the number of samples changes. e privacy parameters ε and δ are set to 0.005 and 1/n 2 , respectively. e experimental results are shown in Figure 3. Above all, as the sample size increases, the classification accuracy is generally on the rise but sometimes slightly decreases due to some inappropriate samples. What is more, Figure 3 also intuitively shows the In the case of the same privacy budget, the DPPCA-SVM classification method has a high classification accuracy when the number of samples is large, which is close to the AG-SVM method. However, when the samples are relatively small, due to the stronger privacy protection, the classification effect is slightly worse than that of AG-SVM. For MOD-SuLQ-SVM, DPPCA-SVM is superior to it in terms of privacy protection and classification accuracy. In summary, DPPCA-SVM that we proposed is compared with AG-SVM and MOD-SuLQ-SVM.
ese algorithms all realize differential privacy through input disturbance. However, the noise they add is quite different, which leads to a distinguishment in the usability of the results. As a result, in order to measure the performance of these algorithms, we designed the above two experiments. Among them, privacy budget and the size of samples are treated as independent variables; that is, we control one of these two parameters and adjust the other to observe changes in the data availability of these algorithms. e results show that when the number of samples is fixed and the privacy budget is small, there is still a certain gap between the classification accuracies of DPPCA-SVM and AG-SVM. However, if the privacy budget is large, the classification accuracies of them will be close to each other. For MOD-SuLQ-SVM, DPPCA-SVM is always better than it overall. In addition, when the privacy budget is fixed, as the sample size increases, the classification accuracies of these algorithms are on the rise. e classification accuracy of DPPCA-SVM is still between those of AG-SVM and MOD-SuLQ-SVM. What is more, when the number of samples is large enough, the classification accuracy of DPPCA-SVM is close to or even surpasses that of AG-SVM. Moreover, we know that DPPCA-SVM provides the privacy protection of (ε, 0), while both AG-SVM and MOD-SuLQ-SVM only satisfy the differential privacy of (ε, δ). erefore, DPPCA-SVM can achieve higher data availability while providing a higher level of privacy protection.

Experiments for Running Time of Algorithm DPPCA-SVM
and Other Algorithms. In the above experiments, we obtained the data availability of these algorithms by comparing the classification accuracies of them. ere is no doubt that classification accuracy is a valuable measure of algorithm performance. Among them, the SVM algorithm always maintains extremely high classification accuracy because it does not implement dimensionality reduction and noise processing. But, in practical applications, the execution efficiency of the algorithm also needs to be considered. erefore, we compared the running times of SVM, DPPCA-SVM, AG-SVM, and MOD-SuLQ-SVM to highlight the necessity of DPPCA-SVM that we proposed. As is shown in Figure 4, in the three data sets, SVM's running time is much longer than those of the other three algorithms, especially when the data set is large. erefore, using principal component analysis before classification can obviously save on computational complexity.
All in all, considering the three aspects of privacy protection level, data availability, and execution efficiency, DPPCA-SVM has a relatively excellent performance, which means that it will have a broad space in practical applications.

Conclusions
Nowadays, privacy protection algorithms have been widely applied to the field of data processing and achieved remarkable achievements. However, few researchers consider differential privacy, principal component analysis, and support vector machines at the same time and combine them together. In this paper, for fast classification and data privacy protection, we propose two algorithms that satisfy (ε, 0)-DP, namely, DPPCA-SVM based on input perturbation and PCA-DPSVM based on output perturbation.
In this paper, we have designed three experiments to measure the performance of DPPCA-SVM and PCA-DPSVM that we proposed. In the first experiment, we mainly demonstrated that DPPCA-SVM based on input disturbance outperforms PCA-DPSVM based on output disturbance when providing the same privacy protection. In the second experiment, DPPCA-SVM is compared with two other algorithms which are both based on input disturbance, namely, AG-SVM and MOD-SuLQ-SVM. For the data sets Sensorless, Covtype, and Musk, when the privacy budget ε > 0.005, the classification accuracy of DPPCA-SVM is always greater than 0.90, which is slightly lower than that of AG-SVM but higher than that of MOD-SuLQ-SVM. However, DPPCA-SVM provides (ε, 0)-DP privacy protection, while AG-SVM only provides (ε, δ)-DP privacy protection. In the last experiment, we verified that DPPCA-SVM can effectively reduce the data dimension and greatly shorten the processing time of data. erefore, considering the intensity of privacy protection, classification accuracy, and the speed of data processing, there is no doubt that DPPCA-SVM has a broader prospect in the field of data processing and machine learning. For example, when the data of user have complex attributes and urgent need for privacy protection, the concept of DPPCA-SVM has a high reference value. In addition, the DPPCA-SVM algorithm also has some limitations. For instance, in this model, the classification accuracy of the SVM algorithm is used to measure the availability of data, which may not be an optimal method. In the future, we can design a scoring function to evaluate the performance of the algorithm in many aspects.
Data Availability e data set Sensorless used to support the findings of this study is available through visiting http://archive.ics.uci.edu/ ml/datasets/Dataset+for+Sensorless+Drive+Diagnosis/. e data set Covtype used to support the findings of this study is available through visiting https://archive.ics.uci.edu/ml/ machine-learning-databases/covtype/. e data set Musk used to support the findings of this study is available through visiting http://archive.ics.uci.edu/ml/machine-learningdatabases/musk/.