SSGD: A safe and efficient method of gradient descent

With the vigorous development of artificial intelligence technology, various engineering technology applications have been implemented one after another. The gradient descent method plays an important role in solving various optimization problems, due to its simple structure, good stability and easy implementation. In multi-node machine learning system, the gradients usually need to be shared. Data reconstruction attacks can reconstruct training data simply by knowing the gradient information. In this paper, to prevent gradient leakage while keeping the accuracy of model, we propose the super stochastic gradient descent approach to update parameters by concealing the modulus length of gradient vectors and converting it or them into a unit vector. Furthermore, we analyze the security of stochastic gradient descent approach. Experiment results show that our approach is obviously superior to prevalent gradient descent approaches in terms of accuracy and robustness.


I. INTRODUCTION
Gradient descent (GD) is a technique to minimize an objective function, which is parameterized by the parameters of a model, by updating the parameters with the opposite direction of the gradient of the objective function about the parameters [1]. It has widely been applied in solving various optimization problems because of its simplicity and impressive generalization ability [2]. But it is born with a heart of revealing privacy. Mathematically, the gradient is the parametric derivative of the loss function, which is explicitly calculated from the given training data and its true label. Therefore, an attacker may extract the sensitive information of the original training data from the captured gradients. Recently, researches have shown that an attacker, which captures the gradient of a training sample, can successfully infer its property [3], tag [4], class representation [5], [6], or the data input itself [4], [7]- [9], with high accuracy. In practical deep learning systems, the gradient of multiple samples is widely used to improve efficiency and performance, which can also be viewed as the per-coordinate average of the single-sample gradients. Is multi-sample gradient safer for the privacy of training data? Unfortunately, Pan et al. [9] gave the theoretical analysis *corresponding author.
to indicate that multi-sample gradient still leak samples and labels under certain circumstances.
Since the work of Zhu et al. [7] was proposed, there is a branch of research [4], [7]- [9] to explore a violent but universal method for successful data reconstruction attacks, and some meaningful empirical results are given on Cifar-10 and ImageNet. These works are based on the same learningbased framework. First, a batch of unknown training samples are used as variables, and then the optimal training samples are searched by minimizing the distance between the groundtruth gradient and the gradient calculated by the variables. The main difference between them is the choice of minimizing distance function. L2 and cosine distances are used in [4], [7] and [8], respectively. Although Zhao et al. [4] used the properties of neural networks to recover the label of a single sample prior to the learning-based attack, this technique is only suitable to single-point gradient. It is same to [7] in the multisample case. Pan et al. [9] gave a theoretical explanation of information leakage for single sample in a fully connected neural network with Relu activation function. Furthermore, they showed that there exists leakage of samples and labels for multi-samples under certain circumstances by utilizing the internal information between neurons, and extended the model to ResNet-18 [10], VGG-11 [11], DenseNet-121 [12], AlexNet [13], shufflenet v2-x0-5 [14], InceptionV3 [15], GoogLeNet [16], and MobileNet-V2 [17].
To solving the problem of gradient security, Bonawitz et al. [18] designed a secure aggregation protocol, Phong et al. [19] encrypted the gradient before sending it, and Abadi et al. [20] used differential privacy to protect the gradient. However, these three methods have their limitations. Secure aggregation [18] requires the gradient to be an integer, so it is not compatible with most CNNs. Homomorphic encryption [19] is only for parameter servers, and differential privacy [20] protects the gradient while reducing the algorithm's performance. Therefore, this paper proposes a new gradient descent method, super stochastic gradient descent (SSGD) for achieving neuron-level security while maintaining the accuracy of model. Moreover, SSGD has stronger robustness. Phong et al. [19] analyzed the leakage of the input data in singlelayer perceptron with the single-sample and single neuron by using the sigmoid activation function. Pan et al. [9] analyzed the leakage of sample data from multi-layer fully connected neural network gradients using the relu activation function, and indicated that multiple samples also reveal privacy. There are two neurons in the last layer which are only activated by the same single sample. Essentially, the leakage is caused by attacking the single-sample gradient. SSGD converts the neuron gradient into a unit vector. This makes that the gradient aggregation of neurons has super randomness. So the attacker cannot know the true gradient. Theoretically, SSGD will makes these attacks invalid, including the attack by searching for the optimal training sample [4], [7], [8] based on minimizing the distance between the ground-truth gradient and the gradient calculated by the variable, and the attack by solving the equation system [9] to obtain the training data.
Our contributions are summarized as follows.
• We propose a gradient descent algorithm, called super stochastic gradient descent. The main idea is update the parameters by using the unit gradient vector. In neural networks, neuron parameters are updated by using the unit gradient vector of neurons. • We analyze theoretically that SSGD can realize neuronlevel security and defend against data reconstruction attacks. • Experimental results show our approach has better accuracy and robustness than prevalent gradient descent approaches.
The rest of this paper is organized as follows. In Section 2, we review the basic gradient descent methods and the data leakage by gradients. In Section 3, we describe the super stochastic gradient descent and analyze the safety of our approach. The experimental results are shown in Section 4. Finally, we conclude this paper and give the further work. Some safety experiments are shown in the appendix.

II. PRELIMINARIES
In this section, we review some basic gradient descent algorithms [1], including batch gradient descent (BGD), stochastic gradient descent (SGD) and mini-batch gradient descent (MBGD). The difference among them is that how much data is used to calculate the gradient of the objective function. Then we describe the information leakage caused by gradients [19].

A. Basic gradient descent algorithms
The BGD is a ordinary form of gradient descent, which takes the entire training samples into account to calculate the gradient of the cost function (θ) about the parameters θ and then update the parameters by where η is the learning rate. The BGD uses the entire training set in each iteration. Therefore, the update is proceeded in the right direction, and finally BGD is guaranteed to converge to the extreme point.
On the contrary, the SGD considers a training sample x (i) and label y (i) randomly selected from the training set in each iteration to perform the update of parameters by The BGD and SGD are two extremes: one uses all training samples and the other uses one sample for gradient descent. Naturally, their advantages and disadvantages are very prominent. For the training speed, the SGD is very fast, and the BGD can not be satisfactory when the size of training sample set is large. For accuracy, the SGD determines the direction of the gradient with only one sample, resulting in a solution which may not be optimal. For the convergence rate, because the SGD considers one sample in each iteration and the gradient direction changes greatly, it cannot quickly converge to the local optimal solution.
The MBGD is a compromise between BGD and SGD, which performs an update with a randomly sampled minibatch of n training samples by MBGD decreases the variance of the updates for parameter, so it has more stable convergence. Moreover, the computing of gradient about a mini-batch is very efficient by using highly optimized matrix optimizations existed in advanced deep learning libraries. Phong et al. [19] illustrated that how gradients leak the data information based on a single neuron shown in Fig. 1. Assume that the d-dimensional vector x ∈ R d represents data input with a label value y ∈ R. w ∈ R d is the weight parameter and b ∈ R is the bias, represented uniformly by θ = (w, b) ∈ R d+1 . g ∈ R d+1 is the gradient vector of the parameter θ. f is an activation function and the loss function

B. Analysis of gradient information leakage
Therefore, we obtain σ k = σ · x k . By solving the system of equations, we can easily get x and y. Also, we know that g is determined by (x, y). So g and (x, y) are bijective. When w and b are known, leaking the gradient g is equivalent to leaking (x, y). Similarly, protecting g is equivalent to protecting (x, y). Based on [9] the single-sample analysis of multi-layer neural networks by using relu activation function, there are also data leakage problem. Although there is no such simple and intuitive leakage of data in a multi-layer neural network, we can still know x and y by analyzing the internal relationship of the neural network and find that (x, y) and g are still bijective.

III. SUPER STOCHASTIC GRADIENT DESCENT
In this section, we propose our super stochastic gradient descent approach for preventing gradient leakage while keeping the accuracy, and then analyze in detail the safety of our approach.

A. Our approach
It was confirmed that the gradient leaks privacy [7], [19]. For solving the security problem caused by the exchange gradient in stochastic gradient descent or mini-batch gradient descent, we propose the super stochastic gradient descent approach, which can protect the gradient information without losing accuracy by hiding part of the gradient information. The gradient is the first-order partial derivative of the objective function, so it is a vector with both magnitude and direction. We seek the gradient of the objective function to find the fastest descent direction. But it is little related with the modulus length of the gradient vector. So we hide the modulus length of the gradient vector and convert the gradient vector into a unit vector.
The super-randomness, caused by the aggregation of multiple unit gradient vectors, may lead to poor results. To guarantee that this kind of randomness is friendly, we utilize the following approaches to reduce the uncertainty caused by super-randomness.
For single sample training sample x (i) and label y (i) , we use unit gradient vector to update parameter θ For multiple samples, the parameter is updated to where x (i:i+n) represents n samples and y (i:i+n) denotes their labels. The gradient ∇ θ (θ; x (i:i+n) ; y (i:i+n) ) of n samples is considered as a basic gradient, and m is the number of basic gradients. Aggregating the unit gradient vectors of m basic gradients on average is to further enhance the stability of the algorithm. The algorithm has higher performance with strong randomness. It is secure to share this unit basic gradient in a distributed environment. Neuron is the smallest information carrier in the neural network structure. In the neural network, we choose to convert each neuron parameter gradient vector into a unit vector. So the single-layer neural network parameter is updated to where θ k represents the k-th column or k-th row of the parameter matrix in the fully connected layer or convolutional layer (the convolution kernel is regarded as a neuron). In the fully connected layer, ∇ θ (θ; x (i:i+n) ; y (i:i+n) ) k is expressed as the k-th column of the gradient matrix. And in the convolutional layer, it represents the k-th row of the gradient matrix of the convolution kernel. d is the number of rows or columns of the gradient matrix. Therefore, each row or column of the gradient matrix is a unit vector. Then we obtain an average gradient matrix by using m such gradient matrices to update the parameters.

B. The safety of SSGD
By analyzing the multi-layer neural network with relu activation function on a training sample, the following relationship is obtained in [9]: where X is the input data, g c represents the c-th dimension of the loss vector g, D i is the activation pattern of the i-th neural network, and G i and W i denote the gradients and parameters of the i-th layer of neural network, respectively. The data reconstruction attack is to solve the above equations. Zhu et al. [4], [7], [8] used brute force training to approximate the training sample x. Pan et al. [9] inferred g c and D i , and then solved the equation to get x and y. In theory, this scheme is more accurate than previous work. The left side of the equation is the i-th layer gradient matrix: The gradient matrix of our SSGD is : Each column is a unit vector, and µ i k is the modulus length of the k-th column vector of the i-th layer gradient matrix, i.e., the modulus length of the k-th neuron gradient of the ith layer neural network. Essentially, the parameter matrix of a layer of neural network is multiplied by a diagonal matrix U i on the right, and the value of the diagonal matrix is the reciprocal of the modulus length of the gradient vector of each neuron. By using our SSGD, the (9) is represented as where U i is not uniquely determined when the loss function are non-convex and non-concave functions. According to [21], we know that the loss function of multi-layer neural networks are non-convex and non-concave functions. Due to the dynamicity of U i , even if g c and D i are known, the equations are not solved. Our method hides the correlation between the gradient and the sample, eliminates the information between neurons and achieves neuron-level security. SSGD make the data reconstruction attack be invalid for single sample gradient update. It has no effect on single-layer perceptron. In the case of multiple samples, (4) and (5) become x j k cannot be obtained by solving these equations. How safe is SSGD method? In a distributed multi-layer neural network, the data will not be leaked by exchanging unit gradient vectors. For multiple samples, our SSGD can prevent the collusion attack. Assuming that n − 1 of n neuron gradients are known, the another neuron gradient value is not inferred. That is, given the unit vector G and any vector G n−1 = n−1 i=1 G i with the relations G = G n−1 + G n and G = G G , G n is not obtained. Since the samples are independent and the gradients have a bijective relationship with the samples, so the gradients are also independent. In two dimensional space, the problem is described as follows. As shown in Fig. 2, given the unit vector − → oβ of a vector − → oγ and the vector − → oα, assume that − → oγ = − → oα + − → αγ, we see whether − → αγ is obtained. From Fig. 2, we can know that any point γ on the ray oγ may be the solution of the problem, so − → αγ is not solved. Since training a model requires rounds of iterations, is it safe to use multiple rounds of iterations? We previously analyzed that the gradient g and the training data (x, y) are bijective in terms of parameter θ, then g = f (θ|(x, y)), where f is a functional relationship. We use θ i and θ i+1 to denote the training parameters of the i-th and i+1-th rounds, respectively. Then we have θ i+1 = θ i − η · g i . The i-th gradient g i = f (θ i |(x, y)) and the i + 1-th gradient g i+1 = f (θ i+1 |(x, y)). So we have θ i+1 = θ i − η · f (θ i |(x, y)). Furthermore, we obtian g i+1 = f (θ i −η·f (θ i |(x, y))|(x, y)). By compared g i+1 with g i , we can see that there is not additional information in g i+1 . So the iteration operation does not cause the information leakage.

IV. EXPERIMENTS
Since Lenet-5 [22] is a classic convolutional neural network model, we use it to test the performance of our approach by substituting SSGD for basic gradient descents. The Lenet-5 contains two convolutional layers, two pooling layers, and three fully connected layers. The activation function is relu. The input dimensions are 784, and output dimensions are 10. The MINST data set is used for testing the accuracy and robustness of our SSGD approach. It contains 60000 training images and 10000 test images, every image is an 28*28 grayscale image, and each pixel is an octet. The accuracy contains training and test accuracies. We use 60000 training images to train model. The training (test) accuracy is the average value of ten experimental results, and every experiment obtains the average training (test) accuracy of randomly selecting 1000 samples from the training (test) set.

A. Accuracy
We compare SSGD with SGD, SGDm [23] and Adam [24], which are widely used gradient descent algorithms. The batch size N is set to 16,32,64,128,256,512,1024,2048,4096,8192, and the number of iterations is 10,000. Because SGD and SGDm have poor adaptability in large batches, there is no fixed learning rate. The learning rate of Adam is set to 0.005. The momentum of SGDm is set to 0.999, and the β 1 and β 2 in Adam are set to 0.9 and 0.999, respectively. For SSGD, the learning rate in this experiment is set to 0.1, which is relatively rough. n is set to 1,4,8,16,32,64,128. m is set to 4,8,16,32,64. When m=1, it is the MBGD. The number of iterations is also 10,000.  The comparative experimental results of SGD, SGDm, Adam and SSGD are shown in Table I and Table II. In Table I, the the number in bracket is learning rate. Table III  and Table IV include the running results our SSGD approach. From Table III, we can see that the larger n is, the better the training accuracy is. And m does not have much influence on the training accuracy. For the test accuracy, it is not better as n increases. When n is 32 or 64, the test accuracy is better. Small batches may lead to that the model is not be fully trained. However, when n is greater than 8, it can be seen that the value of m has a special impact on the test accuracy. It is not high in the middle and low on both sides. On the contrary, it is high on both sides and low in the middle. We still do not know what caused it. From Table II, the performance of our algorithm is better than SGD and SGDm. When the batch size is larger than 32, and our SSGD is equivalent to Adam on test accuracy. However, in the case of very large batches (greater than 4096), the Adam algorithm is over-fitting, and the test accuracy is significantly reduced. Our algorithm on test accuracy is better than Adam.  The convergence speeds about training and test accuracies are shown in Fig. 3 and Fig. 4, respectively. The value in longitudinal axis is the average accuracy of every 10 iterations. The SSGDm is SSGD with momentum. We choose the intermediate value 256 as the batch number in the convergence experiment, where n = 16 and m = 16 for SSGD and SSGDm. The learning rates of SGD and SGDm are 0.0001 and 0.0005, respectively. The momentum of SGDm is set to 0.999. The learning rate of SSGDm is 10/1.0002 t , where t is the number of iterations, and its momentum is 0.99. The other parameters are consistent with the above experiment. From Fig. 3 and Fig. 4, we can see that the convergence speed of our algorithm is faster and more stable than SGD , SGDm and Adam.

B. Robustness
To check the robustness of our algorithm, we add random noise to the gradient. At the same time, we noticed that differential privacy is a way to protect gradient information by adding random noise that meets a certain distribution. To measure the performances of our algorithm and the model with differential privacy to protect gradient. We set privacy budget = 4. Different from cutting gradient value in [20], we strictly define sensitivity as the maximum value minus the minimum value in the gradient matrix.
We use SGDm and Adam as the compared algorithms. Also, we have tested SGD algorithm. When = 5 and the batch number is large, the gradient explosion will occur and the SGD can not be converged. SGDm and Adam algorithms have better robustness. We adjust hyperparameters to get more For each iteration, after the n vectors are added, the Laplace noise of = 4 that strictly meets the differential privacy is added. The sensitivity is set to the maximum value minus the minimum value of the gradient matrix of the same batch. Then we use SGDm and Adam algorithms to update their parameters, respectively. The number of iterations is 10,000. By comparing Table. III, Table IV, Table V, Table VI,  Table VII and Table VIII, we can see that the training and test accuracies of SSGD are obviously better than that of SGDm and Adam with different privacy.  Now, we use SSGDm to verify the robustness of our algorithm. For our algorithm, we use the average of multiple unit gradient vectors to update the gradient. Therefore, the module length of the update gradient vector decreases very slowly, and dynamic learning rates need to be set. The learning rate of SSGDm is set to 10/1.0002 t , momentum = 0.99, where t is the number of iterations. We use the same batch settings and the same noise-added methods with = 4 as SGDm and Adam. Then we use SSGDm to update the parameters. The number of iterations is 10,000. From Table V to Table X, we can see that SSGDm is more robust than the Sgdm and Adam algorithms when the same amount of noise is added in training and test accuracy. From the experimental results, all three algorithms comply with the law of acquaintance, i.e., the larger the batch size is, the better the accuracy is. By Comparing Table IV, although the Laplace noise of = 4 is added in SSGDm, its performance is comparable to that of SSGD without noise.
Where is the limit of the robustness of our algorithm? We try to increase the amount of noise and take equal to 0.2, 0.5, 1, 2, 4. The experimental environment is the same as the robustness experiment above, and the parameter settings are also the same. The batch number is set to n = 16 and m = 16. From Table XI and Table XII, it is clear that our SSGDm has obvious advantages in robustness. The greater the amount of noise is, the more obvious the robustness is. Compared with Adam and SGDm, the average test accuracies are increased by 3.87% and 19.25%, respectively.

V. CONCLUSIONS
In this paper, we proposes a new gradient descent approach, called super stochastic gradient descent. The SSGD enhances the randomness of gradients to protect against gradient-based attacks. Simultaneously, we use multi-sample aggregation to enhance stability and eliminate the uncertainty brought about by ultra-randomness. Our approach achieves neuron-level security and can defend against data reconstruction attacks. Experimental results demonstrate that SSGD has good accuracy and strong robustness because its stability and randomness are enhanced. In the future, we will extend our idea to other machine learning models.

VI. APPENDIX
The experimental code comes from: https://github.com/PatrickZH/Improved-Deep-Leakagefrom-Gradients. The following experimental diagrams include the experimental results of DLG [7] and iDLG [4] attacking the existing gradient descent method and SSGD algorithm on MINST datasets and CIFAR100 datasets. From the experimental results, we can see that our algorithm can defend against DLG [7] and iDLG [4].