D-(DP)SGD: Decentralized Parallel SGD with Differential Privacy in Dynamic Networks

Decentralized machine learning has been playing an essential role in improving training efficiency. It has been applied in many realworld scenarios, such as edge computing and IoT. However, in fact, networks are dynamic, and there is a risk of information leaking during the communication process. To address this problem, we propose a decentralized parallel stochastic gradient descent algorithm (D-(DP)SGD) with differential privacy in dynamic networks. With rigorous analysis, we show that D-(DP)SGD converges with a rate of Oð1/ ffiffiffiffiffiffi Kn p Þ while satisfying ε-DP, which achieves almost the same convergence rate as previous works without privacy concern. To the best of our knowledge, our algorithm is the first known decentralized parallel SGD algorithm that can implement in dynamic networks and take privacy-preserving into consideration.


Introduction
Decentralized machine learning, as a modeling mechanism that allocates training tasks and computes resources to achieve a balance between training speed and accuracy, has demonstrated strong potential in various areas, especially for training large models on large datasets [1][2][3], such as Ima-geNet [4]. Typically, assume that there are n workers where each worker has its local data, decentralized machine learning problem is aimed at solving an empirical risk minimization problem as follows: where f i ðxÞ is the local loss function at node i. The objective f ðxÞ can be rephrased as a linear combination of the local loss function f i ðxÞ. This formulation can be expressed as many popular decentralized learning models including deep learning [5], linear regression [6], and logistic regression [7]. In recent years, decentralized machine learning has attracted much attention to derive convergence solutions while reducing communication costs [8,9]. Previous works mainly study decentralized collaborative learning in a static network assumption. For example, decentralized parallel stochastic gradient descent (D-PSGD) is one of the fundamental methods in solving large-scale machine learning tasks in static networks [1]. In D-PSGD, all nodes compute the stochastic gradient using their local dataset and exchange the results with their neighbors iteratively. However, in fact, dynamicity has been an important feature for networks, especially for large-scale networks, such as IoT [10] and V2V networks [11,12], as nodes in the network can move around and join or leave the network at any time. On the other hand, in large-scale networks, it is hard or even impossible to ensure every node is reliable [13,14]. Consequently, during the collaborative learning process, it is unavoidable to face the risk of information leaking. Hence, when designing decentralized machine learning algorithms, it has been to consider the impact of the dynamicity of network topology and the demand for privacy preservation. However, to the best of our knowledge, there is no existing work taking both factors simultaneously into consideration. In this work, we focus on this missing piece in decentralized learning.
Specifically, based on differential privacy, we present a new dynamic decentralized stochastic gradient descent algorithm (D-(DP) 2 SGD), which offers a strong protection for local datasets of decentralized nodes. With rigorous analysis, we show that our proposed D-(DP) 2 SGD algorithm satisfies ε -DP and achieves the convergence rate of Oð1/ ffiffiffiffiffiffi Kn p Þ when K is large enough. Empirically, we conduct experiments on CIFAR-10 datasets to accomplish image classification tasks. We conduct extensive experiments to evaluate the performance of our proposed algorithms.
The remainder of this paper is organized as follows. We present our survey on related work in Section 2. We then introduce our model, problem, and some useful preliminary knowledge in Section 3. Our algorithm, main results, and analysis are presented in Section 4, Section 5, and Section 6, respectively. Experimental results are illustrated in Section 7. The whole paper is concluded in Section 8.

Related Work
In this section, we introduce some closed related work.

Decentralized Parallel Stochastic Gradient Algorithms.
Most existing work on decentralized parallel stochastic gradient focus on static networks in both synchronous and asynchronous settings [1,[15][16][17][18]. Under the synchronous setting, Lian et al. [1] illustrated the advantage of decentralized algorithms over centralized ones and showed that the proposed D-PSGD converges with a rate of Oð1/ ffiffiffiffiffiffi Kn p Þ when K is large enough, where K is the number of iterations and n is the total number nodes in the network. Qureshi et al. [17] proposed an algorithm called S-ADDOPT, and it converges with a rate of Oð1/KÞ.
Feyzmahdavian et al. [19] and Agarwal et al. [20] considered decentralized SGD in the asynchronous setting. They allowed workers to use stale weights to compute gradients in S-PSGD. Asynchronous algorithms avoid idling any worker to reduce the communication overhead, and it is very robust because it can still work well when part of the computing workers are down. For asynchronous algorithms, Lian et al. [16] proposed the asynchronous decentralized parallel SGD algorithm for convex optimization and showed AD-PSGD converges at Oð1/ ffiffiffi ffi K p Þ. Then, Lian et al. [21] proposed the asynchronous decentralized parallel stochastic gradient descent algorithm for nonconvex optimization and showed the ergodic convergence rate is Oð1/ ffiffiffi ffi K p Þ. They proved that the linear speedup is achievable when the number of workers is bounded by ffiffiffi ffi K p .

Differentially Privacy Decentralized
Learning. Most existing work on differentially private decentralized learning focuses on the static network [22][23][24][25]. Our work combines decentralized learning and dynamic network in a DP setting. In contract, Lu et al. [24] proposed an asynchronous federated learning scheme with differential private for resource sharing in vehicular networks. Hsin-Pai et al. [26] proposed a new learning algorithm LEASGD (Leader-Follower Elastic Averaging Stochastic Gradient Descent), which is driven by a novel Leader-Follower topology and a differential privacy model. And they provide a theoretical analysis of the convergence rate and the trade-off between the performance and privacy in the private setting. Based on the research in [16], Xu et al. [2] designed an algorithm on asynchronous decentralized parallel stochastic gradient descent algorithm with differential privacy (A(DP) 2 SGD). They showed that A(DP) 2 SGD converges at Oð1/ ffiffiffi ffi K p Þ. For all of these reviewed papers, the study of decentralized parallel SGD for differential privacy in dynamic networks is still an open problem.

System Model and Problem Description
We consider a network consisting of n computational nodes (could be a machine or a GPU). At each iterate k, the network topology is denoted by a network G k = ðV , E k Þ, where V is the set of n computational nodes, V = f1, 2,⋯,ng, and E k ⊂ V × V is the set of communication edges at iterate k. If there exists an edge from node i to node j at iterate k, then ði, jÞ ∈ E k . In a connected network, two nodes are neighbors if the node can be connected directly by an edge, i.e., the nodes can communicate with each other. The set of neighbors of node i at iterate k is denoted by N k ðiÞ = fj | ði, jÞ ∈ E k g, and define C k ðiÞ = N k ðiÞ S fig. We assume that the nodes keep unchanged, but the connection between nodes can be changed after every iteration. The network G k is assumed to be strongly connected, i.e., for all nodes i, j ∈ V , there exists a path from i to j at each iterate k ≥ 0. Some frequently used notations are summarized in Table 1.
In a decentralized network, the data is stored at nodes, and each node is associated with a local loss function where D i is a distribution associated with the local data at node i and ξ is a data sampled via D i . In this work, we consider the following optimization problem: where I is a uniform distribution of nodes. Similarity, we gives an δ-approximation solution if where x k is the average local variable with all nodes at iterate k and K is the maximum iterations. We next review the definition of differential privacy, which is originally proposed by Dwork [27].
Definition 1 (see [27] (Differential Privacy)). Given a ε ≥ 0, a randomized mechanism M with domain D preserves ε-differential privacy if for all S ⊆ RangeðMÞ and for any adjacent datasets (Given two datasets D = fx 1 , x 2 ,⋯,x n g and D′ = fx ′ 1 , x ′ 2 ,⋯,x ′ n g, D and D ′ are adjacent if there exist i ∈ ½n such that Wireless Communications and Mobile Computing ∈ D such that: where RangeðMÞ is the output range of mechanism M.
Informally, differential privacy means that the distributions over the outputs of the randomized algorithm should be nearly identical for two adjacent input datasets. The constant ε measures the privacy level of the randomized mechanism M, i.e., a large ε implies a lower privacy level. Therefore, an appropriate constant ε should be chosen to balance the accuracy and the privacy level of the mechanism M.
Then, we introduce the definition of sensitivity, which plays a key role in the design of differential privacy mechanisms.
Definition 2 (see [28] (Sensitivity)). The sensitivity of a function f : D ⟶ ℝ d is defined as follows: The sensitivity of a mechanism M captures the magnitude by which a single individual's data can change the mechanism M in the worse case. Moreover, we will introduce the Laplace mechanism.
Definition 3 (see [27] (The Laplace mechanism). Given any function f : D ⟶ ℝ d , the Laplace mechanism is defined as: where η k,i are i.i.d. random variables drawn from LapðςÞ. The variable of Laplace distribution is 2ς 2 , where ς = Δ/ε according to the property of differential privacy.
Throughout the paper, we adopt the following commonly used assumptions: (1) Lipschitzian gradient: all functions f i ð·Þ's are L -Lipschitzian gradients.
(2) Unbiased estimation: where w k,ij describes how much node j can affect node i at iterate k.
(vi) Weighted average: compute the weighted average by obtaining perturbed variable from neighbors and the matrix W k : x k+ð1/2Þ,i = ∑ j∈C k ðiÞ w k,ijxk,j .
(vii) Gradient update: each node updates its local variable using the weighted average and the local sto- From a global view, we define the concatenation of all local variables, perturbed variables, Laplace noises, random samples, and stochastic gradients by matrix X k ∈ ℝ d×n ,X k ∈ ℝ d×n , and η k ∈ ℝ d×n , vector ξ k ∈ ℝ n , and ∂FðX k , ξ k Þ, respectively: Then, the kth iterate of Algorithm 1 can be described as the following update i.e.,

Main Results
In this section, we present the main results, which guarantees the privacy and the convergence rate of our proposed algorithm.
Under the assumptions, we can get the convergence rate of Algorithm 1 as follows: where This theorem characterizes the convergence of the average of all local optimization variables x k,i . To take a closer if the total number of iterates K is sufficiently large,

Result Analysis
In this section, we will give the analysis for privacy preservation and convergence rate of D-(DP) 2 SGD. For convenience, we define and

Privacy Analysis
Proof (Proof of Theorem 4). From the definition of sensitivity, we obtain Note that where x l k,i and x′ l k,i are the lth component of x k,i and x′ k,i , respectively.
Consider an output vectors y k,i . Then, we follow from the property of Laplace distribution, we get Then, where the first inequality comes from the triangle inequality, and the last inequality follows from (19). Therefore, we know ε = Δ/ς, we can obtain Initialization Initial point x 0,i = x 0 = 0, step length γ, noise variance ς and number of iterations K end fork = 0, 1, ⋯, K − 1 in parallel for nodes i ∈ V do Sample a training data ξ k,i ; Compute the stochastic gradient ∇F i ðx k,i , ξ k,i Þ using the current local variable x k,i and the data ξ k,i ; Randomly generate the Laplace noise η k,i~L apðςÞ and add noise to the variable x k,i , to get the perturbed variablex k,i :x k,i = x k,i + η k,i ; Send the perturbed variablex k,i and its degree d k,i to its neighbors; Receivex k,j and d k,j from its neighbors, j ∈ N k ðiÞ; Determine W k according to Equation (11); Compute the neighborhood weighted average by obtaining perturbed variables from neighbors: x k+ð1/2Þ,i = ∑ j∈C k ðiÞ w k,ijxk,j ; Update its local variable  Proof. From the definition of Equation (11), we can obtain that (1) w k,ij ∈ ½0, 1, for all i, j; (2) w k,ij = w k,ji , for all i, j; (3) ∑ j w k,ij = 1, for all i.
Therefore, W k is a symmetric doubly stochastic matrix.

Lemma 8.
Define ΦðK − 1, 0Þ = I, where I is the identity matrix. Assume that there exists a ρ ∈ ½0, 1Þ such that Then Proof. Let y s = ð1 n /nÞ − Φðs − 1, 0Þe i . We prove this lemma by induction. For s = 0, ∥y 0 ∥ 2 = ∥ð1 n /nÞ − e i ∥ 2 = ððn − 1Þ 2 /n 2 Þ + ∑ n−1 i=1 ð1/n 2 Þ = ððn 2 − 2n + 1 + n − 1Þ/n 2 Þ = ðn − 1Þ/n. We assume that s = K hold, i.e., Eky K k 2 ≤ ððn − 1Þ/nÞρ K . Then, for s = K + 1, note that y K+1 = W K y K , then we have According to Lemma 7, EðW Τ K W K Þ is symmetric and doubly stochastic. Then, 1 n is an eigenvector of EðW Τ K W K Þ, and 1 is the eigenvalue. According to the spectral theorem of Hermitian matrices, we can construct a basis of ℝ n composed by the eigenvectors of EðW Τ K W K Þ starting from 1 n . From Equation (25), the magnitude of all other eigenvectors' associated eigenvalues should be smaller or equal to ρ. Note that y K is orthogonal to 1 n , then we can find By induction, we complete the proof.
In Lemma 9, we give the bound of the sensitivity of our proposed algorithm.

Lemma 9.
Under assumption 4, the sensitivity of the algorithm can be bounded as where G = max i∈V G i and d is the dimensionality of vectors.
Proof. Assume that D k and D′ k are any two adjacent datasets at iterate k. Assume that x k,i and x ′ k,i be the executions for MðD k Þ and MðD ′ k Þ, respectively. Then, from our proposed algorithm, we have where the first inequality comes from the norm inequality and the last inequality is from the triangle inequality. From assumption 4, we have Since we can choose the pair of adjacent datasets D k , D ′ k arbitrarily, and we can obtain The lemma is obtained.
From Lemma 9, we know that the learning rate γ, the dimensionality of vectors d, the maximal bound of subgradient G, and the privacy level ε have an effect on the magnitude of the added random noise. Based on Lemma 9, we next provide the bound of the noise.

Lemma 10.
We give the following inequalities: Proof. According to the property of the Laplace mechanism E∥η s,i ∥ 2 ≤ 2ς 2 and ς = Δ/ε, we obtain a bound on η s in the following: Wireless Communications and Mobile Computing And we can get a bound on η s .
Lemma 11 (see [1]). Under assumption 1, the following inequality holds: The proof of this lemma can be found in the full version of [1]. And we define Dis k,i as the squared distance of the local optimization variable on node i from the averaged local optimization variable on all nodes at iterate k, i.e., Dis k,i = E kðð∑ n i′ =1 x k,i ′Þ/nÞ − x k,i k 2 . In the following, we will present the bound of Dis k,i .

Lemma 12.
Under the definition of Dis k,i , we can get: Proof. According to the update method of X k , we split Dis k,i into two terms: where the seventh equation comes from x 0,i = x 0 = 0 for ∀i. Firstly, we split A 1 into two terms, To bound A 1 , we first boundA 3 and A 4 : Moreover, A 4 can be bound as follows:

Wireless Communications and Mobile Computing
Then, plugging A 3 and A 4 into A 1 , where the last inequality comes from the fact that ð1/ð1 − ρ Moreover, we split A 2 into two terms: We give an upper bound A 5 as follows: where the last second inequality comes from Lemma 8 and assumption 3.
For A 6 , we will give the following upper bound: To bound A 6 , we first bound A 7 and A 8 , for A 7 : where the last inequality comes from Lemmas 8 and 11. Then, we bound A 8 as follows: Wireless Communications and Mobile Computing where A 10 can be bounded by ζ and ρ: and we give a bound of A 9 : Then, we plug A 9 , A 10 into A 8 , plug A 8 , A 7 into A 6 . We can yield the upper bound for A 6 .
Then, we plug A 5 and A 6 into A 2 yielding the upper bound of A 2 .
The lemma is obtained.
Based on these lemmas above, we prove Theorem 5 subsequently.
Then, we split the second term according to 2ha, bi ≤ ∥a We split the last term of (53) into two terms because of According to (54) and (55), (53) can be expressed as: Wireless Communications and Mobile Computing We can bound the second last term using σ: where the last step is true because of assumption 3. Then, it follows from (56): Ef X k+1 1 n n ≤ Ef X k 1 n n − γE ∇f X k 1 n n , ∂f X k ð Þ1 n n ( ) where the last step comes from 2ha, bi = ∥a∥ 2 + ∥b∥ 2 −∥a − b∥ 2 . We then bound the equation X: where the first inequality comes from ∥∑ n i=1 a i ∥ 2 ≤ n∑ n i=1 ∥a i ∥ 2 . According to Equation (37) in Lemma 12, we have the bound of Dis k,i . Then, we will bound its average M k on all nodes as follows: