Approximate Dual Averaging Method for Multiagent Saddle-Point Problems with Stochastic Subgradients

This paper considers the problem of solving the saddle-point problem over a network, which consists of multiple interacting agents. The global objective function of the problem is a combination of local convex-concave functions, each of which is only available to one agent. Ourmain focus is on the case where the projection steps are calculated approximately and the subgradients are corrupted by some stochastic noises. We propose an approximate version of the standard dual averaging method and show that the standard convergence rate is preserved, provided that the projection errors decrease at some appropriate rate and the noises are zero-mean and have bounded variance.


Introduction
The problem of solving optimization problems over a multiagent network has attracted a lot of attention in recent years (see, e.g., [1][2][3][4][5][6][7][8][9][10][11][12][13]).The objective function of such problems is, in general, a sum of local objective functions, each of which is known to one specific agent only.Moreover, the estimates of all agents are restricted to lie in some convex set.Duo to the lack of a central coordinator, the methods that are developed to solve this problem have to be executed by individual agents through local interactions.
In this paper, we consider the multiagent saddle-point problem where the global objective function is given as a sum of local convex-concave functions, subject to some global constraint.We utilize the average consensus algorithm (see, e.g., [14][15][16][17][18][19][20][21]) as a mechanism to design a distributed method for solving this problem.The method is based on the standard dual averaging method (see, e.g., [1,22]), and it can also be viewed as an approximate version of the distributed dual averaging method in [2].Different from the distributed dual averaging methods in [1][2][3][4], which require that the projection steps have to be very accurately calculated, the proposed method assumes that they only have to be computed approximately.Moreover, the proposed method also considers the case where the subgradients are corrupted by some stochastic noises.
Literature Review.In [9], the authors develop a general framework for solving convex optimization problem over a network of multiple agents.Based on the average consensus algorithms, they propose a subgradient-based method; the method is fully distributed, in the sense that each agent only needs to communicate with its neighbors.Different from the work [9], the authors in [1] propose a distributed method that is based on dual averaging of subgradients; in particular, the authors characterize the explicit convergence rate of the proposed method.The authors in [3] further study the effects of communication delays on the distributed dual averaging method.The work [4] utilizes the pushsum algorithm as a mechanism to design a distributed dual averaging method; the implementation of the method removes the need for the doubly stochastic communication matrices.In [2], the authors solve the saddle-point problem over a multiagent network; the objective function is given as a sum of multiple convex-concave functions.Based on the dual averaging method, the authors propose a distributed method and characterize its convergence rate.
The contribution of our work in this paper is mainly twofold.First, we propose an approximate dual averaging method, and the implementation of the method does not need to calculate the projection steps accurately.We show how the projection errors affect the error bound of the method and conclude that the standard convergence rate is preserved when the errors decrease at some appropriate rate.Second, we further consider the case where the subgradients are corrupted by stochastic noises that are zero-mean and have bounded variance, and we also highlight the dependence of the error bound on the variance.
In contrast with the work [22], we solve the saddle-point problem over a multiagent network; in particular, we show that the standard convergence rate (1/ √ ) (where  is the iteration counter) is preserved, even when the projection steps are computed approximately and the subgradients are corrupted by some stochastic noises.In contrast with the work [2], we propose an approximate version of the distributed dual averaging method and show that if the projection errors decrease at some appropriate rate, the standard convergence rate is preserved.
The remainder of this paper is organized as follows.Section 2 gives a formal statement of the multiagent saddlepoint problem and the underlying network model.Section 3 presents the method and its main convergence results.Finally, we conclude with Section 4.
Notation and Terminology.We use R  to denote the dimensional vector space.We denote the standard inner product on R  by ⟨, ⟩ = ∑  =1     , for all ,  ∈ R  .Let M be a closed convex set in R  .We say ℎ() is a proximal function of the set M if it is continuous and strongly convex on M with respect to some norm ‖⋅‖; that is, for all 0 ≤  ≤ 1, ℎ( 1 + (1 − ) 2 ) ≤ ℎ( 1 ) + (1 − )ℎ( 2 ) − (/2)(1 − )‖ 1 −  2 ‖ 2 , for all  1 ,  2 ∈ M, where  is some positive scalar.We define the proximal center of the set M by  0 = arg min ∈M ℎ().For R  × R  , we introduce the following norm: , where  ∈ (0, 1), ‖ ⋅ ‖ 2 denotes the Eculidean norm, and   and   are the parameters that will be specified in the sequel.This implies the following dual norm of  = (  , The supergradient of a concave function can be defined accordingly.

Problem Setup
where W and Z are convex and compact sets in R  and R  , respectively, and each L  is a convex-concave function defined over W × Z known only by agent .We refer to a vector pair Note that such a vector pair ( * ,  * ) is a solution to problem (1).
We now make some assumptions on problem (1).For the set W, we assume that there exists a proximal function ℎ  () with proximal center and convex parameter denoted by  0 and   , respectively.Without loss of generality, we assume that ℎ  ( 0 ) = 0.For the set Z we introduce the similar assumptions and notations; that is, ℎ  ( 0 ) = 0. Therefore, for  = (, ) ∈ X := W×Z, it is natural to introduce a proximal function ℎ() of the set X, given by It is easy to see that the proximal center of X is  0 = ( 0 ,  0 ) and ℎ( 0 ) = 0. Furthermore, we denote  := max ∈X ℎ().

The Method and Assumptions.
We now propose the method, which is based on the method in [2].Specifically, each agent  ∈ V updates its estimates by setting ( = 0, 1, . ..): where and G   (   ) denote a subgradient of L  with respect to  and a supergradient of L  with respect to  at point    , respectively),    ∈ R  × R  is the stochastic noise vector in evaluating G   , { +1 } is a positive and nondecreasing sequence,   () := arg min ∈X {−⟨, ⟩ + ℎ()}, and    () satisfies the following two properties: where  is a positive scalar that represents the error in computing the next iterate by a projection defined by the proximal function ℎ and parameter .Note that    () is not uniquely defined for each .
In the paper, we make the following assumptions.
Assumption 3 (bounded subgradients).We assume that the following inequalities hold for all  ∈ V and  ∈ X: where  W and  Z are positive scalars.
Assumption 4 (stochastic subgradient).We assume that the stochastic noise vector    satisfies the following properties, for all  ∈ V and  ≥ 0: where Φ is some positive constant.

Convergence Results
. We show convergence of the method ( 4) and ( 5) via local average pair (ŵ V  , ẑV  ) defined at each agent V ∈ V, where  ≥ 1 is the iteration counter.
With the assumptions made in Section 3.1, we have the following main convergence result.Theorem 5.Under Assumptions 1, 2, 3, and 4, consider a sequence {  +1 } generated according to the method (4) and (5), with step and projection error sizes: where   and   are some positive scalars.Let ( * ,  * ) ∈ W × Z be a saddle point of L(, ), and then, for each agent V ∈ V and all  ≥ 1, we have where Proof.See The Appendix.
Remark 6. Theorem 5 represents the main convergence of the method (4) and (5), which shows that the function value L(ŵ V  , ẑV  ) converges to L( * ,  * ) at rate (1/ √ ) in expectation, for each V ∈ V.It is easy to see that the error bound is an increasing function of the noise magnitude Φ.It is worth noting that, in method ( 4) and ( 5), we have considered the case where the subgradients are corrupted by stochastic noises that are zero-mean and have bounded variance, and moreover, the projection steps are calculated only approximately.In fact, the proposed method converges when the projection error   decreases as (1/  ), where  > 0. However, for the case when 0 <  < 1/2, the (1/ √ ) convergence rate cannot be achieved.
Remark 7. As compared to the work [2], we show that the standard (1/ √ ) convergence rate for the dual averaging method is preserved, under the assumption that the projection steps are only computed approximately, and the subgradients are corrupted by some stochastic noises as well.As compared to [23], the proposed method solves the saddlepoint problem in a distributed setting, and the expected convergence rate is also established.

Conclusion
We have studied the problem of solving saddle-point problems over a multiagent network.The objective function is given as a sum of local convex-concave functions, subject to some global constraint.Based on the average consensus algorithm and the dual averaging method, we propose an approximate dual averaging method under the constraint that the projection steps are computed approximately and the subgradients are corrupted by stochastic noises.Finally, we have presented the main convergence results of the proposed method.

Appendix Proof of Theorem 5
We provide three lemmas which will be used for the proof of Theorem 5.
Lemma A.2.Under Assumptions 1, 2, 3, and 4, consider a sequence {  +1 } generated according to the method (4) and (5), and then, for all  ≥ 0, where Proof.We can compute the general evolution of   +1 as follows, by referring to (4): In a similar way, for   , we have Hence by noting that   0 = 0 for all  it follows that (A.5) Note that    ∈ X (cf.( 6)), for all  ∈ V and  ≥ 0. Hence, we can use the definition of the dual norm and Assumption 3 to bound ‖G   ‖ * as follows: This, along with Lemma A.2, leads to the following estimate: where we have used the inequality ≤ Φ, according to Assumption 4. Hence, the desired result follows by using the inequality that ∑  =1 ∑  =1  ⌈(−+1)/⌉−2 ≤ /(1 − ).
Proof of Theorem 5. First, we introduce the following gap sequence, for all  ≥ 1: where we have used Assumption 4; that is, E[   ] = 0. Breaking Λ  into two parts, we have For the first term on the right-hand side of (A.10), we can follow an argument similar to that of the proof of Theorem 1 in [2] to provide the following bound: where we have used (A.4), while for the second term, we achieve this in the following way.By recalling the definition of  +1 , we have where , and the last equality follows from the fact that the weight matrix () is double stochastic (cf.Assumption 2).Then, we investigate the sequence {Ψ   (−  )}; that is, It turns out that, for the term ∇Ψ   (−  ), we have where the first equality follows from Lemma A. where the equality      (− V  ) =  V  was used, which holds for  ≥ 0 (for  = 0, it is easy to verify that      (− V  ) =    (0) =  0 ).Substituting (A.16) into (A.13) and then taking the expectation, we obtain where we have used the fact that Ψ  0 (− 0 ) = 0.
. Following an argument similar to that of the proof of Theorem 1 in [2], we can arrive at In a similar way, we have (1/) ∑ −1 =0   ≤ 2  / √ ; therefore, the desired result follows by substituting this and (A.26) into (A.25).The proof is complete.