A Distributed Conjugate Gradient Online Learning Method over Networks

In a distributed online optimization problem with a convex constrained set over an undirected multiagent network, the local objective functions are convex and vary over time. Most of the existing methods used to solve this problem are based on the fastest gradient descent method. However, the convergence speed of these methods is decreased with an increase in the number of iterations. To accelerate the convergence speed of the algorithm, we present a distributed online conjugate gradient algorithm, diﬀerent from a gradient method, in which the search directions are a set of vectors that are conjugated to each other and the step sizes are obtained through an accurate line search. We analyzed the convergence of the algorithm theoretically and obtained a regret bound of O ( �� T √ ) , where T is the number of iterations. Finally, numerical experiments conducted on a sensor network demonstrate the performance of the proposed algorithm.


Introduction
Distributed optimization has received considerable interest in science and engineering, which can be applied in numerous fields such as distributed tracking and localization [1], multiagent coordination [2], distributed estimations using sensor networks [3][4][5], and machine learning [6]. Such problems can be modeled to minimize or maximize the summation of some of the local convex functions, and these local functions only use a local computation and communication in a distributed manner. With an increase in the network size and data volume, more effective distributed algorithms have become a hot research topic. In recent years, many scholars have proposed various distributed optimization algorithms to solve such problems [7][8][9][10][11][12][13][14][15].
Most of the existing algorithms assume that the cost function at each agent is fixed. However, in practical problems, the environment of an agent is uncertain and the cost function of each agent changes over time, requiring us to solve such problems through an online setting. To be more precise, in a distributed online optimization, the cost function of each agent changes during each step, and with t iterations, before a decision is given, the cost function for each agent is unknown; only when we obtain a decision from a constrained set, we can obtain the information of the cost function. In addition, we also obtain a loss at the same time. Such loss reflects the error in the cost of the objective function between the current decision point and the best decision in hindsight, which we call regret. Regret is an important criterion in evaluating a distributed online algorithm. A well-performing distributed online optimization algorithm should decrease the average total regret approach to zero over time.
Because an online distributed optimization algorithm is more consistent with that used for practical problems, many scholars have conducted numerous studies and some effective algorithms have been proposed [16][17][18][19][20][21][22][23][24]. Yan et al. [20] introduced a distributed autonomous online learning algorithm, namely, a subgradient descent method using a projection. When the local objective functions are strongly convex and convex, the regret bounds of the proposed algorithm are obtained, respectively. e authors in [22] introduced an online distributed push-sum algorithm in which the search direction is a negative subgradient in each iteration, which achieves regret O((log(T)) 2 ) when the local function is strongly convex. For a time-varying directed network problem, Zhu et al. [25,26] proposed a distributed online optimization algorithm. During each iteration, the negative subgradient is randomly selected as the search direction. e authors in [27] presented a distributed online algorithm based on a primal-dual dynamic mirror descent for a problem with time-varying coupling inequality constraints and obtained a dynamic regret bound. e authors in [28] proposed a distributed online conditional gradient algorithm for a constrained distributed online optimization problem in the Internet of ings. e existing distributed online optimization algorithm based on the gradient method is simple to calculate and requires little storage; however, to ensure the convergence of the algorithm, the iterative step length usually needs to decrease with an increase in the number of iterations, which will lead to a zigzag path at the end of the algorithm. at is, the algorithm will carry out multiple iterations in the same direction or approximate direction, which greatly increases the computational time of the algorithm. e conjugate gradient algorithm also has the advantages of simple calculations and guaranteed convergence under certain conditions [29][30][31] but differs from the gradient method in that the search direction of the conjugate gradient algorithm is a group of conjugate or approximately conjugate vectors, and during the later stage of the algorithm, there are no additional repeated iterations in the same or approximate direction.
us, the convergence of the conjugate gradient method is generally faster than that of the gradient descent method. In particular, for an objective functional quadratic, the conjugate gradient method has a quadratic termination. Based on these advantages, the conjugate gradient method has been used to solve numerous centralized offline optimization problems [32][33][34][35]. According to the existing literature, however, the conjugate gradient method has not been applied to distributed online optimization problems. To fill in this gap, herein, we present a distributed online conjugate gradient algorithm.
ere are two main contributions provided by the present study. First, a new algorithm for a distributed online constrained convex optimization problem, namely, a distributed online conjugate gradient algorithm, is proposed. In our algorithm, a set of conjugate directions is used to replace the gradient directions used in a traditional gradient descent method, and the step size is obtained through an accurate linear search, thus effectively avoiding the slow convergence speed of a traditional gradient descent algorithm during the later stage. Second, we provide a careful analysis of the convergence of the proposed algorithm and obtain the square root of the regret bound. e remainder of this paper is organized as follows: in Section 2, we first briefly introduce the distributed online optimization model, followed by some necessary mathematical preliminaries and assumptions used in this study. We also provide a detailed statement of our algorithm in Section 3 and an analysis of the convergence of the algorithm in Section 4. e simulation results of our algorithm are then presented in Section 5. Finally, we provide some concluding remarks in Section 6. In addition, further detailed proofs of some of the lemmas applied can be found in the Appendix.

Preliminaries
In this section, we provide a brief background on the distributed online optimization and the conjugate gradient method. At the same time, some constructs used in this study and some relevant assumptions regarding our analysis are provided.

Distributed Online
Optimization. Consider a network system with multiple agents; in this network, each agent i is associated with a convex function f ti (x): R n ⟶ R. All agents aim to solve the following general consensus problem cooperatively: subject to x ∈ X. (1) During each round t ∈ {1, . . ., T}, the ith agent is required to generate a decision point x i (t) from a convex compact set X. en, the adversary replies to each agent's decision with a cost function f ti (x): X ⟶ R, and each agent has a loss of f ti (x i (t)) simultaneously. e communication between agents is specified by a graph G � (V, E), where V � 1, . . . , n { } is the vertex set and E ⊂ V × V is the edge set. Each agent i can only communicate with its immediate neighbors N(i) � j ∈ Vu(i, j) ∈ E . e goal of the agents is to seek a sequence of decision points x i (t) , i ∈ V such that the cumulative regret with respect to each agent i regarding any fixed decision x * ∈ X in hindsight is sublinear in T, that is, lim T⟶∞ R T /T � 0.

Conjugate Gradient Method.
For the following optimization problem, where f(x) is a quadratic continuous differentiable. e iterative form of the conjugate gradient (CG) method is usually designed as where x(k) is the point from the kth iteration, α k > 0 is the step length, and the search direction d k is defined as in which, g k is the gradient of the objective function at the current iterate point x(k), β k ∈ R is a scalar, and the different definitions of β k represent different methods of a conjugate gradient [27]. Well-known conjugate gradient methods include the Polak-Ribiere-Polyak (PRP) method and the 2 Complexity Fletcher-Reeves (FR) method. In this study, we define the parameter β k using the PRP method, the specific form of which is as follows: Gilbert and Nocedal [36] proved that if the parameter β k is appropriately bounded in magnitude, the CG method can converge globally. erefore, the CG method satisfies the sufficient descent condition under this hypothesis.
To analyze the convergence of our algorithm, we provide the bound of the conjugate gradient as follows.
Lemma 1 (see [37]). Let f(x) be a quadratic continuous difference convex function, and ∇ 2 f(x) be a Hessian matrix of the function. For any x ∈ R n , when

there exist two positive numbers m and M such that
Taking an initial point x (1) ∈ C, where x k , d k , and β k are all defined using the PRP method, in which g k � ∇f(x k ).

Some Constructs and
Assumptions. e following assumptions are given throughout this paper: (i) Each cost function f ti (x) is a convex and twice continuous differentiable L-Lipschitz on the convex set X. (ii) e set X is compact and convex, and 0 ∈ X, 0 denotes a vector with all entries equal to zero. (iii) e Euclidean diameter of X is bounded by R.
As the Lipschitz condition in (i) implies, for any x ∈ X and any gradient g i , we have the following: Where ‖ · ‖ * : � sup ‖u‖≤1 〈·, u〉 denotes the dual norm. e next definition is used throughout this paper.
Definition 1 (see [38]). Let f(x) be a function difference on an open set C⊆R n , and let X be a convex subset of C. en, for all (x 0 , x) ∈ X × X. Now, we give an important inequality in [39] that is often used in optimization problems.
Let f(x) be a first-order continuous differentiable function on the set R n , whose first derivative satisfies the Lipschitz condition, and thus ∀x, y ∈ R n , where L is the Lipschitz constant and ‖·‖ denotes the European norm.

Distributed Online Conjugate Gradient Algorithm
For the distributed online optimization problem (1), each locally cost function f ti (x) satisfies the assumptions in Section 2. e network topology relationship among agents is specified by an undirected graph Each agent i can only communicate with its immediate neighbors. e adjacency matrix of the undirected graph is a doubly (1), we present a distributed online conjugate gradient algorithm. After giving a decision x i (t) ∈ X based on the current information, we can obtain the cost function f ti (x) and compute the gradient g i (t) � ∇f ti (x(t)). We can then calculate the value of β i (t) using the gradients in the current iteration point x i (t) and the previous iteration point , computed using a Gram-Schmidt conjugate of the gradients in the current iteration point x i (t) and the previous search direction d i (t − 1), can be constructed. If the parameter is β i (t) ≤ 0, we then obtain the new search direction d i (t) � − g i (t), which is equivalent to restarting the distributed online conjugate gradient algorithm in the direction of the steepest descent. e iteration step length α i (t) can be obtained through an exact line search, and the next iteration point x i (t + 1) can be obtained using the conjugate direction vector d i (t) and step α i (t).
e specific algorithm is summarized in Algorithm 1.
Here, we define the projection function used in this algorithm as follows:

Regret Bound Analysis
To analyze the regret bound for D-OCG, we provide some preliminary remarks and a few definitions. Using Algorithm 1, we can determine the following: Now, we define Complexity and from the evolution of z i (t + 1), we can obtain Now, the main results in our paper can be stated.

Theorem 1. e sequences of x i (t) and z i (t) generated by Algorithm 1 are given for all
, and we thus have the cumulative regret owing to the action of agent i , s where λ � max 1≤i≤n, 1≤t≤T {λ i (t)}, b and D are two nonnegative constants, M and m are as defined in Lemma 1, n is the number of agents, and σ 2 (P) is the second largest eigenvalue of the adjacency matrix P.
From eorem 1, we obtain a regret bound of the proposed algorithm under the local convexity, which is sublinear to T, i.e., the regret bound of the D-OCG algorithm can approach zero as the value of T increases, where T is the number of iterations. It is evident that the value of the regret bound is related to the upper bound L of the gradient of the local objective functions and the diameter R of the constraint set X. By Lemma 1, we know that the regret bound is also related to the Hessian matrix of the local objective functions. Moreover, the value of the regret bound is also related to the scale and topology of the network.
To prove eorem 1, we now present the following lemmas.

Lemma 2.
For any i ∈ V and x * ∈ X, we can obtain the following inequality: Proof. Based on assumption (i), the function f t (x i (t)) is L-Lipschitz continuous on the convex set X, that is, and thus By contrast, (1) Input: convex set X, maximum round number T (2) Initialize: e adversary reveals f ti , ∀i ∈ V (5) Compute the gradients g i (t) ∈ zf ti (x i (t)), ∀i ∈ V (6) Compute: 4 Complexity Combining equations (19) and (20), the proof of Lemma 2 is completed. Now, we prove that the last term of inequality (20) has a particular bound. □ Lemma 3. For any i ∈ V and x * ∈ X, en, based on assumption (i), we know that ‖g i (t)‖ ≤ L. We can then obtain the following: Summing for t � 1, . . ., T for the average of 〈g i (t), x i (t) − x * 〉, the following is obtained: e proof of Lemma 3 is completed. Now, we turn our attention to the following term: According to the definition of the conjugate gradient, we give the bound of equation (25) in Lemma 4.

Lemma 4. For any i ∈ V and β i (t) ≤ b (where b is a nonnegative constant), the following bound holds:
Proof. Based on the definitions of d i (t) and d(t), the lefthand side of the above inequality can be split into two: us, we prove that the first term in equation (27) has a bound. For any function f(x), we know that where dom f is the domain of the function f(x). erefore, for the function we can obtain for any x * ∈ X, that is, so Based on the definition of the conjugate function [40] and the updates for z(t), we have the following: Because α(t) is a nonincreasing sequence, based on the definition of the conjugate function φ * α (z), we can obtain for and thus we obtain the following: According to the inequality (11), we know that A detailed proof of equation (36) is provided in Appendix A. e following inequality is then established: Summing both sides of the above inequation from t � 1 to T, we obtain the following: rough equations (33) and (38), we can write the following: We then analyze the bound on the second term in inequality (27). Because β i (t) � max 1≤i≤n {0, β i (t) PRP }, and β i (t) ≤ b, we then analyze the following two situations. □ then, e conclusion therefore clearly holds.
Because the set X is closed and φ(x) is strongly convex (for the definition of strongly convex, see [40]), the set described above is compact. By contrast, we know that 〈z, x〉 is differentiable in z, and the supremum is unique, and thus we can obtain the following: ∇φ * en, we derive the next two equations through Taylor expansion: and thus (53) and therefore us, Summing both sides of the above inequation from t � 1 to T, we obtain and combining equations (46)-(57), we obtain the following: rough equations (27), (39), and (58), we finalize the proof of Lemma 4.

Lemma 5. (α-Lipschitz continuity of the projections). For any pair
A detailed proof of this Lemma can be seen in Appendix B. Now, we focus on an analysis of a key result concerning regret, i.e., ‖x i (t) − y(t)‖ in Lemma 6.

Lemma 6.
For all i ∈ V and t ∈{0, . . ., T}, the following inequality is true: Proof. Because x i (t) and y(t) are the projections of z i (t) and z(t) onto the set X, through Lemma 5, we have the following: Now, considering the evolution of sequence {z i (t)} in Algorithm 1, we obtain the following: Because p ij is an element of a doubly stochastic matrix, n i�1 p ij � 1, then we have Based on Algorithm 1 and the definition of z(t), we can determine that z(1) � 0, and thus In addition, we can obtain and based on the definition for the 1 norm of the vector (see [41]), To obtain a more specific bound of equation (46), we introduce a useful property of a stochastic matrix as follows [12]: where P t− r− 1 denotes the (t − r − 1)-th power of matrix P, e i is the ith basis coordinate of an n-dimensional space R n , 1 denotes a vector with all entries equal to 1, and σ 2 (P) is the second largest eigenvalue of stochastic matrices P and σ 2 (P) ≤ 1, through which we obtain the following inequality: Combining equations (63), (68), and (70) yields the following: us, we complete the proof of Lemma 6 Now, we can provide a brief proof of eorem 1. □ Proof of eorem 1. Combining lemmata 2-6 yields the following regret bound: By equation (71) and based on α(t) � (λ/ � t √ ), φ(x * ) ≤ R 2 , φ * α(T) (z(T)) ≤ D, and ‖d(t − 1)‖ 2 * ≤ ((M + m) 2 / m 2 )L 2 , we can obtain the conclusion to eorem 1.

Simulation Experiments
To verify the performance of the D-OCG, we consider a problem of a distributed sensor network [18], which has n sensors and aims at the estimation of a random vector x ∈ X � x ∈ R d | ‖x‖ 2 2 ≤ x 2 max . In this network, at each time t ∈ {1, 2, . . ., T}, each sensor i receives an observation vector v ti : R d ⟶ R m , in which the vector v ti is time-varying owing to the effect of the observed noise. Assume that each sensor i has a linear model ϕ i (x) � A i x, where A i is the observation matrix of sensor i, and A i ∈ R m×d and ‖A i ‖ 1 ≤ ϕ max . e local cost function in sensor i is defined as f ti (x) � (1/2)‖v ti − A i x‖ 2 2 , where v ti � A i x + η ti , in which η ti is white noise. e mathematical model of this problem is subject to x ∈ X. (73) In an offline case, the cost function in each sensor i is fixed, and because we can know all information of the cost function in advance, the centralized optimal estimate for this problem can be obtained by In a practical problem, the characteristics of the white noise may be unknown, or some sensors might not work properly for a particular reason, and we therefore need to find an estimate for vector x using a distributed online algorithm. Here, we set d � 1 and A i � 1/2, and sensor i observes v ti � a ti x + b ti , where a ti ∼U(0, 1) and b ti ∼ U(− (1/4), 1/4) (in which x ∼ U(a, b) indicates a random vector x uniformly distributed on (0, 1)). en, the cost function for sensor i at each time t is given by We verified the performance of the proposed algorithm based on the following three aspects: (1) First, we determined how the number of nodes in the network affects the performance of the D-OCG. We can see from Figure 1 that the average regret decreases slowly when we increase the number of nodes, and the algorithm is convergent on different scaled networks. When n � 1, the problem is equivalent to a centralized optimization problem, and our distributed optimization algorithm can reach the same effect as the centralized algorithm. (2) We then checked how the network topology influences the performance of the D-OCG. We implemented the algorithm on three types of graphs with nine nodes. In a complete graph, each node is connected to the remaining nodes, that is, all nodes can exchange information with each other. In a cycle graph, each node is only connected to two nodes directly adjacent to it. e connectivity of a Watts-Strogatz graph is between the complete graph and the cycle graph. From Figure 2, it can be seen that a better connectivity can lead to a slightly faster convergence.
(3) We next compared our algorithm with the class algorithm D-OGD in [20]. e parameters used in these two algorithms are based on their theoretical proofs. e network topology relationship among nodes is complete, whereas for nodes n � 9, the step size is α(t) � 1/ � t √ . As shown in Figure 3, the convergence speed of the two algorithms is initially close, but with an increase in the number of iterations, the D-OCG converges faster than the D-OGD, which fully reflects the excellent performance of the proposed algorithm. Complexity 9

Conclusion
We proposed a distributed online conjugate gradient algorithm to solve the distributed optimization problem with a convex constraint in a network. With this algorithm, the conjugate gradient is used to replace the gradient or subgradient in a traditional gradient decent method. Because the search direction is mutually conjugated throughout the entire algorithm iteration process, we can remove the disadvantage that a slow convergence has in the later stage of a gradient decent. We also presented a detailed analysis of the convergence for the proposed algorithm and obtained a regret bound for the optimization problem. e regret bound has a sublinear convergence.
We applied the proposed algorithm (D-OCG) to a distributed sensor estimation problem. e numerical results show that our algorithm is feasible and effective, and under the same assumptions, the D-OCG has a better convergence rate than the traditional D-OGD gradient method.