A Study on Distributed Optimization over Large-Scale Networked Systems

Distributed optimization is a very important concept with applications in control theory andmany related fields, as it is high faulttolerant and extremely scalable compared with centralized optimization. Centralized solution methods are not suitable for many application domains that consist of large number of networked systems. In general, these large-scale networked systems cooperatively find an optimal solution to a common global objective during the optimization process. +us, it gives us an opportunity to analyze distributed optimization techniques that is demanded in most distributed optimization settings. +is paper presents an analysis that provides an overview of decomposition methods as well as currently existing distributed methods and techniques that are employed in large-scale networked systems. A detailed analysis on gradient like methods, subgradient methods, and methods of multipliers including the alternating direction method of multipliers is presented. +ese methods are analyzed empirically by using numerical examples. Moreover, an example highlighting the fact that the gradient method fails to solve distributed problems in some circumstances is discussed under numerical results. A numerical implementation is used to demonstrate that the alternating direction method of multipliers can solve this particular problem, by revealing its robustness compared with the gradient method. Finally, we conclude the paper with possible future research directions.


Introduction
Optimization is a mathematical discipline which determines the best possible solution corresponding to the optimum performance of a quantitatively well-defined system. e theory of optimization has been established as a desirable tool that is used in a wide range of disciplines, such as automatic control systems, estimation and signal processing, communications and networks, electronic circuit design, data analysis and modeling, statistics, and finance [1][2][3]. In the recent study [4], the novelty search, a tool that is used in evolutionary and swarm robotics was developed for the use of global optimization. Formally, a mathematical optimization problem can be posed as follows: where f 0 is a real-valued objective function of the decision variables x ∈ R n and C ⊆ R n . However, in reality, it may be difficult or not possible to find analytic solutions to certain optimization problems. As a result, iterative methods that provide approximate solutions have been introduced by researchers. Algorithms that are used to solve optimization problems have been extensively analyzed mainly under centralized and decentralized architectures [5,6]. Centralized solution methods are not suitable for many communication networking problems such as large-scale and data-intensive problems that demand distributed solutions. Consequently, the application of distributed optimization techniques where subsystems coordinate to find a solution to the original problem is of utmost importance. Intranets, the Internet, telecommunication networks, aircraft control systems, sensor networks, and electronic banking are some important examples for distributed systems. ese systems consist of a large number of smaller subsystems, and they integrate together to reach an optimal status of the process.
is optimal status of process in large-scale networked systems needs to be achieved without incurring errors and exceeding already set time limits for expected outcomes. erefore, the study of well-established theoretical concepts together with empirical implementations on distributed optimization is critical. is gives us an opportunity to analyze currently existing distributed techniques and methods. In general, we may have many subsystems in a distributed optimization setting. We consider the following optimization problem with five subsystems as an example to provide a deeper explanation of distributed optimization: minimize f 1 x 1 , y + f 2 x 2 , y, r + f 3 x 3 , y + f 4 x 4 , y, z + f 5 x 5 , r, z subject to x 1 ∈ X 1 , x 2 ∈ X 2 , x 3 ∈ X 3 , x 4 ∈ X 4 , where X 1 , X 2 , X 3 , X 4 , X 5 , R, and Z are subsets of R n . In this problem, we can observe that there are three complicating variables y, r, and z. e variable y is shared by subsystems 1, 2, 3, and 4, the variable r is shared by subsystems 2 and 5, while the variable z is shared among subsystems 4 and 5. Figure 1 shows the associated decomposition structure of (2), and the related distributed problem can be stated as follows: minimize f 1 x 1 , y 1 + f 2 x 2 , y 2 , r 1 + f 3 x 3 , y 3 + f 4 x 4 , y 4 , z 1 + f 5 x 5 , r 2 , z 2 subject to y 1 � y 2 � y 3 � y 4 , r 1 � r 2 , z 1 � z 2 . (3) Here, we can observe that problem (3) is minimized by multiple users cooperatively. Hence, a distributed method is required to find a solution.
Many networked systems cannot communicate exact information between subsystems due to unavoidable errors that may occur as a result of limited communication bandwidths and sometimes due to measurement errors [7,8]. erein lies the importance of analysing quantized distributed methods in real life situations [9][10][11][12][13]. Although many quantized distributed methods have been analyzed, deeper investigation of quantization methods is still required.
We present the outline of our paper as follows. In Section 2, we discuss the preliminaries related to distributed optimization and primal and dual decomposition. Section 3 provides a general literature review on currently existing well-known distributed optimization methods. Next, in Sections 4, 5, and 6, we discuss the gradient method, the subgradient method, and the alternating direction method of multipliers (ADMM), respectively. In those sections, we discuss the theoretical concepts of the relevant methods as well as previous studies performed on them. In Section 7, we continue our discussion on distributed optimization with noise to emphasize the importance of involvement of error in distributed optimization methods. In Section 8, we provide our numerical results to discuss the convergence of aforementioned distributed methods. Finally, in Section 9, we conclude our paper with possible future research directions.

Preliminaries
In this section, we discuss the concept of distributed optimization and we introduce primal decomposition and dual decomposition, which play an important role in distributed optimization. Our introduction on primal decomposition and dual decomposition is mainly inspired by the lecture notes on decomposition methods by boyd et al. [14].
roughout the paper, we will use following notations.
Notation. We let R, R n , and R n + represent the set of real numbers, n-dimensional Euclidean space, and positive orthant in n-dimensional Euclidean space, respectively. For x ∈ R n , ‖x‖ denotes the Euclidean norm and [x] X denotes the projection of x on to the set X⊆R n . e set of n × m matrices is denoted by R n×m . e transpose of a matrix A is given by A T . ∇f represents the gradient of a scalar valued function f.

Distributed Optimization.
Distributed optimization is an optimization process that is used in networked systems with a large number of users. is process enables the system to solve a global problem cooperatively even if there is no central controller available in the system. When compared with centralized techniques, distributed optimization has many considerable advantages. In distributed algorithms, nodes or users in the network share information only with necessary parties.
is fact improves cyber security and reduces communication cost. Furthermore, distributed techniques have the ability to handle problems even if the problem size is very large. ese techniques also have the potential to increase the solution speed [15].
Distributed optimization algorithms solve large-scale and data-intensive problems in a wide range of application areas such as communications [16][17][18][19], electricity grid [20,21], large-scale multiagent systems [22,23], smart grids, wireless sensor networks [24], and statistical learning. Zhang and Sahraei-Ardakani have developed a fully distributed DC optimal power flow method that incorporates flexible transmission and discussed the effect of communication limitations on the convergence properties [25,26]. In [27], authors have presented a study on finite-time consensus opinion dynamics and studied an application to distributed optimization over digraph.
Many distributed optimization algorithms are built on decomposition methods. Decomposition is an interesting approach to solving a global problem by breaking it up into smaller subproblems and solving each of them separately. ese subproblems get solved either in parallel or sequentially [6,14,[28][29][30]. Decomposition in optimization appears in early work on large-scale linear programs from 1960s [31]. e simplest decomposition structure is available in block separable problems. For an example, a block separable problem can be given as follows: In this form, we can minimize f 1 (x) and f 2 (x) separately in parallel and obtain the optimal value and optimal solution. However, this method seems to be trivial and does not seem to be an interesting task as many real life problems appear in a more complex form than this [14]. is problem becomes more complicated and creates more interest when the subvectors x 1 and x 2 are coupled. is situation can be handled by primal decomposition and dual decomposition, which are the most well-known decomposition methods currently available.

Primal Decomposition.
Primal decomposition deals with complicating variables. Here, we consider a constrained minimization problem that consists of m number of users as follows: where x � (x 1 , x 2 , . . . , x m , y), C i ⊆ R n , Y⊆R n , and f i s represent real-valued objective functions of individual users. Here, the variable y is called the complicating variable, which complicates the system. When y is fixed, problem (5) decomposes in to m smaller subproblems.
Subproblems are as follows: en, the original problem (5) is equivalent to the problem and this is called the master problem in primal decomposition [14]. Next, the original problem (5) can be solved by solving the master problem (7), using a distributed algorithm under some well-defined assumptions on individual primal objective functions f i s.

Dual Decomposition.
Here, we consider the same problem (5) discussed under primal decomposition only with two users. en, we have the objective function as f(x) � f 1 (x 1 , y) + f 2 (x 2 , y). Next, the problem can be rearranged by introducing new variables y 1 and y 2 as follows [14]: According to this new arrangement, the objective function f is separable. Next, we can apply the decomposition with its dual problem. e Lagrangian of (8) is given by Next, the related dual function is given by which is accompanied with subproblems en, the dual problem of (8) is given by is is called the master problem in dual decomposition. is problem can be solved by using an iterative method such as subgradient method, which will be discussed under Section 5. Although we are able to solve the dual problem and find dual optimal measures, we still cannot guarantee that we can find primal optimal measures without introducing some acceptable conditions on the primal objective function. For an example, if f 1 and f 2 are strictly convex, then the primal variables x 1 , x 2 , y 1 , and y 2 found by solving two subproblems g 1 and g 2 are guaranteed to converge to the optimal solution of the primal problem (8) [14].

A General Literature Review on Distributed Methods for Solving Optimization Problems
In this section, we provide a general overview of currently existing distributed optimization methods. A detailed analysis will be given in later sections with more technical details. Most of the existing studies done on distributed optimization problems have been analyzed and related solution methods have been discussed when the optimization problem is convex. Convex optimization problems can be solved very reliably and efficiently using interior-point methods, and most of the theories related to convex optimization have been already developed. erefore, recognizing or formulating a problem as a convex optimization Journal of Mathematics 3 problem gives us a great advantage. In the texts [5,6], authors have provided the readers with a very good background to develop a working knowledge on convex optimization to recognize, formulate, and solve convex optimization problems. For example, if we consider a nonconvex constrained optimization problem, the associated negative dual problem is always convex. Hence, in some situations, the original problem can be solved by using the dual problem which provides an easy environment to work with because of the convexity. We have observed that currently available state-of-theart distributed methods of solving optimization problems are gradient-based algorithms, subgradient-based algorithms, and their variants, such as ADMM [30,[32][33][34][35][36][37][38]. e gradient method is generally applied on unconstrained optimization problems. In 1970, Ramsay had studied gradient methods for optimizing nonlinear functions of several variables that cause difficulties when second derivative approaches are used [39]. In the recent study [40], Nedić et al. have focused on solving a distributed convex optimization problem using "push-pull gradient methods." ey have given this interesting name as the agents in the problem network push the gradient information to the neighbors and the decision variable information is pulled by neighbors throughout the method. In [41], Calamai and Moré have studied the convergence properties of the projected gradient method for linearly constrained problems which are useful in large-scale problems. e projected gradient method is a variant of the gradient method which is used in constrained optimization. e subgradient method can be considered as a generalization of the gradient method and is useful in optimizing nondifferentiable functions. In [9-12, 22, 42], subgradient methods are used to solve large-scaled distributed problems that deals with the sum of a large number of convex local objective functions. References [24,[43][44][45] are some studies that have been focused on effects of constraints, and they have presented projected subgradient algorithms to solve constrained optimization problems. In [44], Amini and Yousefian have studied a very important class of bilevel convex optimization problems that are often used for large-scale data processing in machine learning and neural networks. e authors in [45] have studied the binary iterative hard thresholding algorithm, a state-of-the-art recovery algorithm in one-bit compressive sensing which makes use of the projected subgradient method.
ADMM is also a well-suited method used in distributed convex optimization over large-scale networked systems arising in statistics and machine learning. e ADMM was first proposed by Gabay, Mercier, Glowinski, and Marrocco [46] in the mid-1970s. In the recent study [47], Xiao et al. have presented a distributed and scalable algorithm for managing the residential demand response programs using ADMM. ey have shown through their simulation studies that the proposed method can reduce customers' electricity bills and peak load. Authors in [48] have presented a distributed ADMM for solving the direct current dynamic optimal power flow with carbon emission trading problem. In [49], Hajinezhad and Shi proposed an algorithm related to ADMM to study a class of nonconvex nonsmooth optimization problems with bilinear constraints which are widely used in machine learning and signal processing application domains. e study [50] has presented a modified distributed ADMM to handle nonconvex optimization problems with discrete control variables.

The Gradient Method
Let us consider an unconstrained minimization problem as follows: where f: R n ⟶ R is differentiable and x ∈ R n . en, the gradient method to solve optimization problems of form (13) can be expressed by following iterative process, which starts from some initial point x 0 : where α k ≥ 0 is known to be the step size. e convergence of method (14) can be discussed under various considerations, using the theorems presented in [51].
Theorem 1 (see [51]). Suppose that α k � α (a constant step size) in (14). Let f(x) be differentiable on R n , ∇f is Lipschitz continuous with constant L, and let f(x) be a strongly convex with constant l. en, method (14) converges to a unique global minimum point x * with the rate of geometric progression when 0 < α < 2/L: Next, the following theorem shows the convergence of (14) for an even smaller class of functions.
Theorem 2 (see [51]). Let f(x) be strongly convex and twice differentiable. Suppose that en, for 0 < α < 2/L, Moreover, when α � 2/(L + l), q is minimal and equal to q * � (L − l)/(L + l). e proofs of eorem 1 and 2 are given in [51], and the convergence to a local minimum point of f(x) is also discussed in the same text under eorem 4 of Section 1.4. We discuss the convergence of the gradient method using a numerical example in the numerical results section (Section 8). In Section 8.1, our focus of discussion is the convergence results with the use of primal decomposition.
ere are many early studies done on gradient methods [39,41,52,53]. Authors in [53] had combined gradient methods with back propagation methods for neural networks to discuss the optimization of weights of multilayer neural networks. In the study [52], authors have proposed two new step sizes for the classical-steepest descent method, where α k in method (14) is used as α k � argmin α f(x k − α∇f (x k )). e most interesting fact regarding these new step sizes is that they require less computational effort than the classical-steepest descent method. However, these studies have not given enough attention and emphasis on distributed optimization techniques, which have become crucial to be analyzed in many application domains.
Some recent work that relies on gradient methods can be found in [8,40,54,55]. In these studies, the gradient method has been applied with the use of distributed techniques. In [8], the authors have investigated fundamental properties of distributed optimization based on gradient methods, where gradient information is communicated using a limited number of bits. It is a well-known fact that message exchange between subsystems is a common phenomenon in distributed optimization settings. However, perfect message exchange between subsystems is not possible due to limited communication bandwidths between subsystems. erefore, quantized information tends to be exchanged between users in networked systems, which led to the exploration of new findings on quantized distributed techniques. e study [8] is a very good initiative in this regard. is piece of work has studied a general class of quantized gradient methods where the gradient direction is approximated by a finite quantization set, to optimize a constrained convex optimization problem. Here, they have considered optimization problems of the form as follows: where f is convex and differentiable with L-Lipschitz continuous gradient, x ∈ R n , X is closed and convex set, and the optimal solution set X * is nonempty and bounded. To solve problem (18), they have used the projected gradient method as follows: where d k is quantized gradient information coded using limited number of bits. In this paper, authors have proposed two types of quantization schemes, namely, binary quantization and proper quantization. (a) Binary Quantization. In this quantization scheme, the quantization set is taken as . A convergence proof of method (19) was given under this binary quantization when X � R n and X � R n + . ese convergence results are very important as they can deal with a dual problem of form (18) associated with equality and inequality constrained primal problems.
(b) Proper Quantization. When the above discussed binary quantization is used to solve TCP problems, the related quantized gradients are transmitted using n bits.
ere are many applications, where the dual problem is maintained by an individual coordinator [18,19]. erefore, it is worth seeking to analyze whether it is possible to use less number of bits than n when an individual coordinator exerts the problem. is fact motivates authors in [8] to discuss about the proper quantization. Here, we like to highlight the following two definitions they have used to establish their results.
Authors in [54] have introduced two measures of communication complexity of dual decomposition, which help to identify the communication overhead required by limited communication networks. e first measure determines the smallest number of bits needed to find a solution within a given accuracy, while the second measure quantifies the best possible solution accuracy when a fixed amount of bits were communicated. Furthermore, in this same work, the authors have studied a quantization scheme (introduced as Primal-Feasible quantization scheme) which guaranteed primal feasibility at each iteration in their method.

The Subgradient Method
Subgradient method is basically used to minimize nondifferentiable convex problems. Nondifferentiable or nonsmooth functions are one important class of problems that arise in many applications of mathematical programming, such as game theory, multicriteria models, nonlinear programming problems, optimal control problems with continuous or discrete time, and integer and mixed integer programming problems [56]. Subgradient methods are firstorder methods. eir performance highly depends on problem scaling and conditioning, whereas Newton's method and interior-point methods are not dependent on problem scaling [57].
Before entering into the topic of subgradient methods, we would like to discuss about subgradients, which can be introduced as a generalized concept of gradients. When a function is nondifferentiable, the gradient of the function at nondifferentiable points cannot be found uniquely. erefore, a well-defined way to express the slope of the function at those nondifferentiable points is required, mainly in optimization theory. us, getting a better understanding of subgradients is essential in the field of optimization theory. Reference [56] gives a very good exposition of the concept of subgradients, and it provides many important theoretical Journal of Mathematics aspects related to subgradients. Polyak's text [51] and the text [6] of Bertsekas are two other good references that discuss subgradients and subgradient methods. Next, we will define a subgradient of a convex function.
e set of all subgradients of f at x is called the subdifferential of f at x and denoted by zf. If f is differentiable, then its subgradient at x is unique and it is the gradient of f at x.

e Basic Subgradient Method.
We consider the same form of the unconstrained optimization problem (13) considered in Section 4. e objective function f(x) is still convex but not necessarily differentiable.
en, the subgradient method used to solve this problem can be given by the following iterative sequence starting at some initial point x 0 : where x k is the kth iterate, g k is an any subgradient of f at x k , and α k > 0 is the step size related to kth iteration. e subgradient method (21) can be considered as an extension of the gradient method (14). e difference is that, in each iteration, we use a subgradient g k of the function f(x) at x k instead of ∇f(x k ) in (14). Moreover, the step size selection in the subgradient method is much different to the gradient method. In [57], Boyd has given five basic step size rules, namely, constant step size, constant step length, square summable but not summable, nonsummable diminishing, and nonsummable diminishing step length. From theses five step size rules, we present three common ones as follows: (1) A constant step size, α k � α is a positive constant and independent of k. (2) Square summable but not summable: the step sizes satisfy For example, α k � 1/k. (3) Nonsummable diminishing: the step sizes satisfy Above choices for the step size α k do not depend on details computed during the subgradient algorithm. is fact differs from the step size rules found in standard descent methods, which uses current point and search direction. Good discussions on descent methods can be found in chapter 9 of [5] and chapter 8 of [58]. We can find many other choices for step size α k in addition to the choices mentioned above. In [51], Polyak has shown that the subgradient method (21) cannot converge rapidly under diminishing nonsummable step size rule. erefore, the author has described another variant of the subgradient method, by introducing a different step size rule that depends on f * , the optimal value of f(x). We introduce this step size in eorem 4.
Next, we discuss the convergence of the subgradient method (21) that relies on Boyd's step size rules mentioned above. We use the following assumptions to discuss the convergence: Assumption 1. Optimal set X * , the set of minimizers of problem (13) is nonempty Assumption 2. ‖g k ‖ is bounded Assumption 3.
e number K s.t ‖x 0 − x * ‖ ≤ K is known, where x * ∈ X * and x 0 is the initial point of the algorithm Theorem 3 (see [57]). Let Assumptions 1, 2, and 3 hold and let f k best � min i�1,...,k f(x i ). en, in method (21), the following inequality holds: where R is s.
e proof of eorem 3 can be found in Section 3.2 of [57]. Using this theorem, one can show that the subgradient method converges within some range of the optimal value f * , for constant step size and constant step length. For other variants of the step size, square summable but not summable, nonsummable diminishing, and nonsummable diminishing step lengths, the subgradient method converges exactly to the optimal value without incurring any error. We discuss the convergence of the basic subgradient method empirically, in the numerical results section with the above presented three step size rules. In Section 8.2, we use a constrained optimization problem, and we dedicate our attention to discussing the convergence using dual decomposition. Next, we state the following theorem which gives the convergence of the subgradient method using Polyak's step length.
Theorem 4 (see [51]). Let the set of minimizers X * of problem (13) (with nondifferentiable f) is nonempty and en, in method (21), e proof of above theorem is given by Polyak in his book [51]. Now, we discuss and analyze some studies done on subgradient methods. In [22], authors have considered a subgradient method to optimize a sum of convex objective functions corresponding to multiple agents. is work analyzes large-scale networked systems, where it is essential to design decentralized resource allocation methods, since the centralized solution methods are not suitable.
is paper has considered a scenario where agents cooperatively minimize a common additive cost. e corresponding optimization problem can be posed as follows: where the function f i : R n ⟶ R represents the cost function of agent i, which is convex and not necessarily to be differentiable, and x ∈ R n is the decision vector. To analyze this problem, authors have proposed the following subgradient method: (26) where w i j represents the weight that agent i assigns to the information x j received from a neighboring agent j and the scalar α i (k) > 0 represents the step size used by agent i. e vector d i (k) is a subgradient of agent i's objective function f i (x) at x � x i (k). Next, to analyze the convergence of method (26), they have used a different representation of that method in a way that each iteration x i (k + 1) can be estimated using the information w i j (s) and estimates x i (s), where i, j � 1, . . . , m and s ≤ k. In this study, the authors have considered an unconstrained optimization problem, but in general, this problem can be viewed in a more advanced setting, in the presence of constraints. is fact motivates readers to extend this seminal work done by Nedić and Ozdaglar to a different path of research, which will lead to a different line of convergence analysis. Furthermore, their model assumes that agents can exchange exact information, which is not possible in practice due to limited communication bandwidths. erefore, the information is usually quantized before being sent, and it is considered that the quantization reduces the communication cost in networked control systems [59][60][61].
In [11], authors have considered the distributed subgradient method discussed in [22] and they have presented improved convergence results. Furthermore, they have shown that upper bounds for the difference between the estimated objective function value and the exact optimal value of the problem have a polynomial dependence on the number of agents m, by using results of their prior work [62]. We can view these bounds as improved versions of error bounds obtained in studies [22,42], which involve exponential dependence on m. Moreover, the authors have studied the subgradient method when the communicated information is quantized to address the issue that perfect message exchange between agents cannot be performed. Some other works related to the same line of research are [9, 10, 12].

Projected Subgradient Method.
Projected subgradient method is an extension of the basic subgradient method used in constrained optimization problems. Consider the optimization problem of the form where f and X are convex. en, the projected subgradient method can be given by where g k is any subgradient of f at x k . Convergence of method (28) can be attained under the same step size rules described under the basic subgradient method [57]. Authors in [43] have presented distributed algorithms to solve a constrained consensus problem and a constrained optimization problem. ey have used a distributed projected subgradient method to solve the constrained optimization problem, which consist of minimizing a sum of convex local objective functions. ey have shown that their method converges to the optimal solution with square summable but not summable step size rule. In [24], Madan and Lall have proposed two distributed projected subgradient methods to find an optimal routing flow to maximize the network lifetime in a partially and fully decentralized manner. In their solution, subgradient methods have been applied with their dual problem. We noticed that most of the studies performed on distributed optimization have used their original primal objective function in the optimization process. ey have not shown much interest on duality theory, which provide many advantages in solving constrained optimization problems. Under these circumstances, Madan's and Lall's work [24] provides immense value addition to the study of distributed optimization.

Alternating Direction Method of Multipliers
ADMM is a simple but strong method that is used in distributed optimization [32]. ADMM is a variant of augmented Lagrangian and method of multipliers that uses the decomposability of dual ascent. In [32], augmented Lagrangian and method of multipliers are discussed under the following equality constrained optimization problem: where x ∈ R n , A ∈ R m×n , and f is convex. en, the augmented Lagrangian for problem (29) is given by where p > 0 is known as the penalty parameter. en, the corresponding dual function is given by g p (λ) � inf x L p (x, λ). e authors have used the gradient method to minimize negative g p (λ) with penalty parameter p as the step size. e method of multipliers can be viewed as more Journal of Mathematics robust version of the dual ascent method, and it yields convergence under more general conditions than the dual ascent. However, "when f is separable, L p is not separable" is the fact that the authors in [32] have concerns with. When f is not separable, the minimization process cannot be continued in parallel, and hence, the method of multipliers cannot be used in dual decomposition. erefore, an alternative way of observing problem (29) is needed, and consequently the ADMM has been introduced to address this issue. ADMM is a method well suited for distributed optimization settings that consist of large-scaled problems. In [32], authors have considered another variation of problem (29) as follows, to view it in separable form which has then led to the introduction of ADMM: where x ∈ R n , y ∈ R m , A ∈ R q×n , B ∈ R q×m , and c ∈ R q . Moreover, f and g are convex functions. en, the distributed algorithm for ADMM can be given using Algorithm 1.
ere are many early studies done on the method of multipliers and ADMM [63,64]. Some recent studies done on ADMM can be found in [65][66][67][68][69]. In [65], Erseghe has proposed a fully distributed algorithm for optimal power flow using ADMM. In this paper, the author has introduced another variation on ADMM Algorithm 1 with assumptions such as g(y) � 0, where y is contained in a linear space with associated orthogonal projector and also with certain assumptions on initial choices. In the study [66], authors have presented a decomposed solution approach with ADMM to solve a cost minimization problem, where the objective consists of energy and battery degradation cost. is work has used a modified version of ADMM, which helps to reduce the computations cost and ensures the stability of the solution. Most of the researchers including the ones mentioned above who have worked on ADMM have no concerns on noises that can be embedded in their models due to different types of errors occurring in practice, for an example, due to limited communication bandwidths. is fact motivates readers to work on this path with ADMM.

Distributed Optimization with Noise
e distributed methods for solving optimization problems can be applied in pure form only if errors and inaccuracies are fully avoided, which is hardly possible in the real world. As an example, errors or noises can occur due to inexact computation or measurement of subgradients and function values, sparsification [70], and quantization [8,71]. e noise can be deterministic or random according to the behaviour of the application domain. Most of the real world problems consist of largescale networked systems and mostly solve a common objective function interactively. In such situations, subsystems have to exchange their private information with neighboring subsystems during the optimization process.
However, the subsystems may not be able to communicate exact information due to several reasons such as security measures and communication overheads. erefore, it is very important to analyze distributed methods with noise imposed on the system.

Distributed Methods with Noise for Optimizing Smooth
Functions. In distributed methods for optimizing differentiable (smooth) functions, we always deal with a computation of the gradient, and instead of the exact value of the gradient ∇f(x k ), we may have it computed with error where r k is introduced to be the noise. In chapter 4 of [51], Polyak has discussed four types of most important classes of noise: (1) Absolute deterministic noise: r k is deterministic and satisfies the boundedness condition ‖r k ‖ ≤ ε (2) Relative deterministic noise: r k is deterministic and satisfies the condition ‖r k ‖ ≤ ε‖∇f(x k )‖ (3) Absolute random noise: r k is random, independent, centered, and has bounded variance, E[r k ] � 0 and E[‖r k ‖] ≤ σ 2 (4) Relative random noise: r k satisfies the condition In the above classes of noise, ε, σ, and τ represent positive constants. In the same text [51], the convergence of the gradient method (14) was discussed, where the gradient is computed with error as given in (32). Here, the convergence properties of the gradient method were analyzed under all four types of errors mentioned above, under the assumption that the objective function is strongly convex and with a gradient satisfying a Lipschitz condition.
Most of the related literatures available to solve optimization problems with the use of gradient like methods under the presence of noise were analyzed under boundedness assumptions on the objective function and the decision variable or show only lim k⟶∞ inf‖∇f(x k )‖ � 0 [72][73][74][75]. Authors in [55] discussed convergence results for the following method, by removing various boundedness conditions such as boundedness from below of f, boundedness of ∇f(x k ), or boundedness of x k : where s k represents a descent direction of a function f: R n ⟶ R and w k is a deterministic or stochastic error. ey first focus on the above method with deterministic error, with w k satisfying following conditions: where p and q are some positive scalars.
en, the convergence of method (33) was obtained using following theorem.
Theorem 5 (see [55]). Suppose that s k in method (33) is a descent direction satisfying for some positive scalars c 1 and c 2 , and for all k, en, for α k > 0 with square summable but not summable step size rule, method (33) guaranteed to convergent to the optimal solution.
Next, the authors have obtained convergence results for minimizing a sum of large number of functions using incremental gradient methods. Moreover, they have focused on stochastic gradient methods. In the recent study [68], authors have analyzed the convergence of distributed ADMM for consensus optimization in the presence of random error. ey have presented lower and upper bounds on the mean squared steady state error of the algorithm when individual objective functions are strongly convex and when the gradients are Lipschitz continuous. Furthermore, authors have presented that steady state error of their noisy ADMM algorithm is bounded when they have a bounded random error and when individual objectives are proper, closed, and convex.

Distributed Methods with Noise for Optimizing Nonsmooth Functions.
In chapter 5 of [51], Polyak has introduced the well-known subgradient method of optimizing nondifferentiable (nonsmooth) problems with noise, where r k is the noise imposed on the subgradient. e convergence results of the noisy subgradient method (36) have been discussed by the same author under the same classes of noises discussed in the previous subsection. In the early study [76], Polyak has studied minimization methods of a nonlinear function with nonlinear constraints when the values of the objective function, constraints, and gradients are computed with errors. In [77], authors have studied the effect of noise on subgradient methods for convex constrained optimization problems of form (27). ey have discussed the convergence properties of the following projected subgradient method when the noise is deterministic and bounded: where g k is an approximate subgradient of the form g k � g k + r k , where r k is the noise and g k is an e k subgradient of f at x k for some e k ≥ 0. Convergence properties of method (37) have been analyzed under three step size rules, namely, constant step size rule, diminishing step size rule, and dynamic step size rule which is given by where f(x k ) is an error involved function value and f lev k is a target level approximating the optimal value f * . First, the convergence of method (37) has been obtained when the constrained set is compact. Secondly, the authors have analyzed their method using a convex objective function which has a sharp set of minima. e important results observed by authors were as follows: (a) in the first scenario, the method converges to the optimal value with some tolerance and (b) in the second scenario, the method converges exactly to the optimal value without any error.
It is very important to pay attention to the stochastic optimization processes since many practical problems cannot be viewed as deterministic structures. Some studies that paid attention to this particular area can be found in [76,78]. Authors in [78] have studied stochastic quasigradient methods which allow solving optimization problems without calculating exact values of objectives and constraints. In [76], a general convex problem with noise was solved with assumptions as follows: Given initial λ. Set k � 1. while (stopping criterion) (1) x minimization step: minimize x L p (x, y k , λ k ) Let x * � argmin x L p (x, y k , λ k ). (2) y minimization step: minimize y L p (x * , y, λ k ) Let y * � argmin y L p (x * , y, λ k ).

Journal of Mathematics 9
(i) e objective function and inequality constraint functions are convex continuous (ii) Feasible set is a convex closed bounded set (iii) Slater condition holds (iv) All noises are with mean zero with bounded variance and are independent at different points

Numerical Results
In this section, we discuss the convergence of the gradient method, subgradient method, and ADMM empirically by using some numerical examples.

Example 1 (Gradient Method: Primal Decomposition).
Here, we consider an unconstrained minimization problem with two users as follows: where Here, A 1 ∈ R (n 1 +n 2 )×(n 1 +n 2 ) and A 2 ∈ R (n 1 +n 2 )×(n 1 +n 2 ) are positive definite matrices. We use primal decomposition and analyze the convergence of the gradient method (14) for this problem with the use of eorem 1. e subproblems related to (39) can be given as follows: x T 2 y T A 2 x T 2 y T T en, the master problem corresponding to (39) is given by minimize y S(y) � S 1 (y) + S 2 (y). (40) Analytically, by solving the subproblems, we can show that

y T A i x T i y T T and
R n 2 ×n 1 , and A i4 ∈ R n 2 ×n 2 for i � 1, 2. en, S(y) is quadratic as S 1 (y) and S 2 (y) are quadratic. Moreover, S 1 (y) and S 2 (y) are strongly convex since A 14 and A 24 are positive definite. Hence, S(y) is also strongly convex and ∇S(y) is Lipschitz continuous. erefore, we can apply eorem 1 to solve problem (40) using the gradient method (14). We use Algorithm 2 to solve (40). In this algorithm, at each iteration, the gradient update is given by ∇S(y k ) � ∇S 1 (y k ) + ∇S 2 (y k ), where ∇S 1 (y k ) � A T 12 x k 1 + A 13 x k 1 + 2A 14 y k and ∇S 2 (y k ) � A T 22 x k 2 + A 23 x k 2 + 2A 24 y k . First, we illustrate our results with scalar valued primal variables x 1 , x 2 , and y (n 1 � n 2 � 1 case) for different values of constant step sizes α k � α. Figure 2 shows the convergence of y k with α � 0.001, α � 0.01, α � 0.1, and α � 0.5. Next, we show the convergence results for different dimensions of the complicating variable y with x 1 ∈ R 10 and x 2 ∈ R 10 . Figure 3 shows the convergence of the residuals ‖y k − y * ‖ with step size α � 0.001, for y ∈ R, y ∈ R 2 , y ∈ R 3 , y ∈ R 5 , and y ∈ R 10 , where y * represents the optimal value of y. We present Figure 4, which indicates log values of ‖y k − y * ‖, to analyze the convergence of residuals when they approach to zero. For this same set of dimensions of y with same step size, the convergence of iterates S(y k ) is shown under Figure 5. Moreover, Figure 6 shows that the primal variable iterates x k 1 and x k 2 converge exactly to their optimal solutions using α k � 0.001 and y ∈ R.

Example 2 (Subgradient Method: Dual Decomposition).
Here, we focus on a problem which is not quadratic. We consider the problem in the following form with two users: and a 1 , a 2 , b 1 , b 2 ∈ R (n 1 +n 2 ) . Here, we intend to solve this problem in a fully distributed manner using dual decomposition. We implement our results for n 1 � n 2 � 1 (scalar valued variables). We consider e dual function corresponding to the primal problem (41) is given by and we use corresponding subproblems in dual decomposition as follows: en, the dual problem corresponding to the primal problem (41) is given by maximize λ∈R n g(λ) � g 1 (λ) + g 2 (λ). We know that g(λ) is always concave (see chapter 05 of [5]). We have obtained the graph of g(λ) as given in Figure 7. is figure also confirms the concavity of g(λ). Moreover, this figure shows that g(λ) is nondifferentiable as it has a sharp point around λ � 5. Hence, −g(λ) is convex and nondifferentiable, and therefore we use subgradient method (21) to minimize −g(λ) using Algorithm 3.
We analyze the convergence results of the subgradient method using eorem 3 discussed under Section 5. erefore, we have to check whether Assumptions 1-3 used in eorem 3 hold for our particular problem considered here. Figure 7 shows that there exists an optimal solution λ * to the dual function g(λ). Hence, Assumption 1 holds. At each iteration in Algorithm 3, y k 2 − y k 1 used in the dual variable update represents a subgradient s k of −g(λ) at λ k (y k 1 − y k 2 represents a subgradient of g(λ) at λ k ). We can observe that ‖s k ‖ ≤ 4 as y 1 , y 2 ∈ Y � [−2, 2]. Hence, Assumption 2 holds. Moreover, we use the initialization λ 0 � 1, and we found that λ * ≈ 5.14 using the CVX solver in Matlab.
erefore, it turns out that ‖λ 0 − λ * ‖ ≈ 4.14, from which Assumption 3 follows. Hence, we can use eorem 3 to analyze the convergence of the subgradient method.
We have obtained the convergence results with constant, square summable but not summable, and nonsummable diminishing step size rules. Figure 8 shows the convergence of log values of ‖λ k − λ * ‖ for different constant step sizes.
is figure shows that large step sizes give fast convergence. Next, we show the convergence with step sizes α k � 0.1, α k � 1/k, and α k � 0.1/ � k √ , in the same figure ( Figure 9) so as to identify the effect of different step size rules. Here, we considered the convergence up to 10 −5 tolerance. We can observe a slower convergence using α k � 1/k and α k � 0.1/ � k √ than that for the constant step size rule.
In our Algorithm 3, both users solve their subproblems separately and find optimal primal variables locally at each iteration. Next, they exchange their information y k 1 and y k 2 with each other and update the dual variable individually. In general, their iterates y k 1 and y k 2 are not feasible. erefore, at each iteration, they agreed to have a feasible solution as y k � (y k 1 + y k 2 )/2. Next, by using these primal variable iterates and updated dual variable λ k , user 1 and user 2 can compute g 1 (λ k ) and g 2 (λ k ), respectively. en, g(λ k ) � g 1 (λ k ) + g 2 (λ k ) can be calculated. is is always a lower bound on f * , the optimal value of the primal problem [5]. Moreover, at each iteration, users can compute two upper bounds on f * as follows [14]: Given initial y, y 0 . Set k � 0. while (stopping criterion) (1) x 1 and x 2 minimization steps: Step 1:   (39) is solved using the gradient method followed by primal decomposition. e figure shows that ‖y k − y * ‖ ⟶ 0 as the iteration number increases. is shows that the convergence of y k to y * is guaranteed even for high dimensional complicating variables.
where b 1 (y k ) � minimize x 1 ∈X 1 f 1 (x 1 , y k ) and b 2 (y k ) � minimize x 2 ∈X 2 f 2 (x 2 , y k ). In [14], w and b are defined as the worst bound and the better bound. Worst bound represents the primal objective function values evaluated at each iteration using feasible points (x k 1 , y k ) and (x k 2 , y k ). Better bound can be obtained by replacing y k 1 and y k 2 with y and then solving subproblems involved with related primal decomposition structure of (41). Figure 10 shows the convergence of g(λ k ), better bound, and worst bound using constant step size rule α k � 0.1 and scalar valued primal  (39) is solved using the gradient method followed by the primal decomposition. is shows that a high accuracy for the convergence of y k can be achieved within 1000 iterations even for high dimensional y.    variables. Here, we can observe that for this particular problem, the lower bound g(λ) and two upper bounds converge exactly to f * .

Example 3 (ADMM).
Here, we first discuss the robustness of ADMM compared with the gradient method. Let us consider the following linear programme: where x ∈ R n and y ∈ R n are decision variables of the problem, a ∈ R n , A ∈ R m 1 ×n , B ∈ R m 2 ×n , C ∈ R m 2 ×n with m 1 , m 2 < n and b ∈ R m 1 is a constant vector. Suppose that the set of solutions of (45) is nonempty. e dual function g(λ), where λ � λ T 1 λ T 2 T with λ 1 ∈ R m 1 and λ 2 ∈ R m 2 , for problem (45) is given by en, analytically we can obtain Next, the dual problem is given by Here, we can easily observe that the optimal value of the dual problem (48) is −λ T 1 b, which is attained when a + A T λ 1 + B T λ 2 � 0 and C T λ 2 � 0. Usually we use following subproblems when we use the gradient method to solve (48): Given initial λ, λ 0 . Set k � 0. while (stopping criterion) (1) Primal variables minimization steps: Step 1: (x k 1 , y k 1 ) � argmin x 1 ∈X 1 ,y 1 ∈Y f 1 (x 1 , y 1 ) + λ k y 1 Step 2: (x k 2 , y k 2 ) � argmin x 2 ∈X 2 ,y 2 ∈Y f 2 (x 2 , y 2 ) − λ k y 2 (2) Dual variable update: ALGORITHM 3: Subgradient method: dual decomposition. Subproblem 1: Algorithm 4 represents the corresponding gradient algorithm.
We can observe that x and y minimization steps (Algorithm 4) given in Algorithm 4 cannot proceed for any arbitrarily chosen λ as (a + A T λ k 1 + B T λ k 2 ) T x and Worst bound f * Figure 10: Convergence of the dual function iterates g(λ k ), the better bound and the worst bound. e figure shows clearly that g(λ) is being a lower bound on f * and the better bound gives a more better approximation on f * than that of the worst bound. However, all 3 bounds converge to the optimal value f * of the original problem (41) for sufficiently large k. λ kT 2 Cy − λ kT 1 b are unbounded below. Hence, the gradient method fails to solve (48), and therefore the linear programme (45) also cannot be solved. However, the interesting fact is that ADMM solves this problem without any issue, showing its robustness compared with the gradient method.
To solve (48) using ADMM, we consider the augmented Lagrangian as follows: L(x, y, λ) � a T x + λ T 1 (Ax − b) + λ T 2 (Bx + Cy) where p represents the penalty parameter. en, the corresponding dual function is given by g(λ) � inf x∈R n ,y∈R n L(x, y, λ). Next, we maximize g(λ) by using Algorithm 5. In this algorithm, α represents a suitably chosen step size. Here, we discuss the convergence of iterates (Algorithm 5) with We choose p � α � 0.1. Figure 11 shows that our method guarantees the convergence of the dual variable λ exactly to its optimal value λ * . Convergence of the dual function iterates g(λ k ) and the objective function iterates f(x k ) � a T x k is given in Figure 12, and it shows that the both dual function and objective function converge exactly to the optimal value f * � −6 of our primal problem (45).
Journal of Mathematics 15

Conclusions
Centralized methods are hardly being used or applied as they are not suitable or they fail to deploy in many optimization settings due to the high demanding necessity of distributed techniques among large-scale networked systems. erefore, an attempt has been made by this paper to discuss the most important methods that currently exist to solve distributed optimization problems. A detailed analysis on gradient like methods, subgradient methods, and ADMM has been presented with numerical results. Gaps in previous studies that need to be developed to enhance the process of distributed optimization over networked systems have been discussed under each section. Here, we summarize the areas  in which future research can be conducted in distributed optimization.
(i) Many studies have shown their interest to solve distributed problems using primal measures. erefore, more theoretical studies related to duality need to be established to make use of the advantage of optimizing more general (nonconvex) distributed problems.
(ii) Methods of finding primal solutions from the dual under more relaxed assumptions are critical, as the dual measures do not converge to primal measures in general. (iii) Distributed methods over limited communications between networked systems need to be analyzed in depth, and related proper quantization schemes that guarantee the convergence of corresponding distributed methods should be identified. (iv) Inexact message exchange between subsystems due to limited communication bandwidths is common in distributed optimization. Consequently, analysis of error-based distributed methods is essential over many distributed application domains.

Data Availability
No data were used to support this study.

Conflicts of Interest
e authors declare that they have no conflicts of interest.