A Distributed Algorithm for Large-Scale Linearly Coupled Resource Allocation Problems with Selfish Agents

A decentralized randomized coordinate descent method is proposed to solve a large-scale linearly constrained, separable resource optimization problem with selfish agent. )is method has a cheap computational cost and can guarantee an improvement of selected objective function without jeopardizing the others in each iteration.)e convergence rate is obtained using an alternative gap benchmark of objective value. Numerical simulations suggest that the algorithm will converge to a random point on the Pareto front.


Introduction and Motivation
Distributed and parallel optimization techniques have become a powerful tool in solving large-scale resource optimization problems [1][2][3][4]. Different from the consensusbased distributed optimization model ( [2,5]), where all agents collectively share the same decision variables, the resource optimization problem usually has a separable structure, i.e., each agent has its own decision variable and own objective function, but all the agents are weakly coupled by equality or inequality constraints [6]. e overall target is to minimize the summation of the agents' objective functions. One typical example of this kind of model is the optimal coordination problem of the distributed energy resources [4,5], in which each generator decides the power generation by minimizing the cost function and meeting the overall demand. Recently, Li et al. [6] extended such method to the model of a global inequality constraint.
is work is motivated by the observation that resource allocation problems in the digital age are often subject to some new limitations unseen in traditional resource allocation problems. For example, in the area of content distribution or video streaming, many researchers have explored a scheme called parallel access [7] or multiple content distribution servers [8,9] in a bid to optimize servers' work load and enhance users' quality of experience (QoE). Similar problems can also arise in the so-called multihoming problem when users start to compete for resources [10,11].
is scheme first divides the original content/video into fragments, the replicas of these fragments are stored in a number of servers (often geographically diverse), and then the users are allowed to access/download fragments from multiple servers concurrently. is scheme is characterized by the following features: (i) Large scale: the number of users could easily hit hundreds of thousands or even millions. (ii) Bandwidth limit: each server has a total bandwidth limit, so that each user is only given a quota of the total bandwidth.
However, there are some practical concerns about this scheme. Specifically, user utilization of bandwidth quota of different servers may vary due to the physical distance or the access quality of different Internet service providers. In this sense, there is a need to reallocate bandwidth quotas of the servers among users to improve user QoE. Unfortunately, there are three major difficulties underlying this optimization problem: (1) e information necessary (gradients of objective functions, etc.) for the quota optimization is scattered among users and may be only partially available at a given time (e.g., users are inactive or unresponsive).
(2) Even if all the information is readily available, considering the large scale of the problem, an overall one-shot optimization could be computationally prohibitive and may lack timeliness, which is especially crucial for the enhancement of video viewing experience.
(3) Considering (2), a distributed algorithm that adaptively and progressively optimizes quota allocation among users should be preferred; however, here the crux is that the users in no way can accept a deteriorating QoE during the optimizing process, i.e., the users (agents) are selfish: one would not be too happy when he knows his own QoE is compromised for the QoE of someone else or for the "greater good" of the overall system. e goal of this paper is to develop an efficient distributed method for solving resource allocation problem with selfish agents. Since the complete information of first and second order information necessary for a one-shot optimization is usually not available in many real world large-scale applications, we develop a randomized coordinate descent (RCD) algorithm that partially updates a pair of objective functions in each iteration (the RCD algorithm coincides with the distributed network optimization model, where only a subset of nodes can communicate with each other and can be optimized in one iteration). Our algorithm can be vaguely viewed as a multiobjective (MO) extension of the singleobjective randomized coordinate descent algorithm [12]. Different from the single-objective optimization problem, the MO problem aims to optimize multiple objectives simultaneously in the sense of Pareto optimality (see [13,14]).
Our RCD method has the following prominent features: first, it optimizes the selected objective functions in each iteration without affecting other objective functions, and the optimization at each iteration is required to be such that none of the selected objective functions should deteriorate. As a result, the value of each objective function should be non-deteriorating over the whole optimizing process, and the solution of each iteration can be applied in real time to progressively improve the objective function of each agent. Second, compared to the centralized MO methods such as [15,16], our algorithm takes full advantage of the separability of the problem structure, and the update has a very cheap cost per iterate and can be easily implemented in a parallel setting. Perhaps more significantly, this work also focuses on the convergence analysis of the multiobjective RCD algorithm. Although the convergence results of singleobjective algorithms are well established ( [17,18]), similar discussions for its MO counterparts are unexpectedly scarce. It was not until very recently that a few works have succeeded in obtaining the convergence rates for certain unconstrained MO algorithms ( [19][20][21]). In fact, as will be evident in our analysis, the limiting point generated by our RCD algorithm will converge to a random point on the Pareto front; as a result, all the tools used in single-objective algorithm analysis are generally rendered useless. To conquer this difficulty and complete the missing piece, we develop a framework of convergence analysis for the RCD method, which generalizes the existing analysis for scalar optimization problem [12]. Under a mild condition, we show that the RCD algorithm has a sublinear convergence rate. If agents' cost functions are all strongly convex, then the RCD algorithm has a linear convergence rate. e paper is organized as follows. In Section 2, we present the constrained MO optimization problem. In Section 3, we present the RCD algorithm. An analysis of the convergence rate of this RCD algorithm for different cases is given in Section 4. Some numerical examples are given in Section 5. We conclude the paper and discuss the future research in Section 6. roughout this paper, the following notations are used. e vector is denoted by the bold letter. We use the following notation to denote the order of the vectors. Given u � [u 1 . . . u n ] T and v � [v 1 . . . v n ] T , u≺v means that u i ≤ v i , for all i � 1, . . . , n and u≺v means that

Problem Formulation
Consider the following standard multiobjective (MO) optimization problem with linear equality constraints: where X � a continuous function satisfying certain conditions for j ∈ M ≔ 1, 2, . . . , M { } (notation 0 denotes the vector with all elements being zero). Clearly, function F(·): R kM : ⟶ R M is separable, i.e., x m only affects f m (x m ). However, these decision variables are weakly coupled by the global equality constraint. Note that although the right-hand side of the equality constraint is 0, there is no difficulty to generalize the results to the non-zero case using some affine transformation. In the context of the above bandwidth allocation problem, here x m can be viewed as the bandwidth quotas of k servers allocated to agent m; f m is a function that represents the negative of the utility (QoE) brought by quotas x m ; and the equality constraint represents the total bandwidth limitation of each server 1, . . . , k. Model (P mo ) arises in various network optimization problems (see, e.g., [22][23][24][25][26][27]).
Problem (P mo ) aims to optimize the multiple objectives simultaneously in the sense of Pareto optimality (see [13,14]). Indeed, a feasible point X * ∈ C is a Pareto optimal solution of problem (P mo ) if there exists no other feasible solution such that F(X ′ ) ≺ F(X * ) and F(X ′ ) ≠ F(X * ). e 2 Scientific Programming traditional way to solve (P mo ) is to aggregate objective functions f m into a weighted summation social welfare, say, m ω m f m (x m ) with ω m > 0, ∀m � 1, . . . , M; then, solve the following single-objective optimization problem using classical methods (e.g., gradient descent method, Newton method, etc.): (2) But this scheme suffers two major drawbacks for the reasons we stated previously: first, the large scale of the underlying problem (a very large M) may prevent this scheme from responding in a timely manner; second, this scheme does not guarantee a progressively non-deteriorating allocation evolution during the optimizing process, so that it is possible that the individual welfares of some agents get sacrificed for the "greater good," i.e., the maximization of social welfare m ω m f m (x m ), and hence they could become discontent and discard the service for good. erefore, to address these issues, we present the following randomized block coordinate descent method.

e RCD Algorithm.
roughout this paper, we assume that the following condition holds true.
where ‖ · ‖ is the Euclidean norm (note that here the Lipschitz coefficient can be agent-specific, e.g., L 1 , . . . , L M ; in this case, we can let L � max L 1 , . . . , L M ; on the other hand, one can also modify our algorithm scheme accordingly with respect to these agent-specific Lipschitz coefficients and obtain an improved convergence rate). Note that the Lipschitz property implies the following inequality: for any d ∈ R k and m ∈ M. Given the current solution X, the classical gradient-based method needs to compute the whole Jacobian to generate a feasible step that simultaneously decreases all objectives. Obviously, this step is expensive when the problem size is large. To conquer this difficulty, we propose the following RCD method. In each iteration, we randomly select two objectives pair i, j with probability p i,j to update their objective values for i, j ∈ M. To ensure convergence of the algorithm, we need the following condition on the sampling probability, Assumption 2. For any p i,j , there exists i 1 , i 2 , . . . , i k ∈ M such that sampling probabilities p i,i 1 , p i 1 ,i 2 , . . . , p i k ,j are all strictly positive. Graphically, one can think of a strictly positive p i,j > 0 as an "edge" between node i and j, and if p i,j � 0, it means that there is no edge between node i and j. Assumption 2 means that the communication network among the nodes is connected, i.e., any two nodes in the network are either directly connected by an edge or indirectly connected by at least one path formed by several intermediate edges. If the network is complete, i.e., each node is directly connected to every other node, p i,j > 0 for all edges, the above condition is automatically satisfied. Some other options are available, for example, the cyclic network (see Figure 1). In a cyclic network, each node is connected in a circular fashion: Node 1 ⟶ Node 2 ⟶ Node 3 ⟶ Node 4 ⟶ Node 1 . . .. Another one is the so-called central coordinator network, in which one node is chosen as a central coordinator (see Figure 2), and the rest of the nodes are connected to this central coordinator while do not share direct connections among themselves. e implication of a connected network is that any local change can eventually ripple through the whole network instead of being contained.
Once a pair i, j is chosen, the following problem for the convex optimization problem is solved: It is not hard to see that due to inequality (4), the solution of problem P i,j (X) provides a descent direction to improve the objective values for both i and j.
e key difference between P i,j and its single-objective RCD counterpart is that a single-objective RCD algorithm will aggregate (5) and (6) and attempt to minimize such aggregate cost. Unlike our multiobjective method, this single-objective scheme does not necessarily generate non-deteriorating solution for both agents. We use d * i,j to denote the optimal solution of problem P i,j (X) and use x + i and x + j to denote the updated solution points for the i-th and j-th objective functions, respectively.
algorithm updates the solution points as follows: and keeps all the others unchanged x + m � x m , for m ≠ i, j and m ∈ M. One prominent feature of the method is that problem P i,j (X) admits explicit solution.
Lemma 1. Given X and pair (i, j), the optimal solution of where λ * i,j � max 0, min 1, Furthermore, X * is Pareto optimal only if the optimal solution of P i,j (X * ) is d * i,j � 0 for all pairs i, j , and the reverse direction holds true if all f m are convex.
Proof. Checking the Karush-Kuhn-Tucker (KKT) condition of problem P i,j (x) yields where λ i,j ∈ [0, 1] is the Lagrange multiplier. Substituting d * i,j back into the constraints gives t * i,j as where It can be easily verified that g i (k; a 1 , a 2 ) is strictly decreasing for k ∈ [0, 1] and g j (k; a 1 , a 2 ) is strictly increasing for k ∈ [0, 1]; after checking the boundary values, we obtain λ * i,j as in (10). Clearly, when each f m is convex and d * i,j � 0, it implies the Pareto optimality, i.e., 1] for any pair (i, j), which satisfies the KKT condition of P mo . e other direction is straightforward. e algorithm is summarized in the following pseudocode.
A distributed algorithm for large-scale linearly-coupled resource allocation problems with selfish agents. Algorithm 1 Since P mo is assumed to be a large-scale problem, the computational cost of methods [15,16] could become intimidating because an overall optimization that involves all objectives is required to compute an update direction. Also, the methods of [15,16] would require all the objective function information to be transmitted to a central coordinating agent, where the optimization is conducted. But  this could create excessive communication overheads in the central coordinating agent. In contrast, our method is able to take advantage of the separability of the problem structure, and the update has an analytical form and can be parallelized by optimizing multiple pairs concurrently; moreover, the pairwise optimization can be easily implemented in a peerto-peer manner. e cost per iteration is O(c f + c), where c f is the maximum cost of computing the gradient of each f i , and c is the cost of updating x i . When compared with the single-objective randomized coordinate descent method, the main difference is that the optimization procedure guarantees that t i,j ≤ 0 for any pair; as a result, the update is always non-deteriorating for each of the agents and hence can be applied in real time.

□
Remark 1. When all information is available, one may also consider to follow the idea of [15], that is, to simultaneously optimize all agents, the update directions are given by solving the following problem: However, the cost per iteration of this problem will be O (Mc f + Mc). Also note that P all generally does not have an analytical solution form; therefore, an additional algorithm for quadratic programming is needed to find its solution.

Convergence Rate Analysis: Non-convex Case.
We investigate the convergence rate of the RCD algorithm in this section. We first introduce the potential function h: For the optimal updating vector d * i,j and the associated Lagrange multiplier λ * i,j , the Lipschitz continuity of f m implies Initialization: set X � X 0 ; while True do Randomly select a matching pair i, j according to some network structure Solve d ast i,j from problem: ALGORITHM 1: Multiobjective adaptation.

Scientific Programming 5
where the second inequality follows from the observation that t * i,j ≤ 0. erefore, under the preceding updating rule, we can estimate the expected decrease of potential function as Now, we introduce the following criteria as a metric, D(·): R kM ⟶ R, as Using D(·), the preceding inequality (10) becomes One can view D(·) as an improvement potential indicator. Specifically, we have the following lemma. Proof. We first prove one direction that D(∇h(X * )) � 0 implies X * is a Pareto optimal solution by contraposition. Assume that X with M i�m x m � 0 is not Pareto optimal; then, there exist i and j ≠ i such that ∇f i (x i ) ≠ t · ∇f j (x j ), ∀t ∈ R + , which implies D(∇h(X)) > 0. e other direction in this lemma is straightforward. We omit the detail.
We now turn to investigate the convergence rate of the RCD algorithm. Convergence rate of methods for singleobjective problem such as min x∈C f(x) is usually measured with respect to the gap from its optimal value f * . However, different from the single objective-based optimization problem, the generated sequence of the MO algorithm may converge to different points in the Pareto optimal frontier, and there is no a priori way to determine which point the sequence converges to; this property will be further substantiated by the numerical simulations. erefore, one needs to construct an alternative benchmark that measures the convergence rate (the construction of such alternative gap function for multiobjective optimization problems is also discussed by Dutta et al. and Tanabe et al. [28][29][30] recently). We define the following two measures (functions): where h † (X) provides a lower bound of the potential function h(X) and v † (X) measures the gap between the current value of potential function and such a lower bound. ese two measures have the following properties.

Lemma 3.
For any X 1 , . Furthermore, the solution X * is a Pareto optimal point only if v † (X * ) � 0, and the reverse direction holds true if each f m is convex.
Proof. e first half follows directly from the observation that F( e rest of the proof is straightforward, and we omit the details. Although h † , v † are only tangential to the following convergence result for non-convex objective functions, they are pivotal to obtain convergence rate for convex and strongly convex objective functions. en, we have the following preliminary convergence rate result for generally non-convex objective functions given as follows. □ Theorem 1. If level set L ≔ X ∈ C: F(X) ≺ F(X 0 ) is bounded, then the following holds: Proof. Taking expectation of both sides of (17) with respect to X 0 and summing up to n, we arrive at the following: where the second inequality follows from Jensen's Inequality. Since L is bounded, we know that h(X 0 ) − h † (X 0 ) is also bounded, and hence (18) follows. Note that in the non-convex case, the stationary point could only be a local minimum.

Convergence Rate: Convex
Case. When each f m is convex, an accelerated convergence rate can be obtained. But first we need some preliminary results. In the same spirit of the dual norm, for a non-empty compact set B ⊂ R kM , we define the dual function of D(·) on B as where en, the following lemma gives the counterpart of Cauchy-Schwarz inequality with respect to the feasible set C of problem p mo .

Lemma 4. Given d ∈ C and a non-empty compact set
, ∀m ∈ M}, the following inequality holds true: Furthermore, there exists a number Γ B ≤ 0 such that Proof. Since A B (d) is a compact set, then D * B (·) is well defined. Depending on the value of D(Y), we have two cases as follows: (1) If D(Y) � 0, from (17), we know Y must be in form Suppose there exists a d ∈ C such that Y T d > 0, which implies g T d m > 0 for some m. However, this leads to M m�1 g T d m � g T ( M m�1 d m ) > 0, which is a contradiction to the fact that d ∈ C. erefore, where the second equality follows from the fact that λ i,j (kY) � λ i,j (Y), ∀k ∈ R, and hence it has D(kY) � |k|D(Y).
As for the last part of the lemma, we can compute the lower bound value Γ B as where Given an initial solution X 0 , if the level set L ≔ X ∈ C|F(X) ≺ F(X 0 ) is bounded, then the solution X n generated by the n-th iteration of RCD algorithm satisfies

Scientific Programming
where R(X 0 ) ≔ max Y∈L ‖X 0 − Y‖ and Γ B is defined with respect to D * B in Lemma 4 with B ≔ ∇h(X): X ∈ L { }. Furthermore, Γ B is bounded by Proof. As the level set L is bounded, problem p mo always admits solution. On the other hand, the Lipschitz property in Assumption 1 implies which further implies the boundedness of the set B. erefore, Γ B satisfies the following inequality: en, it has where the second inequality is from the convexity of h(·), and the third inequality is from (24). Inequality (28) together with (17) gives Since it has F(X n+1 ) ≺ F(X n ), applying Lemma 3 to the preceding inequality gives Taking expectation of both sides of the above inequality and applying Jensen's inequality, where ere exists a σ > 0 such that Summing up f m leads to the strong convexity of the Lyapunov function, i.e., Theorem 3. Given an initial feasible solution X 0 , if level set L ≔ X ∈ C: F(X) ≺ F(X 0 ) is bounded, then the RCD algorithm satisfies where Γ B is defined in Lemma 4 with respect to the set Proof. Applying Lemma 4 to (34), it has

⎧ ⎨ ⎩
(37) en, one can easily verify that C R (X) ⊆ C L (X) by using the convexity of f m . Maximizing the left-hand side of (36) with respect to C L (X) and the right-hand side with respect to C R (X) yields where the second inequality follows from Lemma 4. Letting Taking expectation of both sides and applying the resulting inequality iteratively leads to (35).

Convergence Rate in Probability.
Again with the aid of gap benchmark v † , in this section, we are able to establish the convergence rate in probability for the RCD algorithm. First we introduce the following lemma [31].
en, given confidence level ρ ∈ (0, 1), if property (1) holds, we can choose ϵ < c and If property (2) holds, we choose en, we have Proof (see [31], eorem 1). is proof is done by applying Markov inequality. en, we have the following theorem that quantifies the confidence of reducing the improvement of potential function to no more than ϵ.
where Γ B is the corresponding bounding coefficient with respect to B; if each f m is σ-strongly convex, then we may choose Under either way, it has Proof. is result follows directly from the observation that the gap sequence generated by RCD: v † (X n ) , satisfies the properties mentioned in Lemma 5, according to (32) and (39).

Numerical Simulation
In this section, we use simulation method to demonstrate our theoretical results. Consider the following example with 10 quadratic objective functions, each objective function taking the form of (46), for m � 1, . . . , 10  Figure 5: In a central coordinator network, the limiting potential function value is much more stable, and notably the central coordinator is able to achieve a significantly lower limiting value than in a complete network.
for m � 1, . . . , 10. e coefficients [π m1 , π m2 , π m3 ] are randomly generated in each trial of test. e 10-th objective (agent) is selected as a central coordinator, i.e., at each iteration, one of the first 9 objectives is randomly picked with equal probability to communicate with the 10-th objective. For a given initial point X 0 , we run the algorithm 3 times and record their corresponding potential function value trajectories. e results are summarized in Figure 3.
As is clearly seen, although each of the three runs is convergent, their limiting levels of potential function value differ, indicating that they converge to different points on the Pareto front.
To further highlight this point, we run two extra sets of simulations. With the same setting as the preceding simulation (10 quadratic objectives, randomly generated π m ), again we fix an initial point at X 0 . For the first set of simulation, the communication network is designed to be complete, i.e., at each iterate, any pair i, j can be chosen with equal probability. For the second test, the communication network is the same as the preceding simulation, i.e., the 10-th objective being a central coordinator and one of the rest of the objectives is randomly selected with equal probability to communicate with 10-th objective at each iterate. For each test, we run the RCD algorithm for 10 times, and the resulting value trajectories for the 10-th objective function are recorded in the following figures.
As is clearly indicated in Figure 4, the limiting levels of the 10 runs differ dramatically from each other. As for the central coordinator network (see Figure 5), the limiting level of the coordinator is relatively more stable at around 0.38. What is interesting is that when compared to the complete network with random pairing, the central coordinator somehow manages to achieve a significantly better level of objective value. Indeed, the lowest level of the 10-th objective in the complete network barely touches 0.47.
is is probably because the central coordinator is involved in every single iterate; as a result, its corresponding objective is more heavily optimized.

Discussion and Concluding Remarks
We propose a randomized coordinate descent algorithm to solve the large-scale, linearly coupled resource allocation problem with selfish agents. is method has a low computational cost in each iteration and can guarantee convergence to the Pareto optimal solution under mild conditions, and then we derive the convergence rate of such an algorithm. As the sampling probability is fixed exogenously in the current framework, one potential extension is to identify these sampling probabilities with respect to the problem's parameter, which may further enhance the efficiency of the algorithm. Specifically, it is known in the literature that the selfish/greedy behavior of the individual agents generally leads to efficiency loss from a systemic perspective [27,32], and hence the design of a sampling probability that will narrow this efficiency gap would be an interesting direction of future research.

Data Availability
No data were used to support this study. e MATLAB code used in this article is available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.