Online Supervised Learning with Distributed Features over Multiagent System

An et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Most current online distributed machine learning algorithms have been studied in a data-parallel architecture among agents in networks. We study online distributed machine learning from a diﬀerent perspective, where the features about the same samples are observed by multiple agents that wish to collaborate but do not exchange the raw data with each other. We propose a distributed feature online gradient descent algorithm and prove that local solution converges to the global minimizer with a sublinear rate O ( ��� 2 T √ ) . Our algorithm does not require exchange of the primal data or even the model parameters between agents. Firstly, we design an auxiliary variable, which implies the information of the global features, and estimate at each agent by dynamic consensus method. Then, local parameters are updated by online gradient descent method based on local data stream. Simulations illustrate the performance of the proposed algorithm.


Introduction
With the development of multiagent system, the observed data are being generated at anywhere, anytime, using different devices and technologies [1][2][3]. ere is a lot of interest in extracting knowledge from this massive amount of data and using it to choose a suitable business strategy [4][5][6], to generate control command [7][8][9] or to make a decision [10][11][12][13]. Many applications are required to process incoming data in online way, e.g., a bank monitors the transactions of its clients to detect frauds [2], wireless sensor networks makes inference [14], and sensor network tracks the uncooperative target [15]. e study of online learning is becoming an important topic of research itself [16][17][18]. e success of online machine learning often depends on the entire data stream. In some applications, the observed data may be generated on and held by multiple agents [1,13]. Collecting data to a central site for training incurs extra management and privacy concerns [1]. As a result, some distributed machine learning algorithms have been proposed to train a model by letting each agent perform local model updates and exchange some information between neighbors [19][20][21][22]. Most of the existing algorithms fall into the data-parallel computation [1], where each agent has its local data stream with the entire features. However, in network applications, multiple agents are used to monitor an environment, where agents are distributed over space and are used to collect different measurements. For example, the observation is generated by different observed models [8,9]. It is urgent to develop some applicable algorithm to deal with data streams with distributed features over networks.
In batch learning settings, some algorithms have been proposed for distributed features, such as variance-reduced dynamic diffusion (VRD 2 ) [12], feature distributed machine learning (FDML) [1], and the ADMM (alternating direction method of multipliers) sharing [23]. VRD 2 and FDML obtain the optimal solution in primal domain, and the local model is trained in a distributed manner based on the local features.
e ADMM sharing algorithm formulates distributed feature learning as a distributed primal-dual problem and then obtains the optimal solution by ADMM algorithm. ese algorithms in [1,12,23] effectively deal with the batch distributed feature learning in a distributed form. However, these algorithms in [1,12,23] need to access the entire dataset and cannot be applied in online settings. As the observation is continuously arriving very fast in networks, it is important to study online feature distributed machine learning.
In this paper, we consider the situation where the features are split across agents in online settings either due to privacy consideration or because they are already physically collected in a distributed manner by means of a networked architecture. We propose a distributed feature online gradient algorithm. Online supervised learning over networks is formulated as a "cost of sum" form. e procedure of the proposed algorithm requires two-scales: one scale is used to update the parameters by gradient descent and a second faster scale for running the consistency step multiple times to track an auxiliary term. e main contributions of this paper are summarized as follows.
(1) We propose a distributed feature online gradient (DFOG) descent algorithm. By exchanging some information between neighbors, local solution can approximate the global solution. Compared with VRD 2 [12], FDML [1], and the sharing ADMM algorithm [23], DFOG is applicable to online supervised learning with distributed features over networks. (2) We firstly formulate the centralized cost as a "cost of sum" form. By dynamic consensus algorithm, each node can track the sum term, which implies the entire features of the sample at each round time. en, with the help of online gradient descent algorithm, each node locally updates the parameters based on its data stream.
(3) We prove that the proposed algorithm achieves an O( �� � 2T √ ) regret bound. at is, local solution can approach to the global solution, which is the best decision trained based on the entire dataset. e only transmitted message is some parameters' information, and the proposed algorithm does not require the data of the total number times and does not exchange the raw data between neighbors. e rest of this paper is organized as follows: the problem formulation is discussed in Section 2. In Section 3, we focus on our online optimization algorithm with distributed features over multiagent system, followed by the theoretical results in Section 4. In Section 5, simulations illustrate the effectiveness of our algorithm. Finally, we conclude the paper in Section 6.
Notation and terminology: let x be the feature space and y be the corresponding label. We denote the (i, j)th element of a matrix A by a i,j . For t ∈ N + , the set 1, 2, . . . , T { } is denoted by [T]. For a convex function f, its gradient at a point ω is denoted as ∇ ω f(ω). We denote N as the number of agents in the network. Let R d be the d-dimensional vector space and ‖ω‖ 2 2 is the Euclidean norm of a vector ω ∈ R d .

Problem Formulation
We consider a multiagent system with N agents. e communication between agents is described by a connected graph G � (V, E) [24], consisting of a set of nodes V � 1, 2, . . . , N { }, a set of edges E, and an adjacent matrix A [19]. For each agent i ∈ V, we denote E i � j |(j, i) ∈ E as a set of neighbors of agent i (including agent i itself ).

Assumption 1.
e graph G � V, E { } and the adjacent weighted matrix A satisfy the following [25]: In this work, we focus on a binary online supervised learning with distributed features. e features are distributed over a collection of K agents, as illustrated in Figure 1.
At each time t � 1, 2, . . . , T, network receives a labeled sample (x t , y t ). For all the time T, we consider an empirical risk as follows: where the parameters are denoted as ω ∈ R d×1 , d is the dimension of the features, and y t ∈ − 1, +1 { } is the corresponding scalar label of x t at time t. Moreover, the cost f(ω) is convex and differentiable. In most problem of interest, the cost function is dependent on the inner product ω T x, such as the linear SVM cost f � max(0, 1 − y t (w T x t )) and the logistic regression cost f � log(1 + exp(− y t (ω T x t ))). e factor r(ω) represents the regularization term. Since the features of x t are distributed across agents, we set ω and x t to be column vector and formulate ω and x t into N subvectors denoted by ω i and x t,i , respectively, that is, Each subfeature x t,i vector and subvector ω i is located at agent i. en, cost function (1) can be rewritten as where the regularization term is assumed to satisfy an additive form as 2 Complexity is property holds for many popular regularization choices, such as l 2 , l 1 , and KL-divergence. Problem of this type has been studied before in the literature by using distributed optimization methods in [20,21]. One common way is to formulate problem (3) into a constrained problem, that is, For all the time T, problem (5) is a classical "cost of sum" form [20]. An effective way is to design the Lagrangian function by introducing the dual variable c [23], namely, Problem (6) can be solved in a number of distributed primal-dual methods, such as alternating direction method of multipliers (ADMM) [4,22,26] and primal-dual methods [27][28][29]. ese techniques have good convergence properties but suffer from high computational costs and two-time scale communications. e other way is studied in primal domain [12]. e algorithm in [12] requires a two-time scale operation: a faster time-scale for the consensus iterations and a slower timescale for the data sampling and the gradient computing. First, we use a consensus strategy to obtain the sum term where n k denotes the index of the sample selected uniformly at random from 1, 2, . . . , T { }. After sufficient iterations, it is well-known that z n k ,i ⟶ (1/N) N i�1 z n k ,i . en, the stochastic-gradient step is used to update the parameters ω, where the gradient is evaluated by the gradient vector of the cost evaluated at some random data (x n k , y n k ).
In online settings, since the data (x t , y t ) is observed one by one, we cannot access to the total dataset (x t , y t ) T t�1 . ese algorithms in [1,12,23] cannot be applied for data stream with distributed feature over networks. For each time t � 1, 2, . . . , T, the multiagent system is endowed with a sequence of cost function L t T t�1 , and the goal is to minimize the sum of the cost function. Specifically, we want to minimize the difference between the total cost multiagent system has incurred and that of the best fixed decision in hindsight, which is called regret, and its definition is given as follows: where ω * is the best decision of problem (1), that is, Moreover, we consider the time-varying cost function L t as Generally speaking, the cost Q( N i�1 ω T i x t,i ; y t ) satisfies Assumption 2.
Regret is the standard measure of the performance of online optimization algorithm [19]. An algorithm attains good performance if the regret is sublinear as a function of the total time T. Remark 1. In the multiagent system, since the entries of the feature x t are distributed over N agents, each agent just observes its own data stream. We face the following two challenges in solving problem (8): (1) Distributed challenge: each agent only receives local data stream (x t,i , y t ) and does not access to the entire features (x t , y t ). Under the condition that we do not exchange the raw data between neighbors, each agent needs to obtain some information on the entire features. (2) Online challenge: at any time t 1 , we only have observation for t ≤ t 1 and do not know L t for t 1 ≤ t ≤ T. It is difficult to store all the observations due to the high-dimensional and high-velocity data stream. We need to update the parameters based on the current sample and the previous parameters and pursue a solution approximating to the global solution ω * , which is the best decision based on all the data (x t , y t ) t t�1 as a prior in offline settings.

Distributed Feature Online Gradient Descent Algorithm
In this section, we first analyse a dynamic average consensus method for approximating the sum of ω T i x t,i at agent i and propose an online convex optimization to update the parameters ω.
e detailed framework is summarised in Figure 2. Now, we consider the problem of minimizing (5) by means of an online convex optimization. Let z t � N i�1 ω T t,i x t,i denote the inner product that is available at time t ∈ [T]. e cost function L t can be described as If each agent i can obtain the auxiliary variable z t at any time t, the parameters ω t,i can be obtained by minimizing the local cost L t,i , which is defined as However, the computation of z t needs to access to all the subfeatures x t,i and the subvectors ω t,i over N agents. We denote the average of the local inner products as Motivated by works in [30][31][32][33][34], z t can be approximated by a diffusion-based algorithm. Since the desired variable z t is proportional to the average value z t , z t � Nz t , the consensus strategy can be used to approximate z t . Specifically, for the total number of iterations M, each agent would repeat the following steps M times: where z 0 t,i � Nω T t,i x t,i . After each agent obtains the estimator of z t denoted as z t,i , problem (12) is converted into a differentiable dynamic problem. For online convex optimization problem, online gradient descent and its variants have been achieving optimal dynamic regret in many applications [35]. Recalling that ω t and x t are partitioned into N blocks, the gradient step can be performed in parallel over N agents. Specifically, where the step-size μ t should satisfy μ t > 0, ∞ t�1 μ t � ∞, and ∞ t�1 μ 2 t < ∞. e full algorithm is summarized in Algorithm 1.

Remark 2.
Compared with FDML [1], VRD 2 [12], and the ADMM sharing algorithm [23], DFOG is applicable for data stream with distributed features over multiagent system. At each round time, agents observe the same sample from different features. Each agent can obtain an auxiliary term, which implies the information on the entire features. en, each agent locally runs a gradient descent step to update its local parameters. e procedure of Algorithm 1 is designed to update the parameters ω t,i locally.

Convergency Analysis.
In this section, we analyse the convergence of the proposed algorithm. We first show that the distance between z t,i and z t is upper bounded by the difference between P M and 1/N, which is shown in Lemma 1 and proved in [25].

Lemma 1. Let Assumption 1 holds, for all agents i, j; we have
where N is total number of agents and M is the number of consensus steps in (14).
en, we show that the regret of online gradient descent (OGD) is upper bounded by the cumulative difference between the loss of ω t and ω t+1 , which is present in Lemma 2 and proved in [18].

Lemma 2. Let ω t,i
T t�1 denotes the sequence of parameters produced by OGD. en, for any u, we have Because the features are distributed across agents, Reg T i mainly illustrates the difference between local parameters ω i and the corresponding parameters ω * i in global solution. Based on the above lemma, we derive a regret bound of ω i for DFOG with the regularization term r(ω i ) � (1/2)μ‖ω i ‖ 2 2 .

Remark 3.
is theorem indicates that the convergence rate of DFOG depends on the network topology through B and the number of consensus steps M. e larger the M is or the smaller the B is, the faster the convergence speed is. e theorem presents that the proposed algorithm converges to the global solution with sublinear rate. When the number of data samples increases, the difference between ω t,i with ω * i will become closer.

Time Complexity.
ere are two primary operations associated with learning for DFOG: (1) estimating the inner product z t for each sample at time t and (2) updating the parameters at gradient descent step. At any time t, each estimator z t computation requires O(M) arithmetic operations.
ere is one gradient descent step to update the parameters, which requires O(1) arithmetic operations. As

Space Complexity.
At any time t, DFOG needs to store the parameters z t and ω t , which are updated and timevarying. Hence, space complexity for DFOG is O(1).

Communication Complexity.
We denote the average degree of the communication graph as k. At each consensus step, each node requires to exchange z t (float type, 4 bytes) with its neighbors. Since the network topology is an undirected graph, it requires 8 kM bytes at any time t. Hence, DFOG requires communication traffic of DFOG is 8 kMT bytes for all the time T.

Simulation
In this section, we test our algorithm by minimizing norm regularized logistic regression on two public datasets, a9a and bank from UCI. Here, a multiagent system with 6 agents is considered, and the network is generated by the random geometric graph model. a9a dataset consists of 32561 (1) Initialization: set ω 0,i � 0.
We adopt dynamic consensus method to obtain z t in (14) and use online gradient descent algorithm to update the parameter ω i locally in (15). In our simulations, we get the global model trained in a centralized manner if all the features were collected centrally by stochastic-gradient descent (SGD) algorithm. Next, we compare our algorithm against SGD algorithm proposed in [38] and keep track of the loss for different datasets and parameters settings. Figures 3 and 4 present the evolution of the cost during the training procedure for a9a and bank datasets, respectively. In addition, to make a fair comparison, we analyse the convergence curve based on the count of gradient calculated. Table 2 shows the testing error for different datasets and the error of parameters for DFOG and SGD. e results show that DFOG can converge to the centralized solution of SGD, while keeping local feature sets to the corresponding agent.
at is, DFOG can deal with the online supervised learning problem caused by distributed features over networks.
We next show how the performance depends on different M. Note that when M is larger, we need to do more communication on the consistency step (14). Figure 5 shows the evolution of the cost with different M. It can be found that the larger the M we set, the faster the DFOG approaches to the centralized SGD algorithm.

Conclusions
In this paper, we considered an online supervised learning problem where the features are split across agents in online settings. We proposed an online supervised learning algorithm with distributed features over multiagent system. We first formulated the centralized cost as a "cost of sum" form. By dynamic consensus algorithm, each agent could effectively estimate the sum term, which is calculated based on the entire features at each round time. en, with the help of online gradient descent algorithm, each agent locally updates the parameters. e algorithm designed does not require the data of the total number times and does not communicate the raw data between neighbors. We proved that local solution converges to the centralize minimizer, which is the best decision trained based on the entire dataset, and the proposed algorithm achieves an O(
Distributed machining learning algorithms are worth of further studies due to their promising future, including distributed online boosting, distributed decision tree [39], the use of Big data-aided learning [40], and distributed learning over time-varying communication topology in networks.