Online Learning for DNN Training: A Stochastic Block Adaptive Gradient Algorithm

Adaptive algorithms are widely used because of their fast convergence rate for training deep neural networks (DNNs). However, the training cost becomes prohibitively expensive due to the computation of the full gradient when training complicated DNN. To reduce the computational cost, we present a stochastic block adaptive gradient online training algorithm in this study, called SBAG. In this algorithm, stochastic block coordinate descent and the adaptive learning rate are utilized at each iteration. We also prove that the regret bound of OT can be achieved via SBAG, in which T is a time horizon. In addition, we use SBAG to train ResNet-34 and DenseNet-121 on CIFAR-10, respectively. The results demonstrate that SBAG has better training speed and generalized ability than other existing training methods.


Introduction
Benefitting from a great many data samples and complex training model, deep learning has gained great interest in recent years and has been applied in resource allocation [1][2][3][4], signal estimation [5,6], computer vision [7][8][9], and so on. However, the computing cost is very high in the training process of deep learning, which needs large amounts of training data and iteration update to obtain good model parameters. It is key to speed up model training process and improve model performance.
erefore, besides proposing new training architecture [10], designing an effective training algorithm is also important.
is study focuses on the design of efficient training algorithms for deep neural networks (DNNs). In fact, many questions in practice can be modeled to be an optimization problem in general [11][12][13], which can be solved by employing gradient-based methods. e stochastic gradient descent (SGD) method is an effective optimization algorithm [14]. Moreover, it is easy to implement because of its simplicity and is frequently used in the training process of DNN.
Although the simplicity of stochastic gradient descents, the problem of slow convergence rate always exists. e same learning rate is not suitable for all parameter updates across the training process, especially in the case of sparse training data. For this reason, a number of training methods are presented to address this issue, for instance, AdaGrad [15], RMSProp [16], AdaDelta [17], and Adam [18]. ese methods are referred as Adam-type algorithms since the adaptive learning rates are employed. Further, Adam has attained the most wide application in many deep learning training tasks, such as optimization of convolutional neural networks and recurrent neural networks [19,20]. Despite its popularity, Adam incurs the convergence issue. For this reason, AMSGrad [21] was presented by introducing a nonincreasing learning rate. Besides, the learning rates of the Adam algorithm are either too big or too small, which results in poor generalization performance. To avoid the learning rate of extreme cases, a variant of Adam, Padam [22], was presented via employing a partial adaptive parameter p. SWATS [23] used the switch method from Adam to SGD. AdaBound [24] limited the learning rate to a dynamic bound over time at each iteration.
In deep learning, gradient-based methods are used to optimize the model parameter, which needs to calculate the gradients of all coordinates in decision vectors at each iteration, and a huge number of data and complex model lead to expensive computation cost. Randomized block coordinate descent is an efficient method for high-dimensional optimization problem and has been successfully utilized in the large-scale problem generated in machine learning [25]. It divides the set of variables into different blocks and carries out a gradient update step on a selected block coordinates randomly at each iteration, while holding the remaining ones fixed. In this way, the computational expense of each iteration can be effectively reduced.
In this study, we propose a stochastic block adaptive gradient online learning (SBAG) algorithm to rapidly train DNN, which incorporates an adaptive learning rate and stochastic block coordinate approach to improve the generalization ability and computation cost. Our key contributions are as follows: • We present the SBAG algorithm based on the stochastic block coordinate descent method and AdaBound optimization algorithm to solve high-dimensional optimization problems. (i) We provide the theoretical analysis on the convergence for SBAG. Moreover, we show that SBAG is convergent in the convex setting under common assumptions and its regret is bounded by O( (ii) We demonstrate the performance of SBAG on a public dataset. e simulation results show that the algorithm takes lesser time to achieve the best accuracy on the training set and test set, and it outperforms other methods. e rest of this study is arranged as follows. In the next two sections, we will review the extant literature and introduce related background. In Section 4, we will present SBAG in detail. In Sections 5 and 6, we will describe our convergence analysis and performance evaluation. Finally, we present the conclusion of this paper in Section 7.

Related Work
SGD is one of the most popular algorithms used in DNN because of its implementation easily. However, it has the same learning rate for all parameters updated at each iteration across the training process, and the parameters are updated to the same extent no matter how different the feature frequencies are, which consequently results in slow convergence rate and poor performance. Hence, some variants of SGD were proposed to improve its convergence rate by either making the learning rate adaptive or using historical gradient information for descent direction. Ghadimi et al. [26] used the Heavy-ball method to combine oneorder historical gradients and current gradients for updates. Sutskever et al. [27] presented Nesterov's accelerated gradient (NAG) method. Duchi et al. [15] proposed AdaGrad that first used an adaptive learning rate, whereas AdaGrad's performance is worse in the case of dense gradients because all historical gradients are used in the updates, and this limitation is more severe when dealing with high-dimensional data in deep learning. Hinton [16] proposed RMSProp, which utilizes an exponential moving average to solve the problem that the learning rate drops sharply in AdaGrad. Zeiler [17] proposed AdaDelta, which prevents learning rate decay and gradient disappearance over time. In fact, further research was to combine adaptive learning rate with historical gradient information, such as those used in Adam [18] and AMSGrad [21]. Moreover, Adam has a good convergence rate in many scenarios. However, it was found that Adam may not converge in the later stage of the training process on account of oscillated learning rate. Reddi et al. [21] presented AMSGrad, but the result of the experiments was not much better than Adam. In general, Adam-type algorithms have better performance on convergence, but often do not work well as SGD in out of sample. To address this issue, Keskar and Socher [23] proposed the SWATS algorithm. SWATS utilizes Adam to learn in the early part of the training and switches to SGD in the later stage of the training. In this case, it enjoys the quick convergence rate of Adam and the good performance of SGD, but the switching time is difficult to determine in practice. Huang et al. [28] presented NosAdam increasing the effect of past gradients on parameter update to avoid trapping in local or divergence. Nevertheless, it depends a lot on the initial conditions. Padam [22] introduced a parameter p making the level of adaptivity of the update process controlled. Luo et al. [24] proposed the AdaBound algorithm, which provides a dynamic bound for learning rate, and AdaBound is evaluated on a public dataset and is shown to converge as fast as Adam and perform as well as SGD. However, the aforementioned methods need to calculate all coordinates of gradients in decision vectors at each iteration, and computation cost will be aggravated due to the high-dimensional data and complex model structure.
e randomized block coordinate descent method is a powerful and effective approach for the high-dimensional optimization problem. It employs randomized strategies to pick a block of variables to update per iteration. For general gradient descent algorithms, all the coordinates of gradient vector should be calculated each time. One can easily observe that this will incur significant computing cost when dealing with high-dimensional data. In contrast, the randomized block coordinate method only calculates one block coordinate of gradient vector, which is considered as the descent direction. In particular, the randomized block method selects a coordinate based on probability p and updates the responding decision variable according to its descent direction. In addition, other coordinates of decision vector remain the same as the last time. Although the randomized block coordinate method can save significant computing cost for the learner, especially in optimization problems with high dimension data, it uses the fixed learning rate that scales the entries of gradient equally, and an adaptive learning rate has not been applied in this method.
Compared with the current work, we combine the randomized block coordinate descent method with an adaptive learning rate in this study. At each iteration, a part 2 Computational Intelligence and Neuroscience of gradient vectors is picked randomly, and the corresponding decision vectors are updated. In this way, the gradient is then calculated based on the chosen block coordinates instead of full gradients. Moreover, the extreme learning rates are restricted to a suitable range. Our method not only enjoys good generalization performance but also saves computation cost.

Preliminaries
In this section, we first introduce the optimization problem in detail. en, we begin with the background about the randomized block coordinate method.

e Online Optimization Problem.
In this work, the analysis of sequence iteration optimization problem is based on the online learning framework, which can be seen as a trade-off between a learner (the algorithm) and an opponent.
In such an online convex setting, the learner selects a decision point x t ∈ X produced by the algorithm per time step t, t � 1, . . . , T, and X is a convex and compact subset of R n . At the same time, the opponent responds to the decision of the learner with a loss function f t , which is convex and unknown in advance, and the algorithm suffers a loss f t (x t ).
Repeating the process, we have a sequence of loss functions where f t : X ⟶ R, and they vary with time t. In general, the online learner's prediction problem can be represented as follows: For online learning tasks, the goal is to optimize the regret R T of the online learner's predictions against the optimal decision in hindsight, which is defined as the difference in the total sum of loss functions T t�1 f t (x t ) after performing online learning over T rounds and its minimum value T t�1 f t (x * ) in the deterministic decision point x * . In particular, we define the regret in the following: It is desired that if the regret of online optimization algorithm is a sublinear function of T, which suggests lim T⟶∞ R T /T � 0, then, on average, the online learner executes just and the fixed optimal decision afterwards. In other words, the proposed algorithm converges when its R T is bounded. roughout this study, the diameter of convex compact set X is assumed to be bounded and ‖∇f t (x t )‖ is bounded for all t � 1, 2, . . . , T. Hereafter, ‖ · ‖ denotes the ℓ 2 norm.

Relevant
Definitions. Now, we will describe the relevant definitions that are used in the next sections. (3)

SBAG Algorithm and Assumptions
is section presents the proposed algorithm, followed by the common assumptions for convergence analysis of the algorithm.

Algorithm Design.
In this study, we develop the highdimensional online learning problems and aim to solve the optimization problem (1) by incorporating the stochastic block coordination method and adaptive learning rate. Because the dimensionality n of the decision variable x is high, the computing cost of the gradients is prohibitive. In addition, the tuning of the learning rate is challenging. For these reasons, a stochastic block coordinate adaptive optimization algorithm, dubbed SBAG, is proposed for settling the online problem (1). In our algorithm, the objective functions at different times satisfy some conditions, which are displayed in Assumption 1.
SBAG is described in Algorithm 1, whose input includes , where t � 1, 2, . . . , T. At each round t, a n order diagonal matrix M t is generated and includes random variables w t,i with P(w t,i � 0) ≔ 1 − p t and P(w t,i � 1) ≔ p t , for t � 0, 1, . . . , T and i � 1, . . . , n. In particular, the gradient d t is computed as follows.
where M t ≜ diag w t � diag w t,1 , w t,2 , . . . , w t,n , and elements of w t consist of 0 and 1. When w t,i � 1, it means that the ith coordinate of decision vector is selected to calculate the gradient at time t. From (6), one can observe that the computation cost is greatly reduced at each iteration. In addition, let H t denotes the σ− algebra, which means H t consists of all variables before time t. More explicitly, By Using d t , one and second momentum terms m t and v t are obtained as follows, respectively.
Computational Intelligence and Neuroscience Furthermore, SBAG introduces a bound of learning rate as follows: where each element of the learning rate α/ �� V t is clipped to constrain in an internal at time t, and the upper and lower bounds of the interval are μ low (t) and μ upp (t), respectively.
at is, the output of equation (9) is constrained in [μ low (t), μ upp (t)], and the technique was also used in [23,24]. Moreover, let en, SBAG updates x t+1 as follows: where°is the coordinate-wise product operator. Furthermore, the projection step of equation (11) is equivalent to the following:

Assumptions.
Before presenting the convergence analysis of SBAG, we will now introduce the below common assumptions.
Assumption 2. In this study, X is a bounded feasible set; i.e., Assumptions 1-3 are some standard assumptions in the literature, for example [18,21,24]. In addition, the convergence of SBAG is analyzed based on these assumptions in the following.

Convergence Analysis
Now, we will analyze the convergence of SBAG. We consider the regret, equation (2), in the online optimization problem (a typical scenario). e proposed algorithm generates the gradient d t with probability p t at time t. erefore, d t is a random variable. Moreover, x t is calculated by d t and x t−1 at time t. According to the knowledge of probability and statistics, the expectation should be considered when the variable is randomized. erefore, we define the regret of SBAG as follows: From the convexity of f t , it follows that Moreover, by the definition of matrix M t , we know that M t is a sparse matrix. erefore, applying equation (14) leads to Input: x 1 Parameter: x 1 ∈ X, and β 1t ∈ [0, 1) where β 11 � β 1 and β 2 ∈ [0, 1). p t denotes coordinate selection probability at time t. Moreover, β 1t � β 1 λ t where λ ∈ (0, 1) and t � 1, 2, . . . , T. Initially Set: m 1 � 0 and v 1 � 0.  Computational Intelligence and Neuroscience Taking conditional expectation (conditioned on H t ) on both sides of equation (15), it implies that By equation (1.1f ) of Section 4 in [30], and taking unconditional expectation for equation (16), it follows that From equations (13) and (17), the following equation holds To get the bound of R T , we should consider the two terms on the right side of equation (18).
us, we first propose the following lemmata to estimate term

Lemma 1. If Assumptions 1 to 3 are satisfied, sequences
x t , m t , and v t are generated by SBAG with t ∈ 1, 2, . . . , T { }. Moreover, X is a convex and compact set. (1), and p t ∈ [p min , p max ]. en, we have the following relation: Proof. From equations (9) and (10), it follows that and From equations (20) and (21), and by property of expectation, it can be verified that Plugging equation (7) into equation (22), it yields By Cauchy-Schwarz inequality, we further bound the term (a) of equation (23) and have e second inequality of equation (24) follows from the fact β 1k ≤ β 1 for all k ∈ 1, . . . , T { }. In addition, the third inequality of equation (24) Moreover, since T t�1 1/ − 1, and by equation (25), it follows that erefore, the proof of Lemma 1 is completed. Next, we introduce Lemma 2 to estimate the term

Computational Intelligence and Neuroscience
Applying Young's inequality and the Cauchy-Schwarz inequality into equation (31) leads to Summing equation (32) over t ∈ 1, 2, . . . , T { } and taking expectation on the obtained relation imply that By Lemma 1 and equation (33), it follows from that Since erefore, we further obtain μ −1 t ≥ μ −1 t−1 . en, from equation (34), it can be proved that Computational Intelligence and Neuroscience (35)
□ Theorem 1. Suppose that Assumptions 1 to 3 are satisfied, and sequences x t , m t , and v t are generated by SBAG with t ∈ 1, 2, . . . , T { }. Moreover, X is a convex and compact set. (1), and p t ∈ [p min , p max ]. We obtain the bound of regret as follows: Proof. Applying lemmata 1, 2, and 3 into (18) yields erefore, we complete the proof of eorem 1. From eorem 1, we obtain lim T⟶∞ R T /T � 0. is suggests that SBAG is convergent. In addition, the bound of regret R T is O( ); i.e., given some accuracy ε, it requires an order of O(1/ε 2 ) iterations at least to achieve the given accuracy.

Performance Evaluation
In this section, we perform our experiments on a public dataset to evaluate the performance of algorithm objectively. We consider the machine learning problem, multi-classification tasks taking advantage of the DNN for the experiments.
6.1. Setup. To assess our SBAG algorithm, we research the performance on the classification task problem. We use the CIFAR-10 [32] dataset for our experiments, which is widely used for classification problem. It consists of 10 classes and 50000 training samples and 10000 test samples.
For the experiments, we use the convolutional neural network to solve classification tasks on the CIFAR-10 image dataset, which has a good effect on image classification and object recognition, and specifically implement ResNet-34 [33] and DenseNet-121 [34].

Parameters.
To study the performance of our proposed algorithm, we compare SBAG with SGD [14], AdaGrad [15], and AdaBound [24]. e hyper-parameters of these algorithms are initialized as follows.
For SGD, the scale of the learning rate is selected from the set 100, 10, 1, 0.1, 0.01 { }. AdaGrad uses the initialized learning rate set 5e − 2, 1e − 2, 5e − 3, 1e − 3, 5e − 4 { }, and the value 0 is set for the initial accumulator value of Ada-Grad. e value of hyper-parameters of AdaBound is set the same as Adam. We directly use the initialized hyper-parameter values of AdaBound in our algorithm. In addition, we set the probability of choosing a coordinate from these values in the set 0.10%, 0.50%, 1.00%, 5.00%, { 10.00%, 50.00%}.
In addition, we define the dynamic bound functions following with [24] for our simulation experiments, i.e.,  Figure 1, and when completing the same number of iterations of 200 epochs, our method takes the least time, and the AdaBound spends the most time. e main reason is that only several blocks of coordinates are calculated in the gradient descent process for our algorithm at each iteration t, while the compared algorithms calculate the full gradients at each iteration. Moreover, AdaBound combines the first-and second-order momentum, while SGD and AdaGrad only use first-order gradients; thus, SGD and AdaGrad incur less time than AdaBound. e   same results can be seen for the DenseNet-121 in Figure 1(b).
We present another group of experiments with average loss and running time, which are executed for ResNet-34 and DenseNet-121 on CIFAR-10. e findings are shown in Figure 2. At about 150 epochs, SGD has the biggest average loss than others and decreases sharply after that time, while the average loss of SBAG is smaller compared with others and reaches the minimum value in the shortest running time finally. e reason for fast descent rate of SBAG is due to the randomized block method, which chooses one block coordinate of decision vector to calculate the gradient. In other words, SBAG calculates more samples than other compared algorithms in the same running time. erefore, the convergence of SBAG is verified by the findings presented in Figure 2.
In Figures 3 and 4, the training and test accuracy with running time of four algorithms are evaluated. As we can see,  in about 150 epochs, AdaBound achieves the highest accuracy, and AdaGrad and our algorithm almost have the same accuracy of 92.36% and 93.99%. As the running time goes, the AdaBound and SBAG have the accuracy of 99.96% and 99.93%, respectively. e similar results can be seen on the DenseNet-121. In a word, SBAG works well on training or test set, and at the same time, it has the good generalization ability on both ResNet-34 and DenseNet-121.
From the experiments above, we observe that the SBAG shows a very good performance on both ResNet-34 and DenseNet-121. It incurs less computation cost for each iteration in experiments, which is consistent with theory.

Conclusion
In this study, we proposed a randomized block adaptive gradient online learning algorithm. e proposed algorithm, SBAG, is designed to reduce the gradient computation cost of high-dimensional decision vector. e convergence analysis of SBAG and evaluations on CIFAR-10 demonstrated that the regret bound of SBAG is O( ) when loss functions are convex and achieved significant computation cost savings, without adversely affecting the performance of the optimizer. In the same 200 epochs, the proposed algorithm has the least running time and tightly less in average loss in the end. e accuracy of training sample for ResNet-34 and DenseNet-121 is 99.93% and 99.72%, slightly less compared with that of 99.96% of AdaBound, but our method reaches the highest accuracy on the test sample than Ada-Bound, SGD, and AdaGrad; i.e., SBAG is the fastest in four methods, and the curves are milder than SGD.
Data Availability e data that support the findings of this study are CIFAR-10, which is available from [32].