Fast Training Logistic Regression via Adaptive Sampling

Logistic regression has been widely used in artificial intelligence and machine learning due to its deep theoretical basis and good practical performance. Its training process aims to solve a large-scale optimization problem characterized by a likelihood function, where the gradient descent approach is the most commonly used. However, when the data size is large, it is very time-consuming because it computes the gradient using all the training data in every iteration. *ough this difficulty can be solved by random sampling, the appropriate sampled examples size is difficult to be predetermined and the obtained could be not robust. To overcome this deficiency, we propose a novel algorithm for fast training logistic regression via adaptive sampling. *e proposed method decomposes the problem of gradient estimation into several subproblems according to its dimension; then, each subproblem is solved independently by adaptive sampling. Each element of the gradient estimation is obtained by successively sampling a fixed volume training example multiple times until it satisfies its stopping criteria. *e final estimation is combined with the results of all the subproblems. It is proved that the obtained gradient estimation is a robust estimation, and it could keep the objective function value decreasing in the iterative calculation. Compared with the representative algorithms using random sampling, the experimental results show that this algorithm obtains comparable classification performance with much less training time.


Introduction
Supervised learning is to train a learner with a labelled training set and correctly determine the outputs for the unseen instances [1]. As one of the most famous classification algorithms in supervised learning, logistic regression learner is a generalized linear regression model in which the output is discrete [2]. Logistic regression has been widely used in various kinds of applications owing to its good performance, as process tomography [3], customer churn prediction [4], spatial prediction [5], major chronic diseases and clinical risk prediction [6,7], and so on [8][9][10].
e process of training a logistic regression model aims to solve an unconstrained convex optimization problem, where gradient descent is one of the most important solutions [11]. Because of the computation of the gradient using the whole training instances, it is very time-consuming to use gradient descend (GD) when the data size is large [12].
To speed up GD, many improved algorithms have been developed. According to the volume of data to obtain the gradient estimation, these algorithms can be divided into two groups: stochastic gradient descent and batch gradient descent [13]. Stochastic gradient descent (SGD) uses only one randomly selected training example to compute the gradient, and this can be very efficient for large datasets [14]. So, SGD is much faster and very suitable for online learning. However, the estimated gradient obtained by SGD is difficult to be a descent direction at each iteration, so that it needs a vast number of iterations. Furthermore, SGD is difficult to be suitable for the parallel environment [15].
Different from SGD, batch gradient descent (BGD) obtains the aim gradient estimation using randomly choosing a certain amount of training examples. In this way, BGD could largely reduce the error and instability of the estimation, and it also obtains an effective solution [16][17][18]. As the sampled examples play an important role in estimating the gradient, BGD needs to carefully choose an appropriate sample size before sampling. However, it is difficult to predetermine an appropriate sample size for different datasets. Furthermore, samples of the same size could vary in terms of their qualities, because some examples are more representative or resembling the original data than others [19].
is paper presents an improved adaptive sampling (AS) algorithm for accelerating the logistic regression training process. is method firstly gives a rule for estimating the gradient by some examples, and the obtained gradient estimation can guarantee that the objective function value keeps decreasing in the iterative calculation of GD.
en, the problem of obtaining an appropriate vector that meets the rule can be decomposed into several subproblems, where each subproblem determines a component of the vector that satisfies the stopping rule. Finally, the examples are drawn successively from the training set into the sample, and it terminates as soon as each component of the estimated gradient over the obtained sample satisfies its own rule. To speed up this process, the estimated components satisfying their own stopping rules are not estimated in the subsequent iteration, and they are the corresponding components of the final estimate of the gradient.
e main contributions of this paper are as follows: (1) Giving the rule to judge whether the direction of a vector is a descent direction of the current objective function or not, it is critical for the execution efficiency of the gradient descend method. (2) Providing an adaptive sampling method to overcome the difficulty of the predetermining sample size before sampling, this method can adaptively determine the sample size according to the character of datasets and avoid the influence of human subjective factors.
(3) Applying a strategy of divide-and-conquer to efficiently obtain the gradient estimation on the sampled examples, the aim gradient vector estimation problem is divided into several one-dimensional estimation subproblems, and each subproblem can be solved independently. (4) Proving the obtained gradient estimation is robust using probably approximately correct theory, and this estimation could be a descent direction of the current objective function at each iteration. (5) Designing an efficient mechanism to solve the multivariate estimation problem for large-scale data.
e rest of the paper is organized as follows: Section 2 reviews related methods according to their characteristic. Section 3 proposes a sampling on-demand algorithm for logistic regression and proves its effectiveness. Section 4 reports the experimental results through the comparison with existing methods. Section 5 gives the conclusion of this paper and shows some future research work.

Related Work
Related work in improving GD has been widely developed nowadays. According to the amount of data to obtain the gradient estimation, existing GD algorithms can be divided into two groups: SGD algorithm and BGD algorithm. e original SGD (OSGD) algorithm computes the gradient with only one sample from the training set. e OSGD algorithm does not consider the effect of different dimensions on its convergence, so its rate of convergence could be slow when the surface of the objection function curves steeply for different dimensions. Qian [20] proposed the Momentum algorithm to an accelerated OSGD algorithm, where the current update vector is appended with a fraction of the obtained vector in the previous iteration. To efficiently solve the sparse data, Duchi et al. [21] proposed an Adagrad method to deal with the online learning task, and it set different learning rates for different components of the vector. Besides, there are lots of algorithms to determine the adaptive learning rate for different components [22][23][24]. ese kinds of algorithms can deal with large-scale data, they must perform a vast number of iterations before an appreciable improvement of the objective function, and it is difficult to parallelize.
Different from the SGD algorithm, the BGD algorithm computes the gradient with some randomly sampled examples from the training set at each iteration. So, the BGD algorithm can reduce the variance of the estimate of the gradient, and it achieves more stable convergence. In order to sufficiently often achieve convergence to the optimal solution, the estimate of the gradient by the BGD algorithm needs to enforce descent in the objective function at every iteration. erefore, the sample size is carefully determined. Byrd et al. [25] proposed a dynamic sample gradient (DSG) algorithm, which can dynamically determine the sample size before sampling. For the convex optimization problem, the DSG algorithm can get the optimal solution. However, the sample size determined by the DSG algorithm could increase with the increasing steps, so that the total running time of the DSG algorithm also increases. Furthermore, owing to the fact that samples of the same size could vary in terms of their qualities, this leads to the estimate of the gradient with the fixed-size sample that may not enforce descent in the objective function. On the other hand, choosing a proper learning rate is an important issue for the performance of the BGD algorithm. A smaller learning rate could cause the convergence rate to become slower, but a larger one fluctuates obviously around the optimal solution. Robbins and Sutton [26] proposed a schedule to select the appropriate learning rate during training, where the predefined schedule is conducive to reducing the learning rate. Liang et al. [27] have proposed a sampling on-demand to speed up logistic regression, but the theoretical proof about its robust result is not given, and it does not compare its classification performance with other state-of-the-art approaches. ere are many similar algorithms to yield high accuracy in the solution of the optimization problem [28].

Preliminary.
e logistic regression classier is generated using the posterior probabilities of two labels ({0, 1}) denoted by a linear function in x ∈ R m , where the sum of these two posterior probabilities is one. e form of this model is that where the weight vector ω � (ω 1 , , ω ⊤ z is the inner product between ω and z, y is the label of x, and m is the dimensional size. Let TS � (x 1 , y 1 ), (x 2 , y 2 ), . . . , (x N , y N ) be a training set, each training instance x i (i � 1, 2, . . . , N) is represented by a m-dimensional vector (x i1 , x i2 , . . . , x im ) ⊤ , and its label is y i ∈ 0, 1 { }. erefore, the optimal weight vector is obtained by minimizing the following problem: where Let ω * be the optimal solution of the optimization problem (2), and the classifier of logistic regression is I (f(z),0.5) : where f(z) � 1/(1 + exp(ω * ⊤ z). With these notations, the predicted label y for a given test instance x 0 � (x 01 , x 02 , . . . , x 0m ) can be derived from the function e GD algorithm is an iterative optimization algorithm, where it updates the current solution with the solution solved in the previous step and the gradient of the current objective function each time. Owning to its advantage of simple implementation and effectiveness, GD is widely adopted to solve the unconstraint optimization problem (2). Let ω k be the optimal solution at the kth iteration. e gradient estimation ∇J(ω k ) is obtained as follows: where (2) is a convex optimization problem and the obtained solution is the global optimal solution. e computation of the gradient estimation needs all the data, so that it is very time-consuming for large-scale data. To overcome the deficiency, we propose a novel algorithm for fast training logistic regression via adaptive sampling (LLR-AS).

LLR-AS Algorithm.
In fact, the LLR-AS algorithm is also a type of minibatch gradient descent algorithm. It randomly samples a subset from T to compute the gradient estimation ∇J(ω) to replace ∇J(ω) computed with all the data in every iteration. According to the feature of the GD algorithm, its convergence efficiency is closely related to the quality of the gradient estimation ∇J(ω). e value of J(ω) is difficult to be smaller all the time and the final execution time becomes longer if the objective function value cannot keep decreasing using the gradient estimation in each iteration. Because gradient is the fastest direction in which the current objective function value decreases, a similar direction between ∇J(ω),∇J(ω), and ∇J(ω) should be considered to achieve this aim. In the following, a rule is given to obtain an estimate ∇J(ω) with high quality. is results in the following inequality: where ‖X‖ 2 is a norm of the vector X and the relaxation parameter 0 ≤ ε < 1/2. According to the conclusion [29], the vector − ∇J(ω) must keep the objective function J(ω) value decreasing in the iterative calculation of GD if it satisfies inequality (5). Moreover, parameter ε controls the descent speed for the function J(ω). e larger the value of ε, the smaller the directional derivative, and the lower the descent speed for the function J(ω). Because the optimization problem (2) is convex, the LLR-AS algorithm can get the globally optimal solution. With the rule, our algorithm is outlined in Algorithm 1.

AS Algorithm.
In order to get the vector ∇J(ω) meeting inequality (5), a novel adaptive sampling algorithm is proposed in the following: According to inequality (5), the simplest way of getting the vector ∇J(ω) is to sample a subset S from the set T and obtain an estimate from ∇J(ω; where |S| is the size of subset S. However, the sample size is difficult to determine for different tasks though there exist some theoretical and empirical results in the literature such as PAC [30] learning theory and learning curves [31]. Moreover, the theoretical results are usually worst-case and learning curves are average-case, so they are not necessarily consistent with each other [19]. To tackle this difficulty, we propose an adaptive sampling algorithm. Our method obtains the aim subset S by continually sampling examples from T until the gradient estimation on the sampled subset S satisfying the stopping rule. It decides the sample size Scientific Programming through the information of the sampled examples and solves the difficulty of predetermining the sample size. erefore, the key issue becomes the problem that how to design the stopping rule for our adaptive sampling to satisfy inequality (5). From a statistical point of view, the estimation of gradient ∇J(ω) satisfying inequality (5) is a m + 1-dimensional vector estimating problem. However, the existing sampling procedures mainly focus on a one-dimensional estimate problem, and they cannot be directly applied to the multidimensional problem. Although there exists a close relationship between the components of the gradient ∇J(ω), each component of ∇J(ω) can still be seen as a one-dimensional estimating subproblem. erefore, the multivariate estimation problem (5) can be solved by solving these one-dimensional estimating subproblems. Inequality (5) is equivalent to ∇J j (ω)) 2 } ≥ 0 according to the formula of vector inner product. In other words, inequality (5) must hold if each component ∇J j (ω; S) of ∇J j (ω; S) simultaneously satisfies its own inequality (1 − ε)(∇J j (ω)) 2 ≤ ∇ J j (ω; S) × ∇J j (ω), where j � 1, 2, . . . , m + 1. So, the problem of seeking for the gradient estimation satisfying inequality (5) can be approximatively divided into m + 1 subproblems, where each subproblem is solved by Each component ∇J j (ω) of ∇J(ω) can be considered as the exception of the population T, and ∇J j (ω, S) is the estimation of ∇J j (ω) on the subset S sampled from the population T. According to the central limit theorem, the difference of value between ∇J j (ω, S) and ∇J j (ω) continually becomes small with the enlarging sampled subset S.
ere exists a critical value of the sample subset S size for with a large probability. erefore, both the early stopping rule and consecutive sampling are adopted to get the sampled subset S and estimation over it. Given the aim vector ω and the objective function J(ω), each element ∇J j (ω) can be a constant value and its upper bound of the absolute value could be estimated by a function α j (t, s) � d j �������������������� � ln(6(m + 1)/δ)/(2(t 3 s 3 ))(j � 1, 2, . . . , m + 1), where d j � 2(|min 1≤i≤N z i,j | + |max 1≤i≤N z i,j |) and t and s are the size of sampled subset per sampling and total number of sampling, 0 < δ < 1/2). In the next section, we will show that such a function can achieve this goal, and the stopping condition can be finally transformed into |∇J j (ω; S)| ≥ (1 + 1/ε)α j (t, s) for each component ∇J j (ω). However, it needs to recompute ∇J(ω; S) on the sampled subset S and test whether all the components ∇J j (ω; S) satisfy their own stopping rule during each round of sampling, and this computational burden could cost much more execution time to achieve this aim. Two improvements are made to solve this problem (Algorithm 2).
Let S t− 1 be the cumulatively sampled subset obtained by the first t − 1 iterations sampling. e set S t � S t− 1 ∪ S t , where S t is the sampled subset at the t− th iteration. We have e computation of ∇J(ω; S t ) can be largely reduced using the result on the set S t− 1 at each iteration. On the other hand, we adopt an asynchronous way to get each component of ∇J(ω; S). If one or more components satisfy their own stopping rules, then they are directly the corresponding components of the final result without considering in subsequent iteration. Algorithm 2 describes the multivariate adaptive sampling method.

e Effectiveness of AS Algorithm.
In this subsection, we study the effectiveness of AS algorithm. To derive our main result, we need the following lemmas and theorems.  (x N , y N ) , the stepsize λ > 0. Output: e optimal vector ω * .
Scientific Programming is a random variable depending on the sample drawn from T. Let t 1 j be the smallest integer meeting the following inequalities: Since α t j is a strictly decreasing function with t and J j (ω) is fixed under any given ω, hence, t 1 j is uniquely determined for j � 1, 2, . . . , m + 1.
Proof. We get from the AS algorithm the estimate us, it always holds that ∇J j (ω; S t j )∇J j (ω) ≥ (1 − ε)(∇J j (ω)) 2 as long as ∇J j (ω; S t j ) and ∇J j (ω) have the same sign. On the other hand, if ∇J j (ω; S t j ) and ∇J j (ω) have different signs, then they are quite different as |∇J j (ω; S t j )| ≥ α t j j (1 + 1/ε). Next, we show the difference is large enough, and it is conducive to prove the probability that this situation occurs is small.
Proof. When t j < t 1 j , it always holds that |∇J j (ω; S t j )| < (1 + 1/ε)α t 1 j j and α It follows from the triangle inequality Combining inequality (15) with Lemma 2, we have 6 Scientific Programming Combining with Lemma 3 and 4, we can easily get the following theorem.

Theorem 2.
For any 1/2 > ε > 0 and 1/2 > δ > 0, final estimation ∇J(ω) � (∇J 1 (ω; S t 1 ), . . . , ∇J m+1 (ω; S t m +1 )) generated by the AS algorithm satisfies inequality Proof. Combining eorem 1 and Lemma 1, then we have In eorem 2, the obtained estimation using the AS algorithm could keep the decreasing value of the current objective function in each iteration calculation of GD. erefore, it could guarantee that LLR-AS algorithm gets the optimal solution of the convex optimization problem (2), and this conclusion is verified in the experiment. Besides the optimal solution, we also pay attention to the number of sampled examples (NSE). AS algorithm is an iterative sampling algorithm that samples a fixed number of samples each time, then we can estimate NSE using the total number of iterations. We have already proved that AS algorithm terminates finally within t 1 steps from Lemmas 3 and 4, where t 1 � max 1≤j≤m+1 t 1 j . It is well known that t 1 could be the minimum number of iterations satisfying condition (12); then, we could assume that α j 1 ≈ ε|∇J j (ω)|/(1 + 2ε), where j � 1, 2 · · · , m + 1. Finally, we can estimate NSE as follows: Next, we will discuss the above formula. e effect of the parameter m and δ on formula (19) is small because they are in the logarithmic function. Namely, the AS algorithm has a low possibility to sample too many examples. So, it is useful for sampling examples from the large-scale data.

Experimental Setup.
Two representative gradient descend methods were selected in this study: a common gradient descend with all the data and a dynamic sample gradient algorithm [25]. ese two algorithms are used to solve the logistic regression, and they are named LR-GD and LR-DSG. Seven benchmark datasets are selected for making a fair comparison between our proposal and others [33,34]; their information is shown in Table 1. All of these selected datasets have larger than 50000 instances.
Owning to the simplicity and successful application, we select the classification accuracy (Acc) and training time as the performance measure. LLR-AS and LR-DSG are both accelerated algorithms of LR-GD, so we compare their relative speeds (RS) and the difference in their solutions. RS is a ratio of training time between LLR-AS and each of the others. For estimating these three performance measures Acc and RS, we used a 10-fold cross-validation method. To compare the difference between our method and others under a performance measure, we adopt the Wilcoxon signed-rank test (WSRT) [35]. e reason for selecting WSRT is that it does not require a strict data distribution hypothesis and has stable performance. It is empirically considered to be stronger than other tests [36]. e null hypothesis of WSRT denotes that there exists no significant difference between our algorithm and each one of the others under a performance measure, while the alternative is that there exists a significant difference.
In the following experiments, we fix an initial sample S 0 of size 1% of the total training set and θ � 0.5 for DSG algorithm according to [25], and ε � 0.5, δ � 0.1 for LLR-AS. e stopping condition for these three iterative algorithms is that the Euclidean distance between the current and the previous solution is smaller than 0.001 and the maximum iteration 5000. e significance level of 0.05 is used. All the experiments are executed in Python 3.8 on the same computer of Intel Xeon E5-2650 CPU and 32 GB of RAM.

Experimental Results and Analysis.
In this section, we adopt the classification ability and training efficiency as two important measurements to evaluate the performance of Scientific Programming these algorithms and give a detailed comparing analysis and the reason for the experiment result.

Classification Performance Analysis.
Under the theoretical hypothesis of logistic regression, its classification performance of logistic regression depends on the solution obtained by gradient descend. LR-GD algorithm is trained by gradient descend with all the training data; then its weight vector is the optimal solution, as well as its classification performance. In the following, we compare the predictive ability of these algorithms from the difference in the obtained solution vector and classification accuracy.
(1) e analysis of the difference on solution vector: Pearson correlation coefficient ρ ∈ [0, 1] is chosen to evaluate the difference between two solution vectors for its high effectiveness. Its value is inversely proportional to the difference. e larger the value of ρ, the smaller difference between the two vectors. e correlation coefficients between the solution vector obtained by LLP-AS and each one of LR-GD and LR-DSG algorithms on the test data of each dataset are computed, all the statistics results on different datasets are listed in Table 2. Table 2 shows that the correlation coefficient ρ of the weight vector between LLR-AS and LR-GD is nearly closed to 1 on almost all datasets except for the Cifa dataset, and this same result can be obtained between LLR-AS and LR-DSG. e mean of correlation coefficients on all the datasets is 0.986 and 0.989, and their medians are 0.996 and 0.986. Moreover, it can get a detailed comparison from the descriptive statistics. e following can be seen: (1) there exists a negligible difference of solution vector between the LLR-AS algorithm and each one of these two algorithms, and then the LLR-AS algorithm can get nearly the same solution vector as the LR-GD algorithm. (2) e standard deviation of the correlation coefficients on each data is tiny; then, this experiment result validates that the proposed algorithm could get a robust solution.
Own to the properties of the obtained vector estimation at each iteration, the LLR-AS algorithm performs multiple iterations to continually minimize the objective function value. Meanwhile, the original optimization problem has a unique optimal solution because it is a convex optimization. So, the LLR-AS algorithm is able to guarantee convergence and obtain the optimal solution as the LR-GD algorithm. Furthermore, eorem 2 has also been verified by this experiment result, and the gradient estimation is stable for different datasets.
(2) e analysis of the difference in classification. the classification accuracy Acc is adopted to compare the performance between the proposed algorithm and two state-ofthe-art approaches, and WSRT is performed to test whether there exists a significant difference among them. Table 3 lists the descriptive statistics of classification accuracy of each algorithm obtained by 10-fold cross-validation, and their results are also plotted in Figure 1.
e result on each dataset plotted in Figure 1 shows that the LLR-AS achieves better classification accuracy than the LR-GD algorithm and LR-DSG algorithm on the Ijcnn1 dataset, and it has not the worst classification accuracy on the rest of the datasets. To assess the overall classification performance on all the datasets, the mean and median of the result of each algorithm on eight datasets are computed in the last two rows of Table 3. eir mean values of classification accuracy are 0.750, 0.731, and 0.747, and the median values are 0.722, 0.711, and 0.713. erefore, there exists a negligible difference in classification accuracy among these algorithms. Finally, the obtained p values using WSRT between LLR-AS and each one of the other algorithms are 0.1563 and 0.688, both larger than the given significant level of 0.05. en, it gets that (1) the LLR-AS algorithm has no significant difference in classification accuracy with the LR-GD algorithm and LR-DSG algorithm on the selected datasets. (2) e LLR-AS algorithm has a stable classification performance because its standard deviation of the classification accuracy of the LLR-AS algorithm on every data is relatively small. e reason for the similar classification result is that the classification performance of logistic regression depends on the solution vector, and the LLR-AS algorithm has no significant different solution vector from the LR-GD algorithm and the LR-DSG algorithm. Moreover, the obtained solution vector of the algorithm is robust according to eorem 2, and the small standard deviation of the correlation coefficient on different datasets also verifies this fact.

Training Efficiency Analysis.
Besides the classification performance, training speed is another important measurement to evaluate algorithm training performance. e relative speed RS can evaluate the accelerating extent of the LR-DSG algorithm. Table 4 lists RS on different datasets.  Cifa  60000  3072  2  Cod-rna  59535  8  2  Covtype  581012  54  2  Ijcnn1  141691  22  2  Mnist  350000  85  2  Skin-nonskin  245057  3  2  Susy  5000000 18 2 It finds that the value of RS between the LLR-AS algorithm and LR-GD algorithm is larger than 20 on all the datasets from Table 4, and its value is larger than 109 on Cifa, Cod-rna, and Covtype dataset. So, the LLR-AS algorithm can largely reduce the training time of the LR-GD algorithm. On the other hand, the value of RS between the LLR-AS algorithm and LR-DSG algorithm is larger than one on these seven datasets, and its average value of RS on all the datasets is 1.843. erefore, the LLR-AS algorithm indeed needs less training time than the LR-DSG algorithm.
ere exist three reasons for explaining that the proposed mechanism can achieve a good result on training efficiency. (1) e obtained gradient estimation could be a descent direction of the current objective function at each iteration.
us, the total number of iterations that is positively correlated with the training efficiency will reduce. (2) e divide-and-conquer approach is adopted to compute each component of the gradient vector at each iteration, and it can be executed in a parallel environment.
(3) It is proved that there exists a low possibility to sample too many examples to estimate the gradient; the time of estimating gradient could become short at each iteration. erefore, the proposed algorithm has a better performance than other representative algorithms owning to the above three merits.

Summary
A novel algorithm for fast training logistic regression via adaptive sampling has been proposed to effectively handle the massive dataset in this paper. e proposed algorithm solves the difficulty that the sample size needs to be fixed before sampling, and it also offers an idea of dividing the multivariate estimation problem into several easy-to-solve subproblems. Experimental results on real datasets demonstrate that LLR-AS has obtained a similar classification performance with less execution time in comparison with other representative algorithms. Moreover, this proposed algorithm can deal with the multiclassification problem using the one-vs-all scheme, and paper [37] has shown that this scheme is as accurate as any other approach. e proposed algorithm solves the binary classification problem, but it can be used for the multiclassification problem using the one-vs-all scheme. It needs to train several different classifiers, where each classifier is obtained by distinguishing the instances of the same class from the others in the rest classes. When given an unlabeled instance, the final output is the largest result among the results of all the classifiers.
ough the proposed algorithm has a good performance for large-scale data, there exist two limitations for dealing with various kinds of real datasets. e gradient estimation needs all the features of the data at each iteration, so that it may take a great challenge of its training efficiency for high dimensional data [38][39][40]. Furthermore, this algorithm does not consider the label distribution of the data, and then its performance on imbalanced data could decrease. In the future, we will study how to combine sampling and feature selection to scale up machine learning algorithms and design an effective mechanism to deal with the class imbalance problem.

Data Availability
is publication was supported by LIBSVM datasets, which are openly available at location cited in [33].

Conflicts of Interest
e authors declare that they have no conflicts of interest.