Logistic regression has been widely used in artificial intelligence and machine learning due to its deep theoretical basis and good practical performance. Its training process aims to solve a large-scale optimization problem characterized by a likelihood function, where the gradient descent approach is the most commonly used. However, when the data size is large, it is very time-consuming because it computes the gradient using all the training data in every iteration. Though this difficulty can be solved by random sampling, the appropriate sampled examples size is difficult to be predetermined and the obtained could be not robust. To overcome this deficiency, we propose a novel algorithm for fast training logistic regression via adaptive sampling. The proposed method decomposes the problem of gradient estimation into several subproblems according to its dimension; then, each subproblem is solved independently by adaptive sampling. Each element of the gradient estimation is obtained by successively sampling a fixed volume training example multiple times until it satisfies its stopping criteria. The final estimation is combined with the results of all the subproblems. It is proved that the obtained gradient estimation is a robust estimation, and it could keep the objective function value decreasing in the iterative calculation. Compared with the representative algorithms using random sampling, the experimental results show that this algorithm obtains comparable classification performance with much less training time.

Supervised learning is to train a learner with a labelled training set and correctly determine the outputs for the unseen instances [

To speed up GD, many improved algorithms have been developed. According to the volume of data to obtain the gradient estimation, these algorithms can be divided into two groups: stochastic gradient descent and batch gradient descent [

Different from SGD, batch gradient descent (BGD) obtains the aim gradient estimation using randomly choosing a certain amount of training examples. In this way, BGD could largely reduce the error and instability of the estimation, and it also obtains an effective solution [

This paper presents an improved adaptive sampling (AS) algorithm for accelerating the logistic regression training process. This method firstly gives a rule for estimating the gradient by some examples, and the obtained gradient estimation can guarantee that the objective function value keeps decreasing in the iterative calculation of GD. Then, the problem of obtaining an appropriate vector that meets the rule can be decomposed into several subproblems, where each subproblem determines a component of the vector that satisfies the stopping rule. Finally, the examples are drawn successively from the training set into the sample, and it terminates as soon as each component of the estimated gradient over the obtained sample satisfies its own rule. To speed up this process, the estimated components satisfying their own stopping rules are not estimated in the subsequent iteration, and they are the corresponding components of the final estimate of the gradient. The main contributions of this paper are as follows:

Giving the rule to judge whether the direction of a vector is a descent direction of the current objective function or not, it is critical for the execution efficiency of the gradient descend method.

Providing an adaptive sampling method to overcome the difficulty of the predetermining sample size before sampling, this method can adaptively determine the sample size according to the character of datasets and avoid the influence of human subjective factors.

Applying a strategy of divide-and-conquer to efficiently obtain the gradient estimation on the sampled examples, the aim gradient vector estimation problem is divided into several one-dimensional estimation subproblems, and each subproblem can be solved independently.

Proving the obtained gradient estimation is robust using probably approximately correct theory, and this estimation could be a descent direction of the current objective function at each iteration.

Designing an efficient mechanism to solve the multivariate estimation problem for large-scale data.

The rest of the paper is organized as follows: Section

Related work in improving GD has been widely developed nowadays. According to the amount of data to obtain the gradient estimation, existing GD algorithms can be divided into two groups: SGD algorithm and BGD algorithm.

The original SGD (OSGD) algorithm computes the gradient with only one sample from the training set. The OSGD algorithm does not consider the effect of different dimensions on its convergence, so its rate of convergence could be slow when the surface of the objection function curves steeply for different dimensions. Qian [

Different from the SGD algorithm, the BGD algorithm computes the gradient with some randomly sampled examples from the training set at each iteration. So, the BGD algorithm can reduce the variance of the estimate of the gradient, and it achieves more stable convergence. In order to sufficiently often achieve convergence to the optimal solution, the estimate of the gradient by the BGD algorithm needs to enforce descent in the objective function at every iteration. Therefore, the sample size is carefully determined. Byrd et al. [

The logistic regression classier is generated using the posterior probabilities of two labels ({0, 1}) denoted by a linear function in

Let

The GD algorithm is an iterative optimization algorithm, where it updates the current solution with the solution solved in the previous step and the gradient of the current objective function each time. Owning to its advantage of simple implementation and effectiveness, GD is widely adopted to solve the unconstraint optimization problem (

In fact, the LLR-AS algorithm is also a type of minibatch gradient descent algorithm. It randomly samples a subset from

Initialize:

Obtain the vector

In order to get the vector

From a statistical point of view, the estimation of gradient

Each component

Initialize:

Compute

Sample a random subset

Compute

Let

In this subsection, we study the effectiveness of AS algorithm. To derive our main result, we need the following lemmas and theorems.

Let

The lemma is proved by mathematical induction.

Consider

Assume that the inequality holds for

Consider

The lemma follows immediately from mathematical induction.

Let

According to

Therefore, for any

For the convenience of the following developments, we make the following remarks. Let

If

We get from the AS algorithm the estimate

Thus, it always holds that

In the following, we will give this difference from two cases. If

When

It follows from the triangle inequality that

For any

For any

Combining Theorem

In Theorem

Next, we will discuss the above formula. The effect of the parameter

Two representative gradient descend methods were selected in this study: a common gradient descend with all the data and a dynamic sample gradient algorithm [

Summary of datasets.

Dataset | Size | Features | Class |
---|---|---|---|

Cifa | 60000 | 3072 | 2 |

Cod-rna | 59535 | 8 | 2 |

Covtype | 581012 | 54 | 2 |

Ijcnn1 | 141691 | 22 | 2 |

Mnist | 350000 | 85 | 2 |

Skin-nonskin | 245057 | 3 | 2 |

Susy | 5000000 | 18 | 2 |

Owning to the simplicity and successful application, we select the classification accuracy (

In the following experiments, we fix an initial sample

In this section, we adopt the classification ability and training efficiency as two important measurements to evaluate the performance of these algorithms and give a detailed comparing analysis and the reason for the experiment result.

Under the theoretical hypothesis of logistic regression, its classification performance of logistic regression depends on the solution obtained by gradient descend. LR-GD algorithm is trained by gradient descend with all the training data; then its weight vector is the optimal solution, as well as its classification performance. In the following, we compare the predictive ability of these algorithms from the difference in the obtained solution vector and classification accuracy.

The correlation coefficient

Dataset | ||||
---|---|---|---|---|

Std | Mean | Std | Mean | |

Cifa | 0.006 | 0.935 | 0.003 | 0.986 |

Cod-rna | 0.003 | 0.998 | 0.003 | 0.991 |

Covtype | 0.007 | 0.979 | 0.048 | 0.984 |

Ijcnn1 | 0.001 | 0.996 | 0.004 | 1.000 |

Mnist | 0.000 | 0.995 | 0.003 | 0.985 |

Skin-nonskin | 0.000 | 1.000 | 0.004 | 0.979 |

Susy | 0.001 | 0.998 | 0.009 | 0.998 |

Average | 0.986 | 0.989 | ||

Median | 0.996 | 0.986 |

Table

Own to the properties of the obtained vector estimation at each iteration, the LLR-AS algorithm performs multiple iterations to continually minimize the objective function value. Meanwhile, the original optimization problem has a unique optimal solution because it is a convex optimization. So, the LLR-AS algorithm is able to guarantee convergence and obtain the optimal solution as the LR-GD algorithm. Furthermore, Theorem

Dataset | LR-GD | LR-DSG | LLR-AS | |||
---|---|---|---|---|---|---|

Std | Mean | Std | Mean | Std | Mean | |

Cifa | 0.003 | 0.904 | 0.003 | 0.901 | 0.003 | 0.904 |

Cod-rna | 0.142 | 0.782 | 0.005 | 0.888 | 0.060 | 0.851 |

Covtype | 0.103 | 0.665 | 0.117 | 0.682 | 0.145 | 0.655 |

Ijcnn1 | 0.004 | 0.794 | 0.004 | 0.794 | 0.011 | 0.841 |

Mnist | 0.002 | 0.711 | 0.003 | 0.713 | 0.005 | 0.722 |

Skin-nonskin | 0.002 | 0.583 | 0.002 | 0.573 | 0.005 | 0.589 |

Susy | 0.002 | 0.680 | 0.002 | 0.681 | 0.005 | 0.689 |

Average | 0.731 | 0.747 | 0.750 | |||

Median | 0.711 | 0.711 | 0.722 |

The result on each dataset plotted in Figure

The reason for the similar classification result is that the classification performance of logistic regression depends on the solution vector, and the LLR-AS algorithm has no significant different solution vector from the LR-GD algorithm and the LR-DSG algorithm. Moreover, the obtained solution vector of the algorithm is robust according to Theorem

Besides the classification performance, training speed is another important measurement to evaluate algorithm training performance. The relative speed

The relative speed on seven datasets.

Dataset | LLR-AS vs LR-GD | LLR-AS vs LR-DSG | ||
---|---|---|---|---|

Std | Mean | Std | Mean | |

Cifa | 3.049 | 109.339 | 0.003 | 1.461 |

Cod-rna | 3.456 | 111.260 | 0.203 | 3.067 |

Covtype | 9.659 | 270.898 | 0.348 | 3.945 |

Ijcnn1 | 4.641 | 64.132 | 0.004 | 1.070 |

Mnist | 5.014 | 83.377 | 0.003 | 1.082 |

Skin-nonskin | 1.036 | 20.282 | 0.004 | 1.243 |

Susy | 4.887 | 91.942 | 0.009 | 1.031 |

Average | 107.319 | 1.843 |

It finds that the value of

There exist three reasons for explaining that the proposed mechanism can achieve a good result on training efficiency. (1) The obtained gradient estimation could be a descent direction of the current objective function at each iteration. Thus, the total number of iterations that is positively correlated with the training efficiency will reduce. (2) The divide-and-conquer approach is adopted to compute each component of the gradient vector at each iteration, and it can be executed in a parallel environment. (3) It is proved that there exists a low possibility to sample too many examples to estimate the gradient; the time of estimating gradient could become short at each iteration. Therefore, the proposed algorithm has a better performance than other representative algorithms owning to the above three merits.

A novel algorithm for fast training logistic regression via adaptive sampling has been proposed to effectively handle the massive dataset in this paper. The proposed algorithm solves the difficulty that the sample size needs to be fixed before sampling, and it also offers an idea of dividing the multivariate estimation problem into several easy-to-solve subproblems. Experimental results on real datasets demonstrate that LLR-AS has obtained a similar classification performance with less execution time in comparison with other representative algorithms. Moreover, this proposed algorithm can deal with the multiclassification problem using the one-vs-all scheme, and paper [

Though the proposed algorithm has a good performance for large-scale data, there exist two limitations for dealing with various kinds of real datasets. The gradient estimation needs all the features of the data at each iteration, so that it may take a great challenge of its training efficiency for high dimensional data [

This publication was supported by LIBSVM datasets, which are openly available at location cited in [

The authors declare that they have no conflicts of interest.

This work was supported by Shandong Provincial Natural Science Foundation, China (no. ZR2020MF146), Major Scientific and Technological Innovation Project of Shandong Province (no. 2019JZZY010716), and Key R&D Plan of Shandong Province (no. 2019GGX101061).