This paper presents a technique of evidence maximization for automatic tuning of regularization parameters of elastic nets, which allows tuning many parameters simultaneously. This technique was applied to handwritten digit recognition. Experiments showed its ability to train either models with high accuracy of recognition or highly sparse models with reasonable accuracy.
1. Introduction
One of the important aspects of machine learning is to choose an appropriate subset of the (possibly huge) set of all virtually available features, such as the trained model which depends only on this subset of features. A good choice (feature selection, [1]) can both speed up the training and improve the quality of its result. It depends not only on the particular problem, but also on the data available for training.
Feature selection can either precede the learning itself (e.g., entropy-based or correlation analysis) or be a built-in part of the learning process (e.g., learning with l1-regularization, such as LASSO regression and l1-SVM) [2]. This paper deals with the latter case only.
It is known that learning with l1-regularization can produce rather sparse models which depend on rather few features, but learning with l2-regularization usually produces more accurate models. In [3] some mixed regularization called “elastic net” was proposed. Let F(x,w) be a model parameterized by w, predicting response y by feature vector x, and let J(F(x,w),y) be the cost of prediction F(x,w) provided the true response is y. Then training of such a model with elastic net regularization on the set of samples {(xi,yi),i=1,…,N} using loss minimization (a.k.a. ERM—empirical risk minimization) method or, briefly, “training of an elastic net” is the minimization problem:(1)∑i=1NJFxi,w,yi+λw+μ2w2⟶minw,where · and · stand for l2- and l1-norms, respectively, and λ and μ are nonnegative regularization parameters. It is shown experimentally in [3] that varying the parameters λ and μ one can balance between the sparsity of the model and the accuracy of its prediction.
In this paper elastic nets are used to regularize multiclass logistic regression. A method of tuning more general regularization parameters than λ and μ above is described. This method is tested on a handwritten digit recognition problem.
The rest of this paper is organized as follows. Section 2 presents the mathematical model and the elastic net in details. Section 3 describes the learning algorithm and the evidence maximization technique for tuning regularization parameters of the elastic net; this technique is the main subject of this paper. Section 4 describes experiments with elastic nets for digit recognition. Section 5 exposes the results of experiments. Section 6 summarizes the main results of experiments and discusses further possible applications of the proposed technique.
2. Mathematical Model
Consider multinomial classification in its both deterministic and probabilistic variants: given a feature vector x∈Rd either to predict the correct label y of one of q classes to which the vector x belongs or to estimate the conditional probability p(y∣x) of each class label. Probabilistic classification is considered primary and in deterministic classification a class label (usually the class label) y∈argmaxyp(y∣x) will be predicted.
Let x→=(1,x)∈Rd+1 stand for augmented feature vector. To estimate p(y∣x) multinomial linear logistic regression model(2)py∣x,w→=ew→yx→∑l=1qew→lx→will be trained. The model parameter matrix w→ consists of q(d+1)-dimensional rows w→l=(w0l,w1l,…,wdl). To train the model means to choose some “good” parameter w→.
To do this we use a training dataset T={(x1,y1),…,(xN,yN)} of N couples (xi,yi) which are supposed to be i.i.d. random. T can also be written in a transposed way T={X,Y} where X={x1,…,xN} and Y={y1,…,yN}. Training tries to maximize the posterior of w→ given some prior p0(w→) and the training set T. Since (3)pw→∣T=p0w→pT∣w→pT=p0w→pY∣X,w→pY∣Xand the denominator does not depend on w→, maximization of posterior probability is equivalent to maximization of the numerator or of its logarithm: (4)lnp0w→pY∣X,w→=lnp0w→+∑i=1Nlnpyi∣xi,w→⟶maxw→.The second summand in (4) is the log likelihood of the model L(w→;T), while the first one depends on the choice of the prior.
Let (q×d)-matrix w stand for w→ without the bias column w0. The prior is usually taken independent of the bias, so p0(w→)=p0(w). In the simplest cases when spherical Gaussian or Laplacian distributions are taken as priors, training (4) turns to an optimization problem with l2- or l1-regularization, respectively.
Similarly, elastic nets are obtained from the prior(5)p0w→=1Zλ,μe-λw-μw2/2,where(6)Zλ,μ=∫e-λw-μw2/2dw=∫-∞∞e-λt-μt2/2dtqd=2eλ2/2μμ∫λ/μ∞e-τ2/2dτqd=2eλ2/2μμ2πΦ-λμqd(remember that the space of w is qd-dimensional) and Φ(·) denotes the cumulative function of the standard one-dimensional Gaussian distribution:(7)Φt=∫-∞t12πe-τ2/2dτ.To simplify calculations instead of the function Φ(·) we use(8)Ψt=et2/2∫-∞te-τ2/2dτ=2πet2/2Φt.For instance, the normalization factor Z(λ,μ) becomes (9)Zλ,μ=2μΨ-λμqd.
Plugging (5) into (4) turns training of elastic net into the optimization problem:(10)-∑i=1Nlnpyi∣xi·w→+λw+μ2w2⟶minw→.
Both prior (5) and regularization summands in (10) are isotropic with respect to all d features. However the features themselves might be unequal by their nature. To respect such an inequality we partition all features into K groups of features of the same nature. For example, all pixel values of the image have the same nature and will belong to the same group of features, while computed features or the aspect ratio falls to other groups.
Let us fix a partition of the set of indices(11)1,…,d=⨆k=1KDk into subsets Dk of cardinalities dk=#Dk and define separate regularization parameters λk and μk for each group. Then training of generic elastic net (10) turns into (12)-∑i=1Nlnpyi∣xi·w→+∑k=1Kλk∑j∈Dkwj+μk2∑j∈Dkwj2⟶minw→,and training of the elastic net for linear logistic regression (2) turns into (13)-∑i=1Nwyix→i-ln∑l=1qewlx→i+∑k=1Kλk∑j∈Dkwj+μk2∑j∈Dkwj2⟶minw→.
It is easy to see that optimization problem (13) is convex for any training set T and nonnegative λk and μk. Choice of values of 2K regularization parameters λk and μk, which is the subject of this paper, will be discussed later in Section 3.2.
Standard gradient methods are not applicable to minimization problems (10) and (13) because they contain nonsmooth terms |w| and |wj|. So the algorithm proposed by Nesterov in [4] for minimization of sums of smooth and simple nonsmooth convex functions is used. Nesterov’s algorithm provides the best convergence rate at moderate number of steps (less than the number of variables, which is equal to q(d+1) in (10) and (13)) among all known methods of nonsmooth optimization [5].
Nesterov’s algorithm can exploit strong convexity (μ-convexity) of the target function and converges the faster, the bigger μ can be guaranteed in advance. The target function in (13) is not strongly convex in the bias column w0, but it would be strongly convex if l2-regularization was applied to all parameters w→ including w0.
Consider the following modification of problem (13).
Estimate the bias column w^0: (14)w^0l=lnnlNforl=1,…,q,where nl is the number of training samples of class l. The estimate w^0 is the solution of minimization problem: (15)-∑i=1Nlnpyi∣w0=-∑i=1Nlnew0yi∑l=1qew0l⟶minw0,which is nothing but maximum likelihood training of the featureless logistic regression model.
Choose some μ0>0 and instead of (13) solve (16)-∑i=1Nw→yix→i-ln∑l=1qew→lx→i+μ02w0-w^02+∑k=1Kλk∑j∈Dkwj+μk2∑j∈Dkwj2⟶minw→.
The target function in (16) is strongly convex with nonnegative parameter μ=mink=0,1,…,Kμk.
3.2. Evidence Maximization
To train elastic nets (10), (13), or (16) successfully some reasonable values of regularization parameters λ and μ (hyperparameters) are required. In machine learning problems with one or at most two hyperparameters (e.g., in SVM [1]) their values can be found by grid search. However, there are 2K+1 hyperparameters in generalized elastic net (16) and we are interested in the case K>1. In this case, a reasonable way to optimize them is evidence maximization. The use of evidence maximization for estimation of hyperparameters of ridge regression and other Gaussian-based models is well known [6]. For non-Gaussian elastic nets the evidence of hyperparameters can be neither computed nor maximized exactly and will be approximated rather roughly.
Let prior p0(w→) depend on two hyperparameters λ and μ like in (5). Then posterior (3) with λ and μ indicated explicitly is (17)pw→∣T,λ,μ=pT,w→p0w∣λ,μpT∣λ,μ=pY∣X,w→p0w∣λ,μpY∣X,λ,μ=Lw→;Tp0w∣λ,μ∫Lw→;Tp0w∣λ,μdw→=Lw→;Tp0w∣λ,μEλ,μ;T.The denominator is ignored in maximization of posterior (3) because it does not depend on w→. However it depends on λ and μ. This denominator E(λ,μ;T) is called the evidence of parameters λ and μ with respect to the training set T. Despite its special name, it is a usual likelihood, not the likelihood of a single model like L(w→;T) in (4), but the likelihood of the whole probability space of models defined by hyperparameters λ and μ.
For prior (5) the evidence of pair (λ,μ) is(18)Eλ,μ;T=∫Lw→;Tp0w∣λ,μdw→=1Zλ,μ∫elnLw→;T-λw-μw2/2dw→and the evidence maximization is equivalent to minimization (19)-lnEλ,μ;T=qdln2μΨ-λμ-ln∫elnLw→;T-λw-μw2/2dw→⟶minλ,μ.
The normalization factor Z(λ,μ) is rewritten using formula (9) here.
The gradient of (19) is (20)∇λ-lnEλ,μ;T=-qdλλ/μΨ-λ/μ-λ2μ+Eλ,μw,∇μ-lnEλ,μ;T=-qd2μ1-λ/μΨ-λ/μ+λ2μ+12Eλ,μw2, where Eλ,μ[f] stands for the expectation of f(w) with respect to posterior distribution of w proportional to L(w→;T)p0(w∣λ,μ):(21)Eλ,μf=∫fwelnLw→;T-λw-μw2/2dw→∫elnLw→;T-λw-μw2/2dw→.
To minimize (19) instead of traditional gradient steps the transformation (22)λ⟵qdλ/μ/Ψ-λ/μ-λ2/μEλ,μw,μ⟵qd1-λ/μ/Ψ-λ/μ+λ2/μEλ,μw2is used iteratively.
Formulas (20) imply that each point of maximum of the evidence is a fixed point of transformation (22). No convergence of transformation (22) is guaranteed. But in the experiments several iterations of this transformation allowed training more accurate model.
For modified elastic net (16), transformation (22) turns into (23)λk⟵qdkλk/μk/Ψ-λk/μk-λk2/μk∑j∈Dk∑l=1qEλ,μwjl,μk⟵qdk1-λk/μk/Ψ-λk/μk+λk2/μk∑j∈Dk∑l=1qEλ,μwjl2for k=1,…,K and(24)μ0⟵q∑l=1qEλ,μw0l-w^0l2.
Expectations Eλ,μ[|wjl|], Eλ,μ[wjl2], and Eλ,μ[w0l-w^0l2] cannot be computed exactly because posterior p(w→∣λ,μ) is rather complicated and high-dimensional. They are estimated using diagonal Laplace approximation [7] of posterior p(w→∣λ,μ) at trained model (16) w∗=w∗(λ,μ) instead of the p(w→∣λ,μ) itself.
3.3. Stopping Criterion
To stop either training (16) with fixed regularization parameters (λ,μ) or iterations of transformations (23) and (24) of (λ,μ), the following validation technique is used. The available dataset T is partitioned into training set Ttrain of Ntrain samples and validation set Tval of Nval samples. The first one is used to train elastic nets (16) while the second one is used to decide whether further training becomes senseless and should be stopped. Namely, training of the elastic net is stopped if likelihood L(w→;Tval) has not increased after several (about 30) last optimization steps, and tuning of the regularization parameters (λ,μ) is stopped if likelihood L(w∗(λ,μ);Tval) of the trained model has not increased after several (about 5) last iterations.
This criterion is a kind of well-known early stopping method [8]. On one hand, such an early stopping speeds up the training significantly. On the other hand, it is a regularization technique [9] by itself and can hide the effect of tuning the regularization parameters via evidence maximization, which is the subject of the study here. To find a balance, the delays between nonincreasing of the validation likelihood and stopping were chosen empirically.
4. Experiments
The method described in Sections 2 and 3 was applied to recognition of handwritten digits from MNIST database (see [10]). This database contains grayscale raster images of 28×28=784 pixels each, which belong to one of q=10 classes. Traditionally it is partitioned into N=60000 samples for training and M=10000 for testing. 15% of training samples were left out for validation, so Ntrain=51000 and Nval=9000.
Both to make linear logistic regression more powerful and to test the proposed method of estimation of numerous regularization parameters more features were added to the model. Besides the 784 primary features (the pixel intensities) several groups of secondary features were generated. Then all the features, both secondary and primary, were normalized to zero mean and unit variance.
The following groups of secondary features were used in experiments.
Horizontal and vertical components of the gradient of the pixel intensity (784+784=1568 features).
Amplitudes and phases of the discrete Fourier transform [11] of the pixel intensity (784+784=1568 features).
Projection histograms [11], that is, the number of nonzero pixels and positions of the first and the last one within each row and each column of the image (28+28+28∗2+28∗2=168 features).
The corner metric matrix of the image, which for each pixel of the image contains the estimated “likelihood” to be its corner point. The corner metric matrix is calculated by MATLAB function cornermetric [12] (784 features).
The local standard deviation matrix, which for each pixel of the image contains the standard deviation of the intensity over 9-by-9 neighborhood of the pixel. The local standard deviation is calculated by MATLAB function stdfilt [12] (784 features ).
This amounts to d=5656 primary and secondary features in total.
Remember that the proposed learning technique consists of two levels: the inner level is training of elastic net (16) with fixed regularization parameters (λ,μ) using Nesterov’s optimization algorithm and the outer level inspired by maximum evidence principle is iterative transformations (23) and (24) of λ and μ. Several different partitions (11) of features into groups were tried.
Each line in Tables 1, 2, and 3 represents single experiment for training of elastic net. Each row of the table represents elastic net (16) trained with some λ and μ. Each experiment was repeated for 20 times. Estimated intervals of the measured values, shown in tables, are intervals of two standard deviations around the mean.
Elastic nets trained with several fixed regularization parameters λ and μ.
λ
μ
Sparseness (%)
Mean log likelihood
Error (%)
0
0
2.46 ± 0.00
0.0638 ± 0.0007
2.06 ± 0.05
3
10.32 ± 1.56
0.0583 ± 0.0011
1.85 ± 0.05
10
16.84 ± 2.49
0.0609 ± 0.0007
1.81 ± 0.06
30
45.87 ± 4.03
0.0823 ± 0.0005
2.18 ± 0.05
100
63.26 ± 2.90
0.1419 ± 0.0003
3.41 ± 0.05
300
75.77 ± 4.37
0.2503 ± 0.0004
5.19 ± 0.05
1
1
3.75 ± 0.16
0.0621 ± 0.0007
2.00 ± 0.04
1
10
3.81 ± 0.19
0.0621 ± 0.0007
2.00 ± 0.04
1
30
6.71 ± 2.54
0.0607 ± 0.0013
1.95 ± 0.07
10
1
16.73 ± 2.40
0.0609 ± 0.0007
1.81 ± 0.06
10
10
16.56 ± 2.46
0.0613 ± 0.0007
1.81 ± 0.06
10
30
16.25 ± 2.53
0.0621 ± 0.0006
1.82 ± 0.05
10
100
16.20 ± 3.09
0.0649 ± 0.0005
1.86 ± 0.05
30
100
38.42 ± 2.42
0.0862 ± 0.0005
2.20 ± 0.05
100
30
61.51 ± 2.45
0.1428 ± 0.0003
3.41 ± 0.05
100
100
59.18 ± 2.09
0.1445 ± 0.0003
3.41 ± 0.05
0
1
2.46 ± 0.00
0.0638 ± 0.0007
2.06 ± 0.05
10
2.46 ± 0.00
0.0638 ± 0.0007
2.06 ± 0.04
100
2.46 ± 0.00
0.0638 ± 0.0007
2.05 ± 0.04
300
2.46 ± 0.00
0.0659 ± 0.0006
2.05 ± 0.06
Elastic nets trained with the evidence maximization technique.
K
Sparseness (%)
Mean log likelihood
Error (%)
1
12.53 ± 3.18
0.0580 ± 0.0010
1.83 ± 0.06
8
9.99 ± 1.40
0.0557 ± 0.0008
1.70 ± 0.05
13
9.54 ± 1.22
0.0560 ± 0.0010
1.69 ± 0.04
40
10.17 ± 1.40
0.0555 ± 0.0007
1.71 ± 0.05
136
8.74 ± 1.38
0.0581 ± 0.0006
1.81 ± 0.04
385
8.05 ± 1.28
0.0587 ± 0.0004
1.82 ± 0.04
1456
8.35 ± 1.53
0.0581 ± 0.0005
1.81 ± 0.04
5656
10.82 ± 2.68
0.0582 ± 0.0010
1.80 ± 0.06
Sparse elastic nets trained with the evidence maximization technique.
K
Sparseness (%)
Mean log likelihood
Error (%)
1
56.72 ± 6.94
0.0916 ± 0.0027
2.32 ± 0.08
8
75.04 ± 0.86
0.0898 ± 0.0026
2.70 ± 0.08
13
75.56 ± 1.43
0.0960 ± 0.0029
2.97 ± 0.08
40
84.58 ± 1.21
0.1054 ± 0.0029
3.13 ± 0.14
136
85.41 ± 0.64
0.0816 ± 0.0011
2.32 ± 0.06
385
87.55 ± 1.89
0.0804 ± 0.0032
2.47 ± 0.09
1456
85.72 ± 0.57
0.0745 ± 0.0008
2.28 ± 0.05
5656 (=d)
88.62 ± 0.50
0.0739 ± 0.0009
2.28 ± 0.04
Tables 1, 2, and 3 contain the following three columns of properties of trained models.
Sparseness. It is the share of features unused in the model that is #{j≥1∣w→j=0}/d.
Mean Log Likelihood. It is the mean over the M-element test set of minus logarithm of the predicted probability of the true class label of the sample, (1/M)∑i=1M(-ln(p(yi∣xi,w→))).
Error. It is the misclassification rate measured on the same M-element test set, provided the most probable class is predicted, (1/M)#{i∣p(yi∣xi,w→)=max1≤l≤qp(l∣xi,w→)}.
Sparseness of the trained model appears due to l1-regularization in the elastic net and increases with λ.
4.1. Constant Regularization Parameters
First, several control experiments with fixed scalar values of regularization parameters λ and μ were performed. Their results are shown in Table 1.
The minimal average test error 1,81% was achieved with parameters λ=10 and μ=1.
4.2. Tuning Regularization Parameters by Evidence Maximization
Next, experiments with automatic tuning of regularization parameters λ and μ were performed. Since all features had been normalized, the learning was started from λk0=1 and μk0=1 for all k=1,…,K. The results are shown in Table 2. Each row represents the elastic net obtained by the described two-level learning process for certain partition (11) of features.
Several different partition schemes were tested.
K=1, trivial partition: all features belong to the same group.
K=8, rough partition: primary features, horizontal and vertical components of the gradients, amplitudes and phases of the Fourier transform, and three other types of secondary features each form a separate group.
K=13,40,136,385,1456: the whole image (28×28 pixels) is split into k×k equal squares and, roughly speaking, the groups are formed by features of certain type calculated for certain squares. The exceptions are projection histograms calculated not for squares, but for rows or columns of squares (k groups for each of 6 histograms) and amplitudes and phases of the Fourier transform, both partitioned into k×k equal squares in the frequency space. So the total number of groups of the partition is equal to 7k2+6k. For k=1,2,4,7,14 this gives K=13,40,136,385,1456.
K=5656=d, fine partition: each feature forms a separate group.
These experiments show that the evidence maximization technique allows one to obtain more accurate elastic nets than elastic nets with guessed scalar regularization parameters. Indeed, compare the last column of Table 1 with lines K=8, 13, and 40 of Table 2. These lines represent elastic nets trained with certain values of 17-, 27-, and 81-dimensional regularization parameters, which can hardly be guessed.
4.3. Sparse Elastic Net
Last, we performed a series of experiments trying to train very sparse but reasonably accurate models. Sparseness of the model trained with elastic net depends mostly on its parameter(s) λ or λk. In the described technique these parameters are tuned in order to get elastic nets with higher evidence. However, experiments show that iterations of the transformations (23) and (24) with the stopping criterion of Section 3.3 tend to stop before they reach any (local!) maximum of the evidence, and where they stop depends on the initial parameters λk0 and μk0.
Experiments of Section 4.2 (Table 2) started from λk0=1 and μk0=1 for all k. Then sparseness was low but the trained models made more accurate predictions. If λk>λkmax=maxj∈Dk,l=1,…,q∂lnL(0;Ttrain)/∂wjl, optimization problem (16) has unique solution w=0, the most sparse one, but not accurate. Starting iterations from λk0=λkmax allows one to get sparse elastic net with reasonable accuracy.
Table 3 shows the results of training elastic net with starting parameters λk0=λkmax, k=1,…,K. These results are discussed in the following section.
5. Results and Discussion5.1. Accuracy of the Trained Model
The best model trained with the evidence maximization technique shown in Table 2 has 1,69% average test error, which is significantly less than 1,81% obtained by guessing of scalar regularization parameters (Table 1). In our experiments each learning with evidence maximization took only 5–10 reestimations of the regularization parameters. So the numbers of elastic nets trained to fill in Tables 1 and 2 are comparable (moreover, not all guesses are shown in Table 1).
The evidence maximization technique allows one to guess only an appropriate partitioning of the features instead of particularly good values of the regularization parameters. Still, this technique is not fully automated. None of the two obvious extreme partitions (the roughest and the finest ones) leads to the best model. 1,83% in the first line of Table 2 compared to 1,81% achieved in Table 1 shows that the evidence maximization not necessarily leads to the best accuracy. But it can be used when regularization parameters are multidimensional and naive attempts to guess a good value of them are unfeasible.
The obtained accuracy is much lower than best state-of-the-art results obtained by convolutional neural networks, deep learning, and augmentation of training dataset. But the elastic net with precisely tuned regularization parameters can achieve higher accuracy than other traditional models of the same complexity (e.g., 1- or 2-layer neural networks or SVM with Gaussian kernel) (see [10]).
5.2. Sparseness of the Trained Model
In some practical classification problems high sparseness of the model takes priority over its high accuracy. The proposed method allows one to train models with various tradeoff between sparsity and accuracy.
The last elastic net shown in Table 3 provides test error 2,28% and sparseness 88,62%, so only 644 of 5656 features are used. Compared to the most accurate elastic net from Table 2, the error increased by 0,59%, while the number of used features decreased more than sevenfold, from 5116 to 644. This result was achieved by tuning individual regularization parameters for each feature starting from the biggest reasonable λk0.
6. Conclusion
This paper describes a method of machine learning based on a technique of adjusting of regularization parameters of elastic nets inspired by evidence maximization principle. The method is able to cope with multidimensional regularization parameters using only rough simple ideas about their initial values and about the nature of the features used in the models to be learned.
This method was tested on MNIST database of handwritten digits and allowed training more accurate elastic net than could be trained with traditional grid search of one or two scalar regularization parameters. It allowed also training very sparse models with reasonable accuracy.
Still the primary goal of the proposed method of learning lies beyond the scope of this paper. It is to develop a mechanism of feature selection based on training of elastic nets with controlled tradeoff between their sparseness and accuracy. In future the proposed method is going to be applied to other machine learning problems, including problems with very large number of features.
Competing Interests
The authors declare that they have no competing interests.
Acknowledgments
This work was partially supported by the Russian Foundation for Basic Research Grants no. 15-29-06081 “ofi-m” and no. 16-07-00616 “A.”
JamesG.WittenD.HastieT.TibshiraniR.GuyonI.ElisseeffA.An introduction to variable and feature selectionZouH.HastieT.Regularization and variable selection via the elastic netNesterovY.RichtarikP.SchmidtM.Modern convex optimization methods for large-scale empirical risk minimizationProceedings of the International Conference on Machine LearningJuly 2015BishopC. M.PrilepkoA. I.KalinichenkoD. Ph.YaoY.RosascoL.CaponnettoA.On early stopping in gradient descent learningMorgadoD. F.AntunesA.MotaA. M.Regularization versus early stopping: a case study with a real systemProceedings of the 2nd IFAC Conference Control Systems Design2003Bratislava, SlovakiaLeCunY.BottouL.BengioY.HaffnerP.Gradient-based learning applied to document recognitionTrierØ. D.JainA. K.TaxtT.Feature extraction methods for character recognition—a surveyThe MathWorks IncMATLAB Image Processing Toolbox documentation, http://www.mathworks.com/help/images/