Evidence Maximization Technique for Training of Elastic Nets

This paper presents a technique of evidence maximization for automatic tuning of regularization parameters of elastic nets, which allows tuning many parameters simultaneously. This technique was applied to handwritten digit recognition. Experiments showed its ability to train either models with high accuracy of recognition or highly sparse models with reasonable accuracy.


Introduction
One of the important aspects of machine learning is to choose an appropriate subset of the (possibly huge) set of all virtually available features, such as the trained model which depends only on this subset of features.A good choice (feature selection, [1]) can both speed up the training and improve the quality of its result.It depends not only on the particular problem, but also on the data available for training.
Feature selection can either precede the learning itself (e.g., entropy-based or correlation analysis) or be a builtin part of the learning process (e.g., learning with  1regularization, such as LASSO regression and  1 -SVM) [2].This paper deals with the latter case only.
It is known that learning with  1 -regularization can produce rather sparse models which depend on rather few features, but learning with  2 -regularization usually produces more accurate models.In [3] some mixed regularization called "elastic net" was proposed.Let (, ) be a model parameterized by , predicting response  by feature vector , and let ((, ), ) be the cost of prediction (, ) provided the true response is .Then training of such a model with elastic net regularization on the set of samples {(  ,   ),  = 1, . . ., } using loss minimization (a.k.a.ERMempirical risk minimization) method or, briefly, "training of an elastic net" is the minimization problem: where ‖ ⋅ ‖ and | ⋅ | stand for  2 -and  1 -norms, respectively, and  and  are nonnegative regularization parameters.It is shown experimentally in [3] that varying the parameters  and  one can balance between the sparsity of the model and the accuracy of its prediction.
In this paper elastic nets are used to regularize multiclass logistic regression.A method of tuning more general regularization parameters than  and  above is described.This method is tested on a handwritten digit recognition problem.
The rest of this paper is organized as follows.Section 2 presents the mathematical model and the elastic net in details.Section 3 describes the learning algorithm and the evidence maximization technique for tuning regularization parameters of the elastic net; this technique is the main subject of this paper.Section 4 describes experiments with elastic nets for digit recognition.Section 5 exposes the results of experiments.Section 6 summarizes the main results of experiments and discusses further possible applications of the proposed technique.

Mathematical Model
Consider multinomial classification in its both deterministic and probabilistic variants: given a feature vector  ∈ R  either to predict the correct label  of one of  classes to which the vector  belongs or to estimate the conditional probability ( | ) of each class label.Probabilistic classification is considered primary and in deterministic classification a class label (usually the class label)  ∈ argmax  ( | ) will be predicted.
Let ⃗  = (1, ) ∈ R +1 stand for augmented feature vector.To estimate ( | ) multinomial linear logistic regression model will be trained.The model parameter matrix ⃗  consists of ( + 1)-dimensional rows ⃗   = (  0 ,   1 , . . .,    ).To train the model means to choose some "good" parameter ⃗ .To do this we use a training dataset T = {( 1 ,  1 ), . . ., (  ,   )} of  couples (  ,   ) which are supposed to be i.i.d.random.T can also be written in a transposed way T = {X, Y} where X = { 1 , . . .,   } and Y = { 1 , . . .,   }.Training tries to maximize the posterior of ⃗  given some prior  0 ( ⃗ ) and the training set T. Since and the denominator does not depend on ⃗ , maximization of posterior probability is equivalent to maximization of the numerator or of its logarithm: The second summand in (4) is the log likelihood of the model ( ⃗ ; T), while the first one depends on the choice of the prior.Let ( × )-matrix  stand for ⃗  without the bias column  0 .The prior is usually taken independent of the bias, so  0 ( ⃗ ) =  0 ().In the simplest cases when spherical Gaussian or Laplacian distributions are taken as priors, training (4) turns to an optimization problem with  2 -or  1 -regularization, respectively.
Similarly, elastic nets are obtained from the prior where (remember that the space of  is -dimensional) and Φ(⋅) denotes the cumulative function of the standard onedimensional Gaussian distribution: To simplify calculations instead of the function Φ(⋅) we use For instance, the normalization factor (, ) becomes Plugging ( 5) into (4) turns training of elastic net into the optimization problem: Both prior (5) and regularization summands in (10) are isotropic with respect to all  features.However the features themselves might be unequal by their nature.To respect such an inequality we partition all features into  groups of features of the same nature.For example, all pixel values of the image have the same nature and will belong to the same group of features, while computed features or the aspect ratio falls to other groups.
Let us fix a partition of the set of indices and training of the elastic net for linear logistic regression (2) turns into It is easy to see that optimization problem (13) is convex for any training set T and nonnegative   and   .Choice of values of 2 regularization parameters   and   , which is the subject of this paper, will be discussed later in Section 3.2.

Learning Technique
3.1.Nonsmooth Convex Optimization.Standard gradient methods are not applicable to minimization problems (10) and (13) because they contain nonsmooth terms || and |  |.So the algorithm proposed by Nesterov in [4] for minimization of sums of smooth and simple nonsmooth convex functions is used.Nesterov's algorithm provides the best convergence rate at moderate number of steps (less than the number of variables, which is equal to ( + 1) in ( 10) and ( 13)) among all known methods of nonsmooth optimization [5].
Nesterov's algorithm can exploit strong convexity (convexity) of the target function and converges the faster, the bigger  can be guaranteed in advance.The target function in ( 13) is not strongly convex in the bias column  0 , but it would be strongly convex if  2 -regularization was applied to all parameters ⃗  including  0 .Consider the following modification of problem (13).
(1) Estimate the bias column ŵ0 : where   is the number of training samples of class .The estimate ŵ0 is the solution of minimization problem: which is nothing but maximum likelihood training of the featureless logistic regression model.

Evidence Maximization.
To train elastic nets (10), (13), or (16) successfully some reasonable values of regularization parameters  and  (hyperparameters) are required.In machine learning problems with one or at most two hyperparameters (e.g., in SVM [1]) their values can be found by grid search.However, there are 2 + 1 hyperparameters in generalized elastic net (16) and we are interested in the case  > 1.In this case, a reasonable way to optimize them is evidence maximization.The use of evidence maximization for estimation of hyperparameters of ridge regression and other Gaussian-based models is well known [6].For non-Gaussian elastic nets the evidence of hyperparameters can be neither computed nor maximized exactly and will be approximated rather roughly.
Let prior  0 ( ⃗ ) depend on two hyperparameters  and  like in (5).Then posterior (3) with  and  indicated explicitly is The denominator is ignored in maximization of posterior (3) because it does not depend on ⃗ .However it depends on  and .This denominator (, ; T) is called the evidence of parameters  and  with respect to the training set T. Despite its special name, it is a usual likelihood, not the likelihood of a single model like ( ⃗ ; T) in ( 4), but the likelihood of the whole probability space of models defined by hyperparameters  and .
For prior (5) the evidence of pair (, ) is and the evidence maximization is equivalent to minimization The normalization factor (, ) is rewritten using formula (9) here.
The gradient of ( 19) is where E , [] stands for the expectation of () with respect to posterior distribution of  proportional to ( ⃗ ; T) 0 ( | , ): This criterion is a kind of well-known early stopping method [8].On one hand, such an early stopping speeds up the training significantly.On the other hand, it is a regularization technique [9] by itself and can hide the effect of tuning the regularization parameters via evidence maximization, which is the subject of the study here.To find a balance, the delays between nonincreasing of the validation likelihood and stopping were chosen empirically.

Experiments
The method described in Sections 2 and 3 was applied to recognition of handwritten digits from MNIST database (see [10]).This database contains grayscale raster images of 28 × 28 = 784 pixels each, which belong to one of  = 10 classes.Traditionally it is partitioned into  = 60000 samples for training and  = 10000 for testing.15% of training samples were left out for validation, so  train = 51000 and  val = 9000.
Both to make linear logistic regression more powerful and to test the proposed method of estimation of numerous regularization parameters more features were added to the model.Besides the 784 primary features (the pixel intensities) several groups of secondary features were generated.Then all the features, both secondary and primary, were normalized to zero mean and unit variance.
The following groups of secondary features were used in experiments.
(3) Projection histograms [11], that is, the number of nonzero pixels and positions of the first and the last one within each row and each column of the image (28 + 28 + 28 * 2 + 28 * 2 = 168 features).
(4) The corner metric matrix of the image, which for each pixel of the image contains the estimated "likelihood" to be its corner point.The corner metric matrix is calculated by MATLAB function cornermetric [12] (784 features).
(5) The local standard deviation matrix, which for each pixel of the image contains the standard deviation of the intensity over 9-by-9 neighborhood of the pixel.The local standard deviation is calculated by MATLAB function stdfilt [12] (784 features ).
This amounts to  = 5656 primary and secondary features in total.
Remember that the proposed learning technique consists of two levels: the inner level is training of elastic net (16) with fixed regularization parameters (, ) using Nesterov's optimization algorithm and the outer level inspired by maximum evidence principle is iterative transformations (23) and (24) of  and .Several different partitions (11) of features into groups were tried.
Each line in Tables 1, 2, and 3 represents single experiment for training of elastic net.Each row of the table represents elastic net (16) trained with some  and .Each experiment was repeated for 20 times.Estimated intervals of the measured values, shown in tables, are intervals of two standard deviations around the mean.Error.It is the misclassification rate measured on the same element test set, provided the most probable class is predicted, Sparseness of the trained model appears due to  1regularization in the elastic net and increases with .

Tuning Regularization Parameters by Evidence Maximization.
Next, experiments with automatic tuning of regularization parameters  and  were performed.Since all features had been normalized, the learning was started from  0  = 1 and  0  = 1 for all  = 1, . . ., .The results are shown in Table 2.Each row represents the elastic net obtained by the described two-level learning process for certain partition (11) of features.
Several different partition schemes were tested.These experiments show that the evidence maximization technique allows one to obtain more accurate elastic nets than elastic nets with guessed scalar regularization parameters.Indeed, compare the last column of Table 1 with lines  = 8, 13, and 40 of Table 2.These lines represent elastic nets trained with certain values of 17-, 27-, and 81-dimensional regularization parameters, which can hardly be guessed.

Sparse Elastic
Net. Last, we performed a series of experiments trying to train very sparse but reasonably accurate models.Sparseness of the model trained with elastic net depends mostly on its parameter(s)  or   .In the described technique these parameters are tuned in order to get elastic nets with higher evidence.However, experiments show that iterations of the transformations (23) and (24) with the stopping criterion of Section 3.3 tend to stop before they reach any (local!)maximum of the evidence, and where they stop depends on the initial parameters  0  and  0  .Experiments of Section 4.2 ( Table 3 shows the results of training elastic net with starting parameters  0  =  max  ,  = 1, . . ., .These results are discussed in the following section.2 has 1,69% average test error, which is significantly less than 1,81% obtained by guessing of scalar regularization parameters (Table 1).In our experiments each learning with evidence maximization took only 5-10 reestimations of the regularization parameters.So the numbers of elastic nets trained to fill in Tables 1 and 2 are comparable (moreover, not all guesses are shown in Table 1).

Accuracy of the Trained Model. The best model trained with the evidence maximization technique shown in Table
The evidence maximization technique allows one to guess only an appropriate partitioning of the features instead of particularly good values of the regularization parameters.Still, this technique is not fully automated.None of the two obvious extreme partitions (the roughest and the finest ones) leads to the best model.1,83% in the first line of Table 2 compared to 1,81% achieved in Table 1 shows that the evidence maximization not necessarily leads to the best accuracy.But it can be used when regularization parameters are multidimensional and naive attempts to guess a good value of them are unfeasible.
The obtained accuracy is much lower than best state-ofthe-art results obtained by convolutional neural networks, deep learning, and augmentation of training dataset.But the elastic net with precisely tuned regularization parameters can achieve higher accuracy than other traditional models of the same complexity (e.g., 1-or 2-layer neural networks or SVM with Gaussian kernel) (see [10]).

Sparseness of the Trained Model.
In some practical classification problems high sparseness of the model takes priority over its high accuracy.The proposed method allows one to train models with various tradeoff between sparsity and accuracy.
The last elastic net shown in Table 3 provides test error 2,28% and sparseness 88,62%, so only 644 of 5656 features are used.Compared to the most accurate elastic net from Table 2, the error increased by 0,59%, while the number of used features decreased more than sevenfold, from 5116 to 644.This result was achieved by tuning individual regularization parameters for each feature starting from the biggest reasonable  0  .

Conclusion
This paper describes a method of machine learning based on a technique of adjusting of regularization parameters of elastic nets inspired by evidence maximization principle.The method is able to cope with multidimensional regularization parameters using only rough simple ideas about their initial values and about the nature of the features used in the models to be learned.This method was tested on MNIST database of handwritten digits and allowed training more accurate elastic net than could be trained with traditional grid search of one or two scalar regularization parameters.It allowed also training very sparse models with reasonable accuracy.
Still the primary goal of the proposed method of learning lies beyond the scope of this paper.It is to develop a mechanism of feature selection based on training of elastic nets with controlled tradeoff between their sparseness and accuracy.In future the proposed method is going to be applied to other machine learning problems, including problems with very large number of features.
(  , ) ,   ) +  || +  2 ‖‖ 2 → min  , ←   (  / √   /Ψ (−  / √   ) −  2  /  ) ∑ ∈  ∑   ←   (1 −   / √   /Ψ (−  / √   ) +  2  /  ) ∑ ∈  ∑ The available dataset T is partitioned into training set T train of  train samples and validation set T val of  val samples.The first one is used to train elastic nets (16) while the second one is used to decide whether further training becomes senseless and should be stopped.Namely, training of the elastic net is stopped if likelihood ( ⃗ ; T val ) has not increased after several (about 30) last optimization steps, and tuning of the regularization parameters (, ) is stopped if likelihood ( * (, ); T val ) of the trained model has not increased after several (about 5) last iterations.

Table 1 :
Elastic nets trained with several fixed regularization parameters  and .

Table 2 :
Elastic nets trained with the evidence maximization technique.

Table 3 :
Sparse elastic nets trained with the evidence maximization technique.

Table 2 )
started from  0  = 1 and  0  = 1 for all .Then sparseness was low but the trained models made more accurate predictions.If   >  max  = max ∈  ,=1,..., | ln (0; T train )/   |, optimization problem (16) has unique solution  = 0, the most sparse one, but not accurate.Starting iterations from  0  =  max  allows one to get sparse elastic net with reasonable accuracy.