Coordinate Descent Based Hierarchical Interactive Lasso Penalized Logistic Regression and Its Application to Classification Problems

We present the hierarchical interactive lasso penalized logistic regression using the coordinate descent algorithm based on the hierarchy theory and variables interactions. We define the interaction model based on the geometric algebra and hierarchical constraint conditions and then use the coordinate descent algorithm to solve for the coefficients of the hierarchical interactive lasso model. We provide the results of some experiments based on UCI datasets, Madelon datasets from NIPS2003, and daily activities of the elder.The experimental results show that the variable interactions and hierarchy contribute significantly to the classification. The hierarchical interactive lasso has the advantages of the lasso and interactive lasso.


Introduction
Sparse linear models (such as the lasso) are a remarkable success of the regression analysis of high-dimensional data [1].The lasso is a least squares regression with the L1 penalty function.It can also be extended to the generalized linear model [2], for example, the logistic regression with L1 penalty used for classification [3].In the lasso model, the response variable is assumed to be a linear weighted sum of the predictor variables, and the optimization problem used to find the weighting coefficients can be solved by the coordinate descent algorithm [4].If, in the analysis of high-dimensional data, the response variable cannot be explained by a linear weighted sum of predictor variables, a higher-order model and quadratic model need to be used.In most cases, this suggests the presence of variable interactions [5].The presence of such interactions is considered important, as, for example, the interaction between single nucleotide polymorphisms (SNPs) plays an important role in the diagnosis of cancer and other diseases [6].While the linear model has some advantages, such as good interpretability and simple calculations, the variable interaction models are considered to be a focus of the modern research [7].
There are three types of methods used in the hierarchical interaction models.The first one is a multistep method.This method is based on removing or adding the best predictor variables or interaction variables in each iteration.Once the predictor variables corresponding to the interaction variables are in the model, the interaction variables must be in the model as well [8].Alternatively, we can consider the variable selection before the interaction selection [9].Usually, the modified LARS algorithm is used in such models to solve the interaction model [10].The second type is the Bayes model method.This approach improves the random search variable selection method for the hierarchical interaction model [11].The third type is based on optimization.The sparse interaction model is formulated as a nonconvex optimization problem [12] and further expressed as a convex optimization problem, such as all-pair lasso [13] or interaction group lasso [14].
In the literature on the sparse structures [15], composite absolute penalties (CAP) can also obtain the sparseness of the group and interaction, but the interaction coefficient is penalized twice [16].To solve for the hierarchical sparseness in the nonlinear interaction problem, the existing literature [17] has introduced the VANISH method.The logic regression method considers the binary variable high-level interaction [18].The existing literature [19] uses a simple recursive approach to select the interaction variables from high-dimensional data.The literature [20] proposed a genetic algorithm using selection to choose interaction variables in high-dimensional data.
The literature [13] presents a hierarchical interactive lasso method for regression and provides a method of model coefficients estimation using KKT conditions and the Lagrange multiplier method.Based on the literature [13] and our past work, the authors propose the concept of geometric algebra interaction and coordinate descent algorithm for the hierarchical interactive lasso penalized logistic regression.We used experimental data including 4 kinds of datasets from the UCI machine learning database, one Madelon datasets from NIPS2003, and one daily life activity recognition datasets.The experimental results reveal the outstanding advantages of the hierarchical interactive lasso method compared to the lasso and interactive lasso methods.The innovations include the following.(1) We use geometric algebra to explain variable interaction; (2) we derive an improved coordinate descent algorithm to solve the hierarchical interactive lasso penalized logistic regression; (3) we use the hierarchical interactive lasso for the classification problem.

The Variable Interaction Theory of Geometric Algebra
Definition 1.If the function (, ) cannot be represented as a sum of independent functions,  1 () +  2 (), then ,  in the  function are said to have interaction.
A popular explanation of Definition 1 is that if a response variable cannot be represented as a linear weighted sum of the prediction variables, it is probably because there are interactions between the variables.
Interactions between variables can be easily explained by the geometric algebra theory.Figure 1 is a diagram showing all subspace in geometric algebra.The 1-vectors, namely, order-1 main variables, can represent a  dimensional subspace of the original data.That is,  dimensional base of the original data is projected on the 1-vectors.The 2-vectors show the interaction between two variables.The simplest 2vectors coefficient can be the product of two 1-vectors.In the literature [13] our proposed area feature is considered as one of the interactions.In the literature [20] our proposed orthocenter feature is considered as one of the interactions.Higher-order interactions are represented by -vectors.In this paper, we only study area interactions between 1-vectors.
This method can also be extended to the nonlinear complex function interactions or higher order.

The Binary Logistic Regression Based on Interaction and Hierarchy
The outcome variable in the binary logistic model is denoted by ; the income variables are predictors . 1 , . . .,   , . . .,   are order-1 main variables, and the pairwise     are interactions variables between order-1 main variables.The binary logistic model has the form where Θ  = 0, the main variables coefficients are  ∈  +1 , the interaction variables coefficients are Θ ∈  × ,  0 is 1, and  satisfies (0,  2 ).Assume that the training samples are (x 1 ,  1 ), . . ., (x i ,   ), . . ., (x N ,   ), x i ∈   , Our goal is to select a feature subset from the order-1 main variables (dimension ) and order-2 interaction variables (dimension ( − 1)/2).We then estimate the coefficient values for nonzero model parameters.We can obtain the probability of two classes as follows: The maximum likelihood estimation is used to estimate the unknown model parameters, which make the likelihood function of  independent observations the largest.We define Then, the logarithmic likelihood function of (4) is Mathematical Problems in Engineering 3 We use the second-order Taylor expansion at the current estimated value ( β, Θ) for ( 5) and obtain the subproblem as follows: where The proof that (5) implies ( 6) is presented in the Appendix A.
In order to obtain the sparse solution for the main variable coefficients and interaction coefficients, the penalty function is used to enhance the stability of the interactive model: We focus on those interactions that have large main variable coefficients.Such restrictions are known as "hierarchy." The mathematical expression for them is Θ ̸ = 0 ⇒ β ̸ = 0 or β ̸ = 0. So, we add the constraints enforcing the hierarchy into (7) as follows: where = 0, and β ̸ = 0.The new constraint guarantees the hierarchy, but we cannot obtain a convex solution because (8) is not convex.So, instead of  we use  + ,  − .And the corresponding convex relaxation of ( 8) is as follows: where  =  + −  − ,  ± = max{±, 0},  + ,  − ∈  +1 , and

Coordinate Descent Algorithm and KKT Conditions
The basic idea of the coordinate descent algorithm is to convert multivariate problems into multiple single variable subproblems.It allows optimizing only one-dimensional variables at a time.The solution can be updated in a cycle.We solve (9) using the coordinate descent algorithm.
The Lagrange function corresponding to ( 9) is as follows: where and α and  ± are the dual variables corresponding to the hierarchical constraint and the nonnegative constraints.Formula ( 10) can be decomposed into  subproblems: The solution of ( 12) as a convex problem can be obtained by a set of optimality conditions, known as the KKT (Karush-Kuhn-Tucker) conditions.This is the key advantage of our approach.
In conclusion, the overall idea of the coordinate descent algorithm is that the minimization of ( 9) is equivalent to the minimization of (10).Formula (10) can be decomposed into  independent formula (12).Formula ( 12) can be solved as (13).The final coefficient optimization iteration formula is as follows: where β+() is the estimated value of the th main variable coefficient after  iterations and Θ  is the estimation of the interaction coefficient between the th variable and the th variable after  iterations.

The Experimental Results and Analysis of Four UCI
Datasets.There are four UCI study database datasets, which include the breast-cancer-Wisconsin datasets, Ionosphere datasets, Liver disorders datasets, and Sonar datasets, as shown in Table 1.
We do the 10-fold cross-validation (10-CV) experiments in the paper for 20 times using , where  1 = 2 2 = .Besides we complete an experiment employing the interactive hierarchical lasso logistic regression method.The results include the number of nonzero variable coefficients, average error rate of the 10-CV, standard deviation (SD), CPU time, and the value of lambda () estimated.The results are shown in Table 2.
The results of the 10-CV based on the four datasets are presented in Figures 2 to 5 using the proposed method.In the figures, the horizontal axis represents the logarithmic value of  and the vertical axis is the error rate of the 10-CV.Besides, the horizontal axis at the top of each figure represents the number of nonzero variable coefficients corresponding to the  value.
The results for the breast-cancer-Wisconsin datasets are shown in Figure 2. The minimum error rate is 0.03 and the number of selected variables is more than 11.The results for the Ionosphere experimental datasets are shown in Figure 3.When the number of selected variables is 101, the lowest error rate is 0.28 with the smaller standard deviation.The number of the selected variables is larger than the original dimension, so the interactions provide the classified information.The results for the Liver disorders datasets are presented in Figure 4.If the number of selected variables is 25, the lowest error rate can reach 0.26, while the standard deviation is 0.02.Finally, the results for the Sonar datasets are presented in Figure 5.When more than 80 variables are selected, the minimum error rate is 0.14.In what follows, we compare our method to the existing literature [13].The classification results and training time of our method are better than those shown in the literature [13].The experimental results of the lasso, all-pair lasso, and conventional pattern recognition methods with 10-fold cross-validation of 20 times in the four UCI datasets are listed, respectively, in Tables 3, 4, and 5. Conventional pattern recognition methods include support vector machine (SVM), linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), -nearest neighborhood (-NN), and decision tree (DT) methods.The lasso is a method that considers the main variables without the interaction variables.All-pair lasso is a method that considers the main variables and interaction variables but without the hierarchy.The experimental  results show that our model is better in classification results and more stable.This highlights the advantage of the variable interactions and hierarchy.

The Experimental Results for High-Dimensional Small
Sample Data.The Madelon datasets from NIPS2003 was  6.The results show that our method is slightly better than the lasso and all-pair lasso.This implies that the interactions may also be important in the Madelon datasets.

Activity Recognition (AR) Using Inertial Sensors of Smartphones.
Anguita et al. collected sensor data of smartphones [10].They used the support vector machine (SVM) method to solve the classification problem of the daily life activity recognition.These results play an extremely significant role in disability and elderly care.Datasets can be downloaded following the literature [10].30 volunteers aged 19-48 years participated in the study.Each person performed six activities wearing the smartphone on the waist.To obtain the data class label, experiments were conducted using video recording.The smartphone used in the experiments had built-in accelerometer and gyroscope for measuring 3D linear acceleration and angular acceleration.The sampling frequency was 50 Hz, which is more than enough for capturing human movements.
We use the datasets to evaluate our method.We use the upstairs and downstairs movements as two active classes.The training sets have 986 samples and 1,073 samples, respectively.The test sets have 420 samples and 471 samples, respectively.The variable dimension is 561, which includes the time and frequency from sensor signals.
Experimental results of the three lasso methods and some pattern recognition methods are shown in Table 7.The results show that our method is better than the pattern recognition methods, since it takes the variable selection and interaction into account.Our method achieves the best classification results with less training and testing time.

The Numerical Simulation Results and Discussion
. Now, suppose that the number and dimension of the samples are  = 200,  = 20.We take interactions into consideration and provide the following three kinds of simulation based on formula (1).
The SNR of the main variables is 1.5, and SNR of the interaction variables is 1.The experiment results of 100 times are shown in Figure 6.
When the real model is hierarchical, our method is the best, and the lasso is the worst.This is shown in Figure 6(a).When the real model only includes interaction variances, the interactive lasso is the best, and our method takes the second place, while the lasso is still the worst, as shown in Figure 6(b).The reason for this result is that when our method fits the model, the interaction variables are considered to be main variables.When the real model only includes main variables, the lasso is the best, and our method still takes the second place, and the all-pair lasso is the worst, as shown in Figure 6(c).
We believe that many actual classification problems could be hierarchical and interactive.They contain both main variables and interaction variables.Our method fits in this kind of situation.

Conclusion
Taking into consideration the interaction between variables, the hierarchical interactive lasso penalized logistic regression using the coordinate descent algorithm is derived.We provide the model definition, constraint condition, and the convex relaxation condition for the model.We obtain a solution for the coefficients of the proposed model based on the convex optimization and coordinate descent algorithm.We further provide experimental results based on four UCI datasets, NIPS2003 feature selection challenge datasets, and true daily life activities identification datasets.The results show that the interaction widely exists in the classification models.They also demonstrate that the variable interaction contributes to the response.The classification performance of our method is superior to the lasso, all-pair lasso, and some pattern recognition methods.It turns out that the variable interaction and hierarchy are two important factors.Our further research is planned as follows: other convex optimization methods including the generalized gradient descent method or alternating direction multiplier method, the hierarchical interactive lasso penalized multiclass logistic regression method, the elastic net method, or the hierarchical group lasso method.The application of the multisensor interaction in the daily life activities of the elderly is a new way of using of our method.For notational convenience, we have written   instead of (  ).A logarithmic likelihood function of ( 4) is as follows: First, we give the first-and second-order partial derivative and mixed partial derivative of (4) with respect to  and Θ: Then (A.1) is expanded by using Taylor series with respect to the expended point ( β, Θ):

Figure 1 :
Figure 1: The diagram of the subspaces in geometric algebra.

Figure 4 :Figure 5 :
Figure 4: The results of Liver disorders.

Figure 6 :
Figure 6: The error rate of the three lasso methods in the simulation experiment (the Bayes error is shown by the purple dotted line in the graph).

Table 2 :
The experimental results of our method.

Table 3 :
The experimental results of the lasso penalized logistic regression model.

Table 4 :
The experimental results of the all-pair lasso penalized logistic regression model.So the interactive dimension is 124750.You can find more information about the datasets following the link http://www.nipsfsc.ecs.soton.ac.uk/; you can also download the datasets and see the results of challenges, balance error rates, and the area under the curve.The model is trained by using a training set.The model parameters are selected by using a validation set.Also, the prediction results of the final model using the test set are uploaded online.The classification score of the final model is obtained.Our results are shown in Table

Table 5 :
The experimental results of the traditional pattern recognition methods.

Table 6 :
The experimental results for the Madelon datasets.

Table 7 :
The experimental results of the three lasso methods and five traditional pattern recognition methods.