A Two-Stage Regularization Method for Variable Selection and Forecasting in High-Order Interaction Model

,


Introduction
Sparse representation has attracted a large deal of attention from researchers in different scientific fields including financial engineering and risk management computational biology and machine learning community [1].For instance, in spectrum data analysis, scientists are exploring how to select few exact frequencies on which the beam energy is concentrated to avoid lower leakage problem [2].The main goal of demanding sparsity is to discover quite a few numbers of relevant variables from a large pool of candidate variables which are essential when establishing an interpretable forecasting model.A model with high precision, low model complexity, and sensitivity to the model parameters can not only avoid the expensive cost of excessive supply but also reduce the time cost, service interruption, and resource waste [3].Furthermore, researchers have noticed that a linear additive model using main effects only cannot provide accurate forecasting results.This motivates them to turn their attentions to high-order interaction model which considers interaction terms between variables.Interaction model has advantage over a linear model which does not consider possible relationship between variables.In microarray data analysis, for example, biologists are particular interested in gene-gene interactions and gene-environment interactions than a single gene because interactions of single nucleotide polymorphisms (SNPs) are more important in cancer diagnosis [4].However, new studies have also revealed the interactive effect of multiple environment factors.For example, a child with a poor quality environment would be more sensitive to a poor environment as an adult which ultimately led to higher psychological distress scores.This depicts a three-way interaction term between gene and environment.Furthermore, interaction with multiple genes is also considered to be essentially important in the molecular analysis [5].This motivates us to consider a high-order interaction model which includes both two-way and threeway interaction terms in this paper: where  0 is the intercept term which vanishes after centering, x  ⊙ x  ⊙ x  represents the Hadamard product of x  , x  , and x  which denotes the three-way interaction term.The noise term follows Gaussian distribution.  ,   , and   are corresponding coefficients for main effects, two-way, and three-way interaction terms, respectively.Obviously, this high-order interaction model is more complicate than twoway interaction model.
Although high-order interactions play an important role in depicting the nonlinear relationship among variables, sparse representation of high-order interaction model is quite challenging since the number of interaction terms is huge (of order ( 2 ) or even ( 3 )) if there are  main effects in the model.For instance, even with 10 2 variables in the model, a high-order interaction model has to consider 10 4 or 10 6 variables which make variable selection procedure more difficult.Furthermore, computation cost is another issue because of the huge number of variables in the highorder interaction model.Even worse, severe multicollinearity exists among variables caused by the high dimensionality.There is a large body of literature on sparse representation methods.Subset selection methods based on L0 type penalty have been discovered for decades.These methods including Akaike information criterion (AIC) [6], Bayesian information criterion (BIC) [7], and mallow's C p [8] which are NP hard since all the possible combinations of variables are taken into account.Namely, if there are m variables, 2  subset models have to be checked.To improve the computational efficiency, sparse representation approaches based on convex optimization are proposed.For instance, Tibshrani (1996) proposed the least absolute shrinkage and selection operator (LASSO) which applies L1 type penalty to pursue sparsity.In wavelet study, it is famous as basis pursuit method [9].It is more computational efficient than subset selection methods.Efron (2004) studied the least-angle regression (LAR) which solved the same problem as LASSO but in a forward selection manner.They have shown that the solution path of LAR is pretty similar as that of LASSO.Yuan and Lin (2006) investigated group LASSO method which selected group variables based on "all in and all out" principle.To overcome the shortcoming of LASSO and further improve the selection accuracy, Zou (2006) studied adaptive LASSO using weights for each variable to adjust the sparsity degree.Friedman (2012) summarized and provided a review of sparse regression and classification methods [10].
However, convex optimization has drawbacks since it is difficult to discover all the relevant variables if the model coherence is sufficiently high.To overcome the difficulties, nonconvex penalties including smoothly clipped absolute deviation (SCAD) [11], minimax concave penalty (MCP) [12], and hard ridge (HR) [13,14] are proposed to boost the accuracy.Ye et al. (2017) studied sparse least squares support vector machine with Lp norm function applied to select the relevant support vectors [15].To facilitate parameter tuning work and increase the forecasting accuracy, Jiang (2018) studied square root nonconvex optimization (SRNO) and showed its superior performance [2] via simulation and real data studies.
Sparse representation in the two-way interaction model has been studied for years.Bach (2008) explored hierarchical multiple learning and studied its selection property [16].Zhao et al. (2009) proposed composite absolute penalty (CAP) which considers a mixture of Lp and Lq norm functions which are given by ‖  ‖  + ‖(  ,   ,   )‖  [17].Choi et al. (2010) studied SHIM method considering   =     .They also designed a nonconvex optimization to perform sparsity representation.However, the number of features is only 10 which is pretty small.Thus, there is no guarantee that their method can work in the high dimensional setup [18].Bickel et al. (2010) used a two-step algorithm to purse sparsity.In the first step, LASSO was applied and backward procedure was used in the second step [19].Wu et al. (2010) proposed a stagewise procedure which screened the main effects in the first step and the selected variable will be cleaned based on t-test statistic [20].Radchenko and Jame (2010) investigated VANISH constructing the main effects and interactions from two small sets of orthogonal functions [21].Bien et al. (2013) studied hierarchical LASSO (HL) which considered a L1 optimization problem using a nonconvex constraint ‖  ‖ 1 ≤ |  | [22].She et al. (2014) studied GRESH method which enforced sparsity to main effects and two-way interactions simultaneously [23].Hao et al. (2014) screened the main effects using forward regression in the first step and fitted the quadratic model with selected main effects and associated interactions [24].Simon and Hastie (2015) proposed "Glinternet" to discover the two-way interaction terms between categorical variables and continuous variables [25].Yan and Bien (2017) advocated VARXL method to implement sparse representation in a high dimensional time series data [26].To the best of our knowledge, most of variable selection methods mentioned above only consider two-way interaction terms.Few studies have been done on the high-order interaction model with both two-way and three-way interaction terms.To tackle this issue, this paper explores sparse representation in a high-order interaction model.The main contributions of this paper are given as follows: (i) A two-stage square root hard ridge method (TSS-RHR) for sparse representation in high-order interaction model is proposed.The rest of this paper is organized as follows.Section 2 represents the general framework of TSSRHR.Section 3 designs a fast and simple-to-implement algorithm with theoretical guarantee of its convergence and optimality.The theory part of TSSRHR is given in Section 4. In Section 5, real data analyses are shown to exhibit superior performance of the proposed method over other approaches.The conclusion of this paper is given in Section 6.

Two-Stage Square Root Hard Ridge Method
We investigate and study the square root hard ridge (SRHR) method which combines the benefits of square root loss function and nonconvex hard ridge penalty function for high-order interaction selection.We describe the two-stage SRHR algorithm at first.At stage 1, only main effects are selected by SRHR with all of high-order interactions including two-way and three-way interactions kept out.The selected variables are put in the set A. At stage 2, the selected set in stage 1.A is expanded by adding both associated two-way and three-way interactions within A. SRHR is applied again to implement variable selection task.
(i) Stage 1. Define G = [] = {1, . . ., } which represents all the main effects.Select the important effects using SRHR method by considering the following optimization problem: where  1 and  1 are two regularization parameters which are supposed to be determined.The index for selected variables are given in A and the corresponding solution path is given by  (1) where 1(⋅) denotes the indicator function, ⊗ represents the Cartesian product, and  2 and  2 are two regularization parameters which need to be selected carefully.The solution path is provided by  (2)    2 , 2 and the selected variable are given in B.
A penalized sparse representation method considers the sum of two components: the loss function and the penalty function.The loss function often applies the negative loglikelihood of the error and in Gaussian setup, it is a squared error loss function which involves the noise level  which is difficult to estimate.Specifically, in a Gaussian linear regression setting y = X+, the negative likelihood function of y is −log((y|X) = ‖y − X‖ 2  2 / 2 + ℎ() + ), where ℎ() is a function of  and  is a constant.SRHR uses square root error loss function which can facilitate the parameter tuning work by avoiding estimating the noise level  or using other forms of estimates [27].Hard ridge penalty has two parts; it is obvious that the first part which is L 0 norm penalized the number of nonzero elements using the regularization parameter .The second part is L2 type ridge penalty.It can compensate the shrinkage given by the L0 norm and is able to handle the high collinearity of the high-order interaction model.Therefore, SRHR combines the advantages of square root loss function and hard ridge penalty.
The choices of regularization parameters ( 1 ,  2 ,  1 , and  2 ) play an important role in balancing a model's in sample fit and out-of-sample forecasting ability.We grid  1 starting from the largest value  1,max = max(X  y) such that all the coefficients are shrunk to zeros.Then let the smallest value  1,min be a small proportion of  1,max , that is,  1,min = 0.05 1,max and the grid values are generated between  1,max and  1,min exponentially.For  2 , the same strategy is also applied.The optimal values for  1 and  2 are obtained from a grid of values {1 − 5, 1 − 4, 1 − 3, 1 − 2, 1 − 1, 1 + 1}.
To choose the optimal regularization parameters, extended Bayesian information criterion (EBIC) [28] and high dimensional Bayesian information criteria (HDBIC) [29] are applied in this paper.Particularly, EBIC is defined as and HDBIC is defined as where |A| and |B| are represent the cardinalities of A and B. In the first stage,  1 and  1 will be selected by the EBIC since large number of main effects are considered.It worth mentioning that different from [24], we did not apply BIC in the second stage.Although the number of variables is reduced dramatically, we still use HDBIC which is a high dimensional version of BIC because both two-way and threeway interaction terms are considered which still cause the dimensions of the model to be high.

Algorithm
In this section, for notation simplicity, two design matrices are designed as follows: and Furthermore, let X = [X, X, X].Clearly, X represents the design matrix for two-way interaction terms, X denotes the design matrix for three-way interaction terms, and X is the overall design matrix for main effects, two-way interaction terms, and three-way interaction terms.For the coefficients to be estimated, let Φ and Ψ be the coefficient matrix for twoway and three-way interaction terms, respectively.Then (1) can be written as where To design an algorithm for TSSRHR, we consider the following optimization problem in the first stage: and in the second stage, we consider min To solve ( 9) and ( 10) accurately and efficiently, threshold rule is defined at first.
From the definition, it is easy to see that Θ(; ) is an odd monotone unbounded shrinkage rule for , at any .A vector version of Θ is defined component-wise if either  or  is replaced by a vector.Now define the HR thresholding rule, which is closely associated with the HR penalty, as where  > 0,  > 0 are two regularization parameters.Using HR thresholding function, ( 9) and ( 10) can be solved based on the following iterations: and The 2 nd stage: where  → Θ  is a vector version of HR thresholding function with  replaced by a vector.
Algorithm 1 exhibits the details of TSRHR algorithm with  (−1) = 0,  (0) = 1,  (0) =  (−1) , and Ω (0) = Ω (−1) defined.Consider the nonconvex optimization is not easy to solve using ADMM algorithm [30].HR thresholding function is applied in the algorithm.The basic structure of this algorithm follows coordinate decent algorithm.The error tolerance of tol1, tol2 and maximum number of iterations m1 and m2 are determined based on trial and error.Usually, the default values of  1 ,  2 and  1 and  2 are 1e-4, 1e-4, 500, and 500, respectively.TSSRHR algorithm has some advantages.Firstly, it does not involve any matrix inversion.Secondly, Θ  (; , ) is applied to enforce elementwise sparsity.The step sizes  1 and  2 are able to guarantee the convergence of the algorithm.The theoretical results show that  1 ≥ ‖X‖ 2 and  2 ≥ ‖ X‖ 2 provide satisfactory results.The step sizes do not depend on line search which takes a lot of computation time.The TSSRHR algorithm is simple to implement and computational efficient.To further speed up the convergence, a fast iterative shrinkage thresholding algorithm (FISTA) [31] is used to obtain the TSSRHR estimate.The following theorem provides the theoretical guarantee of convergence of Algorithm 1.

Prediction and Estimation
Notice that (Ω) = 2 ‖Ω‖ 0 /2(1+)+‖Ω‖ 2 Inputs: Z = (X, y) = {(x 1 , y 1 ), ⋅ ⋅ ⋅ , ((x  , y  ))} : x ∈   , X ∈  × , y ∈   the dataset  1 : maximum number of iterations used in the SRHR algorithm of the first stage;  2 : maximum number of iterations used in the SRHR algorithm of the second stage;  1 : the error tolerance used in the SRHR algorithm of the first stage;  2 : the error tolerance used in the SRHR algorithm of the second stage;  1 : scale parameter used in the SRHR algorithm of the first stage;  2 : scale parameter used in the SRHR algorithm of the second stage; Output: The forecasting test error and selected pattern; (1) Randomly divide the original data Z into the training dataset (X  , y  ) and test dataset (X  , y  ).The first stage using SRHR algorithm: for V = 1 to  1 (5) Initialization:  ← 0, b () ← 0 (6) Scaling: Step 1.
Step 4: end while (28) end for (29) end for (30) Obtain the solution path B and update the sparsity patterns G using HDBIC criterion.(31) Calculate the test error using test dataset ( X , y  ) Algorithm 1: The TSRRHR algorithm.
Assumption 2 (weak decomposability assumption).A norm  in   is called weakly decomposable for an index set J ⊂ [], if there exists a norm  J  on  − such that Furthermore, a set J is called allowed set if  is a weakly decomposable norm for this set.
We can show that hard ridge penalty is weakly decomposable as follows: define (Ω) fl  2 ‖Ω‖ 0 /2(1 + ) + ‖Ω‖ This is due to weak decomposability property of the penalty and the penalty.
This remark provides prediction error bound and estimation error bound for the TSSRHR estimate.The proof of remark follows Theorem 1 and Corollaries 1 and 2 in [37] directly.Thus, the details are omitted here.

Real World Data
In this section, twinsine signal data including 8 scenarios and microarray gene expression data about inbred mouse are treated as real data examples to test the performances of sparse representation methods.It worth mentioning both of these two real data examples are ultra-high dimensional data when three-way interaction terms are considered.Comparisons will be made between the proposed two-stage hard ridge (TSSRHR) and other two-stage sparse representation methods including two-stage LASSO (TSLASSO), two-stage SCAD (TSSCAD), and two-stage square root lasso (TSSRL) in terms of forecasting accuracy, sparsity, and computational efficiency.The unknown parameters in TSLASSO and TSS-CAD are determined using 10-fold cross validation.TSSRL selects the unknown parameter based on a theoretical choice:  = 1.1 −1 (1 − 0.05/2) where  is the number of variables and  is the cumulative distribution function of Gaussian distribution.Here we remind that TSSRHR selects the regularization parameters by EBIC and HDBIC (cf.Section 2).The forecasting model will be trained with 75% of the samples and the remaining 25% will be used to calculate the test errors.The entire procedure will be repeated for 150 times and the median of evaluation criteria including number of nonzero elements (NZ), mean square error (MSE), root mean square error (RMSE), and computational cost (CC) will be reported.

Twinsine Signal Data.
Spectral estimation studying the distribution over frequencies is widely applied in speech coding and radar solar signal processing.Nevertheless, it is critically difficult and challenging when there are a large number of frequencies which causes high resolution problem.This type of problem is called super resolution spectral estimation.To avoid the power leakage and noisy observation caused by the super resolution, a novel method is urgently needed to choose the important frequencies which plays a dominant role in super-resolution estimation.The twinsine signal is generated from the target function: where  1 = 2,  2 = 3,  1 = /3,  1 = 0.25HZ, and () follows white Gaussian noise with variance given by  2 .The frequence resolution is chosen as 0.002HZ to detect and distinguish the two sinusoidal components.  =  max ⋅ / for  = 0, 1, ⋅ ⋅ ⋅ ,  and the frequency atoms are given by In our experiments, we will consider the following 8 different scenarios: (ix) Scenario 9: (, ,  2 ) = (500, 300, 1) (x) Scenario 10: (, ,  2 ) = (1000, 300, 1) The results of forecasting, variable selection, and computation are summarized in Table 1 and Figures 1, 2, 3,  and 4. It is not difficult to observe that TSSRHR obtains the most accurate outcome in all the scenarios including the most challenging scenario when  = 5000 and  2 = 5.TSSRL outperforms TSLASSO and TSSCAD method by delivering lower MSE and its forecasting results are comparable with TSSRHR in scenario 6.From the prospective of model sparsity, TSSRL achieves the most interpretable model followed by TSSRHR.TSLASSO and TSSCAD tend to select a large number of variables which can be found in all the scenarios except scenario 5 and scenario 6.For instance, TSLASSO selects 31458 variables and TSSCAD chooses 29113 variables in scenario 8.A large number of nuisance variables are included in the models given by TSLASSO and TSSCAD which will affect the forecasting results negatively.In terms of computational efficiency, TSSCAD spends more computational time in scenarios 1-2 and scenarios 5-6 when the dimensions are moderate.As the dimensions increase, all of the forecasting approaches take more computation cost.For instance, TSSRL took 1215.93seconds and 1529.38 seconds in scenarios 4 and 8 when  = 5000.TSLASSO runs fastest by using the convex penalty.However, notice that when D is large, although TSSRL also uses the convex penalty, it is not computational efficient since the design of the TSSRL is based on COORD algorithm which runs much slower than coordinate descent algorithm.In scenario 9 and 10 where the sample sizes are large, TSSRHR still performs better than other methods.Same as other scenarios, TSSRL gives the most sparse model with fewer variables.Also notice that the train speed of TSSRHR is acceptable.Therefore, TSSRHR outperforms other forecasting methods in terms of forecasting accuracy, model sparsity, and computation speed.

Microarray Gene Expression Data. The inbred mouse microarray gene expression dataset with 31 female mice and Table 1:
The statistical performance of TSSRHR, TSLASSO, TSSCAD, and TSSRL.The performances are measured by number of nonzero elements of main effects, two-way interactions, three-way interactions, MSE (×100), RMSE (×100), and computational costs (CC, in seconds, unless otherwise specified).2. It worth mentioning that a gene filtering preprocessing procedure is applied to the inbred mouse data to reduce the expensive computational cost.TSSRHR delivers the best forecasting performances by giving the lowest MSE and RMSE followed by TSSRL, TSLASSO, and TSSCAD.TSLASSO selects a model with 6311 threeway interaction terms while TSSCAD only selects 3 variables which is fewer than those of TSSRL and TSSRHR.This can be verified in Figure 5 which shows the selection pattern of the compared methods.The computational cost of TSSRL is almost 4 times of TSSRHR.Comparing with TSSRHR and TSSCAD, TSLASSO and TSCAD spend less computational cost.Overall, TSSRHR provides the lowest forecasting error with sufficient model sparsity and acceptable computational time.

Conclusion
Two-way interaction model has been widely used in numbers of scientific fields.However, researchers find that three-way interaction terms play more important role than two-way interaction terms when establishing forecasting model.To this end, this paper studies high-order interaction model which has both two-way and three-way interaction terms.To establish an interpretable model and reduce the computation time, sparse representation method using a two-stage square root hard ridge (TSSRHR) is proposed and studied.Based on the property of two-stage method, the number of interactions will decrease dramatically so that the computational issue is solved.To implement SRHR in each stage, a simple and efficient algorithm is designed.Furthermore, the prediction and estimation error bounds of TSSRHR are also exhibited using two regularity assumptions.For real data analysis, a twinsine signal data and microarray gene expression data are provided to show the forecasting and selection performances of TSSRHR comparing with other stagewise sparse representation methods.A hierarchical variable selection problem on the high-order interaction model will be conducted for the future research.
(ii) In computation, a fast and simple-to-implement algorithm is designed.(iii) In theory, we have shown the prediction error bound and estimation error bound of the TSSRHR estimate.(iv) Real data examples are shown to demonstrate the efficiency and superior performance of the proposed method over other existing competitors.
29 male mice is considered to demonstrate the effectiveness of the proposed TSSRHR method.The gene expression values of 22,690 genes are measured by each array and the response gene is a continuous phenotypic variable measured by stearoyl-coenzyme desaturase (scd1) with probe set ID 1415965 at.The remaining 22,689 probe set IDs are regarded as candidate variables.We are seeking a good model which is able to describe the relationship between the response variable and other candidate genes.The performances of forecast models on inbred mouse microarray gene expression data are shown in Table

Figure 5 :
Figure 5: Sparsity pattern of compared forecasting methods in the inbred mouse gene expression data.