In the data mining, the analysis of high-dimensional data is a critical but thorny research topic. The LASSO (least absolute shrinkage and selection operator) algorithm avoids the limitations, which generally employ stepwise regression with information criteria to choose the optimal model, existing in traditional methods. The improved-LARS (Least Angle Regression) algorithm solves the LASSO effectively. This paper presents an improved-LARS algorithm, which is constructed on the basis of multidimensional weight and intends to solve the problems in LASSO. Specifically, in order to distinguish the impact of each variable in the regression, we have separately introduced part of principal component analysis (Part_PCA), Independent Weight evaluation, and CRITIC, into our proposal. We have explored that these methods supported by our proposal change the regression track by weighted every individual, to optimize the approach direction, as well as the approach variable selection. As a consequence, our proposed algorithm can yield better results in the promise direction. Furthermore, we have illustrated the excellent property of LARS algorithm based on multidimensional weight by the Pima Indians Diabetes. The experiment results show an attractive performance improvement resulting from the proposed method, compared with the improved-LARS, when they are subjected to the same threshold value.
Data mining has shown its charm in the era of big data; it has gained much attention in academia regarding how to mine useful information from mass data by mathematical statistics model [
Zou (2006) introduced the adaptive-LASSO by using the different tuning parameters for different regression coefficients. He suggests minimizing the following objective function [
Keerthi and Shevade (2007) proposed a fast tracking algorithm for LASSO/LARS [
Charbonnier et al. (2010) suggest that
Since the LASSO method minimizes the sum of squared residual errors, even though the least absolute deviation (LAD) estimator is an alternative to the OLS estimate, Jung (2011) proposed a robust-LASSO-estimator that is not sensitive to outliers, heavy-tailed errors, or leverage points [
Bergersen et al. (2011) found that a large value of
Arslan (2012) found that, compared with the LAD-LASSO method, the weighted LAD-LASSO (WLAD-LASSO) method will resist the heavy-tailed errors and outliers in explanatory variables [
LASSO problem is a convex minimization problem; the forward-backward splitting operator method is important to solving it. Salzo and Villa (2012) proposed accelerated version to improve the method’s convergence ability [
Zhou et al. (2013) proposed an alternative selection procedure based on the kernelized LARS-LASSO method [
Zhao et al. (2015) added two tuning parameters
Salama et al. (2016) proposed a new LASSO algorithm, the minimum variance distortionless response (MVDR) LARS-LASSO [
In light of superior performance achieved in [ In the solving process of LASSO, each attribute in the evaluation population has different relative importance to the overall evaluation. The relative importance include the following: not all attributes influence the regression results and each individual in the regression model has different weight. When improved-LARS algorithm calculated the equiangular vector, we distinguish the effect resulting from different attribute variable, considering joint correlation between regression variables and surplus variable. We discuss the method proposed in this paper by the experimental evidence of the Pima Indians Diabetes Data and two sets of evaluation index.
In Section
Suppose that there are the multidimensional variables
The improved-LARS algorithm can solve LASSO problem well, which is based on the Forward Selection algorithm and Forward Gradient algorithm. The improved-LRAS has appropriate forward distance, lower complexity, and more relevance of information. Figure The improved-LARS calculates the correlation between Until some other individual, say When a third individual The LARS procedure works until the residual error is less than a threshold or all the variables are involved in the approach, the algorithm stop.
Basic steps of the LARS.
In Figure
In the process of improved-LARS stepwise regression, the angle regression takes all selected variables with the same importance. However, each individual of
In Figure
Because of the addition condition, it will inevitably increase the range of values about judgment condition. In order to keep the stability of the system, we limit the product in
After transformation, it may indicate the possibility that
The correlation of
The predictor direction when adding the multidimensional weight.
On the basis of Figure
Applying the aforementioned process to multidimensional high-order system, the collected
or
The collected result response is
There are many calculation methods of
PCA uses orthogonal transformation for dimension reduction in statistics [
The correlation coefficient matrix
where
We sort the multiple correlation coefficient of individual by multiple regression in statistical methods [
For
Based on Independence Weight, CRITIC is a kind of objective weighting method proposed by Diakoulaki et al. [
The quantitative conflict between the
The greater
In order to obtain the numerical solution of stability,
For
The equiangular
We now can further describe the improved-LARS based on multidimensional algorithm; we begin at
The
The length of approach along the
“
Part_PCA, Independent Weight evaluation, and CRITIC are added in the process of improved-LARS algorithm, respectively. Three parallel algorithms are established to calculate the weight of each attribute.
The centralized weight
Then the new
When
( ( ( ( ( ( ( ( ( ( ( (
This improved algorithm increases the calculation steps for adding the weighting analysis, so the calculation time increases. But the approach mechanism of each variable in the statistical model stays the same, so the space complexity is consistent with the original algorithm.
For the characteristics of the compression of the regression coefficient, the experiment set should be sparse, as well as one dependent variable which is easy to distinguish. We take Pima Indians Diabetes Data Set provided by Applied Physics Laboratory of the Johns Hopkins University, for example [
We use ROC curve to show results in order to evaluate more intuitively the performance of the proposed method. That is a binary classification problem whether the participants' diabetes is positive or negative; the testing results have the following four types: TP (true positive), the testing results are positive and are positive actually FP (false positive), the testing results are positive but are negative actually TN (true negative), the testing results are negative and are negative actually FN (false negative), the testing results are negative but are positive actually
We take the following three characters as the inspection standards through basic four types of statistics by ROC space: ACC (accuracy), TPR (true positive rate), NPV (negative predictive value),
NPV is the proportion of correct detecting about negative; it means that the people who tested negative actually are negative. ACC represents the proportion of correct estimating in the sum of positive and negative. NPV is the proportion of people who tested negative in actually negative population. Compared to NPV, TPR is also called Hit Rate; it is the proportion of correctly detecting the people who actually are positive in the tested positive population. ACC, TPR, and NPV tell us the result is better or worse than LARS.
Another character we judge the result with is SSR. The smooth turning point of SSR is corresponding to the optimal regression coefficient of predictor variable.
Threshold
Figure
Curves about three kinds of inspection standard.
Figure
Therefore, the solutions of LASSO when adding the Independence Weight are optimal, followed by the CRITIC and part_PCA. This improved algorithm significantly increases the NPV and TPR ensuring the ACC is not reduced. The veracity of LASSO’s solutions is improved through changing approach direction when adding the weighting.
Figure
The curve about the SSR of response and equiangular direction.
It can be found when synthesizing these three inspection standards that, adding the approach with multidimensional weight, the threshold value of the optimal solution is mainly reduced (except CRITIC), which means that the sum of absolute values of system’s regression coefficient is less than a smaller threshold; this algorithm meets the requirements in more extreme threshold range.
Table
Regression coefficient.
|
|
|
|
|
|
|
|
|
---|---|---|---|---|---|---|---|---|
Β | 1.7211 | 3.5409 | −0.4862 | −0.3665 | −0.6504 | 2.2141 | 0.7071 | −0.1810 |
Βpart_PCA | 1.5300 | 3.5311 | −0.4269 | −0.3018 | −0.6181 | 2.0409 | 0.6727 | −0.0567 |
|
1.5050 | 3.5294 | −0.4253 | −0.2863 | −0.6072 | 2.0066 | 0.6651 | −0.0356 |
|
1.6449 | 3.4482 | −0.5019 | −0.3705 | −0.6302 | 2.1668 | 0.5471 | −0.1597 |
Table
Principles of
Comparison of |
|
---|---|
Improved-LARS | |
Multidimensional weight LARS | |
part_PCA | |
Independence Weight | |
CRITIC | |
In this paper, a method considering the variables choosing and the approach direction of LARS algorithm is used to solve LASSO; we propose the LARS algorithm based on multidimensional weight to improve the veracity of LASSO’s solutions and keep the advantage of LASSO’s parameter estimation, which has stable regression coefficient, reduces the number of parameters, and has good consistency of parameter convergence. We verify the efficiency of the algorithm with Pima Indians Diabetes Data Set. The precision of the calculated weight was flawed for the greater dimension of individual, so we need to further optimize the embedding weight algorithm in the later studies, to improve the accuracy and precision of regression algorithm in approach variable and direction choosing which is changed by weighting.
The authors declare that they have no conflicts of interest.
This study was supported by the National Natural Science Foundation of China (61170192, 41271292), Chinese Postdoctoral Science Foundation (2015M580765), the Fundamental Research Funds for the Central Universities (XDJK2014C039, XDJK2016C045), Doctoral Fund of Southwestern University (swu1114033), and the Project of Scientific and Technological Research Program of Chongqing Municipal Education Commission (KJ1403106).