Robust Group Identification and Variable Selection in Regression

The elimination of insignificant predictors and the combination of predictors with indistinguishable coefficients are the two issues raised in searching for the true model. Pairwise Absolute Clustering and Sparsity (PACS) achieves both goals. Unfortunately, PACS is sensitive to outliers due to its dependency on the least-squares loss functionwhich is known to be very sensitive to unusual data. In this article, the sensitivity of PACS to outliers has been studied. Robust versions of PACS (RPACS) have been proposed by replacing the least squares and nonrobust weights in PACSwithMM-estimation and robust weights depending on robust correlations instead of person correlation, respectively. A simulation study and two real data applications have been used to assess the effectiveness of the proposed methods.


Introduction
The latest developments in data aggregation have generated huge number of variables.The large amounts of data pose a challenge to most of the standard statistical methods.In many regression problems, the number of variables is huge.Moreover, many of these variables are irrelevant.Variable selection (VS) is the process of selecting significant variables for use in model construction.It is an important step in the statistical analysis.Statistical procedures for VS are characterized by improving the model's prediction, providing interpretable models while retaining computational efficiency.VS techniques, such as stepwise selection and best subset regression, may suffer from instability [1].To tackle the instability problem, regularization methods have been used to carry out VS.They have become increasingly popular, as they supply a tool with which the VS is carried out during the process of estimating the coefficients in the model, for example, LASSO [2], SCAD [3], elastic-net [4], fused LASSO [5], adaptive LASSO [6], group LASSO [7], OSCAR [8], adaptive elastic-net [9], and MCP [10].
Searching for the correct model raises two matters: the exclusion of insignificant predictors and the combination of predictors with indistinguishable coefficients (IC) [11].The above approaches can remove insignificant predictors but be unsuccessful to merge predictors with IC.Pairwise Absolute Clustering and Sparsity (PACS, [11]) achieves both goals.Moreover, PACS is an oracle method for simultaneous group identification and VS.
Unfortunately, PACS is sensitive to outliers due to its dependency on the least-squares loss function which is known as very sensitive to unusual data.In this article, the sensitivity of PACS to outliers has been studied.Robust versions of PACS (RPACS) have been proposed by replacing the least squares and nonrobust weights in PACS with MMestimation and robust weights depending on robust correlations instead of person correlation, respectively.RPACS can completely estimate the parameters of regression and select the significant predictors simultaneously, while being robust to the existence of possible outliers.
The rest of this article proceeds as follows.In Section 2, PACS has been briefly reviewed.The robust extension of PACS is detailed in Section 3. Simulation studies under different settings are presented in Section 4. In Section 5, the proposed robust PACS has been applied to two real datasets.Finally, a discussion concludes in Section 6.

A Brief Review of PACS
Under the linear regression model setup with standardized predictors   and centered response values   ,  = 1, 2, . . .,  and  = 1, 2, . . ., .Sharma et al. [11] proposed an oracle method PACS for simultaneous group identification and VS.PACS has less computational cost than OSCAR approach.In PACS, the equality of coefficients is attained by adding penalty to the pairwise differences and pairwise sums of coefficients.The PACS estimates are the minimizers of the following: where  ≥ 0 is the regularization parameter and  is the nonnegative weights.The second term of the penalty encourages the same sign coefficients to be set as equal, while the third term encourages opposite sign coefficients to be set as equal in magnitude.
Choosing of appropriate adaptive weights is very important for PACS to be an oracle procedure.Consequently, Sharma et al. [11] suggested adaptive PACS that incorporate correlations into the weights which are given as follows: where β is √ consistent estimator of , such as the ordinary least squares (OLS) estimates or other shrinkage estimates like ridge regression estimates and   is Pearson's correlation between the (, )th pair of predictors.Sharma et al. [11] suggest using ridge estimates as initial estimates for 's to obtain weights perform well in studies with collinear predictors.

Robust PACS
3.1.Methodology of Robust PACS.The satisfactory performance of PACS under normal errors has been demonstrated in [11].However, the high sensitivity to outliers is the main drawback of PACS where a single outlier can change the good performance of PACS estimate completely.
Note that, in (1), the least-squares criterion is used between the predictors and the response.Also, the weighted penalty contains weights which depend on Pearson's correlation in their calculations.However, the least-squares criterion and Pearson's correlation are not robust to outliers.To achieve the robustness in estimation and select the informative predictors robustly, the authors propose replacing the least-squares criterion with MM-estimation [12] where the MM-estimators are efficient and have high breakdown points.Moreover, the nonrobust weights replaced with robust weights depend on robust correlations such as the fast consistent high breakdown (FCH) [13], reweighted multivariate normal (RMVN) [13], Spearman's correlation (SP), and Kendall's correlation (KN).The RPACS estimates minimizing the following: where  ≥ 0 is the regularization parameter and  is the robust version of the nonnegative weights which are describes in (2).
scale of the residuals, and it is defined as a solution of where  is a constant and  0 function satisfies the following conditions: (1)  0 is symmetric and continuously differentiable, and  0 (0) = 0.
(2) There exist  > 0 such that  0 is strictly increasing on [0, ] and constant on [, ∞). ( The MM estimator in the first part of ( 3) is defined as an Mestimator of  using a redescending score function, () =  1 ()/, and   obtained from (4).It is a solution to where  1 is another bounded  function such that  1 ≤  0 .

Choosing the Robust Weights.
The process of choosing the suitable weights is very important in order to obtain an oracle procedure [11].The weights, which are described in (2), depend on Pearson's correlation in their calculations.From a practical point of view, it is well known that Pearson's correlation is not resistant to outliers and thus choosing weights in (2) based on this correlation will cause uncertain and deceptive results.Consequently, in order to get robust weights, there is a need to estimate the correlation by using robust approaches.There are two types of robust versions for Pearson's correlation.The first type consists of those that are robust to the outliers, without interest in the general structure of the data, whereas the second type gives attention to the general structure of the data when dealing with outliers [14].KN and MCD (minimum covariance determinant) are examples for the first and second types, respectively.Olive and Hawkins [13] proposed FCH and RMVN methods as practical consistent, outlier resistant estimators for multivariate location and dispersion.Alkenani and Yu [15] employed FCH and RMVN estimators instead of Pearson's correlation in the canonical correlation analysis (CCA) to obtain robust CCA.The authors showed that these estimators have good performance under different settings of outliers.
In this article, the FCH, RMVN, SP, and KN correlations have been employed instead of Pearson's correlation in order to obtain robust weights as follows: where  is a robust version of Pearson's correlation such as FCH, RMVN, SP, and KN correlations.β is a robust initial estimate for  and we suggest using robust ridge estimates as initial estimates for 's.

Simulation Study
In this section, five examples have been used to assess our proposed method RPACS by comparing it with PACS which is suggested in [11].A regression model has been generated as follows: In all examples, predictors are standard normal.The distributions of the error term  and the predictors are contaminated by two types of distributions,  distribution with 5 degrees of freedom ( (5) ) and Cauchy distribution with mean equal to 0 and variance equal to 1 (Cauchy (0, 1)).Also, different contamination ratios (5%, 10%, 15%, 20%, and 25%) were used.The performance of the methods is compared by using model error (ME) criterion for prediction accuracy which is defined by ( β − )  ( β − ) where  represents the population covariance matrix of .The sample sizes were 50 and 100 and the simulated model was replicated 1000 times.
Example 1.In this example, we choose the true parameters for the model of study as  = (2, 2, 2, 0, 0, 0, 0, 0)  ,  ∈ R 8 .The first three predictors are highly correlated with correlation equal to 0.7 and their coefficients are equal in magnitude, while the rest are uncorrelated.
Example 2. In this example, the true coefficients have been assumed as  = (0.5, 1, 2, 0, 0, 0, 0, 0)  ,  ∈ R 8 .The first three predictors are highly correlated with correlation equal to 0.7 and their coefficients differ in magnitude, while the rest are uncorrelated.
Example 3. In this example, the true parameters are  = (1, 1, 1, 0.5, 1, 2, 0, 0, 0, 0)  ,  ∈ R 10 .The first three predictors are highly correlated with correlation equal to 0.7 and their coefficients are equal in magnitude, while the second three predictors have lower correlation equal to 0.3 and different magnitudes.The rest of predictors are uncorrelated.
Example 4. In this example, true parameters are  = (1, 1, 1, 0.5, 1, 2, 0, 0, 0, 0)  ,  ∈ R 10 .The first three predictors are correlated with correlation equal to 0.3 and their coefficients are equal in magnitude, while the second three predictors have correlation equal to 0.7 and different magnitudes.The rest of predictors are uncorrelated.
Example 5.In this example, the true parameters are assumed as  = (2, 2, 2, 1, 1, 0, 0, 0, 0, 0)  ,  ∈ R 10 .The first three predictors are highly correlated with pairwise correlation equal to 0.7 and the second two predictors have pairwise correlation of 0.7, while the rest are uncorrelated.It can be observed that the groups of three and two highly correlated predictors have coefficients which are equal in magnitude.
To avoid repetition, the observations about the results in Tables 1-5 have been summarized as follows.
From Tables 1, 2, 3, 4, and 5, when there is no contamination data, PACS has good performance compared with our proposed methods.It is clear, when the contamination ratio of  (5) or Cauchy (0, 1) goes up the performance of PACS goes down while RPACS with all the robust weights has a stable performance, and the preference is for RPACS.RMVN and RPACS.RFCH, respectively, for all the samples sizes.The variations in ME values for the RPACS estimates with all the robust weights are close under different setting of contamination and sample sizes, and they are less than the variations of PACS estimates.

Analysis of Real Data
In this section, the RPACS methods with all the robust weights and PACS method have been applied in real data.The NCAA sports data from Mangold et al. [16] and the pollution data from McDonald and Schwing [17] have been studied.
The response variable was centered and the predictors were standardized.To verify RPACS, the two data sets have been analyzed by including outliers in the response variable and the predictors.The two data sets have been contaminated with (5%, 10%, 15%, and 20%) data from multivariate  distribution with three degrees of freedom.
To evaluate the estimation accuracy of the RPACS methods, the correlation between the estimated parameters according to the different methods under consideration and the estimated parameters from PACS without outliers, denoted as Corr( β, βPACS,0 ), was presented.Also, the effective From Tables 6 and 7, we have the following findings in terms of estimation accuracy and the effective model size: (1) In case of no contamination, it can be observed that RPACS methods give comparable results as PACS.In addition, it can be seen that RPACS.RMVN and RPACS.FCH achieve better performance than RPACS.KN and RPACS.SP.(2) In case of contamination, the performance of PACS is dramatically affected.Also, it is obvious that RPACS.RMVN and RPACS.FCH methods give very consistent results, even with the high contamination percentages.The performance of RPACS.KN and RPACS.SP is less efficient than RPACS.RMVN and RPACS.FCH especially for all the contamination percentages.

Conclusions
In this paper, robust consistent group identification and VS procedures have been proposed (RPACS) which combine the strength of both robust and identifying relevant groups and VS procedure.The simulation studies and analysis of real data demonstrate that RPACS methods have better predictive accuracy and identifying relevant groups than PACS when outliers exist in the response variable and the predictors.In general, the preference is for RPACS.RMVN and RPACS.RFCH, respectively, for all the samples sizes.

Table 1 :
ME results of Example 1.

Table 2 :
ME results of Example 2.

Table 3 :
ME results of Example 3.

Table 4 :
ME results of Example 4.

Table 5 :
ME results of Example 5.
5.2.Pollution Data (PD).The PD is taken from a study of the effects of different air pollution indicators and sociodemographic factors on mortality.The dataset is available from the website (http://www4.stat.ncsu.edu/∼boos/var.select/pollution.html).The data contains  = 60 observations and  = 15 predictors.The response is the total Age

Table 6 :
The Corr( β, βPACS,0 ) and the effective model size values for the methods under consideration based on the NCAA sport data.

Table 7 :
The Corr( β, βPACS,0 ) and the effective model size values for the methods under consideration based on the pollution data.