Robust Bayesian Regularized Estimation Based on 𝑡 Regression Model

,


Introduction
Following the pioneer work of Tibshirani [1], the Lasso (least absolute shrinkage and selection operator) has generated much interest in statistical literatures [2][3][4][5].For the Gaussian linear regression, where x  's are either fixed or random covariates,  1 , . . .,   , i.i.d.∼ (0, 2 ).The Lasso estimates are viewed as  1 regularized least squares estimates, which are defined as for  ≥ 0 and with ×1 vectors y = ( 1 , . . .,   )  , × matrix  = ( 1 , . . .,   )  , and  × 1 vector .The key advantage of Lasso lies in its ability to do simultaneous parameter estimation and variable selection and the entire path of Lasso estimates for all values of  can be efficiently computed through a modification of the LARS (least angle regression) algorithm of Efron et al. [5].The Gaussian assumption is not crucial in model (1), but it is useful to make connections to the likelihood framework for regularized  regression in Section 2. The Lasso estimator in (2) is equivalent to minimizing the penalized negative loglikelihood (; y) as a function of the regression coefficients  and using the  1 -penalty ‖‖ = ∑  =1 |  |: equivalence means here that we obtain the same estimator for a potentially different tuning parameter, but the Lasso estimator in (2) does not provide an estimate of the variance parameter  2 .In fact, the variance parameter  2 plays an important role especially when the error in the regression model has high variance.
On the other hand, the Lasso estimate in (2) based on least square loss may suffer from poor performance when the error distribution has heavy tail which is longer than normal distribution or the error has large variance or contains outliers in the linear regression model, which motivate many researches to consider regularized estimation based on robust method.The most existing robust regularized estimation methods mainly replace the least square loss function in (2) 2 Journal of Probability and Statistics by some robust loss functions, such as Huber loss,  1 loss, and quantile loss function.Li and Zhu [6] considered quantile regression with the Lasso penalty and developed its piecewise linear solution path.Wang et al. [7] investigated the least absolute deviance (LAD) estimate with adaptive Lasso penalty (LAD-Lasso) and proved its oracle property.Wu and Liu [8] further discussed the oracle properties of the SCAD (smoothly clipped absolute deviation) and adaptive Lasso regularized quantile regression.Zou and Yuan [9] proposed a new efficient estimation and variable selection procedure which is called composite quantile regression.Chen et al. [10] studied the robust Lasso in the presence of noise with large variance.They considered two editions of robust Lasso, which use the convex combined loss and Huber loss function criterion instead of least square criterion.
Besides the above existing robust regularized method, which, in essence, replace the  2 loss by other loss criteria, we can also directly consider robust estimation from the standpoint of error distribution in model (1).It is well known that when the error in the linear regression follows a normal distribution, the estimations will be sensitive to outliers, which motivate us to consider using robust error distribution in the linear regression model in (1).The  distribution as an alternative to the normal distribution has frequently been suggested in the literature as a robust extension of the traditional normal models.The  distribution provides a convenient description for regression analysis when the residual term has density with heavy tails and excess kurtosis.
Since the Lasso estimate in (2) can be interpreted as a Bayesian posterior mode estimate when the regression parameters have independent Laplace priors, motivated by this connection, Park and Casella [11] discussed the Bayesian Lasso estimation for linear regression based on Gibbs sampler method.The Bayesian Lasso has several advantages; for example, it can conveniently provide interval estimates, that is, Bayesian credible intervals, that can guide variable selection, the structure of the hierarchical model provides methods for selecting the Lasso parameter, and sometimes its performances outperform non-Bayesian method.Therefore, several authors were subsequently devoted to the variable selection problem based on Bayesian method, such as Park and Casella [11], Hans [12], Li et al. [13], and Kyung et al. [14].
The main goal of this paper is to extend the Bayesian regularized method for  regression model (4), which is a robust edition Bayesian regularized estimation method, and it is expected that the performances will be better than other regularized methods when the error distribution follows a symmetry distribution and there are outliers in regression model.
The rest of the paper is organized as follows.In Section 2, we introduce the  regression with Lasso penalty.In Section 3, we build the framework of Bayesian  regression model with adaptive Lasso penalty and then show the corresponding Bayesian hierarchical model and obtain the Gibbs sample estimation procedure.In Sections 4 and 5, we carry out some simulation studies and real data analysis to illustrate the performance of our new method.Some conclusion and remarks are contained in Section 6.We also discuss the Bayesian  regression with group Lasso penalty and its Gibbs sampler method in the Appendices.

𝑡 Regression Model and 𝐿 1 Norm Regularization
2.1. Regression Model.The  distribution as an alternative to the normal distribution has frequently been suggested in the literatures; for example, Lange et al. [15] introduced  distributed errors regression models as a robust extension of the traditional normal models; Liu and Rubin [16] discussed the estimations of  distribution by EM algorithm and its extensions; and Lin et al. [17] studied heteroscedasticity diagnostics problem by score test for  linear regression models.And it has often been used in many real applications, especially in finance.
The probability density function of a  distributed variable  with location parameter , dispersion parameter , and degree of freedom V, denoted as (,  2 , V), is where Γ(⋅) denotes the gamma function, ,  ∈ R, and  > 0.
The mean and variance of  are  (V > 1) and V 2 /(V − 2) (V > 2), respectively.When V = ∞ and 1, the  distribution reduces to the normal and Cauchy distributions, respectively.The univariate linear Student- regression model can be expressed as where the x  's are known -vector of covariates and  = ( 1 , . . .,   )  are the unknown -vectors of parameters.

𝑡 Regression with Lasso Penalty.
As discussed in Section 1, the  2 loss in the Lasso model is not robust to heavytailed error distribution and/or outliers.This indicates that the Lasso is not an ideal goodness-of-fit measure criterion in the presence of noise with large variance.We consider following regularized  regression, which is defined as where  = (  , )  and   (,  2 ;   , x  ) is the log-likelihood function of   : Comparing ( 5) with (2), we take the parameter  2 into the optimization of the penalized maximum likelihood estimator.
Though we are penalizing only the parameter , the variance parameter estimate σ2 is influenced indirectly by the amount of shrinkage  in (5).
There is main drawback of the estimator in (5) [18].It is not equivalent under scaling of the response [19].More precisely, consider the transformation which leaves model (2) invariant, but this is not the case for the estimator in (5).We address this drawback by using the penalty term ‖‖ 1 /, leading to the following estimator: Noting the form of the regularized estimation in ( 8), the Lasso estimates can be interpreted as posterior mode estimates when the regression parameters have independent Laplace (i.e., double-exponential) priors: Park and Casella [11] argued that conditioning on  2 is important because it guarantees a unimodal full posterior.

Bayesian Formulation of Robust 𝑡 Adaptive Lasso
3.1.Bayesian Adaptive Lasso  Regression.It is well known that the ordinary Lasso has a conflict between the consistency for model selection and optimal prediction.As a solution to achieve both estimation and model selection consistency, the adaptive Lasso [20] was proposed and its model selection consistency and asymptotic normality under certain rate of shrinkage parameter   were proved.To extend the idea of the adaptive Lasso [20] to our proposed robust  Lasso regression, we define the robust  adaptive Lasso as where   's allow unequal penalties for the coefficients and we define the precision parameter  =  −2 .Motivated by the connection with Bayesian estimation and (9), we consider a fully Bayesian analysis using a conditional Laplace prior on   : For any  ≤ 0, by the following equality [21]: the Laplace prior (11) on  can be written as On the other hand, from Lange et al. [15] and Liu and Rubin [16], (, , V) can be treated as a mixture of normal and gamma distributions; that is, if  ∼ Γ(V/2, V/2) and  |  ∼ (, () −1 ), then  ∼ (, , V).If we further put gamma priors on the parameter  2  and use the improper prior density on , then we have the following Bayesian hierarchical model: where u = ( 1 , . . .,   ), z = ( 1 , . . .,   ), and s = ( 1 , . . .,   ).
Remark 1.The amount of shrinkage parameters  2  's depend on the value of hyperparameters  and .Because larger  and smaller  lead to bigger penalization, it is important to treat  and  as unknown parameters to avoid giving specific values which affect the estimates of the regression coefficients.
Remark 2. Similar to Park and Casella [11], there is also another existing method to obtain the estimation of adaptive Lasso parameters beside sampling from the posterior distribution of Based on the expression (15), we can obtain the posterior full conditional distribution for all parameters, which can yield a tractable and efficient Gibbs sampler that works as follows.
The full conditional distribution (  | x, y, , , s − , u,  2  1 , . . .,  2  , , ) is Thus, the full conditional distribution of   is a generalized inverse Gaussian distribution.Now we consider the full conditional distribution of   which is given by where So the full conditional distribution of  is a gamma distribution.
The full conditional distribution of Obviously, the full conditional distribution of  is just a gamma distribution.However, the full conditional posterior distribution of  does not have a closed form.Fortunately, we note that the full conditional posterior distribution of  is a log-concave function, so the adaptive rejection sampling algorithm [23] can be used to sample from this distribution.The adaptive rejection sampling algorithm is an efficient sampling method from any univariate probability density function which is log-concave.
Remark 3. The Lasso was originally designed for variable selection for the posterior mode of   's can be exactly zeros due to the nature of the laplace prior in (11).However, the posterior draws of   's cannot be exact zeros since   's are drawn from the continuous posterior distribution.A post hoc thresholding rule may overcome this difficulty.Alternatively, kyung et al. [24] recommended using the credible interval on the posterior mean to guide variable selection.
Remark 4. For variable selection, sometimes, several explanatory variables may be represented by a group of derived input variables.In this case the selection of important variables corresponds to the group of variables.The group Lasso penalty [13,25] takes the group structure into account and can do variable selection at the group level.For the Bayesian  regression model with adaptive group Lasso penalty, we give the details in the Appendices.

Simulation Studies
The main goal of this section is to assess the performance of our method (BAT.L) through some simulations with comparison to several Bayesian and non-Bayesian methods.The Bayesian methods include the Bayesian Lasso (BLS.L) [11] and Bayesian adaptive Lasso (BALS.L) [22].The non-Bayesian methods include Lasso (LASSO) [1] and adaptive Lasso (ALASSO) [20].The data in the simulation studies are generated by x  ∼ (0, Σ), where the covariance matrix Σ is an AR(1) matrix; that is, Σ  =  |−| with correlation  = 0.5, and the error distributions of   considered here are the following six situations: The first choice is a standard normal distribution, (0, 1).
The third choice is a  distribution with degree of freedom V = 3, (3).
For each simulation study and each choice of the error distribution, we run 50 simulations.In each simulation, we generate a training set with 20 observations, a validation set with 20 observations, and a testing set with 200 observations.The validation set is used to choose the penalty parameters in LASSO and ALASSO.After the penalty parameters are selected, we combine the training set and validation set together to estimate .Since BAT.L, BLS.L, and BALS.L do not need the validation set since they estimate the penalty automatically, so we combine the training set and validation set together for estimation.The testing set is used to evaluate the performance for these methods.Within each simulation study, we consider three different settings for .
Simulations 1, 2, and 3 are corresponding to the sparse case, very sparse case, and sparse recovery problem in the predictors, respectively.
Note that we intentionally choose error distributions that are different from the  distribution and Laplace distribution to see how the Bayesian  regression adaptive Lasso estimation depends on the error assumption.Our simulation results show that, in terms of parameter estimation accuracy, the Bayesian  adaptive Lasso method still performs well even when this error distribution assumption is violated.
Unless otherwise specified, the freedom degree V = 4 is used in all simulations for BAT.L. Since the true model is known, we can compute the median of mean absolute deviations (MMAD), that is, median , where the median of mean absolute deviations is calculated based on 50 simulations.We present the results for simulation in Table 1.The parentheses in these tables are the standard deviations of the MMAD obtained by 500 bootstrap resamplings of the 50 mean absolute deviations.
Simulations show that, in terms of the MMAD, our new regularized method performs better than the other four methods in general, especially for the nonnormal error distributions.

Land Rent Data
In this section, we demonstrate the performance of BAT.L together with other methods by a real data set.
Weisberg [26] reported a data set about land rent data.The data was collected by Douglas Tiffany to study the variation in rent paid in 1977 for agricultural land planted with alfalfa.The variables are average rent per acre planted to alfalfa (), average rent paid for all tillable land ( 1 ), density of dairy cows (number per square mile) ( 2 ), and proportion of farmland used as pasture ( 3 ) and  4 = 1 if liming is required to grow alfalfa;  4 = 0, otherwise.The unit of analysis is a county in Minnesota; the 67 counties with appreciable rented farmland are included.The response variable alfalfa, which has mean 42.1661 and standard deviation 22.5866, is a high protein crop that is suitable feed for dairy cows.It is thought that rent for land planted with alfalfa relative to rent for other agricultural purposes would be higher in areas with a high density of dairy cows and rents would be lower in counties where liming is required, since that would mean additional expense.
We first center all the variables so that the intercept is not considered and the predictors are standardized and then use above five regularized methods to fit the whole data set.For the three Bayesian methods, we use the Bayesian credible

Conclusion Remarks
In this paper, we studied the regularized  regression model based on Bayesian method.This new method extended the Bayesian Lasso [11] through replacing the least square loss function by the log-likelihood function of  distribution and the Lasso penalty by adaptive Lasso penalty.Bayesian hierarchical models are developed and Gibbs samplers are derived for our new method.Both the simulation studies and the real data analysis show that BAT.L performs better than other existing Bayesian and non-Bayesian methods when the error distribution has heavy tails and/or outliers.We also discussed the Bayesian  regression model with adaptive group Lasso penalty.
There is room to improve our methods.In this paper, we treat the dispersion parameter  in  regression model as constant.In fact, the dispersion parameter  may be different for different observations, which need us to consider the problem of regularized  regression with varying dispersion model; that is, where z  's are covariates, which constitute, in general, although not necessary, a subset of x  's.For the varying dispersion  regression model, it is important to consider how to simultaneously do variable selection for both regression coefficients  and  based on Bayesian method or non-Bayesian method.Furthermore, it also be interesting to consider the problem of variable selection for varying dispersion  regression models with high dimensional covariates.
Research in this aspect is ongoing.

A. Bayesian 𝑡 Regression with Adaptive Group Lasso Penalty
In the Appendices, we consider regression problems where some explanatory variables may be represented by a group of derived input variables.In this case, the selection of important variables corresponds to the group of variables.The group Lasso penalty [13,25] takes the group structure into account and can do variable selection at the group level.Suppose that the predictors are grouped into  groups and   is the coefficient vector of the th group predictors x  .Then  = (  1 , . . .,    )  and x  = (x  1 , . . ., x   )  .Let   be the dimension of the vector   and let K  be a known   ×   positive definite matrix ( = 1, . . ., ).Define ‖  ‖ K  = (   K    ) 1/2 .We consider the following adaptive group Lasso  regression: Pick the prior of   as the Laplace prior: where   = 2 −(  +1)/2 (2) −(  −1)/2 /Γ((  + 1)/2).By the equality (12) , (14)little experience shows that the two methods perform well, but the empirical Bayes estimation method can save some computing time; for more details see Park and Casella[11]and Leng et al.[22].3.2.Gibbs Sampler for All Parameters.Let  − be the parameter vector  excluding the component   , s − the vector s excluding the component   , and u − the variable excluding the component   .Based on the Bayesian hierarchical model(14), the posterior distribution of all parameters can be given by 2  's.We can use the empirical Bayes estimation method to update  2  's without setting the prior of  2  's in the Bayesian hierarchical model.

Table 1 :
Simulation results of MMAD for different distribution errors † .† In the parentheses are standard deviations of the MMADs obtained 500 bootstrap resampling.The bold numbers correspond to the smallest MMAD in each category.

Table 2 :
MSE of five different regularized methods for land rent data.We found that the five different methods all select  1 and  2 as important variables.Furthermore, in order to evaluate the performance of BAT.L together with other methods, we randomly select 40 samples as train data set and the remaining 27 samples as test data set; the mean square errors for test data set are reported in the Table2for five different regularized methods.As we can see from Table2, the performance of BAT.L outperforms other variable selection procedures for this data set.