Penalized Maximum Likelihood Method to a Class of Skewness Data Analysis

An extension of some standard likelihood and variable selection criteria based on procedures of linear regression models under the skew-normal distribution or the skew-t distribution is developed.This novel class of models provides a useful generalization of symmetrical linear regressionmodels, since the random term distributions cover both symmetric as well as asymmetric and heavytailed distributions. A generalized expectation-maximization algorithm is developed for computing the l 1 penalized estimator. Efficacy of the proposed methodology and algorithm is demonstrated by simulated data.


Introduction
In recent years, nonnormal distributions have received substantial interest in the statistics literature.The growing need for more flexible tools to analyze data that exhibit nonnormal features, including asymmetry, multimodality, and heavy tails, has led to intense development in nonnormal modelbased methods.Many such distributions have been successfully applied to numerous datasets from a wide range of fields, including medical sciences, bioinformatics, environmetrics, engineering, economics, and financial sciences.Some recent applications of skew-normal and skew- models include [1][2][3][4][5].
The rich literature and active discussion of skew distributions as well as regression analysis of skew distributions were initiated by the pioneering work of [6], in which the univariate skew distribution was introduced.Following its generalization to the multivariate skew-normal distribution in [7], the number of contributions has grown rapidly.The concept of introducing additional parameters to regulate skewness in a distribution was subsequently extended to other parametric families.For a comprehensive survey of skew distributions, see, for example, the articles [4,[8][9][10].
Besides the skew-normal distribution, which plays a central role in these developments, the skew- distribution has also received much attention.Being a natural extension of the  distribution, the skew- distribution retains reasonable tractability and is more robust against outliers than the skewnormal distribution.
On the other hand, the Lasso is a popular method for regression analysis that uses an ℓ 1 penalty to achieve a spares solution.This idea has been broadly applied, for example, to generalized linear models by [11] and Cox's proportional hazard models for survival data by [12].Inspired by an important work of [13], we will introduce a new approach to regression analysis based on a class skew distribution by taking advantage of sparse penalization methods.We propose to fit an ℓ 1 penalized linear model with an ℓ 1 penalty imposed on the location parameters of the skew distribution.Due to the sparse shrinkage property of the ℓ 1 penalty, some components of regression coefficients are estimated by exact zero when the tuning parameter is properly chosen.
The rest of the paper is organized as follows.In Section 2, we introduce the skew-normal and skew- distribution and present some of their properties.Section 3 outlines the EM-type algorithm and a penalty estimate of the proposed model of the skew distribution.In Section 4, a method of choice of tuning parameter is developed.The methodology proposed is illustrated in Section 5 by analyzing a simulated study.

The Skew-Normal Distribution and
the Skew- Distribution where (⋅) and Φ(⋅) denote the standard normal density function and cumulative function, respectively.Note that if  = 0, the density of  reduces to the (, ) density.

The Skew-𝑡 Distribution.
To achieve a higher degree of excess kurtosis and accommodate to data of a heavy tail, skew- distributions have been introduced by [14].A univariate random variable  follows the skew- distribution,  ∼ ST(, , , ]); the density function reads where   = ( − )/ and  ] and  ] denote, respectively, the density function and distribution function of a standard Student's  distribution with ] degrees of freedom, and parameters , , and  are the same as above.Clearly, if  ∼ ST(, , , ]), it has the following stochastic representation: where  ∼ SN(0, 1, ) and  ∼ Γ(]/2, ]/2), independently and Γ(, ) denotes the Gamma distribution with parameters  and .

Regression Model of Skew Distribution and EM Algorithm
In this section, we consider the linear regression model where the observation follows skew-normal distributions or skew- distributions.In general, a linear regression model is defined as or where   are responses, parameter vector  = ( 1 , . . .,   )  ,   is a vector of explanatory variable values, and the random error   follows SN(0, , ) or ST(0, , , ]), respectively, which corresponds to the regression model with an error distribution of which the mean is zero.It follows that the log-likelihood function for original parameter  = (, , ) or  = (, , , ]), given the observed sample  = ( 1 , . . .,   )  from SN or ST, respectively, is given by where   is the function of (⋅)Φ(⋅) or (⋅)(⋅) in ( 1) or (4), according to the SN or ST distribution, respectively.Note that the Newton-Raphson method can be directly applied to get the ML estimate of the above log-likelihood, but in order to obtain a more efficient approach of estimation and variable selection of skew models, we discuss a more elaborate technique to find the ML estimate based on an EM algorithm.Firstly, for skew-normal distributions, let  = ( 1 , . . .,   )  ; then under the hierarchical representation of Section 2.1, it follows that the complete log-likelihood function of modified parameter  = (, , ) associated with where  is a constant that is independent of .Then consider the penalized log-likelihood defined by where   (⋅) is a nonnegative penalty function and we use   (|  |) = |  |, which is the Lasso penalty, where  is the tuning parameter not included in .The Lasso estimator is then defined as argmaxPLL(); that is, Clearly,  = ( 1 , . . .,   ) can be seen as "missing data"; in the following, we will derive a generalized expectationmaximization algorithm for computing the ℓ 1 penalized estimator.It turns out that the penalized estimator can be computed by iterative Lasso-penalized least square. Set as the -function over  at iteration  step and where  | denotes the conditional expectation of  given .
At the -step, by using the properties of conditional expectation of the truncated normal distribution, we obtain where μ() Then, combining the -function and the Lasso penalty, we set as the modified -function.
At the -step update  by maximizing Q( | φ() ) over  in a sequence of conditional maximization steps.Firstly, we get the following expression: then, we can obtain the closed expression of other parameters, So, the above procedures can be summarized as follows.
Step 1: set initial values for , , , and tuning parameter .
Step 2: compute the value of ê1, and ê2, at current iteration step as the "working observed value." Step 3: solve the penalized least squares problem about parameter  and obtain its new Lasso estimator.
Step 4: compute the new estimator of  and .
Step 2: compute the value of ŝ1, , ŝ2, and ŝ 1, , ŝ 2, at current iteration step as the "working observed value." Step 3: solve the penalized least squares problem about parameter  and obtain its new Lasso estimator.
Step 4: compute the new estimates of  and  directive and then get the new estimates of ].
Step 5: get the tuning parameters  by GCV.

Choice of Tuning Parameters
We now consider the choice of tuning parameters.In application, the cross-validation (CV), or generalized crossvalidation (GCV), is often used for choosing tuning parameters.Following the examples of [11,15], we develop a componentwise deviance-based GCV criterion for the proposed models.
Let φ be the EM estimates of parameters  without penalty functions.For a given value of tuning parameter of , let β be the maximum penalized likelihood estimates of the parameters of the model by fixing the rest of components of  at φ. Denote the deviance function as Further, let ℓ  ( β) be the second derivative of the loglikelihood function with respect to  evaluated at β.We define a GCV criterion of the model as where () is the effective number of regression coefficients.It is given by where The tuning parameter  is chosen by minimizing GCV().

Simulation Study
In order to investigate the experimental performance of our methodology, we undertake a simulation study to investigate the effect of estimation and variable selection.The dataset is generated as follows: let   = ( 1 , . . .,   )  be covariate, where  = 30 and  1,...,1 are independent and distributed as uniform distribution (−1, 1).Moreover, assume response   is formulated as where   is distributed as the following five cases.
Under above settings, we use three linear regression models based on normal, skew-normal, and skew- distribution to fit each dataset, respectively.
Table 1 shows the results of the simulation study, where No. denotes the number of nonzero components of , which can determine the effect of variable selection, and where AE denotes the average of absolute error; that is, (1/) ∑  =1 | 0  − β |, which can show the accuracy of the estimates, and the  is the tuning parameter.
Examination of Table 1 reveals that (i) when the true distribution is normal, the difference of the three models is very small and the coefficient  can be found correctly; (ii) when the true distribution is nonnormal, the models based on normal fit the data badly; not only does the proportion of nonzero increase but also the average error becomes significant; (iii) for the cases of the true distribution being nonnormal, the models based on the skew distribution can improve the fitting effect better than the ones of the normal distribution.From Figure 1(c), it is clear that the ST model is most appropriate for fitting this dataset.

Conclusion
In this paper, we have proposed a class of linear regression models based on the asymmetric distribution.An EM-type algorithm and the Lasso penalty estimates are developed by exploring the statistical properties of the model that can be implemented efficiently in types of software such as SAS, R, and Matlab.Meanwhile, we have shown that the ℓ 1 penalized maximum likelihood estimation can be done via a generalized EM algorithm that is equivalent to iterative ℓ 1 penalized least square for some parameters.We have observed that if the data are generated from a true model of the skew distribution, the ℓ 1 penalized models based on EM algorithm not only identify zero coefficient but also are significantly more accurate than the general MLE models.
On the other hand, due to recent advances in computational technology, it is worthwhile to carry out Bayesian treatments via Markov chain Monte Carlo (MCMC) sampling methods in the context of this paper.

Figure 1 :
Figure 1: (a)-(c) Trace plot of regression coefficients and profile of tuning parameter.

Table 1 :
Results of the simulation study.