Modified Least Trimmed Quantile Regression to Overcome Effects of Leverage Points

Quantile regression estimates are robust for outliers in y direction but are sensitive to leverage points. ,e least trimmed quantile regression (LTQReg) method is put forward to overcome the effect of leverage points. ,e LTQReg method trims higher residuals based on trimming percentage specified by the data. However, leverage points do not always produce high residuals, and hence, the trimming percentage should be specified based on the ratio of contamination, not determined by a researcher. In this paper, we propose a modified least trimmed quantile regression method based on reweighted least trimmed squares. Robust Mahalanobis’ distance and GM6 weights based on Gervini and Yohai’s (2003) cutoff points are employed to determine the trimming percentage and to detect leverage points. A simulation study and real data are considered to investigate the performance of our proposed methods.


Introduction
Quantile regression (QReg) has received much attention since the seminal work of Koenker and Bassett [1]. It can be considered as one of the important statistical breakthroughs in recent decades. e desirable advantages of quantile regression led to its application in wide areas of sciences such as in medicine, financial, economics, agriculture, environment, and others [2,3]. QReg is an extension of the mean regression model to conditional of the different quantiles of the response variable distribution. erefore, QReg is able to provide much more detailed stochastic relationship among random variables.
Consider the following regression model: where y i is an (n × 1) vector of a response variable, x i is a (k × 1) vector of covariates variables, β is a vector of unknown parameters, and ε i is an (n × 1) vector of error terms. For any τ-quantiles in the interval (0, 1), the parameter β τ can be estimated consistently as the solution to the following optimization problem: where ρ τ (.) is the check function, defined as where I(.) denotes the indicator function. One of the important advantages of quantile regression is the insensitivity for outliers and heavy tailed distribution for error term. is robustness of QReg for outliers arises because of the nature of the check function which is shown in (3) (see [3][4][5]). Similar to M-estimator regression, QReg is not robust when the predictor variables contain outliers which are called high leverage points (HLPs) [6]. ere are some attempts to overcome the effect of HLPs and maximize the breakdown point of QReg. Giloni et al. [7] proposed a weighting method to increase the breakdown point and cope with HLP, based on the blocked adaptive computationally efficient outlier nominators (BACON) method that is proposed by Billor et al. (2000), in which a clean subset is chosen via their algorithm. e limitation of the weighting method is that it can be used with small numbers of regressers (often, one or two regresser variables). Rousseeuw and Hubert [8] proposed the regression depth as an extended version for regression quantile. ey pointed out that the depth quantiles is robust to HLPs. Adrover et al. [9] presented a robust estimation method that is unaffected by leverage points and, at the same time, maximizes the breakdown point. e disadvantages of the weighting method and depth quantiles are computational complexity and nonstandard asymptotic distributions Neykov et al. [10].
Recently, least trimmed QReg is proposed by Neykov et al. [10] to reduce the effects of HLPs. is method is a generalization of the location estimator that was proposed by Tableman [11] and least trimmed absolute deviation proposed by Hawkins and Olive [12]. Neykov et al. [10] proved the consistency of the least trimmed quantile regression method and discussed the breakdown point of the estimators.
e limitation of this method is that the trimming percentage is a constant whereby the trimmed data may be lower or higher than the actual contamination percentage of the data. e least trimmed quantile method minimizes the quantile residuals in (2) for the subset (h) out of the sample size (n). However, it is important to mention that the leverage point is not affected by residuals. erefore, this method does not correctly detect the high leverage points.
In this paper, we proposed a new algorithm to develop the least trimmed quantile regression method and to overcome these disadvantages in the existing methods. e new proposed algorithm integrates the reweighted least trimmed method that proposed byČížek [13] with QReg to determine the trimming percentage and robust Mahalanobis' distance to identify the HLPs. In addition, we employ the Gervini and Yohai [14] technique to compute the cutoff point and new weights for the QReg. Besides that, RMD is used to detect the leverage points.
Neykov et al. [10] proposed the least trimmed quantile regression (LTQReg) as an efficient and robust method to overcome the effect of HLPs on QReg. LTQReg is defined as follows: where ρ τ (ε i (β)) is defined as in (2) and (3). Neykov et al. [10] proved that when the trimming constant h � (n + N(X) + 1)/2, the breakdown point of LTQReg estimator is asymptotically equal to 0.50, where N(X) is the maximum number of explanatory variables. Müller [20] and Neykov et al. [10] pointed out that N(X) � p − 1.
e LTQReg method is based on the smallest quantile errors to reduce the influence of leverage points. In this situation, we would like to ask the following question: is the error values of the QReg and LS will be high for all leverage points? We most answer to this question by the following example. Let us consider the simple linear regression model In order to illustrate the effect of leverage points and outliers on the error term, 20% of the observations are contaminated by replacing the first 10 observations with contaminated observations. We consider three cases of contaminations: outliers, HLPs, and both outliers and HLPs simultaneously. e first 10 observations of the explanatory variable and dependent variable are contaminated as follows: x i ∼ U(−50, 50) and y i ∼ U(50, 100). Least squares (LS) and QReg at three quantiles (0.25, 0.50, and 0.75) were then applied to the data. In this example, we want to investigate if the LS and QReg produced high errors in all contamination scenarios which are suitable for LTQReg.
For all the three cases of contaminations, the fitted residuals are plotted as shown in Figures 1-3. Figures 1-3 clarify influences of outliers, HLPs, and both on LS and QReg in different quantiles. In Figures 1 and 3, we can see clearly that when the data are contaminated by outliers, the first 10 observations have highest residuals for both LS and QReg in different quantiles. On the contrary, in Figure 2, when the data are contaminated by HLPs, the residuals of LS and QReg are not affected by HLPs. From  Figures 1-3, we can conclude that the outlier observations have a direct effect on the residuals, whereas the leverage points have no effect on the residuals. Hence, we can say that the LTQReg method is not an effective method to reduce the effect of leverage points because it is based on trimming the highest (n − h) residuals.

Modified Least Trimmed Quantile Regression (MLTQReg)
In this section, we will discuss the modified LTQReg method to determine the rate of contamination data and the best trimming percentage. ree modified methods will be discussed in this section based on the reweighted least trimmed squares (RWLTS) method which was proposed byČížk [13] depending on hard rejection weights [16] and combined to the LTQReg method to robustify the weighted least squares method. e hard rejection weights in the RWLTS method are defined as where u i ′ s are the standardized of regression residuals and t n > 0 is the cutoff point that was adapted by Gervini and Yohai [14]. e cutoff point value is computed by comparing   Mathematical Problems in Engineering the empirical distribution function G + n of standardized absolute residuals with the distribution function G + 0 of absolute residuals under the assumed model. e friction of unusual observation in the sample (d n ) can be measured as where k � 2.5 [16]. erefore, the t n value is set to the (1 − d n )th quantile of G + n (t) as follows: e procedure of the reweighted least trimmed square method [13] can be described in two steps. e first step is determining the trimming constant h based on the weights that are given in (6), defined as e second step is applying the LTS method depending on the trimming constant h that is computed in the first step.
To increase breakdown points of the proposed method, a high breakdown estimator LTS, LMS, or S are used as initial (see [13,14]) and the robust weights are used to improve the efficiency.
Next, we will describe three algorithms based on the RWLTS to improve the LTQReg [10]. (RWLTQReg). In this method, we combine the RWLTS with LTQReg to determine the trimming constant, and the algorithm for this method can be describe as follows:

Reweighted Least Trimmed Quantile Regression
Step 1. Consider the LTQReg estimator as an initial estimate with high breakdown point and compute the standardized residual u i for i � 1, . . . , n.
Step 2. Calculate hard rejection weights for the standardized residuals as where t n is the cutoff point of Gervini and Yohai [14] that is shown in (8).
Step 3. Calculate the trimming constant (h n ) based on the weights in equation (10), from the formula h n � n i�1 w Res i .
Step 4. Applying the LTQReg based on the algorithm that proposed by Neykov et al. [10] for the subset of the size h n , this procedure can be described as follows: (i) Set r � 0, select a subset with the size h n from the sample. (ii) For the subset h n , use the QReg to estimate the coefficients (β r τ ). (iii) For all observations in the sample, compute the residuals and then order the residuals as . . , n. (iv) en, set r � r + 1 and the new subset is considered the first h n . (v) For the new subset, Steps (ii), (iii), and (iv) were repeated. is procedure is repeated until convergence.

Modified Least Trimmed Quantile Regression Based on RMD (RMD-LTQReg).
In this algorithm, we used the modified least trimmed quantile regression (MLTQReg) and RMD to detect the leverage points. e algorithm is presented as follows: Step 1. Compute the RMD as follows: where T (X) and C (X) are the location and the shape estimates of MVE.
Step 2. Compute w RMD where K is the cutoff point computed as follows:   Mathematical Problems in Engineering where mad(y i ) � med |y i − medy j | and c is a constant, 2 or 3.
Step 3. As in Step 1 and 2 of the RWLTQReg algorithm, we compute w Res i weights.
Step 4. Find the final weights by combining w RMD i with w Res i as follows: Step 5. Hence, the trimming constant ...
Step 6. We will apply Step 4 in the RWLTQReg algorithm for the subset of the size h n from the sample and set probability zero for the leverage points to ensure that we will not start with the bad subset (contains leverage points), Rousseeuw and Van Driessen [21], which means the condition of w RMD i ≠ 0 is satisfied (clean of leverage points).

Modified Least Trimmed Quantile Regression Based on GM6 Method (GM6-LTQReg).
e GM-estimator is proposed by Schweppe (see [22]) to reduce the influence of leverage points. Adrover et al. [9] showed that the breakdown of the GM-estimator was never higher than 1/(p + 1). Coakley and Hettmansperger [23] proposed the GM6-estimator to increase the breakdown point of the GM-estimator by using the least trimmed squares (LTS) as initial and RMD based on MVE to downweight leverage points. In this paper, we suggest using the GM6 weights to modify the LTQReg, and the procedure of this modification can be determined by following algorithm: Step 1. For i � 1, . . . , n, compute an initial estimate for the coefficients and the corresponding residuals (ε i ), the high breakdown estimators (LTQReg), Neykov et al. [10].
Step 2. Compute the scale of the residuals (se), as follows: se � 1.4826 median of largest(n − p) of the ε i .

Mathematical Problems in Engineering
Step 3. Determine the standardized residuals u i � [ε i /(w 0 i × se)], where w 0 i is an initial weight computed as follows: Step 4. Hard rejection weights for the standardized residuals can be computed as follows: where t n is a Gervini and Yohai [14] cutoff point that is shown in (8).
Step 5. Hence, the trimming parameter will be computed as h n � n i�1 w i . Step 6.
Step 4 in the RWLTQReg algorithm for the subset of the size h n .

Simulation Study
In this section, the Monte Carlo simulation study is presented to compare the performances of some existing methods such as LTQReg [10] and QR [1] with our proposed methods RWLTQReg, RMD-LTQReg, and GM6-LTQReg.
Following Neykov et al. [10], two explanatory variables (x i1 and x i2 ) are generated with large sample size (n � 100 and 200) from the following classical heteroscedastic multiple linear regression model: where we assume that the coefficients b 0 � b 1 � b 2 � 1, and the error term (ε i ) is distributed as N(0, 1). Also, two experiments are considered with different distribution for explanatory and response variables with three levels of contamination (δ � 10%, 20%, and 30%). e trimming percentage for the LTQReg will be considered as (0.20, 0.30).   N(0, 1). e variables are contaminated with different percentages, where the explanatory variables are contaminated as x ij ∼ N(−10, 3), j � 1, 2, and the response variable is contaminated as y i ∼ N(20, 3). e contamination is done by replacement of clean data by outlying data in both explanatory and response variables. Let m � δ × n { }, and the explanatory variables are contaminated by replacing i � intger(m/3), . . . , m clean observations with outlying observations, whereas the response variable y i is contaminated by replacing i � 1, . . . , intger(2m/3) clean observations with outlying observations. Figures 4 and 5 show the spread shape of generating data. It can be seen clearly that these data contain leverage points (outlying in the x direction) and outliers (outlying in the y direction) and influence observations (outlying in both x and y directions in the same time).

e Second Experiment. In this experiment, a distribution of explanatory variables is set as normal distribution
At the three quantiles 0.25, 0.50, and 0.75, the generated data in different contamination percentages (10%, 20%, and 30%) are fitting via the proposed methods (RWLTQReg, RMD-LTQReg, and GM6-LTQReg) and the existing methods (QReg and LTQReg). Root of mean squares errors j − β true j ) 2 ) are computed to evaluate our proposed methods.
In Tables 1 and 2, we reported the RMSE and MAE values for the first and second experiments. In these tables, we can see that RMSE and MAE values for all the methods at three quantiles are shown in the rows and three levels of contamination are shown in the columns. LTQReg (20%) and LTQReg (30%) show the least trimmed quantile method with 20% and 30% trim, respectively. e results in these tables are the average of 100 replications for the two experiments of the simulation study.  levels of contamination are the highest. at is, this method is more affected than the other methods by outlying data that fall in both x and y directions. On the contrary, we can see that the proposed method RMD-LTQReg has the lowest values of RMSE and MAE in most cases. is indicates that the performance of RMD-LTQReg is better than the others. On the other hand, the RWLTQReg has better performance than the other methods except the RMD-LTQReg. In addition, we can see when the contamination levels are 20% and 30%, the GM6-LTQReg performance is better than LTQReg (20%) in most cases and LTQReg (30%) in the 30% contamination level.
In Figures 6-9, we can see that the SE values for the parameters that estimated by the RMD-LTQReg method are the smallest in almost all cases, which indicates the performance of the RMD-LTQReg method is the best among all studied methods. Also, these figures show clearly that the QReg method has high SE leads to worse performance. In addition, the rest of the methods used showed close results in most cases and were varying in some other cases. erefore,   it could be argued that it is difficult to determine which one is better than others.

Real Data Applications
In this section, the "Star Cluster CYG OB1" dataset is considered to verify the performance of our proposed methods.
is dataset contains 47 observations with one explanatory variable which is the logarithm of the effective temperature at the surface of stars. e independent variable is the logarithm of its light intensity. Rousseeuw and Leroy [24] presented that the scatterplot of this dataset shows two groups of observations. e first group includes the majority of data that contain 43 stars, whereas the second group includes the remaining four stars (the observations are 11, 20, 30, and 34). e observations 11, 20, 30 and 34 are classified as leverage points [24].
In this example, we consider three quantiles (0.25, 0.50, and 0.75) to examine robustness of our proposed methods. Table 3 presents the RMSE and MAE values for all proposed and existing estimation methods at each quantiles. It is clear to see that the QReg method has the highest RMSE and MAE values, whereas, the RMD-LTQReg method following by the RWLTQReg method has the better performance due to they have the smallest RMSE and MAE values, whereas the RMD-LTQReg method have detected the HLPs correctly. Also, we can see that the LTQReg (30%) is better than both of GM6-LTQReg and LTQReg (20%). Figure 10 shows the fitted residuals of regression quantiles for the existing and proposed methods. We can see that the QReg method is dramatically affected by the leverage points. Even though the RWLTQReg has lowest RMSE and MAE values in some cases, it is also affected by leverage points evident by trimming the observations that have high residuals, but it failed to trim leverage points. However, LTQReg (20%), LTQReg (30%), and GM6-LTQReg methods showed convergence in the chart and  illustrated that these methods were also affected by the HLPs but better than QReg. e proposed method RMD-LTQReg shows a good performance due to its ability to trim the leverage points.

Conclusions
In this paper, we proposed a new estimation method to overcome the impacts of leverage points in data. e new estimation method is called modified least trimmed quantile regression. In addition, we proposed three methods based on hard rejection weights that are used in reweighted least trimmed squares (Čížek [13]) to determine the trimming constant and to reduce the leverage point influence. In our proposed methods, the cutoff point of Gervini and Yohai [14] is employed for QReg. Moreover, Reweighted least trimmed, GM6 weights and robust Mahalanobis' distance are developed for quantile regression.
To investigate the performances of our proposed methods, a simulation study and real data are considered.
e results indicate that the LTQReg has bad performance with data having leverage points due to it trims observations that have high residuals, whereas leverage points do not always have high residuals. Although, the RWLTQReg has good performance, evident by small RMSE, MAE and SE values, but it is not able to get rid of the leverage points. It is the same for the GM6-LTQReg that even though it is able to determine the trimming parameters, it is also affected by leverage points. From the results, it is clear to see that the RMD-LTQReg method is the best estimation method which can avoid the effect of leverage points.

Data Availability
e "Star Cluster CYG OB1" dataset is considered to verify the performance of our proposed method. is dataset is obtained form the basis for the main sequence in a Hertzsprung-Russell diagram of the Star Cluster CYG OB1, and it has been used by many researchers such as Rousseeuw and Leroy [24], Adrover et al. [9], and Neykov et al. [10]. It is available at package "robustbase" in R.