A New Robust Diagnostic Plot for Classifying Good and Bad High Leverage Points in a Multiple Linear Regression Model

Identification of high leverage point is crucial because it is responsible for inaccurate prediction and invalid inferential statement as it has a larger impact on the computed values of various estimates. It is essential to classify the high leverage points into good and bad leverage points because only the bad leverage points have an undue effect on the parameter estimates. It is now evident that when a group of high leverage points is present in a data set, the existing robust diagnostic plot fails to classify them correctly. This problem is due to the masking and swamping effects. In this paper, we propose a new robust diagnostic plot to correctly classify the good and bad leverage points by reducing both masking and swamping effects. The formulation of the proposed plot is based on the Modified Generalized Studentized Residuals. We investigate the performance of our proposed method by employing a Monte Carlo simulation study and some well-known data sets. The results indicate that the proposed method is able to improve the rate of detection of bad leverage points and also to reduce swamping and masking effects.


Introduction
In regression problems, several versions of outliers exist such as residual outliers, vertical outliers, and high leverage points.Any observation that has large residual is referred to as a residual outlier.Ve rtical outliers (VO) or -outliers are those observations which are extreme or outlying in coordinate.On the other hand, high leverage points (HLPs) are those observations which are extreme or outlying in coordinate.HLPs can be classified into good leverage points (GLPs) and bad leverage points (BLPs).GLPs are those outlying observations in the explanatory variables that follow the pattern of the majority of the data, while BLPs are the opposite.BLPs have a larger impact on the computed values of various estimates.On the other hand, GLPs contribute to the efficiency of an estimate (see [1][2][3][4]).As such, only BLPs should be weighted down while GLPs should not be given low weights in the computation of weighting function in any robust method.Nonetheless, it is now evident that most robust methods attempt to reduce the effect of outliers by weighting the outliers down, irrespective of whether they are good or bad leverage points.
There are a number of good papers in the literature on the detection of HLPs (see [5][6][7][8]).However, those detection methods are mainly focused only on the identification of HLPs without taking into consideration their classification into good and bad.It is very important to make the classification, as only the BLPs are responsible for the misleading conclusion about the fitting of the regression model.
It is not easy to capture the existence of several versions of outliers in multiple regression analysis by using a graphical method ( [6]).If only one independent variable is being considered, the four types of outliers can easily be observed from a scatter plot of  against the  variables.However, for more than one predictor variable, it is difficult to detect these outliers from a scatter plot.Not much work has been focused on classifying HLP's into good and bad leverage points.Rousseeuw and Van Zomeren [2] have proposed a robust diagnostic plot or outlier map which is more efficient than the nonrobust plot for classifying observations into four types of data points, namely, regular or good observations, vertical outliers, GLPs, and BLPs.We suspect that this plot fails to detect multiple outliers and high leverage points.Pison and Van Aelst [9], suggested a new plot using robust distance obtained from robust location and scale estimators to identify outliers and empirical influences and to find the influence of observations.Similar to Rousseeuw and Van Zomeren plot, they construct a graphical tool for multivariate models.Although Pison and Van Aelst plot can be very useful in model building stage to evaluate the quality of a fit based on the number of detected outliers, it does not focus on classifying unusual data into VO, GLP, and BLP (see [9]).Hubert et al. 2005 introduced a new diagnostic plot based on robust principle component analysis (PCA) denoted by ROBPCA (see [10]).The ROBPCA can distinguish between 4 types of observations for high dimensional data.The regular observations are close to the PCA subspace.GLP lies close to the PCA space but far from the regular observations.We can also have VO, which have a large orthogonal distance to the PCA space.The BLPs have a large orthogonal distance whose projection on the PCA subspace is remote from the typical projections.To draw the diagnostic plot or outliers map, on the horizontal axis, we plot the robust score distance of each observation and on the vertical axis we draw the orthogonal distance of each observation.They suggested that the cut-off point for horizontal axis is √ 2 ,0.975 and approximately cutoff value for vertical axis equals ( μ + σ 0.975 ), where μ and σ are the estimated mean and standard deviation, respectively.The ROBPCA method is efficient for high dimensional data but not for low dimensional data.
As such, we propose a suitable plot, in this regard.The Modified Generalized Studentized Residuals for the identification of multiple vertical outliers and multiple high leverage points are discussed in Section 2. In Section 3, we propose a new robust diagnostic plot for classifying observations into the four categories.The proposed plot is based on the Modified Generalized Studentized Residuals (MGt  ).A simulation study and three numerical examples are presented in Sections 4 and 5, respectively.Finally, some concluding remarks are given in Section 6.

The Modified Generalized Studentized Residuals for the Identification of Multiple Outliers and Multiple High Leverage Points
Consider a multiple linear regression model as follows: where  is an ( × 1) vector of observation of dependent variables,  is an (×) matrix of independent variables,  is a (×1) vector of unknown regression parameters,  is an (×1) vector of random errors with identical normal distribution of  ∼ NID(0,  2 ), and  is the number of independent variables.The linear regression model in (1) can be rewritten as follows: The ordinary least squares (OLS) estimates for linear regression in (1) are given by and the th residuals can be expressed in terms of the true disturbance as follows: where  = (  ) −1   is a projection matrix or hat matrix denoted by "."The diagonal elements of "" matrix are called the hat values denoted by ℎ  , given by The ℎ  values are often used as a classical diagnostic method to identify the high leverage points.However, the ℎ  mostly fails to detect HLPs due to the fact that it suffers from the masking effect ( [6,7,12,13]).However, the ℎ  mostly fails to detect HLPs due to the fact that it suffers from the masking and swamping effects.The masking problem happens when we fail to detect some outliers that are hidden by other outliers (outlier is identified wrongly as an inlier), whereas the swamping happens when we detect some inliers as outliers ( [14]).According to Barnett and Lewis (1994), the masking is "the tendency for the presence of extreme observations not declared as outliers to mask the discordancy of more extreme observations under investigation as outliers" ( [15]).On the other hand, the errors corresponding to the outliers may be very big, and this may lead to swamping problem.It means that the clean observations are declared as outliers by mistake.Hadi [13] suggested a single case deleted measure called Potentials matrix.The diagonal elements of Potential matrix denoted by "  " are given by (see [1,13,16]) where  () is the matrix  excluding the th row.We can rewrite   as a function of ℎ  as Although the potential matrix is efficient to identify a single outlier, it is not successful to identify multiple HLPs (see [3,17]).To remedy this problem, Imon [17] proposed the Generalized Potentials denoted by "GP" as a diagnostic method for multiple HLPs.The GP diagnostic method is able to detect multiple HLPs, but it is not adequately effective in identifying the exact number of HLPs.This is due to the choice of the initial basic subset, which is prone to masking effects.Habshah et al. [3] developed the Diagnostic Robust Generalized Potential (DRGP) to improve the rate of detection of HLPs.They divided data into two sets, remaining data which contains clean data and deletion set which contains all suspect data.Let us denote a set of good cases (remaining in the analysis) by "" and a set of bad cases (deleted from the analysis) denoted by "."The DRGP consists of two steps, whereby in the first step the robust method is used to identify the suspected HLPs.In the second step, the generalized potential diagnostic approach is used to confirm our suspicion.Habshah et al. [3] pointed out that the low leverage points (if any) are put back into the estimation of the remaining subset, sequentially (the observation with the smallest   will be substituted first), and then recompute the   values.This process is continued until all members of the deletion set have been checked to determine whether or not they can be declared as HLPs.
The suspected HLPs are determined by the robust Mahalanobis distance (RMD), based on the minimum volume ellipsoid (MVE) developed by Rousseeuw and Leroy [6] as where () and () are robust locations and shape estimates of the MVE, respectively.Habshah et al. [3] suggested using the following cut-off value for the robust Mahalanobis distance: where MAD is the median absolute deviation.Suppose that "" contains ( − ) cases after  < ( − ) cases where "" have been deleted.Once the remaining set has been determined, the second steps of DRGP are carried out to confirm the suspected HLPs by using the GP, denoted by  *  and defined as where Observations, in which the  *  values are larger than the following threshold, where  can be taken as a constant value of 2 or 3, are declared as HLPs.The studentized residuals (internally studentized residuals) and -studentized residual (externally studentized residuals) are widely used measures for the identification of outliers (see Cook and Weisberg [18]).The studentized residuals are defined as follows: where σ = [ε  ε/( −  − 1)] is the standard deviation estimator of the residuals.The special case of ( 13) is called the -studentized (denoted by   ), given by (Chatterjee and Hadi [12]) where σ() is the standard deviation estimator of the residuals excluding the th case.These two measures (  and   ) are also not very successful in detecting multiple outliers due to masking and swamping effects ( [3,17]).Imon [17] suggested a Generalized Studentized Residual (denoted by Gt  ) based on a group of deletions to identify multiple outliers.The generalized version of regression diagnostics first requires the selection of deletion group "" that contains all suspected influential cases and the remaining cases "." The suspected influential cases consider outliers and HLPs separately, whereby outliers and HLPs are identified using the robust Reweighted Least Squares (RLS) residuals (Rousseeuw and Leroy [6]) and GP (Imon [8]), respectively.The union of the set of suspected outliers and the set of suspected HLPs points forms the members of the deletion set, which has  observations.However, the initial basic subset of Gt  is not very stable since it is based on the GP which suffers from masking and swamping effects.In this regard, the DRGP is employed to remedy this problem.The Modified Generalized Studentized Residuals (MGt i ) are formulated based on the RLS and the DRGP as initial estimators.Once the "" set is identified, the vector of the estimated parameters in the remaining groups, denoted by β() , is defined as The residual for th deletion observation is given by The th externally studentized residual  *  for the remaining groups "" is given by Thus, the diagonal elements of the hat matrix are given by By utilizing the results of Rao [19], an additional point  in the "" set is defined as and the corresponding estimate is given by Hence, the formulation of the externally studentized residual for  ∉  is defined as Subsequently, the Modified Generalized Studentized Residuals (MGt  ) for the whole data set are formulated by combining ( 17) and ( 21) as follows: , for  ∈ , ε() where σ is the standard deviation of  group and σ− is the standard deviation of  group excluding the th case.

Monte Carlo Simulation Study
In this section, a Monte Carlo simulation study is designed to evaluate the performance of our new proposed method, MGt-DRGP plot in classifying observation into regular  2 and 3.They present the percentage of correct detection of BLPs and the masking and swamping rates at different levels of contamination and different sample sizes.
The results clearly indicate that the MGt-DRGP plot has a superior ability to identify the correct number of BLPs compared to OLS-MD and LMS-RMD plots regardless of the number of regressor variables, contamination rate, and size of samples.At a low level of contamination ( = 0.05), the MGt-DRGP plot has perfectly identified the BLPs without any masking or swamping effects.The LMS-RMD plot has a high percentage of detection of BLPs.However, it also has a high percentage of swamping due to the weakness of the RMD method, which tends to swamp some low leverage points.Although the OLS-MD has a small low rate of swamping, its performance is very poor for detection of BLPs due to the masking effects.
At moderate and high levels of contamination " = 0.10, 0.5, and 0.20," the MGt-DRGP plot still outperforms other plots.On the other hand, the performance of LMS-RMD plots decreases in terms of having a smaller percentage of correct detection and increasing rate of masking and swamping effects.It can be seen that the performance of the OLS-MD plot is very bad as the level of contamination increases.The MGt-DRGP plot consistently has a higher percentage of correct detection and the lowest rate of swamping and masking, irrespective of the number of independent variables , level of contamination, and sample size.

Example and Discussion
In this section, three examples are considered to assess the performance of our proposed diagnostic method.

Artificial Data Set.
Our first example is an artificial data set given by Kamruzzaman and Imon [11].This data set has three variables and contains 20 observations that are generated independently as "uniform (0, 1)."We suppose that the third variable is the response variable and the rest are the predicted variables.The boxplot in Figure 1 shows that this artificial data set does not have any outlier or high leverage point.We would like to evaluate the performance of our proposed diagnostic method in two situations.Firstly, we modified the data to see the effect of one BLP on the plots.To create BLP, the th case of the response and predictor variables (  GLP; the th case of predicted variables ( 1 and  2 ) is replaced by arbitrary large values.Table 4 exhibited the original and the modified artificial data The three plots (OLS-MD, LMS-RMD, and MGt-DRGP) were then applied to both modified artificial data sets.The values of the diagnostic methods and preceding plots are displayed in Table 5 and Figure 2, respectively.Since only a single BLP is present in the first modified data, we can clearly see that all the classical and robust diagnostic plots are able to correctly detect the BLP (case 20) as shown in Figures 2(a), 2(b), and 2(c).However, for second modified data, it is interesting to observe that only the MGti-DRGP plot is able to detect and classify the 5 outlying observations (cases 1, 2, 3, 4, and 20) into VO, GLPs, and BLPs correctly (see Figure 2(f)).Although the LMS-RMS plot is able to detect and classify those 5 outlying observations correctly, its swamped 4 observations (cases 8, 10, 15, and 18) are as shown in Figure 2(e).The classical OLS-MD can only detect 1 outlying observation (case 2) and mask 4 observations (cases 1, 3, 4, and 20) as shown in Figure 2(d).
Next, we would like to justify which plot has identified or has classified the suspect observations correctly.Since removing suspect observations leads to a drastic change in the coefficient estimates, a good classification plot is the one which corresponds to the highest percentage changes for various estimates.The percentage of change in estimate (PCE) is computed as where θOriginal is the OLS parameter estimates of the original data and θProposed is the OLS estimate for data set excluding suspected cases (VO and BLPs) and the "| ⋅ |" is the absolute value.Another criterion of a good plot is that, after deleting the suspect bad influential observations, there is a significant reduction in the standard error of the estimates.Table 6 presents the regression coefficients, standard error for coefficients, coefficient of determination, -test, and the PCE values for the second modified data in different situations, when we remove a single outlier (case 2) that is detected by OLS-MD plot and when we remove the outlying observations (cases 1, 2, 8, 10, 15, 18, and 20 and cases 1, 2, and 20) that are detected by LMS-RMD and MGt-DRGP plots, respectively.The results of Table 6 clearly show that the suspected influential observations that are detected by MGt-DRGP plot are removed, resulting in the smallest standard error for coefficients and having the largest PCE values.

Phosphorus Data Set.
The second example is the phosphorus data set, which is taken from Snedecor and Cochran [21].The concentrations of phosphorus in parts per million in each of 18 soils were measured to investigate the source from which the corn plants obtain their phosphorus.This data set has two predictor variables;  1 is the concentrations of inorganic phosphorus in the soil and  2 is the concentrations of organic phosphorus in the soil.The response variable  is the phosphorus content of corn grown in the soil at 20 ∘ C. Chatterjee and Hadi [1] showed by using the Studentized OLS residuals that this data set has a single outlier (case 17).Table 7 shows the absolute values for MD, RMD, DRGP,   , Studentized LMS residuals, and MGt  .The MD did not detect any HLP, whereas the RMD detected 2 HLPs (cases 1 and 6) and the DRGP detect 3 HLPs (cases 1, 6, and 10).The   detected only one outlier (case 17), whereas the Studentized LMS residuals and MGt i detected 2 outliers (cases 10 and 17).
Next, we would like to see the classification made by OLS-MD, LMS-RMD, and MGt-DRGP plots.From Figures 3, 4, and 5, we can see clearly that the OLS-MD plot identified one vertical outlier (case 17).However, the LMS-RMD plot Mathematical Problems in Engineering         It can be seen from Table 10 that the standard errors of the estimates when removing observations 16, 22, and 19 (identified by MGt-DRGP) are smaller than those when removing case 22 and cases 16 and 22, which are identified by OLS-MD and LMS-RMD, respectively.

Conclusion
In this paper, we proposed a new method for the identification of BLPs by means of a diagnostic plot.Most of the time, the classical OLS-MD plot fails to correctly identify the BLPs.The robust LMS-RMD plot is also not very successful in classifying observations into four categories.In this regard, we propose a new MGt-DRGP plot which is very successful in classifying observations into regular observations, vertical outliers, and good and bad leverage points.The Monte Carlo simulation study clearly shows that the MGt-DRGP plot can detect BLPs correctly with very low rates of masking and swamping.It is interesting to observe that the OLS-MD suffers from the masking problem and LMS-RMD suffers from the swamping problem.

Figure 2 :
Figure 2: The OLS-MD, LMS-RMD, and MGt-DRGP plots for the first situation (plots a, b, and c) and the second situation (plots d, e, and f) of modified artificial data set.

Figure 3 :
Figure 3: The studentized OLS residual against MD for the Phosphor Data Set.

Figure 4 :
Figure 4: The standardized LMS residual against RMD for the Phosphor Data Set.

Figure 5 :
Figure 5: The Modified Generalized Standard Residual against DRGP for Aircraft data.

Figure 6 :
Figure 6: The Studentized OLS residual against MD for the Aircraft data set.

Figure 7 :
Figure 7: The Standardized LMS residual against RMD for the Aircraft data.

Figure 8 :
Figure 8: The Modified Generalized Studentized Residual against DRGP for Aircraft data.

Table 1 :
Scatter plot of DRGP against Modified Generalized Studentized Residuals.DRGP plot is compared with some existing plots, namely, the OLS-MD and LMS-RMD plots.The performances of these plots are evaluated based on the rate of correct detection of BLPs and the rate of masking and swamping effects.A good plot is the one that has a higher percentage of correct detection of BLPs and smaller rate of masking and swamping effects.The experiments consider two explanatory variables,  = 2 and 3.In each experiment, four samples of size  = 20, 40, 100, and 200 and different percentages of BLPs ( = 0.05, 0.10, 0.15, and 0.20) are considered.The regular observations are generated according to a normal distribution with a mean equaling 0 and a variance equaling 1.In order to generate HLPs in a data set, the first "100 %" observations of the regular data in  and  variables are replaced with a certain percentage of BLPs.To generate BLPs, the first value of a BLP is kept fixed at 10 and sequential values are created by multiplying the values index, , by 2. Each experimental run involves 5000 replications.The results of the Monte Carlo simulation study are summarized in Tables
1 ,  2 , and ) are replaced by outliers, where clean observations are replaced by arbitrary large values displayed in the brackets in Table 4.In this regard, the original data is modified by substituting one good observation with one BLP (case 20).In the second situation, we would like to observe the combined effect of VO, GLP, and BLP.In this situation, the original data is modified by replacing 2 good observations with 2 VO (cases 1 and 2), 2 GLPs (cases 3 and 4), and 1 BLP (case 20).To create VO, only the th observation of response variable () is replaced by arbitrary large value and to create

Table 5 :
Values of MD, RMD, DRGP,   , MLS residuals, and MGt  and their cut-off points (in the brackets) for the second modified Artificial data set.

Table 6 :
The regression estimates, standard errors (in the brackets), and PCE for the second modified Artificial data set.

Table 7 :
Values of MD, RMD, DRGP,   , MLS residuals, and MGt  and their cut-off points (in the brackets) for Phosphorus data set.

Table 8 :
The regression estimates, standard errors (in the brackets), and PCE for Phosphorus data set.

Table 9 :
RMD, DRGP,   , MLS residuals, and MGt  and their cut-off points (in the brackets) for Aircraft data set.

Table 10 :
[22]dard errors of regression coefficients for Aircraft data set.The last example is the Aircraft data set, which is taken from Gray[22].This data set contains 23 cases with four predictor variables (aspect ratio, life to drag ratio, weight of the plane, and maximal thrust); the response variable is the Cost.From the results of Table9, we can see that the   identified one outlier (case 22), whereas the Standardized LMS residuals and MGt i identified cases 16, 22 and cases 16, 19, and 22 as outliers, respectively.Moreover, the MD identified one HLP (case 14) while RMD and DRGP detected cases 14, 20, and 22 and cases 19, 22 as HLPs, respectively.The classification of data into regular data, vertical outliers, and good and bad leverage points is shown inFigures 6,