Robust Control Charts for Monitoring Process Mean of Phase-I Multivariate Individual Observations

T 2 control charts with high-breakdown robust estimators based on the reweighted minimum covariance determinant (RMCD) and the reweighted minimum volume ellipsoid (RMVE) to monitor multivariate observations in Phase-I data. We assessed the performance of these robust control charts based on a large number of Monte Carlo simulations by considering different data scenarios and found that the proposed control charts have better performance compared to existing methods.


Introduction
Control charts are widely used in industries to monitor/control processes.Generally, the construction of a control chart is carried out in two phases.The Phase-I data is analyzed to determine whether the data indicates a stable (or in-control) process and to estimate the process parameters and thereby the construction of control limits.The Phase-II data analysis consists of monitoring future observations based on control limits derived from the Phase-I estimates to determine whether the process continues to be in control or not.But trends, step changes, outliers, and other unusual data points in the Phase-I data can have an adverse effect on the estimation of parameters and the resulting control limits.That is, any deviation from the main assumption (in our case, identically and independently distributed from normal distribution) may lead to an out-of-control situation.Therefore, it becomes very important to identify and eliminate these data points prior to calculating the control limits.In this paper, all these unusual data points are referred to as "outliers." Multivariate quality characteristics are often correlated, and to monitor the multivariate process mean Hoteling's  2 control chart [1,2] is widely used.To implement Hoteling's  2 control chart for individual observations in Phase-I, for each observation x  we calculate where x  = ( 1 ,  2 , . . .,   )  is the th -variate observation, ( = 1, 2, . . ., ) and the sample mean x, sample covariance matrix S 1 are based on  Phase-I observations.In Phase-I monitoring, the  2 (x  ) values are compared with the  2 control limit derived by assuming that the x  's are multivariate normal so that the  2 control limits are based on the beta distribution with the parameters /2 and ( −  − 1)/2.However, the classical estimators, sample mean, and sample covariance are highly sensitive to the outliers, and hence robust estimation methods are preferred as they have the advantage of not being unduly influenced by the outliers.The use of robust estimation methods is well suited to detect multivariate outliers because of their high breakdown points which ensure that the control limits are reasonably accurate.Sullivan and Woodall [3] proposed a  2 chart with an estimate of the covariance matrix based on the successive differences of observations and showed that it is effective in detecting process shift.However, these charts are not effective in detecting multiple multivariate outliers because of their low breakdown point.
Vargas [4] introduced two robust  2 control charts based on robust estimators of location and scatter, namely, the minimum covariance determinant (MCD) and minimum volume ellipsoid (MVE) for identifying the outliers in Phase-I multivariate individual observations.Jensen et al. [5] showed that  2 MCD and  2 MVE control charts have better performance when outliers are present in the Phase-I data.Chenouri et al. [6] used reweighted MCD estimators for monitoring the Phase-II data, without constructing Phase-I control charts.However, in many situations Phase-I control charts are necessary to assess the performance of the process and also to identify the outliers.We propose  2 control charts based on the reweighted minimum covariance determinant (RMCD)/reweighted minimum volume ellipsoid (RMVE) ( The organization of the remaining part of the paper is as follows.In Section 2, we discuss the properties of a good robust estimator and we briefly explain the MCD/MVE estimators and their reweighted versions.The proposed  2 RMCD / 2 RMVE control charts are given in Section 3 along with the control limits arrived at based on Monte Carlo simulations.We assess the performance of the proposed control charts in Section 4, and the implementation of the proposed methods is illustrated in a case example in Section 5. Our conclusions are given in Section 6.

Robust Estimators
The affine equivariance property of the estimator is important because it makes the analysis independent of the measurement scale of the variables as well as the transformations or rotations of the data.The breakdown point concept introduced by Donoho and Huber [7] is often used to assess the robustness.The breakdown point is the smallest proportion of the observations which can render an estimator meaningless.A higher breakdown point implies a more robust estimator, and the highest attainable breakdown point is 1/2 in the case of median in the univariate case.For more details on affine equivariance and breakdown points one may refer to Chenouri et al. [6] or Jensen et al. [5].
An estimator is said to be relatively efficient compared to any other estimator if the mean square error for the estimator is the least for at least some values of the parameter compared to others.A robust estimator is considered to be good if it carries the property of affine equivariance along with a higher breakdown point and greater efficiency.In addition to the above three properties of a good robust estimator, it should be possible to calculate the estimator in a reasonable amount of time to make it computationally efficient.
It is difficult to get an affine equivariant and robust estimator as affine equivariance and high breakdown will not come simultaneously.Lopuhaä and Rousseeuw [8] and Donoho and Gasko [9] showed that the finite sample breakdown point of ( −  + 1)/(2 −  + 1) is difficult for an affine equivariant estimator.The largest attainable finite sample breakdown point of any affine equivariant estimator of the location and scatter matrix with a sample size  and dimension  is ( −  + 1)/2 [10].Therefore relaxing the affine equivariance condition of the estimators to invariance under the orthogonal transformation makes it easy to find an estimator with the highest breakdown point.
The classical estimators, sample mean vector, and covariance matrix of location and scatter parameters are affine equivariant but their sample breakdown point is as low as 1/.The MCD and MVE estimators have the highest possible finite sample breakdown point ( −  + 1)/2.However, both of these estimators have very low asymptotic efficiency under normality.But the reweighted versions of MCD and MVE estimators have better efficiency without compromising on the breakdown point and rate of convergence compared to MCD and MVE.In the next two subsections, we discuss in detail about the MCD and MVE estimators and their reweighted versions.

MCD and RMCD Estimators.
The MCD estimators of location and scatter parameters of the distribution are determined by a two-step procedure.In step 1, all possible subsets of observations of size ℎ = ( * ), where 0.5 ≤  ≤ 1 are obtained.In step 2, the subset whose covariance matrix has the smallest possible determinant is selected.The MCD location estimator x MCD is defined as the average of this selected subset of ℎ points, and the MCD scatter estimator is given by S MCD =  , *   , * C MCD , where C MCD is the covariance matrix of the selected subset, the constant  , is the multiplication factor for consistency [11], and   , is the finite sample correction factor [12].Here (1 − ) represents the breakdown point of the MCD estimators.The MCD estimator has its highest possible finite sample breakdown point when ℎ = ( +  + 1)/2 and has an  −1/2 rate of convergence but has a very low asymptotic efficiency under normality.Computing the exact MCD estimators (x MCD , S MCD ) is computationally expensive or even impossible for large sample sizes in high dimensions [13], and hence various algorithms have been suggested for approximating the MCD.Hawkins and Olive [14] and Rousseeuw and van Driessen [15] independently proposed a fast algorithm for approximating MCD.The FAST-MCD algorithm of Rousseeuw and van Driessen finds the exact MCD for small datasets and gives a good approximation for larger datasets, which is available in the standard statistical software SPLUS, R, SAS, and Matlab.
MCD estimators are highly robust, carry equivariance properties, and can be calculated in a reasonable time using the FAST-MCD algorithm; however, they are statistically not efficient.The reweighted procedure will help to carry both robustness and efficiency.That is, first a highly robust but perhaps an inefficient estimator is computed, which is used as a starting point to find a local solution for detecting outliers and computing the sample mean and covariance of the cleaned data set as in Rousseeuw and van Zomeren [16].This consists of discarding those observations whose Mahalanobis distances exceed a certain fixed threshold value.MCD is the current best choice for the initial estimator of a two-step procedure as it contains the robustness, equivariance, and computational efficiency properties along with its  −1/2 rate of convergence.Hence RMCD estimators are the weighted mean vector and the weighted covariance matrix where  , is the multiplication factors for consistency [11],  , , is the finite sample correction factor [12], and the weights   are defined as and   is (1 − )100% quantile of the chi-square distribution with  degrees of freedom.
This reweighting technique improves the efficiency of the initial MCD estimator while retaining (most of) its robustness.Hence the RMCD estimator inherits the affine equivariance, robustness, and asymptotic normality properties of the MCD estimators with an improved efficiency.

MVE and RMVE Estimators.
Determining the MVE estimators of location and scatter parameters of the distribution is almost in line with that of the MCD estimator.As in the case of MCD, all the possible subsets of data points with size ℎ = ( * ) (where 0.5 ≤  ≤ 1) is obtained first.Then the ellipsoid of minimum volume that covers the subsets are obtained to determine the MVE estimators.The MVE location estimator is the geometrical center of the ellipsoid, and the MVE scatter estimator is the matrix defining the ellipsoid itself, multiplied by an appropriate constant to ensure consistency [13,16].Thus MVE estimator does not correspond to the sample mean vector and the sample covariance matrix as in the case of the MCD estimator.
Here (1 − ) represents the breakdown point of the MVE estimators, as in the case of MCD, and it has the highest possible finite sample breakdown point when ℎ = ( +  + 1)/2 [8,17].The MVE estimator has an  −1/3 rate of convergence and a nonnormal asymptotic distribution [17].
As in the case for MCD estimators, MVE estimators are also not efficient.Hence, a reweighted version similar to that for MCD has been proposed by Rousseeuw and van Zomeren [16].Note that it has been shown more recently that the RMVE estimators do not improve on the convergence rate (and thus the 0% asymptotic efficiency) of the initial MVE estimator [8,12].Therefore, as an alternative, a one-step Mestimator can be calculated with the MVE estimators as the initial solution [13,18] which results in an estimator with the standard  −1/2 convergence rate to a normal asymptotic distribution.For more details on MCD/MVE estimators one may refer to Chenouri et al. [6] or Jensen et al. [5].
The algorithm to determine the MVE/RMVE estimators is available in the statistical software SPLUS, R, SAS, and Matlab.

Robust Control Charts
We propose to use  2 charts with robust estimators of location and dispersion parameters based on RMCD/RMVE for monitoring the process mean of Phase-I multivariate individual observations.RMCD/RMVE estimators inherit the nice properties of initial MCD estimators such as affine equivariance, robustness, and asymptotic normality while achieving a higher efficiency.We now define a robust  2 control chart with RMCD and RMVE estimators for th multivariate observation as where  obtained empirically.In the next subsection we apply Monte Carlo simulation to estimate quantiles of the distribution of  2 RMCD and  2 RMVE for several combinations of sample sizes and dimensions.For each dimension, we further introduce a method to fit a smooth nonlinear model to arrive, the control limits for any given sample size.

Computation of Control Limits.
We performed a large number of Monte Carlo simulations to obtain the control limits.We generated  = 200, 000 samples of size  from a standard multivariate normal distribution MVN(0,   ) with dimension .Due to the invariance of the  2 RMCD and  2 RMVE statistics, these limits will be applicable for any values of  and Σ.Using the reweighted MCD/MVE estimators x RMCD , S RMCD , x RMVE , and S RMVE with a breakdown value of  = 0.50,  2 RMCD / 2 RMVE statistics for each observation in the data set were calculated using (5), and the maximum value attained for each data set of size  was recorded.The empirical distribution of maximum of  2 RMCD and  2 RMVE was inverted to determine the (1 − )100% quantiles.We used the R-function "CovMcd()" in the "rrcov" package written by Torodov [19] to ascertain the RMCD/RMVE estimators.
From Figures 1, 2, and 3, we can see that the nonlinear fit is very well supported by the high  2 values, which help us to determine the  2 RMCD and  2 RMVE control limits for any given sample size.The least square estimates of the parameters  1(,) ,  2(,) , and  3(,) when  = 0.50 for dimensions  = (2, 3, . . ., 10) and  = (0.05, 0.01 and 0.001) for  2 RMCD / 2 RMCD control charts are given in Table 1.Using these estimates, the control limits for  2 RMCD and  2 RMVE can be found using ( 6) for any sample size.
For the implementation of a robust control chart, first collect a sample of  multivariate individual observations with dimension .Compute robust estimates of mean and covariance matrix using R or any other software with  = 0.50, and determine  2 RMCD / 2 RMVE .Outliers can be determined by comparing the  2 RMCD / 2 RMVE values with control limits obtained using (6) for specific values of , , , and the constants given in Table 1.The outlier free data can be used to construct the standard  2 control chart for monitoring the Phase-II observations.

Performance Analysis
We assess the performance of the proposed charts when outliers are present due to the shift in the process mean.In their study, Jensen et al. [5] concluded that the  2 MCD / 2 MVE control charts had better performance in terms of probability of signal.Hence, we compare the performance of our proposed method with  2 MCD / 2 MVE charts as well as the standard  2 charts based on classical estimators.Our study compares more combinations of dimension , sample size , and .For a particular combination of , , and , a number of datasets are generated.Out of the  observations generated,  *  of them are random data points generated from the out-of-control distribution, and the remaining  * (1 − ) observations are generated from the in-control distribution so that the sample of  data points may contain some outliers.We set  = 0.10 and 0.20 to ensure that the sample contains few outliers.Without loss of generality, we consider the in-control distribution as (0,   ).The out-of-control distribution is a multivariate normal with a small shift in the mean vector with same covariance matrix.The amount of mean shift is defined through a noncentrality parameter (), which is given by where ( 1 − ) is the shift in the mean vector.The larger the value of  is, the more extreme the outliers are.The proportion of datasets that had at least one  2 RMCD or  2 RMVE statistic greater than the control limit was calculated, and this proportion becomes the estimated probability of signal.We compared the performance of these charts with standard  2 charts,  2 MCD , and  2 MVE charts.The standard  2 chart was included in our performance study as a reference because of its common usage.
The probability of a signal for different values of  = (0, 5, 10, 15, 20, 25, 30) and for some of the values of  = (30, 50, 100, 150),  = (2, 6, 10) and  = (10%, 20%) was considered in our study.Fifty thousand datasets of size  were generated for each combination of , , and , and the probability of signal was estimated for  = 0.05, 0.01, and 0.001.We considered various combinations of  1 ,  2 , and  which determine  as per (7) and found that the probability of signal is the same irrespective of the combination of  1 ,  2 and .Hence we have considered  1 =  2 and  = 0 for various values of .We have presented only a selected set of plots to save space.The plots of probability of signal for  = 0.05 and 0.01,  = 2 and 6, and  = 50 and 100 are given in Figures 4, 5, 6, and 7 for easier understanding.For dimension  = 10, we used  = 100 and 150, and the plots of probability of signal are given in Figures 8 and 9.
From Figures 4-9, we can see that when the value of the noncentrality parameter is zero or close to zero, the probability of signal is close to  which is expected for an incontrol process.As the value of the noncentrality parameter increases the probability of signals also increases.Using this criterion, we select the best method for identifying the outliers.If the probability of signal does not increase for increase in noncentrality parameter, then it is clear that the estimator has broken down and is not capable of detecting the outliers.
A careful examination of these plots of probability of signals corresponding to various values of , , and  indicates that for small values of  and ,  As  increases for a fixed value of , the breakdown points of RMCD and RMVE get smaller as the breakdown value is given by ( −  + 1)/2.This suggests that the larger  is, the larger  will need to be in order to maintain the breakdown point, which is very well demonstrated in Figures 8 and 9.In general, there was always one estimator, RMCD or RMVE, that was found to be superior across all the values of the noncentrality parameter as long as the proportion of outliers was not so big as to cause the estimators to break down.This greatly simplifies the conclusions that can be made about when the RMCD or RMVE estimators are preferred to the MCD and MVE estimators.
Nevertheless,  2 RMCD and  2 RMCD charts are preferred for the various combinations of , , and , and some broad recommendations can be made on the selection among these two charts.When  < 100, the  2 RMVE will be the best for small dimension.When  ≥ 100, the  2 RMCD is preferred.As  increases, then the percentage of outliers that can be detected by the  2 RMVE chart decreases.It is true for both the charts that when  is higher, the number of outliers that can be detected decreases for smaller sample sizes.Thus for Phase-I applications where the number of outliers is unknown,  2 RMVE should be used only for smaller sample sizes, and it is also computationally feasible. 2 RMCD should be used for larger sample sizes or when it is believed that there is a large number of outliers.When the dimension is large, larger sample sizes are needed to ensure that the estimator does not break down and lose its ability to detect outliers.Hence for larger dimension cases,  2 RMCD is preferred with large sample sizes.For very small samples ( < 30), one may opt for higher values of , for which control limits need to be developed.

Case Example
To illustrate the applicability of the proposed control chart method, we discuss a real case example taken from an electronic industry.The data gives 105 measurements of 3 axial components of acceleration measured by accelerometer on a e-compass unit fixed on the objects.The mean vector and covariance matrix under the classical, RMCD, and RMVE methods of the sample data considered are given by  = ( A simple comparison of these estimators indicates that there are outliers in the Phase-I data.The plots of  2 ,  2 RMCD , and  2 RMVE values along with the respective control limits at 99% confidence level for the sample data are given in Figure 10. The control limits for  2 are arrived at based on beta distribution, and  2 RMCD / 2 RMVE are calculated using (6) for  = 3 and  = 105.From Figure 10, it is very clear that both  2 RMCD and  2 RMVE control chart alarms signal for 3 outliers whereas the standard  2 control chart alarm signals for none even though all the charts are having the same pattern.This indicates the effectiveness of the proposed robust control charts in identifying the outliers.

Conclusions
Use of robust control chart in Phase-I monitoring is very important to assess the performance of the process as well as detecting outliers.We propose  2 RMCD / 2 RMVE control charts for Phase-I monitoring of multivariate individual observations.The control limits for these charts are arrived empirically and a non-linear regression model is used for arriving control limits for any sample size.The performance of the proposed charts were compared under various data scenarios using large number of Monte Carlo simulations.Our simulation studies indicate that  2 RMVE control charts are performing well for smaller sample sizes and smaller  dimension where as  2 RMCD control charts are performing well for larger sample sizes and larger dimensions.We illustrated our proposed robust control chart methodology using a case study from the electronic industry.

Figure 4 :
Figure 4: Probability of signal for  2 control chart with different estimation methods for  = 2,  = 50.

Figure 5 :
Figure 5: Probability of signal for  2 control chart with different estimation methods for  = 2,  = 100.

Figure 6 :
Figure 6: Probability of signal for  2 control chart with different estimation methods for  = 6,  = 50.

Figure 7 :
Figure 7: Probability of signal for for  2 control chart with different estimation methods for  = 6,  = 100.

Figure 8 :
Figure 8: Probability of signal for  2 control chart with different estimation methods for  = 10,  = 100.

Figure 9 :
Figure 9: Probability of signal for  2 control chart with different estimation methods for  = 10,  = 150.
2 RMCD / 2 RMVE ) for monitoring Phase-I multivariate individual observations.RMCD/RMVE estimators are statistically more efficient than MCD/MVE estimators and have a manageable asymptotic distribution.We empirically arrive at Phase-I control limits for the  2 RMCD / 2 RMVE control chart for some specific sample sizes and fitted a nonlinear model to determine control limits for any sample size for dimensions 2 to 10.Our simulation studies show that  2 RMCD / 2 RMVE control charts are performing well compared to  2 MCD / 2 MVE control charts for monitoring the Phase-I data.
RMVE charts are performing well which is evident from all the plots presented here.When  is large (see Figures8 and 9), the  2 RMCD has clear advantage compared to  2 RMVE .From these figures, we see that standard  2 control chart possesses little ability to detect the outliers and the  2