Model Selection Approaches for Predicting Future Order Statistics from Type II Censored Data

This paper studies a discriminant problem of location-scale family in case of prediction from type II censored samples.Threemodel selection approaches and two types of predictors are, respectively, proposed to predict the future order statistics from censored data when the best underlying distribution is not clear with several candidates. Two members in the location-scale family, the normal distribution and smallest extreme value distribution, are used as candidates to illustrate the best model competition for the underlying distribution via using the proposed prediction methods. The performance of correct and incorrect selections under correct specification and misspecification is evaluated via using Monte Carlo simulations. Simulation results show that model misspecification has impact on the prediction precision and the proposed three model selection approaches perform well when more than one candidate distributions are competing for the best underlying distribution. Finally, the proposed approaches are applied to three data sets.


Introduction
For saving testing time and sample resource, censoring schemes often are considered to implement life tests.Type I censoring scheme and type II censoring scheme are two popular censoring schemes based on the criteria of test time censoring and failure number censoring.Plenty studies can be found for evaluating the reliability of lifetime components via using type I censoring test or type II censoring test.See examples like, [1][2][3][4][5][6] etc.
In this study, we mainly restrict our attention to using type II censoring scheme for predicting the censored sample for reliability evaluation when a discriminant problem is considered.In the type II censoring scheme, we consider an experiment where  identical components are placed in the test simultaneously.Assuming that  ℎ component fails, the experiment is terminated.Thus the last ( − ) components are censored.In many engineering applications, censored data are not allowed for implementing statistical methods to obtain information.For example, if we like to conduct a factorial design or fractional factorial design based on the experimental design methods, most experimental design methods cannot be implemented with censored data.In such situation, a reliable procedure for predicting censored or unobserved observations is required.Moreover, if we can predict the unobserved observations and transform a censored data set into a complete data set, the parameter estimation problem becomes easy especially for dealing with the cases, which have no analytic solutions of the parameter estimators can be obtained.The purpose of predicting life length of the  ℎ ( <  ≤ ) item is equivalent to the life length of a (n-s+1)-out-of-n system that was made up of  identical components with independent life lengths.When s = n, it is better known as the parallel system.For this issue, various methods have been developed to predict the censored data.Kaminsky and Nelson [7] provided interval and point prediction of order statistics.Fertig et al. [8] provided Monte Carlo estimates of the distribution percentiles to construct prediction intervals for samples from a Weibull or smallest extreme value distribution (SEV).Kaminsky

and Rhodin
Mathematical Problems in Engineering [9] provided the maximum likelihood predictor (MLP) to predict the future order statistics and then estimate the unknown parameters.Wu et al. [10] proposed five new pivotal quantities to obtain prediction intervals of future order statistics from the Pareto distribution.Kundu and Raqab [11] describes the Bayesian inference and prediction of the two-parameter Weibull distribution.Panahi and Sayyareh [12] proposed parameter estimation and prediction of order statistics for the Burr type XII distribution.Some of these predictions are complex, or they need to construct complex statistical models.Therefore, these existing methods are not easy to apply.
In order to solve this problem, Raqab [13] modified the MLP method and proposed four modified MLPs (MMLPs) to predict the future order statistics for the normal distribution (ND).In order to simplify the estimation function, they considered four types of modification to approximate the terms of hazard rate and extended hazard rate functions form a ND, which has unknown mean and known standard deviation.Yang and Tong [14] used MMLP method to predict type II censored data from factorial experiments.They derived the simple explicit solutions for parameters for a ND, which has unknown mean and unknown standard deviation.Chiang [15] used another three MMLP procedures to predict type II censored data under the Weibull distribution.In his procedures, it is difficult to find the only root solution to the parameter estimation.However, the parameter estimation of MMLP method can be obtained via simple parameter explicit solution only in the ND.For other commonly used distributions, the likelihood equations of MMLP may be nonlinear and does not admit explicit solutions.Hence the parameter estimation of MMLP loses the advantage for other commonly used distributions.
Another important problem in life testing experiments is the model selection based on the existing sample.In practical applications, many statistical distributions are much alike, especially in censored data, and the underlying distribution of product quality characteristics is usually unknown.They may fit the data well in practical applications.However, their predictions may lead to a significant difference.Therefore, correctly identifying the underlying distribution is an important issue and it has long been studied.Dumonceaux and Antle [16] applied ratio of maximized likelihood (RML) to discriminating between the lognormal and Weibull distributions.Kundu and Manglick [17] proposed statistical methods to discriminate between the lognormal and gamma distributions.Kundu and Raqab [18] proposed a selection to discriminate between the generalized Rayleigh and lognormal distribution.Yu [19] provided a misspecification analysis method to discriminate between the ND and SEV for the design of experiment.Dey and Kundu [20] studied the discrimination problem between the lognormal and loglogistic distributions.Elsherpieny et al [21] considered the discrimination problem between the Weibull and log-logistic distributions.Ashour and Hashish [22] provided a numerical comparison study for using RML-procedure, S-procedure, and F-procedure in failure model discrimination.Pakyari [23] presented diagnostic tools based on the likelihood ratio test and the minimum Kolmogorov distance method to discriminate between the generalized exponential, geometric extreme exponential, and Weibull distributions.Elsherpieny et al. [24] provided a method to discriminate the gamma and log-logistic distributions based on progressive type II censored data.Although the inference methods in the aforementioned studies are valuable, the impacts of model misspecification on predicting the future order statistics have not been well studied.
Among the model discrimination problems, due to the well-developed theory and inferential procedures for the location-scale family of distributions, the model discrimination within the location-scale family of distributions is particularly important and it has received much attention.The main purpose of this paper is to address these issues and provide satisfactory estimators of parameters and predictors of future order statistics when the underlying distribution is unknown but it is a member in the location-scale family.Specifically, for lifetime analysis, the essence of this study is to predict the future order statistics for type II censored data when the underlying distribution is unknown but is a member of the location-scale family.The major contributions of this study for censored data prediction are presented in Figure 1.
The rest of this paper is organized as follows.Section 2 presents materials and methods.In this section, statistical methods to obtain approximate predictors for type II right censored variables are studied and two prediction methods are proposed to predict the type II right-censored variables based on the AMLEs.The ND and SEV are considered as the candidate distributions to compete the best distribution for obtaining the predictors of type II right-censored variables.In Section 3, we provide three algorithms to implement the three proposed model selection approaches to deal with the discrimination problem when obtaining the predictors of type II right-censored variables based on the proposed methods.An intensive simulation study is conducted in Section 4 to evaluate the performance of the proposed approaches.Then, three examples are used to demonstrate the applications of the proposed methodologies in Section 5. Some concluding remarks are provided in Section 6.

Methods for Approximate Predictors
and respectively, where  is location parameter and  is scale parameter.(⋅) and (⋅) are the PDF and CDF of a member, Please note that the capital notation  : in ( : ) is unknown and can be predicted based on the sample x.Based on the proposed method by Raqab [13], the PLF of  : ,  and  in (3) can be represented as a product of two likelihood functions, the PLF of  and  (i.e., which is denoted as  1 ) and the PLF of  : (i.e., which is denoted as  2 ).Both likelihood functions are presented, respectively, by and −  ( : ) . ( In practice, we can obtain the MLEs of  and , denoted by μ and σ, respectively, through maximizing  1 (, ; x) in (4).Then use μ and σ to replace  and  as the plug-in parameters in (5) to predict  : .Let  : = ( : − )/ for  = 1, . . ., ,  : = ( : − )/ for  =  + 1, . . .,  and z = ( 1: ,  2: , . . .,  : ), then we can rewrite ( 4) and ( 5) by and where and Because of no analytic presentation for μ and σ, one needs to use numerical gradient computation methods, for example, the Newton-Raphson method, for obtaining μ and σ via by equating ( 8) and (9).To obtain proper initial solutions for implementing gradient computation methods, we consider using the approximate MLEs (AMLE) of  and  from Hossain and Willan [25] as their initial solutions in this study.

Approximate Maximum Likelihood Predictors.
When we obtain the MLEs μ and σ, we can predict  : by using two approximation methods, the expected value prediction method and Taylor series prediction method.The resulting predictors of  : based on the expected prediction method is denoted by MLP E , and the resulting predictors of  : based on the Taylor series prediction method is denoted by MLP T .
The two approximate methods mainly use two different methods to get the approximates of ℎ 1 ( : ,  : ) and ℎ( : ).Mehrotra and Nanda [26] proposed approximate maximum likelihood estimators for the ND and gamma distribution by replacing ℎ() and ℎ() by their respective expected values and efficiencies compared to those for the best linear unbiased estimators for these distributions.Balakrishnan and Cohen [27] used the Taylor series expansion of ℎ() and ()/() at the points  −1 (  ) to obtain modified MLEs of the parameters of the ND and Rayleigh distribution, where   = /( + 1) for  = 1, 2, . . ., .The main point of their approach is that likelihood equations involve complicated terms and it is not possible to obtain an explicit form for MLE.So we follow their ideas and find an explicit form for the predictor of  : .Based on the expected value prediction method, replacing (, ) with ( μ, σ), and replacing ℎ 1 ( : ,  : ) and ℎ( : ) by their respective expected values in (10).According to Raqab [13], the expected value of ( : ), ℎ 1 ( : ,  : ) and ℎ( : ) can be presented, respectively, by ≤  and  : =  : if  ≤ ,  Based on the Taylor series prediction method, replacing (, ) with ( μ, σ) and replacing ℎ( : ) and ℎ 1 ( : ,  : ) with their Taylor series approximations at points  −1 (  ) and ( −1 (  ),  −1 (  )), respectively, in (10).In this study, we denote the MLP E and MLP T of  : under the candidate distribution  by X,1 : and X,2 : , respectively.There are many common distributions in location-scale family of distributions.The widely used members including the ND, SEV, logistic distribution, etc.It is impossible to list all inference formulas for predicting  : under all widely used members in the location-scale family.In this study, we use ND and SEV as candidates to illustrating the applications of the proposed methods.But the suggested algorithms in this study can be applied for the cases with more than two candidate members.The reason to select the ND and SEV as candidates is due to the fact that the Weibull distribution and lognormal distribution are two widely used distributions for life testing applications.The Weibull and lognormal distributions can be respectively transformed into the SEV and ND by taking logtransformation.
If the underlying distribution is normal, the PDF of normal distribution is given by Through using (17), we can obtain Ψ() = −  ()/() = .The MLEs of normal distribution parameters are denoted by μ and σ .Replacing  and  with μ and σ in (6), we can represent (6) by where Φ(⋅) is the CDF of the standard ND.According to (15) and ( 16), ℎ 1 ( : ,  : ) and ℎ( : ) can be replaced with their respective expected values in (10).Equation (10) can be rewritten as The values of ( : ) are available and have been tabulated by Teichroew [28].Hence, MLP E of  : for ND can be derived as The values of , , ,  and V  are given in Appendix A. Equation ( 10) can be rewritten by The MLP T of  : can be obtained by where  + 1 ≤  ≤ .
Based on the Taylor series prediction method, expanding ℎ( : ) and ℎ 1 ( : ,  : ) by using the Taylor series at the points  −1 (  ) and ( −1 (  ),  −1 (  )), respectively.We obtain and The values of   ,   ,   ,   and V  are given in Appendix B. Equation ( 10) can be rewritten as The MLP T of  : can be derived as X,2 for  + 1 ≤  ≤ .

Three Model Selection Approaches
When several candidate distributions are competing for the best underlying distribution and the users cannot identify which one distribution is the best, we suggest three approaches to discriminate the candidate distributions, the ratio of the maximized likelihood (RRML) approach, modification   approach (shorted as   approach), and modification D approach (shorted as the D approach), to obtain the predictor of X: .It is noticed that the idea of the   approach and D approach is based on goodness-of-fit test methods.All these three approaches can be implemented to obtain the predictor of  : via using Algorithms 1-3.
Step 1. Collect a type II censored sample, which has size  and  observed failure times; we consider  candidate distributions.
Step 1. Collect a type II censored sample, which has size  and  observed failure times.
Step 3. Based on the method proposed by Castro-Kuriss et al. [29], the modification of   with censored observations can be presented by where  : = (( : − )/).The definition of (•) is the same as that of (2), it represents the CDF of the assumed distribution in model selection.Evaluate the value of   through using the candidate distribution   for  = 1, 2, . . ., .
Step 4. Let X2, : be the predicted value of  : for  = 1 or 2, then X2, : can be obtained with the smallest D .That is, X2, : is the value corresponding to D2  ( μ2 , σ2 ), which is defined by If the candidate distributions are ND and SEV, Steps 2, 3, and 4 in Algorithm 2 can be reduced to Step 2' and Step 3' as the following, respectively: Step 2'.Obtain (μ  , σ ) and (μ  , σ ).Obtain the X, : under the ND and obtain the X, : under the SEV for  = +1, . . .,  and  = 1 or 2.
Step 3'.The modification of   with censored observations can be presented by where  : = (( : −)/).The definition of (•) is the same as that of (2); it represents the CDF of the assumed distribution in model selection.Evaluate the values of   through using the ND and SEV and denot them by D  ( μ , σ ) and D  ( μ , σ ), respectively.
Step 1. Collect a type II censored sample, which has size  and  observed failure times.

Monte Carlo Simulations
A Monte Carlo simulation study was conducted in this section, by using R language, to evaluate the performance of the proposed three approaches with two predicting methods.We consider the ND and SEV as the candidate distributions for competing the best lifetime model in the simulation study.The data sets of type II censoring sample,  1: , . . .,  : , used in the simulation were randomly generated from the ND and SEV with location parameter  = 0 and scale parameter  = 1.Then, the  ℎ order statistic is predicted and denoted by X: for  =  + 1,  + 2, . . .,  for the sample sizes  = 20, 30, 40, 50 and 60.For the purpose of comparison, the values of the bias and mean square error (MSE) of X: are evaluated using  = 10000 Monte Carlo runs: and where X:, is the predicted value of  : that is obtained in the  th iteration of simulation for  = 1, . . ., .All simulation results are displayed in Tables 1 and 2 with the candidate distributions of ND and SEV.From Tables 1 and 2, we notice that the bias and MSE are large when the misspecification model is used.The impact of misspecification depends on the values of  and .As  or  increases, the simulated bias and MSE are decreased.We also find that the MSE based on using the Taylor series prediction method is smaller than that based on using the expected values prediction method when the sample size is or larger than 30.
To evaluate the performance of the three proposed model selection approaches for MLP, Tables 3-5 report the simulation results for three model selection approaches from the ND.Tables 6-8 respectively report the simulation results for three model selection approaches from the SEV.The column "correct (%)" presented in Tables 3-8 is the correct model selection rate in all simulation runs.From Tables 3-8 we find that the three model selection approaches have good ability to identify the correct underlying distribution with a high probability.Moreover, the MSEs of these three approaches are close to those simulated MSEs of the cases by using the real underlying distribution.Overall, the correct model selection rates through using   approach or  approach are higher than that of using the RRML approach when the sample size is smaller than 30.When the sample size grows to or over 30, the performance of the RRML approach is improved and the correct model selection rate of the RRML approach is higher than that are obtained by using the   or  approach.To compare the performance of using two different MLPs, the MSEs of using the expected values prediction method are smaller than that using the Taylor series prediction method when the sample size is smaller than 30.The proposed approaches can perform well under large sample size cases.

Illustrative Examples
In this section, three numerical examples are presented to illustrate the proposed approaches in Sections 2-4.

Example 1.
A test airplane component's failure time dataset provided in Mann and Fertig [30], in which 13 components were placed on test, and the test was terminated at the time of the 10 ℎ failure.The failure times (in hours) of the 10 components that failed were  1 : 0.22, 0.50, 0.88, 1.00, 1.32, 1.33, 1.54, 1.76, 2.50, 3.00.
Let  1 be the logs of the ten observations, i.e.,  1 = ln( 1 ). Figure 2 presents the histogram and the estimated PDFs of the ND and SEV.From Figure 2, we find a difficulty to fully decide the best distribution for lifetime fitting due to the fact that both candidate distributions can provide good fitting for this data set.In this example, we consider using   approach to discriminate competing models and apply Taylor series prediction method to predicting the future order statistics, which are censored.The R source codes of Example 1 can be found in Appendix C and other designs can be obtained from the authors upon request.Through using Newton-Raphson algorithm, we obtained the MLEs of  and  as ( μ , σ ) = (0.479, 0.938) and ( μ , σ ) = (0.821, 0.705) for the ND and SEV, respectively.
The   values via using ND and SEV are 0.223 and 0.212, respectively.Because the   value obtained from the SEV is smaller than that obtained from the ND, we claim the best distribution of this data set is SEV.The Taylor series prediction for ( 11:13 ,  12:13 ,  13:13 ) under the extreme value distribution with the censored sample can be obtained by ( Ŷ2,2 For more information about this carbon fiber breaking strength data set, one can be referred to Meeker and Escobar (1998).In this example, we assume that the censoring proportion is 0.8696 ( = 20,  = 23).Figure 3 presents the histogram and the estimated PDFs of ND and SEV based on the type II right-censored data set.From Figure 3, it is difficult to decide the best distribution from these two candidate distributions.
We consider using  approach in Example 2 for model selection and use expected values prediction method to

Example 3.
We consider the experiment on the pulloff performance for use in automotive engine components, reported by Byrne and Taguchi [33] and further studied by Yang and Tong [14], is used to illustrate the methodologies developed in this study.An experiment was conducted to find a method to maximize the pull-off force.Four control factors that could influence the assembly's pull-off force have been identified.Repeat 8 times for each run and record the pull-off force in pounds.Table 9 lists the four control factors with their levels and complete data of this experiment.In this example, we assume that the censoring proportion is 0.75 ( = 6,  = 8).Please note that censored data cannot support the practitioner to conduct experimental design methods.Predicting the unobserved data and using a pseudo-complete data set for conducting experimental design methods is required.
We consider using the RRML approach for model selection and use Taylor series prediction method to predict     the future order statistics in this example.After combining the uncensored data and the predicted censored data, the pseudo-complete data are shown in Table 10.

Conclusions
It could be difficult to discriminate a best model sometimes from several candidate distributions.The sample size, estimation methods, and goodness-of-fit testing methods can affect the final results of model selection.In this study, we focus on providing reliable methods to obtain predicting values of censored data to reduce the impact of model misspecification.
In this study, three model selection approaches are proposed for predicting the future order statistics from type II censored data, in which the quality characteristic is assumed to follow a location-scale family.The ND and SEV are considered as the candidate members in the location-scale distribution to compete the best underlying distribution.The ND can be the log transformation from the lognormal distribution and the SEV can be the log transformation from the Weibull distribution.Discrimination between lognormal and Weibull distributions is equivalent to the discrimination between ND and SEV.Hence, both ND and SEV are widely used for practical reliability applications.Through any one of three proposed approaches, the robust predictions can be obtained even under model uncertainty.Three examples are used to illustrate the methodologies.Moreover, the performance of these three proposed approaches are evaluated through using Monte Carlo simulations.Numerical results show that the three proposed model selection approaches are robust and effective in obtaining good predicted values for the future order statistics, which are censored.
In comparing these three proposed approaches, we recommend using   approach or  approach for model selection and use expected values prediction method to predict the future order statistics for small sample size cases, that is, the sample cases with a size  is less than 30.For large sample size cases (sample size  larger than 30), we recommend using RRML approach for model selection and use Taylor series prediction method to predict the future order statistics.Simulation results show that the proposed approaches are robust and can highly reduce the impact caused by model uncertainty.The proposed approaches can

2. 1 .
Approximate Maximum Likelihood Estimation.Let   denote the failure time of  th item and   = log(  ), which follows a location-scale family, having the probability density function (PDF) and cumulative distribution function (CDF):

Figure 1 :
Figure 1: The flow chart of the major contribution of this study.

Figure 2 :
Figure 2: The histogram and the estimated probability density functions of airplane component's failure time in Example 1.

Figure 3 :
Figure 3: The histogram and the estimated probability density functions of tests on endurance of deep groove ball bearings in Example 2.

Table 1 :Table 2 :Table 3 :
The corresponding bias and MSEs for different settings with model misspecification when true distribution is ND.The corresponding bias and MSEs for different settings with model misspecification when the true distribution is SEV.The corresponding bias and MSEs for different settings of RML approach when the true distribution is ND.

Table 5 :
The corresponding bias and MSEs for different settings of  approach when the true distribution is ND.

Table 4 :
The corresponding bias and MSEs for different settings of   approach when the true distribution is ND. :

Table 6 :
The corresponding bias and MSEs for different settings of RML approach when the true distribution is SEV.

Table 7 :
The corresponding bias and MSEs for different settings of   approach when the true distribution is SEV.

Table 8 :
The corresponding bias and MSE for different settings of D approach when the true distribution is SEV.

Table 9 :
Factors with levels of each factor and complete data in the experiments.insertion depth with Shallow (1), Medium (2) and Deep (3) levels.Factor D is Percent adhesive in connector pre-dip with Low (1), Medium (2) and High (3) levels.

Table 10 :
The pseudo-complete data and results of model selection.