One-Step Dynamic Classifier Ensemble Model for Customer Value Segmentation with Missing Values

Scientific customer value segmentation (CVS) is the base of efficient customer relationship management, and customer credit scoring, fraud detection, and churn prediction all belong to CVS. In real CVS, the customer data usually include lots of missing values, which may affect the performance of CVSmodel greatly.This study proposes a one-step dynamic classifier ensemble model for missing values (ODCEM) model. On the one hand, ODCEM integrates the preprocess of missing values and the classification modeling into one step; on the other hand, it utilizes multiple classifiers ensemble technology in constructing the classification models. The empirical results in credit scoring dataset “German” from UCI and the real customer churn prediction dataset “China churn” show that the ODCEM outperforms four commonly used “two-step” models and the ensemble based model LMF and can provide better decision support for market managers.


Introduction
In the increasingly fierce market competition, the traditional resources in enterprises, such as product quality, price, and production capacity, have been unable to bring new competitiveness for the enterprises; customer relationship management (CRM) has become the new resource from which enterprises can gain benefit continuously. The core objective of CRM is to maximize the customer value for enterprises [1]. As we all know, each customer has different value for an enterprise; usually about 80% profits are created by 20% customers. The enterprise can get the most profits only if it devotes the limited resources to the most valuable customers, develops specialized customer strategy, and designs different products and services for different customers [2]. Therefore, scientific customer value segmentation (CVS) is the base for efficient CRM. In the last decades, CVS has been applied to many key business processes such as customer retention, customer growth analysis, and customer acquisition, and it provides important support for decision making in CRM [3,4]. CVS is to classify customers according to their ability of creating value for enterprise, provide targeted products, services, and marketing models, enable enterprise to allocate resources more rationally, reduce cost effectively, and gain more profitable market penetration [5]. In fact, customer credit scoring, fraud detection, and churn prediction all belong to CVS, and their essential work is to classify customers according to different dimensions of customer value [6,7].
The research methods of CVS can be roughly divided into three categories: (1) qualitative analysis (as for this method, the customer manager judges the value of any customer through selecting and reading the customer information. This method is highly subjective); (2) statistical analysis methods construct a statistical model for CVS (the representative methods include logistic regression [8], discriminant analysis [9], and so on); (3) machine learning methods (with the development of information technology, people began to utilize some models derived from machine learning in CVS in the 1990s, such as support vector machine [10], decision 2 Mathematical Problems in Engineering tree [11], artificial neural network [12], and multiple classifiers ensemble technology) [13].
With the in-depth study of the theory and practice of CVS, it is discovered that many customer data collected through questionnaires, interviews, and other means often include lots of missing values (MVs) [14,15]. Yim et al. [15] have studied customer satisfaction through 450 questionnaires and found that 90 questionnaires contain a large number of MVs. The data not only from the questionnaires but also from the enterprises' CRM database often contain MVs. Take the enterprise Honeywell listed in Fortune Global 500 for example; although they have strict data collection standards, the missing rate of data in the customer database is still as high as 50% [14].
The MVs contained in CRM customer data influence the CVS effects to a large extent [16]. To solve this issue effectively, Kim et al. [17] have proposed a three-phase framework: (1) data collection and preprocess; (2) CVS modeling (it finds the main factors that affect customer value from customer information and constructs classification model to predict customer value based on these factors); (3) marketing strategies formulation (it proposes appropriate strategies for different customers to maximize their values for the enterprise according to the results obtained in Phase 2).
Scholars have done a lot of researches around the above framework. They usually preprocess the MVs to make the data complete and then establish CVS model. The two steps are carried out independently, so we call it "two-step" model. The simplest way of preprocess is listwise deletion (LD) [14], which deletes the instances with MVs from the dataset directly. For example, Subramania and Khare [18] have utilized pattern classification method in the diagnosis analysis of automotive warranty and service, and they adopt LD to handle MVs. This method is simple but it is easy to lose a lot of important information [19]. Therefore, the imputation methods are more popular. For instance, Lessmann and Voß [20] have proposed a support vector machine based hierarchical reference model for credit scoring and replaced MVs with the mean of the nonmissing values of the corresponding attribute before modeling; Paleologo et al. [21] have presented subagging model for credit scoring; they replace the missing values either by the maximum or by the minimum of the nonmissing values of the attribute; Li and Wang [22] have proposed Bayesian network technology based attribute fatigue analysis model in product development; they impute the MVs by EM method [23] before modeling.
The propositions mentioned above have made important contributions to customer value segmentation with MVs. However, some scholars have found that many CVS models are sensitive to data preprocess method, and the results are instable [24,25]. In addition, as the most popular preprocess methods, imputation methods still have some disadvantages. The commonly used imputation methods are based on the assumption of random missing [26], so they all need to suppose that the data obey some distribution models. But in practice, a variety of missing mechanisms are often intertwined. If the assumption and the model are irrational, they are prone to data deflexion, which may lead to serious estimation bias and affect the learning effect of subsequent classifiers [27]. Therefore, the "two-step" CVS models need further improvement.
This study introduces multiple classifiers ensemble technology [28] to CVS and constructs one-step dynamic classifier ensemble model (ODCEM) for MVs, which integrates the preprocess of missing values and the classification modeling into one step. The empirical results in a customer credit scoring dataset and a customer churn prediction dataset show that the proposed method is superior to other models in CVS performance.
The structure of this study is organized as follows: it briefly introduces the commonly used processing methods for MVs in Section 2; proposes the work principle and detailed steps of ODCEM in Section 3; proceeds the experimental design and detailed results analysis in Section 4. Finally, the conclusions and future work are in Section 5.

The Commonly Used Processing Methods for MVs
In general, most customer value segmentation models, such as artificial neural network, logistic regression, and support vector machine, require that the customer data are complete. As long as one value of some attribute is missing, it cannot train the model [29]. At present, the main methods of handling MVs can be summarized into three categories: listwise deletion, imputation methods, and ensemble based methods.

Listwise Deletion.
Listwise deletion (LD) [14] is the simplest method of handling MVs. For any instance with MVs in model training set, LD will delete it from the dataset directly. If the missing rate of the model training set (the ratio between the number of instances with MVs and the size of the dataset) is very small, LD will be effective, while, if the missing rate is large and the MVs are not distributed randomly, it may result in data deviation and gaining wrong conclusions [14]. At last, it cannot be used when each instance in model training set contains some MVs or the instances in test set also contain some MVs.

Imputation Methods.
At present, the imputation methods are the most commonly used ones for dealing with MVs, in which each MV is replaced by a value generated by some mechanism according to the nonmissing values in the model training set. Many scholars have proposed a variety of MV imputation methods, among which some methods are very simple and applicable, such as mean substitution (MS) [30], -nearest neighbours imputation (KI) [27], regression imputation (RI) [31], and EM imputation (EM) [23]. Mean substitution (MS) [30] usually divides the attributes into nominal and numeric while dealing with MVs. For nominal attribute, the MV is replaced by the most common attribute value, while for numerical one, the MV is replaced by the average of all nonmissing values of the corresponding attribute.
-nearest neighbours' imputation [27] uses an instance based algorithm to impute the MVs. Every time it finds a MV in a current instance, it computes the -nearest neighbours of the instance and imputes a value from them. For nominal value, the most common value among all neighbours is taken, and for numerical value we will use the average value.
RI method [31] needs to select some independent attributes for predicting the MV first and then constructs the regression equation to estimate the MV, that is, replaces the MV with its conditional expectation. In detail, given a MV for an attribute , suppose that attributes have been observed for that instance. The records where these + 1 attributes are available define a training set, and a regression model to predict from the predictors is fitted. Finally, the fitted model provides a prediction for the initial MV of .
EM imputation [23] finds the maximum likelihood estimates recursively according to the observed data and consists of two steps: expectation step ( -step) and maximization step ( -step). In -step, it calculates the expectation of the complete data sufficient statistics given the observed data and current parameter estimates and updates the parameter estimates through the maximum likelihood approach based on the current values of the complete sufficient statistics in -step. It repeats the two steps till the parameter estimation converges, and the expectation of each MV in the final -step is regarded as the imputation value.
In fact, all the above imputation methods belong to single imputation. Single imputation replaces each MV with one value, which cannot reflect the uncertainty of MVs well. Thus, Rubin [32] has proposed multiple imputation (MI) method. In this method, each MV is imputed times by the same imputation algorithm, which uses a model that incorporates some randomness. As a result, "complete" datasets are generated, and usually the average of the estimates across the samples is used to generate the final imputation value. The main disadvantage of MI imputation is the large calculating cost.

Ensemble Based Methods.
Recently, some scholars have tried to utilize multiple classifiers ensemble technology to construct the classification model for MVs. For example, Krause and Polikar [33] and Mohammed et al. [34] have proposed Learn ++ method for missing features (abbreviated as LMF) which classifies the data with MVs directly. It selects some attribute subsets in the whole attribute space, obtains a number of training subsets by mapping, and then trains a base classifier in each training subset. For each test instance * (may contain MVs), LMF finds the base classifiers which can classify it and combines the classification results of the selected classifiers by voting to get its final classification result. The empirical results show that LMF method can achieve better classification performance. For more detailed process of Learn ++ , please refer to [34].
In theory, LMF also belongs to one-step ensemble strategy. However, Mohammed et al. have also pointed out that LMF cannot classify the test instance with many MVs because we cannot find any available base classifier for it [34]. At last, for each test instance * , LMF method gets the final classification results through combining the results of all available base classifiers. However, redundancy may exist among the base classifiers. Therefore, it is expected to improve the classification performance if an appropriate classifier subset can be selected to ensemble.

One-Step Dynamic Ensemble Model for Missing Values
3.1. Basic Idea. In fact, there are many data issues such as noise and imbalanced class distribution except missing values in customer value segmentation (CVS). In this study, we mainly focus on the issue of CVS with missing values and propose one-step dynamic classifier ensemble model (ODCEM) for missing values, while for the other issues such as imbalanced class distribution that may exist in the CVS dataset, we need to preprocess them first and then construct ODCEM model. The terms of one-step ensemble in ODCEM model contain two meanings: first, it integrates the preprocess of missing values in Phase 1 with the customer classification modeling in Phase 2 from the "three-phase" CVS framework proposed by Kim et al. [17] and reduces the dependence of assumption about missing mechanism and data distribution model; second, it introduces multiple classifiers ensemble technique to customer value classification modeling.
Suppose a CVS issue contains attributes; its training set train and test set test contain 1 and 2 customer instances, respectively. In addition, all the instances can be divided into classes according to their values for the enterprise, and both train and test contain some MVs. ODCEM model mainly includes two phases: training base classifiers and classifying test instances. In training phase, it first divides train into three subsets according to the missing rate of instances: 1 , 2 , and 3 (the reason of the number of subsets being three is that the three subsets correspond to low missing level set, middle missing level set, and high missing level set, resp.), and then in each subset it assigns different sampled weights to the attributes according to different numbers of MVs in each attribute, selects attribute subsets randomly, obtains a series of training subsets by mapping like the random subspace method [35], deletes the instances with MVs in each training subset, and then trains the classification model to compose the base classifier pool (BCP). In classifying test instances phase, for each test instance * ∈ test ( = 1, 2, . . . , 2 ), it findsnearest neighbours of * from train to compose the local area of * , selects some classifiers with best classification performance in from BCP to classify * , and finally obtains the final classification result of * by weighted voting. The process of ODCEM model is described in Figure 1.
It is notable that, in ODCEM model, if there are only a few MVs in train , most of the training instances will be assigned to the low missing level subset 1 , and we can get the base classifiers with good enough classification performance by sufficient training instances, which can ensure satisfactory CVS effect of ODCEM, while if there are lots of MVs in train and the subsets 1 , 2 , and 3 may all contain a certain number of instances, it can also find the base classifiers in subset 1 or 2 containing fewer MVs to classify the test instance * . In short, it can always find some base classifiers for a given test instance. Thus, ODCEM method can make up for the disadvantage of LMF method proposed by Mohammed et al. [34] to a large extent. In the following content, we will describe the process of ODCEM model in detail.

Train Base Classifiers.
To train the base classifiers, ODCEM first divides train into 3 subsets according to the missing rate of the instances. For each instance ∈ train , its missing rate is defined as follows: where miss is the number of missing values in instance and is the total number of attributes. It is easy to know that 0 ≤ ≤ 1.
After getting the value of ( = 1, 2, ..., 1 ), we rank all training instances according to the order of from small to large. Thus, the more frontier the instance is, the fewer missing values the instance has; even there may be no MV. Further, we divide the range of into 3 intervals: [0, 1/3), [1/3, 2/3), and [2/3, 1] and divide the whole training set train into 3 subsets according to the intervals: 1 , 2 , and 3 . Therefore, 1 contains the instances with the missing rate ∈ [0, 1/3), and similarly, 3 contains the instances with the missing rate ∈ [2/3, 1].
In order to train base classifiers in each subset ( = 1, 2, 3), we select a series of attribute subsets randomly according to the basic idea of random subspace (RSS) [35]. As the number of MVs in different attributes is often different, the less the number is, the more information the attribute contains and the larger the possibility to be selected is. As for subset , if it is nonempty, then the sampled weight of attribute ( = 1, 2, . . . , ) in is calculated as follows: where is the number of MVs in the column of attribute in subset and | | is the total number of instances in . Especially, if = 0, that is, there is no MV in the column of attribute , we let = 2 directly; namely, assign much larger sampled weight for the attribute compared with the attribute with MVs. Finally, the weight vector = ( 1 , 2 , . . . , ) is normalized, and the final sampled weight of attribute in is obtained: Select attribute subsets randomly according to the sampled weight vector = ( 1 , 2 , . . . , ); the number of attributes in each attribute subset is equal to half of the total attributes [35]; then obtain some training subsets by mapping. In fact, the training subsets obtained at this moment may contain MVs, while the commonly used classification models such as neural network, support vector machine, and logistic regression require that the training set cannot contain MVs. Thus, we delete the instance with MV from the training subset, and if the remaining training subset is nonempty, then train a base classifier by it. Finally, all the trained base classifiers consist of a base classifier pool (BCP). It is worth noting that the number of base classifiers in BCP varies in different datasets, which may be affected by missing rate of the dataset, the size of the dataset, and so forth.

Classify the Test Instances.
In the above section, a series of base classifiers are trained in the training set. In this section, we will classify all the test instances in test set by the trained base classifiers. The ODCEM model proposed in this study belongs to dynamic classifier ensemble selection [36], in which a classifier subset is selected out from BCP for each test instance * ∈ test ( = 1, 2, . . . , 2 ).
To achieve this process, it needs to select -nearest neighbours from the total training set train to compose the local area (called ) of * . In this study, we choose Euclidean distance measure to calculate the distance between * and all instances in train . However, it cannot be calculated sometimes because there is MV in * or in the training instance. For test instance * = ( * ,1 , * ,2 , . . . , * , ) and any instance = ( ,1 , ,2 , . . . , , ) in train , we define two MV Mathematical Problems in Engineering 5 indication vectors: = ( ,1 , ,2 , . . . , , ) and V = (V ,1 , V ,2 , . . . , V , ) as follows: where , = 1, 2, . . . , . Further, we suppose only ( ≤ ) attribute values are not missing in test instance * ; that is, Thus, the distance between the test sample * and the instance in train can be calculated as follows: In (5), when the distance between two instances cannot be calculated, we take it as ∞.
After finding the local area of * through (5), we select some suitable classifiers to classify instances in (a classifier is selected if there is no missing value in the corresponding columns in with all attributes in the feature subset used for training this classifier), calculate the classification accuracy of each classifier in , and select half number (denoted by ) of the selected base classifiers with higher classification accuracy, and their classification accuracy in is acc 1 , acc 2 , . . . , acc , respectively. Classify the test instance * with selected classifiers; for each base classifier ( = 1, 2, . . . , ), it will output a probability estimation ( ( * ) = | ), which means the probability of * belongs to the category of ( = 1, 2, . . . , ). The final prediction result of test instance * is obtained through weighted voting: where = acc / ∑ acc , = 1, 2, . . . , .

Pseudocode of Model.
The pseudocode of ODCEM model is as follows.
Input. Training set train , test set train , number of nearest neighbours , and the number of attribute subsets are selected in each subset ( = 1, 2, 3).
Output. The final classification result of each instance * in test is as follows.
(1) Calculate the missing rate of each instance in train according to (1) and divide the entire training set into 3 subsets according to : 1 , 2 , and 3 .
(2) For each subset ( = 1, 2, 3), if is nonempty, then ) of all attributes according to (3); (2.2) select attribute subsets randomly from attribute space according to and obtain training subsets ( = 1, 2, . . . , ) by mapping; (2.3) delete the instance with MV in , and if the remaining is nonempty, then a base classifier is trained with and added to BCP.
(3) For each test instance * ∈ test , (3.1) find -nearest neighbours of * from the entire training set train to compose its local area according to (5)

The Empirical Study
In order to analyze the CVS performance of ODCEM proposed in this study, we conducted experiments in the credit scoring dataset "German" from UCI [37] and credit card customer churn prediction dataset "China churn" of one commercial bank in Sichuan province China. At the same time, we compared the CVS performance of ODCEM with that of four commonly used "two-step" models for MVs, which impute MVs by -nearest neighbours imputation (KI), mean substitution (MS), EM imputation (EM), and regression imputation (RI), respectively, and then constructed multiple classifier systems with subagging method [21]. It is worth noting that listwise deletion (LD) imputation was not referred to in our experiments because of its obvious disadvantage, and multiple imputation (MI) method was not selected as the benchmark for its high computation cost. Finally, we also compared one-step ensemble selection strategy ODCEM with the ensemble based method LMF proposed by Mohammed et al. [34].

Description of the Datasets.
The first dataset used in the study is "German, " a credit card customer credit scoring dataset from German [37]. There are 20 attributes and 1 class label in the dataset, in which 7 attributes are numeric and 13 attributes are qualitative. The class label includes two different states { , } which divide the customers into two classes: good credit and bad credit. There are 1000 customer instances in the dataset: 700 customers with good credit and 300 customers with bad credit. In addition, there is no MV in it.
The second dataset is about churn prediction of credit card customer from one commercial bank in Sichuan province, China ("China churn"). The data interval is 2010.5-2010.12. According to the basic principle of churn prediction attribute selection and considering the availability of data, we selected 25 prediction attributes (see Table 1). For the customer class label, we defined the churn customer as someone who canceled card from May 2010 to October 2010 or did not consume for 3 months. After simple data cleaning, we obtained 3255 customer instances from the database finally, in which there are 302 churn customers and 2953 nonchurn customers. The churn rate is 9.28% and it belongs to class imbalanced dataset. Meanwhile, it includes a lot of MVs, and the missing rate of each attribute is in Table 1.

Experimental Design.
As there is no MV in "German" dataset, we generated the MVs artificially and analyzed the CVS performance of various models under different missing condition. Depending on the reason why MVs have been produced, the missing mechanism can be classified into three types [19]: MCAR (missing completely at random), MAR (missing at random), and MNAR (missing not at random). Suppose that a dataset contains two parts: obs and mis , where obs stands for all observed data and mis stands for all MVs. If ( | obs , mis ) = ( | obs ), where is the event we concerned, then we call it MAR, and if ( | obs , mis ) = ( ), then we call it MCAR, while if it does not meet the above two missing mechanisms, we call it MNAR. In this study, we considered the above three mechanisms in "German" dataset, and let the missing level = 5%, 10%, 20%, 30%, and 40%, respectively. In order to facilitate this process, we adopted the method proposed by Fayyad and Irani [38] to prediscretize the continuous attributes, and then three kinds of MVs were produced as follows.
(1) MCAR. For each instance, a random number ∈ (0, 1) was generated. If < , then ceil (20 * ) attributes were selected randomly (here ceil() is an uprounding function in Matlab and 20 is the number of attributes in "German" dataset), and let the values of these attributes be missing. Since the MVs of each instance are random, it should belong to missing completely at random.
(2) MAR. For each instance, a random number ∈ (0, 1) was generated, and if < , then two attributes , , ( < ) were selected randomly. If = 1 (after prediscretization in "German" dataset, the smallest value of each attribute is just 1, and thus it can make this condition be suitable for any attribute), let the value of be missing. The MVs of are only related to the value of attribute and have nothing to do with its own values. Therefore, it should belong to missing at random.
(3) MNAR. For each instance, a random number ∈ (0, 1) was generated, and if < , then an attribute was selected randomly. If = 1, let the value of be missing. The MVs of are related to its own values, which should belong to missing not at random.

Mathematical Problems in
In this study we chose support vector machine (SVM) to generate the base classifier for its popularity and immense success in various CVS tasks [18]. Thus, the four "twostep" CVS models are abbreviated as KI-SVM, MS-SVM, EM-SVM, and RI-SVM. When training SVM, the choice of kernel function is very important. We found that radial basis kernel (RBK) based classifier could obtain the best performance through experimental comparison, so we chose it as the kernel function of SVM. Meanwhile, there are two important parameters in ODCEM model: the number of nearest neighbours in the local area and the number of attribute subset selected in each subset ( = 1, 2, 3). We conducted the sensitivity analysis of the two parameters and experimented in two datasets with seven different values of : 3, 5, 7, 9, 11, 13, and 15 and seven different values of : 10, 15, 20, 25, 30, 35, and 40, respectively. We found that the performance of ODCEM with different values of showed some fluctuations, and it performed best when = 5. As for the parameter , it is found that the performance of ODCEM rose first with the increase of and achieved the best when equaled about 30 and then showed a little fluctuation when > 30. The detailed analysis is omitted because of the space consideration, and we made = 5, = 30. Further, as for the subagging ensemble strategy used in four "two-step" models, Paleologo et al. [21] have found that the number of base classifiers in between 20 and 50 is reasonable, and thus we let it be the maximum 50 in four "two-step" models as well as the LMF model.
Further, the class distribution of both datasets is imbalanced. If we train CVS model with such data directly, then it tends to classify all customers as the majority class. There are many techniques to deal with class imbalance data, including oversampling and downsampling. In this study, what we concern most is not the optimal match between the six models referred and the methods handling class imbalance data, but the CVS performance of six models in the condition of MVs. Therefore, without loss of generality, we adopted oversampling method to balance the class distribution of training set and then conducted the experiments by the six models. In addition, to compare the performance of ODCEM proposed in this study with the other five models, we adopted 10-fold cross-validation (CV10) which divides the entire dataset into 10 equal parts randomly and takes one part as test set and the other nine parts as training set every time; the rotation of 10 times is called CV10.
As for the four "two-step" models, we imputed the MVs with SPSS 17.0 for EM-SVM and RI-SVM first and then conducted model training and classification on the platform of Matlab 2008b, while the other two simpler "two-step" models KI-SVM and MS-SVM, as well as ODCEM and ensemble based LMF, were implemented on the platform of Matlab 2008b directly. Finally, all experiments were performed on a dual-processor 3.0 GHz Pentium 4 Windows computer with 2 GB RAM, and the final classification result was the average of 10 times CV10 in each case.

Evaluation Criteria.
To evaluate the performance of the models referred to in this study, we introduced the confusion matrix in Table 2. On this basis, four commonly used evaluation criteria were adopted [39]: (1) total accuracy = ((TP + TN)/(TP + FN + FP + TN)) × 100%; (2) the area under the receiver operating characteristic curve (AUC). The receiver operating characteristic (ROC) curve is an important evaluation criterion of classification model in the data with imbalanced class distribution [40]. For an issue of two classes, the ROC graph is a true positive rate-false positive rate graph, where -axis is true positive rate (TP/(TP + FN) × 100%) and the -axis is false positive rate (FP/(FP + TN) × 100%). However, sometimes it is difficult to compare ROC curves of different models directly, so AUC is more convenient and popular; (3) type I accuracy = (TP/(TP + FN)) × 100%; (4) type II accuracy = (TN/(FP + TN)) × 100%. Table 3 shows the credit scoring performance of six models in "German" dataset with MCAR type MVs. The results show that, (1) for the total accuracy, AUC, and type I accuracy, ODCEM outperforms the other five models under each missing level; (2) for the type II accuracy, ODCEM outperforms the other five models when = 10% and 30%, and it performs more poorly than KI-SVM when = 5%, 20%, and 40%. Compared with the type II accuracy, we are usually concerned more about the AUC and type I accuracy of a model in class imbalanced CVS issue. Thus, we can conclude that the overall credit scoring performance of ODCEM is better than that of the other models referred to in this study in "German" dataset with MCAR type MVs. In addition, the performance rank of six models on each evaluation criterion in five missing levels, respectively, is also shown in Table 3, and the last row is the average rank, which can be regarded as a criterion of the overall performance of the models. Therefore, we can rank the six models according to the overall performance for MCAR type MVs from high to low as follows: ODCEM, LMF, MS-SVM, KI-SVM, RI-SVM, and EM-SVM. It is notable that LMF outperforms the four "two-step" models, and EM-SVM performs the most poorly in this case. Figure 2 shows the trend of credit scoring performance of six models in "German" dataset with MCAR type MVs. It can be seen from the figure that the performance of each model does not decline obviously with the increase of missing level but appears with great fluctuation, and the general trend is increasing first and then reducing. This may be related to the production means of MVs of MCAR in this study, or related to the inherent characteristics of "German" dataset.

Experimental Results Analysis in "German" Dataset.
Similarly, the performance rank of six models on each evaluation criterion in five missing levels is also shown in Table 4. Thus, we can rank the six models according to the overall performance for MAR type MVs from high to low as follows: ODCEM, EM-SVM, RI-SVM, MS-SVM, KI-SVM, and LMF. It is notable that the performance of EM-SVM is   only poorer than that of ODCEM, and the performance of LMF is the poorest. The trend of customer credit scoring performance of six models in "German" dataset with MAR type MVs is shown in Figure 3. It is shown that the performance of six models reduces quickly with the increase of missing level. In particular, when the missing level is low, such as = 5%, the ensemble based model LMF can achieve good classification performance, which is only poorer than that of ODCEM and EM-SVM, while with the increase of missing level, such as when ≥ 20%, the performance of LMF is poorer than that of the other five models. The results also demonstrate that ODCEM model proposed in this study can overcome the disadvantages of LMF model to a large extent and can achieve better performance in high missing level than LMF.
The performance of six models in "German" dataset with MNAR type MVs is shown in Table 5. Although ODCEM only achieves comparable performance with MS-SVM and EM-SVM when = 5%, the performance of ODCEM is better than that of the other models when = 10%, 20%, 30%, and 40%. Thus, we can conclude that the overall credit scoring performance of ODCEM is still better than that of the other five models with MNAR type MVs in "German" dataset. Further, according to the average performance rank in the last row of Table 5, the six models can be ranked as follows: ODCEM, KI-SVM, MS-SVM, EM-SVM, LMF, and RI-SVM. Note: the bold-face in Table 3 shows the maximum of each row. The numbers in parentheses are the ranks of the six models with the corresponding evaluation criterion in each row. Note: the bold-face in Table 4 shows the maximum of each row. The numbers in parentheses are the ranks of the six models with the corresponding evaluation criterion in each row. The trend of credit scoring performance of six models in "German" dataset with MNAR type MVs is shown in Figure 4. It can be seen that the performance of six models still does not decline obviously with the increase of missing level, but different degrees of fluctuation appear, which is similar to that in MAR mechanism. At the same time, Figure 4 also shows that the performance fluctuation of ODCEM is minimal, which demonstrates that the ODCEM model proposed in this study has the best robustness for the MNAR type MVs in "German" dataset.
After analyzing the credit scoring performance of six models in "German" dataset with three missing mechanisms comprehensively, the following conclusions can be obtained: (1) the overall performance of ODCEM model is always the best under three missing mechanisms; (2) the MVs of different missing mechanisms can bring various degree effects on the performance of six models, and it is the greatest under MAR missing mechanism. This may be related to the production ways of MVs under three missing mechanisms.

Experimental Results Analysis in "China
Churn" Dataset. Table 6 shows the customer churn prediction performance of six models in "China churn" dataset. It can be seen from the table that the total accuracy, AUC, type I accuracy, and type II accuracy of ODCEM are the best, which shows that ODCEM model can also achieve satisfactory performance in the real CVS dataset. It is notable that the performance of LMF is only comparable with that of RI-SVM, and their performance is the poorest in six models. "China churn" dataset contains a large number of MVs, and its average missing rate of instance is 8.06%. Combining with the analysis in Section 4.4, we can roughly conclude that the performance of ODCEM model proposed in this study is better than that of LMF model proposed by Mohammed et al. [34] when there are many MVs in the dataset. Further, the performance of six models in "China churn" dataset is shown in Figure 5. It can be seen that the performance of six models rises gradually when the ratio of instance for training model in the entire training set (abbreviated as ratio of instance) increases. Especially, the curve of ODCEM is always at the top even when the ratio is very small, such as 10%. It demonstrates that compared with the other five models, the churn prediction performance of ODCEM is the best when the models are trained by 10 training subsets with different sample size.

Further Discussions.
In Sections 4.4 and 4.5, the experiments in "German" dataset and "China churn" dataset verify the effectiveness of the proposed method. Through accuracy and AUC, we can find which model is the best and which one is the poorest. However, the differences between the good models and bad ones are unclear [41]. Therefore, we conducted McNemar's test [42] to examine whether the proposed ODCEM model significantly outperforms the other five models referred to in this study. Taking the real customer churn prediction dataset "China churn" as an example, the results of McNemar's test are shown in Table 7. For space consideration, the results of McNemar's test for the "German" dataset with three missing mechanisms and five missing levels are omitted here. Actually, some similar conclusions can be obtained from "German" dataset by McNemar's test.
As shown in Table 7, it can be concluded as follows.
(1) The proposed one-step ensemble selection model ODCEM outperforms the "two-step" models, as well as the ensemble based model LMF at 1% statistical significance level.
(2) For LMF model, its performance is significantly poorer than that of EM-SVM at 1% statistical significance level and significantly poorer than that of MS-SVM and KI-SVM at 5% statistical significance level, while McNemar's test does not conclude that it performs poorer than the RI-SVM model.  Further, we also compare the computation complexity of six models and find that the complexity of MS-SVM is the lowest, followed by RI-SVM, KI-SVM, EM-SVM, LMF, and ODCEM. The time complexity of ODCEM and LMF is much the same, and it is slightly higher than that of EM-SVM.
Finally, with the analysis of Sections 4.4 and 4.5, we can draw the following conclusions.
(1) The CVS performance of ODCEM is the best in both UCI customer credit scoring dataset and real customer churn prediction dataset, which shows that ODCEM has good adaptability and can be used for a variety of CVS tasks.
(2) For the four "two-step" models, their performance is much different in the same dataset with three missing mechanisms and in different datasets ("German" and "China churn"), and the results are unstable. It demonstrates that the customer value classification modeling is sensitive to the preprocess methods of MVs in "two-step" CVS strategies and the CVS performance depends on the missing mechanism, which is similar to the conclusion of Crone et al. [24].
(3) For LMF model, when there are only a few MVs, such as in "German" dataset with the missing mechanism of MAR and MNAR, it achieves comparable performance with KI-SVM, MS-SVM, and EM-SVM, Note: the bold-face in Table 5 shows the maximum of each row. The numbers in parentheses are the ranks of the six models with the corresponding evaluation criterion in each row.  Table 6 shows the maximum of each column.
especially under the mechanism of MAR; its performance is only poorer than that of ODCEM. However, when there are many MVs, for example, in "German" dataset with the missing mechanism MCAR and "China churn, " its performance is poor. It shows that LMF model is not suitable for the CVS issues with lots of MVs, which is basically consistent with the experimental results of Mohammed et al. [34].

Conclusions and Future Work
This study mainly focuses on the CVS issues with MVs and proposes one-step dynamic classifier ensemble model (ODCEM) for MVs to make up for the disadvantage of the existing "two-step" models. On the one hand, ODCEM model integrates the preprocess of MVs and the classification modeling into one step; on the other hand, it utilizes multiple classifier ensemble technology in constructing the classification models. It can fully utilize the information of nonmissing values in dataset without imputation, thus reducing the dependence on the data missing mechanism assumptions and the data distribution. The empirical results in "German" dataset of UCI and the real customer churn prediction dataset "China churn" show that the CVS performance of ODCEM is better than that of four commonly used "two-step" models and the existing ensemble based model LMF.  Table 7 are the Chi squared values and P values are in brackets.