A Robust Probability Classifier Based on the Modified χ 2-Distance

We propose a robust probability classifier model to address classification problems with data uncertainty. A class-conditional probability distributional set is constructed based on the modified χ-distance. Based on a “linear combination assumption” for the posterior class-conditional probabilities, we consider a classification criterion using the weighted sum of the posterior probabilities. An optimal robust minimax classifier is defined as the one with the minimal worst-case absolute error loss function value over all possible distributions belonging to the constructed distributional set. Based on the conic duality theorem, we show that the resulted optimization problem can be reformulated into a second order cone programming problemwhich can be efficiently solved by interior algorithms. The robustness of the proposed model can avoid the “overlearning” phenomenon on training sets and thus keep a comparable accuracy on test sets. Numerical experiments validate the effectiveness of the proposed model and further show that it also provides promising results on multiple classification problems.


Introduction
Statistics classification has been extensively studied in the field of machine learning and statistics.A typical classification problem is to design a linear or nonlinear classifier based on a known training set such that a new observation can be assigned to one of the known classes.Many classification models have been proposed, such as the naive Bayes classifiers (NBC) [1,2], artificial neural network [3], and support vector machines (SVM) [4].
In real-world classification problems, it is often the case that the data of training set are imprecise due to unavoidable observational noises in the process of data collection or data approximation from incomplete samples.One way to handle the data uncertainty is to design a robust classifier in the sense that it has the minimal worst-case misclassification probability for the training sets.The idea of robustness has been widely applied in many traditional machine learning and statistics techniques, such as robust Bayes classifiers [5], robust support vector machines [6], and robust quadratic regressions [7].Robust classifiers are highly related to the recently flourished research on robust optimization.For more recent developments on robust optimization, we refer the readers to the excellent book [8] and reviews [9,10].
Recently [11,12] have proposed a robust minimax approach called the minimax probability machine to design a binary classifier.Unlike the traditional methods, they make no assumption on the class-conditional distributions, but only the mean and covariance matrix of each class are assumed to be known.Under this assumption, the designed classifier is determined by minimizing the worst-case probability of misclassification under all possible choices of classconditional distributions with the given mean and covariance matrix.By reformulating the classifier design problem into second order cone programming, they show that the computational complexity of the proposed approach is similar to that of SVM.Because of its computational advantage and competitive performance with other current methods, this as the soft-margin support vector machine which uses the Hinge loss function [21,22] and the logistic regression which uses the negative log likelihood function [23].Note that the absolute error function is essential in our model to obtain a tractable optimization problem for the proposed model.Numerical experiments on real-world application validate the effectiveness of the proposed classifier and further show that the proposed classifier also performs well for multiple classification problems.
The paper proceeds as follows.Section 2 introduces the proposed robust minimax probability classifier based on the modified  2 -distance and discusses how to construct the desired distributional set   .Section 3 provides an equivalent reformulation by handling the robust constraints and robust objective separately.Numerical experiments on real-world data set are carried out to validate the effectiveness of the proposed classifier in Section 4. Section 5 concludes this paper and gives future research directions.

Classifier Models
In this section, a simple probability classifier is first presented and then it is extended to handle data uncertainty by introducing a distributional set   .We also discuss how to construct this distributional set based on training data set.

Probability Classifier.
Bayes classifiers assign an observation  to the  * ()th class which has the maximal posterior probability; that is, and ( | ) is the posterior probability function, that is, the conditional probability that the sample belongs to the th class, given that we know it has feature vector .
Using Bayes' theorem, we have where () is the prior probability of the th class, ( | ) is the conditional probability for the th class, and () is the probability that a sample has feature vector .Note that () is a constant if the values of the feature variables are known and thus can be omitted.To design an effective Bayes classifier, the key issue is estimating the class-conditional probability ( | ) or the joint probability (, ).Theoretically, using the chain rule, we have However such estimating method leads to the problem of "dimension disaster." To address this issue, the naive Bayes classifier makes the following "conditional independence assumption": where    () := (  | ) is the class-conditional probability that the observation  belongs to the th class based on the th feature.Here we introduce another "linear combination assumption" for the class-conditional probability: where    is a coefficient.Compared with the "conditional independence assumption, " which uses the probabilistic information in terms of multiplication, the proposed "linear combination assumption" uses the probabilistic information in terms of weighted sum.We will further discuss the rationality of this assumption at the end of this subsection.
Under this assumption, we have where    := ()   denotes the probability weight of the th feature for the th class.
To obtain the optimal probability classifier based on the "linear combination assumption, " it is natural to consider the following optimization problem: where (⋅, ⋅) : R × R →  + is a prespecified loss function.In the following context, we will take the absolute error function as our loss function; that is, (, ) = | − |.In view of its probability property, it is straightforward to impose the following constraints on the posterior probability: Under such constraints, we have that where || = ∑ ∈ ∑ ∈  , .Thus the optimal probability classifier (PC) problem can be formulated as follows: It is no doubt that the "linear combination assumption" may not work sometimes.However, we justify the proposed classifier by the following facts.
(1) As an intuitive interpretation, note that    () estimates the probability of the observation  belonging to the th class only based on the th feature; thus it provides partial probabilistic information of the sample.Hence we can interpret the weight    as certain degree of trust on the information, and in this sense, the "linear combination assumption" is a way of combining evidence from different sources.Similar ideas can also be found in the theory of evidence; see the Dempster-Shafer theory [24,25].(2) In terms of the classification performance, in the worst case, the proposed classifier may put all weight on one feature; thus in such case, it is equivalent to a Bayes classifier based on a well-selected feature.If each class has its "typical" feature which can distinguish it from other classes, the proposed classifier has the ability to learn this property by putting different weights on different features for different classes and thus provides better classification performance.A real-life application on lithology classification problems also validates its classification performance by comparison with support vector machines and the naive Bayes classifier.(3) Another advantage of the proposed classifier is its high computability.As we show in Section 3, the proposed classifier and its robust counterpart problems can be reformulated as second order cone programming problems and thus can be solved by interior algorithms in polynomial time.

Robust Probability Classifier.
Due to observational noises, the true class-conditional probability distribution is often difficult to obtain.Instead we can construct a confidence distributional set which contains the true distribution.Unlike the traditional distributional sets in minimax probability machines, which only utilize mean and covariance matrix, we construct our class-conditional probability distributional set based on the modified  2 -distance which uses more information from the samples.
The modified  2 -distance (⋅, ⋅) : R  × R  →  is used to measure the distance between two discrete probability distribution vectors in statistics.For given  = ( 1 , . . .,   )  and  = ( 1 , . . .,   )  , it is defined as Based on the modified  2 -distance, we present the following class-conditional probability distributional set: where   , is the nominal class-conditional distribution probability for the th sample belonging to the th class based on the th feature and the prespecified parameter  is used to control the size of the set.
To design a robust classifier, we need to consider the effect of data uncertainty on the objective function and constraints.The robust objective function is to minimize the worstcase loss function value over all the possible distributions in the distributional set   ; the robust constraints ensure that all the original constraints should also be satisfied for any distribution in   .Thus the robust probability classifier problem is of the following form: Note that the above optimization problem has an infinite number of robust constraints and its objective function is also an embedded subproblem.We will show how to solve such minimax optimization problem in Section 3.

Construct the Distributional Set.
To get the distributional set   , we need to define the parameter  and the nominal probability   , .The selection of parameter  is application based and we will discuss this issue in the numerical experiment section; next we will provide a procedure to calculate   , .
For the th feature, the following procedure takes an integer   indicating the number of data intervals as an input and will output the estimated probability   , of the th sample belonging to the th class.
(1) Sort samples in the increased order and divide them into   intervals such that each interval has at least ⌊||/  ⌋ number of samples.Denote the th interval by Δ , .
(2) Calculate the total number of samples in the -class,   , the total number of samples in the th interval,  , , and the total number of samples belonging to the -class in the th interval,  ,, .
(3) For the th sample, if it falls into the th interval, the class-conditional probability   , is calculated by Note that from the definition of   , we easily compute the upper bound   , and lower bound   , for the true classconditional probability   , as follows: The above problems can be efficiently solved by a second order cone solver such as SeDuMi [26] or SDPT3 [27].

Solution Methods for RPC
In this section, we first reduce the infinite number of robust constraints to a finite set of linear constraints and then transform the inner robust objective function into a minimization problem by the conic duality theorem.At last, we obtain an equivalent computable second order cone programming for the RPC problem.The following analysis is based on the strong duality result in [8].
Consider a conic program of the following form: and its dual problem where   is a cone in R   and  *  is its dual cone defined by A conic program is called strictly feasible if it admits a feasible solution  such that    −   ∈ int   , ∀ = 1, . . ., , where int   denotes the interior point set of   .
Lemma 1 (see [8]).If one of the problems (CP) and (DP) is strictly feasible and bounded, then the other problem is solvable, and (CP) = (DP) in the sense that both have the same optimal objective function value.

Robust Constraints.
The following lemma provides an equivalent characterization for the infinite number of robust constraints in terms of a finite set of linear constraints which can be solved efficiently.
Lemma 2. For given , , the robust constraint is equal to the following constraints: Proof.First note that the distributional set  , can be represented as the Cartesian product of a series of projected subsets where the projected subset on index  is defined by  ,0 , −   , V ,0 , ) : where the last equivalence comes from the strong duality between these two linear programs.

Robust Objective Function.
In the RPC problem, the robust objective function is defined by an inner maximization problem.The following proposition shows that it can be transformed into a minimization problem over second order cones.To prove the following result, we utilize the concept of conjugate function  * of the modified  2 -distance: where the function [⋅] + is defined as For more details about conjugate functions, see [28].
where a second order cone  +1 is defined as Proof.For given feasible  satisfying the robust constraints, it is straightforward to show that the inner maximum problem is equal to the following minimization problem (MP): The above constraint can be further reduced to the following constraint: Next we show that the constraint about the conjugate function can be represented by second order cone constraints:

Numerical Experiments on Real-World Applications
In this section, numerical experiments on real-world applications are carried out to verify the effectiveness of the proposed robust probability classifier model.Specifically we consider lithology classification data sets from our practical application.We compare our model with the regularized SVM (RSVM) and the naive Bayes classifier (NBC) on both binary and multiple classification problems.
All the numerical experiments are implemented in Matlab 7.7.0 and run on Intel(R) Core(TM) i5-4570 CPU.SDPT3 solver [27] is called to solve the second order cone programs in our proposed method and the regularized SVM.

Data Sets.
Lithology classification is one of the basic tasks for geological investigation.To discriminate the lithology of the underground strata, various electromagnetic techniques are applied to the same strata to obtain different features, such as Gamma coefficients, acoustic wave, striation, density, and fusibility.
Here numerical experiments are carried out on a series of data sets: the borehole T1, Y4, Y5, and Y6.All boreholes are located in Tarim Basin, China.In total, there are 12 data sets used for binary classification problems and 8 data sets used for multiple classification problems.For each data set, based on a prespecified training rate  ∈ [0, 1], it is randomly partitioned into two subsets: a training set and a test set, such that the size of training set accounts for  of the total number of samples.

Experiment Design.
The parameters in our models are chosen based on the size of data set.The parameter  depends on the number of the classes and defined as  =  2 /||, where  ∈ (0, 1).The choice of  can be explained in this way: if there are || classes and the training data are uniformly distributed, then, for each probability   , = 1/||, its maximal variation range is between   , (1 − ) and   , (1 + ).The number of data intervals   is defined as   = ||/(|| × ) such that if the training data are uniformly distributed, then in each data interval there are  samples in each class.In the following context, we set  = 0.2 and  = 8.
We compare the performances of the proposed RPC model with the following regularized support vector machine model [6] (take the th class for example): where  , = 2 , −1 and   ≥ 0 is a regularization parameter.As pointed by [8],   ≥ 0 represents a trade-off between the number of training set errors and the amount robustness with respect to spherical perturbations of the data points.
To make a fair comparison, in the following experiments we will test a series of  values and choose the one with best performance.Note that if   = 0, we refer to this model as the classic support vector machine (SVM).See also [6] for more details on RSVM and its applications to multiple classification problems.

Test on Binary Classification.
In this subsection, RSVM, NBC, and RPC are implemented on 12 data sets for the binary classification problems using the cross-validation methods.
To improve the performances of RSVM, we transform the original data by the popularly used polynomial kernels [6].Tables 1 and 2 show the averaged classification performances of RSVM, NBC, and the proposed RPC (over 10 randomly generated instances) for binary classification problems on Y5 and T1 data sets, respectively.For each data set, we randomly partition it into a training set and a test set based on the parameter tr which varies from 0.5 to 0.9.The highest classification accuracy on a training set among these three methods is highlighted in bold while the best classification accuracy on a test set is marked with an asterisk Tables 1 and 2 validate the effectiveness of the proposed RPC for binary classification problems compared with NBC and RSVM.Specifically, for most of the cases, RSVM has the highest classification accuracy on training sets but its performance on test sets is unsatisfactory.For most of the cases, the proposed RPC provides the highest classification accuracy on test sets.NBC provides better performances on test sets as the training rate increases.The experimental results also show that for given training rate, PRC can provide better performances on test sets than that on training sets; thus it can avoid the "overlearning" phenomenon.
To further validate the effectiveness of the proposed RPC, we test it on additional 10 data sets, that is, T41-T45 and T61-T65.Table 3 reports the averaged performances of three methods over 10 randomly generated instances when the training rate is set to 70%.Except for data sets T45, T63, and T64, RPC provides the highest accuracy on the test sets, and, for all the data sets, its accuracy is higher than 80%.As shown in Tables 1 and 2, the robustness of the proposed RPC guarantees its scalability on the test sets.

Test on Multiple Classification.
In this subsection, we test the performances of on multiple classification problems by comparison with RSVM and NBC.Since the performance of RSVM is determined by its regularization parameter , we run a set of RSVM with  varying from 0 to a big enough number and select the one with the best performance on test sets.
Figures 1 and 3 plot the performances of three methods on Y5 and T1 training sets, respectively.Unlike the case of binary classification problems, we can see that RPC provides a competitive performance even on the training sets.One explanation is that RSVM can outperform the proposed RPC on training sets by finding the optimal separation hyperplane  for binary classification problem S while RPC is more robust to extend to solve multiple classification problems since it uses the nonlinear probability information of the data sets.The accuracy of NBC on the training sets also improves as the training rate increases.
Figures 2 and 4 show the performances of both methods on Y5 and T1 test sets, respectively.We can see that, for most  provides the highest classification performances among three methods, and even for the M5 data set, its accuracy (88.0%) is very close to the best one (88.1%).
From the tested real-life application, we conclude that the proposed RPC has the robustness to provide better performance for both binary and multiple classification problems compared with RSVM and NBC.The robustness of PRC enables it to avoid the "overlearning" phenomenon, especially for the binary classification problems.

Conclusion
In this paper, we propose a robust probability classifier model to address the data uncertainty in classification problems.
To quantitatively describe the data uncertainty, a classconditional distributional set is constructed based on the modified  2 -distance.We assume that the true distribution lies in the constructed distributional set centered in the nominal probability distribution.Based on the "linear combination assumption" for the posterior class-conditional probabilities, we consider a classification criterion using the weighted sum of the posterior probabilities.The optimal robust probability classifier is determined by minimizing the worst-case absolute error value over all the possible distributions belonging to the distributional set.
Our proposed model introduces the recently developed distributionally robust optimization method into the classifier design problems.To obtain a computable model, we transform the resulted optimization problem into an equivalent second order cone programming based on conic duality theorem.Thus our model has the same computational complexity as the classic support vector machine and numerical experiments on real-life application validate its effectiveness.On the one hand, the proposed robust probability classifier provides a higher accuracy compared with RSVM and NBC by avoiding overlearning on training sets for binary classification problems; on the other hand, it also has a promising performance for multiple classification problems.
There are still many important extensions in our model.Other forms of loss function, such as the mean squared error function and Hinge loss functions, should be studied to obtain tractable reformulations and the resulted models may provide better performances.Probability models considering joint probability distribution information are also interesting research directions.

Table 1 :
Performances of RSVM, NBC, and RPC for binary classification problems on Y5 data set.

Table 2 :
Performances of RSVM, NBC, and RPC for binary classification problems on T1 data set.

Table 3 :
Performances of RSVM, NBC, and RPC for binary classification problems on other data sets when tr = 70%.

Table 4 :
Performances of RSVM, NBC, and RPC for multiple classification problems on T1 data set.
To further test the performance of PRC on multiple classification problems, we carry out more experiments on data sets M1-M6.Table 4 reports the averaged performances of three methods on these data sets when the training rate is set to 70%.Except for the M5 data set, PRC always Mathematical Problems in Engineering