Imprecise imputation as a tool for solving classification problems with mean values of unobserved features

A method for solving a classification problem when there is only partial information about some features is proposed. This partial information comprises the mean values of features for every class and the bounds of the features. In order to maximally exploit the available information a set of probability distributions is constructed such that two distributions are selected from the set which define the minimax and minimin strategies. Random values of features are generated in accordance with the selected distributions by using Monte Carlo technique. As a result, the classification problem is reduced to the standard model which is solved by means of the support vector machine. Numerical examples illustrate the proposed method.


Introduction
There are several major data mining techniques including classi…cation, clustering, novelty detection, etc.We consider classi…cation as a data mining technique used to predict an unobserved output value y based on an observed input vector x.This requires us to estimate a predictor f from training data or a set of example pairs of (x; y).A special very important problem of the statistical machine learning is the binary classi…cation problem which can be regarded as a task of classifying some objects into two classes (groups) in accordance with their properties or features.In other words, we have to classify each pattern x into one of the classes by means of a discriminant function f .A common assumption in supervised learning is that training and predicted data are drawn from the same (unknown) probability distribution, i.e., training and predicted data come from the same statistical model.As a result, most machine learning algorithms and methods exploit this assumption which, unfortunately, does not often hold in practice.This may lead to a performance deterioration in the induced classi…ers [1,2].This problem may arise if we have imbalanced data [18] or in case of rare events or observations [36].The assumption does not hold also in case of partially known or observed features.For instance, it may take place when we know only some mean values of the features, but can not get their actual values during training.
One of the approaches to handle the above problem and to cope with the imbalance and possible inconsistencies of training and predicted data is the minimax strategy for which the classi…cation parameters are determined by minimizing the maximum possible risk of misclassi…cation [1,2].This is an "extreme" strategy of decision making.As pointed out in [1], the minimax classi…ers may be seen as over-conservative since its goal is to optimize the performance under the least favorable conditions.Therefore, it is interesting to simultaneously study the so-called minimin or optimistic strategy for which the classi…cation parameters are determined by minimizing the minimum possible risk of misclassi…cation.This is another "extreme" strategy.
By taking into account the above, we propose a classi…cation model using the minimax and minimin strategies for situations when a part of features are observed and there are precise values of the features corresponding to di¤erent classi…ed classes, but our initial information about other part of features is restricted by mean values of the features for every class.In other words, we know only mean values (expectations) of some features and do not have any observations.This is a very restrictive information which should be exploited.The features with this information will be called unobserved for simplicity.A typical example of the above situation is a mode of production of reinforced concrete beams whose quality and strength depend on a number of parameters such as the weight of reinforcement bars, concrete materials, etc.If we have not observed or measured some of the parameters before, it is di¢ cult to reject new beams or to classify them into two classes: defective (rejected) or of high quality, because we do not have the learning set of beams with the measured parameters.However, if we know, for instance, how much steel has been used up by manufacturing N beams, then we are able to evaluate the average weight of steel in a beam.The information can be elicited, for instance, from experts.Often, it is easy for experts to provide judgments about some average values of a feature for every class because this information is the most simple and understandable.
One of the simplest ways to solve the classi…cation problem with the partial information is to assume that the mean values are observed values.In fact, we replace in this case an unknown probability distribution of data of a feature by the deterministic variable which takes one value corresponding to the mean value of this feature.Of course, we accept here a very strong assumption which may lead to a signi…cant performance deterioration especially if the underlying probability distribution is not symmetric.Another way is to …nd the mean values of every observed feature and use a simplest classi…cation algorithm considered by many authors, for instance, by [26].However, we lose some useful information in this case, which can be inferred from the observations.
In order to maximally exploit the available information about features, we propose another approach whose underlying ideas can be formulated a combination of multiple imputation [24] and imprecise models of features.
As indicated in [25], imputation is a class of methods by which an estimation of the missing value or its distribution is used to generate predictions from a given model.In particular, either a missing value is replaced with an estimation of the value or alternatively the distribution of possible missing values is estimated and corresponding model predictions are combined probabilistically.Various imputation treatments for missing values in training data are available that may be deployed at prediction time [3,9,14,15,20,21].However, some treatments such as multiple imputation [24] are particularly suitable to induction.In particular, multiple imputation (or repeated imputation) is a Monte Carlo approach that generates multiple simulated versions of a data set such that each are analyzed and the results are combined to generate inference.
We do not know the probability distributions of data for unobserved features.However, the mean values of features and their boundary values produce a set of probability distributions bounded by some lower and upper cumulative distribution functions (CDFs).This way leads to constructing the so-called p-boxes [6,12] from data.It should be noted that the considered set of distributions is not the set of parametric distributions having the same parametric form as the bounding distributions, but it is the set of all possible distributions restricted by the lower and upper bounds.This is an important feature of the proposed approach in this paper.A probability distribution is selected from the p-box in order to make a pessimistic decision, which maximizes the risk function as a measure of the classi…cation error.In other words, the well-known minimax strategy is applied for solving the classi…cation problem, which appears as an insurance against the worst case [23].Another probability distribution is selected from the p-box in order to make an optimistic or minimin decision.The similar idea applied to regression models has been considered in [30,31,32].So, the …rst idea is to consider the lower and upper probability distributions of feature data produced by the corresponding mean values and bounds for the feature values.
It should be noted that the obtained bounding probability distributions do not belong to standard types of probability distributions and their convolution for combining features and for computing parameters of the discriminant function is an extremely hard problem.Therefore, in order to cope with this problem the second idea is proposed.We can apply the Monte Carlo technique for generating random values of features, which are governed by the probability distributions selected from the p-boxes in accordance with the minimax and minimin strategies [4,27].In a nutshell, for every example of the training set, we generate a (large) number of random values for unobserved features.It is a multiple imputation technique which has been applied to classi…cation prob-lems [10,25,37].But the main distinction of the proposed approach from the available ones is that it is based on some partial information about unobserved features and uses the p-boxes for generating random values of features.
After carrying out this procedure, the classi…cation problem can be solved by means of standard methods, for instance, by means of the support vector machine (SVM).The Monte Carlo technique has also been applied to general classi…cation problems [7,28].It has been successfully applied to reliability analysis problems in the framework of classi…cation models [16,17].Of course, the Monte Carlo technique requires additional computation e¤orts.However, its main advantage is its simplicity.Moreover, we get the standard classi…cation problem solved by standard available software tools.
We have to stress that there is no sense in applying the proposed model when we have some missing values among the observed values of a feature.The model has to be used when we do not have observations for some features at all and only mean values of the features and their bounds are known.
The paper is organized as follows.A statement of the well-known standard classi…cation problem is given in Section 2. This statement is extended on the case of a set of probability distributions of training data in Section 3. In this section, two strategies: minimax and minimin, are formally introduced.The classi…cation problem with mean values for a part of unobserved features is considered in Section 4. A general method for constructing the classi…cation model by partial information about some features with using the set of probability distributions is described in the same section.A question of the training data generation for realizing Monte Carlo simulation is solved in Section 5. A way for reducing the classi…cation problem with partial information about features to the standard problem and its solution by means of the SVM method is given in Section 6. Numerical examples with synthetic data and with the real datasets, including Iris, Pima Indian Diabetes, Mammographic masses, Parkinsons, Indian Liver Patient, Breast Cancer Wisconsin (Original), Breast Cancer Wisconsin (Diagnostic), Musk, Lung-cancer datasets from UCI Machine Learning Repository [13], are provided in Section 7.

The standard classi…cation problem
The binary-classi…cation problem can be formulated as follows.There are predictor-response data with a binary response y representing the observation of classes y = 1 and y = 1.The binary-classi…cation problem is to estimate a region in predictor space in which class 1 is observed with the greatest possible majority.Suppose we are given empirical data (x 1 ; y 1 ); (x 2 ; y 2 ); :::; Here fx 1 ; x 2 ; :::; x n g is some nonempty set of the patterns or examples; y 1 ; :::; y n are labels or outputs taking the values 1 and +1; l is the number of features.It is supposed that the number of elements in the training set belonging to the class y is n y and their indices form the set of indices N (y), i.e., we can write n 1 + n 1 = n and N (y) = fi : y i = yg.
Classi…cation problem is usually characterized by an unknown CDF F 0 (x; y) on R l f 1; +1g de…ned by the training set or examples x i and their corresponding class labels y i .
The main problem is to …nd a decision function g(x), which predicts accurately the class label y of any example x that may or may not belong to the training set.In other words, we seek a function g that minimizes the classi…cation error, which is given by the probability that g(x) 6 = y.One of the possible approaches for solving the problem is the discriminant function approach which uses a real valued function f (x) called the discriminant function whose sign determines the class label prediction: g(x) = sgn(f (x)).The discriminant function f (x) may be parametrized with some parameters w = (w 0 ; w), w = (w 1 ; :::; w l ), that are determined from the training examples by means of a learning algorithm.In particular, the function f (x) may be linear, i.e., f (x) = hw; xi + w 0 .Introduce also the notation x (k) i for the i-th element of the vector x k .Given the training data the linear discriminant training problem is to minimize the following risk measure [33]: Here the loss function L(x; y) usually takes a non-zero value when the sign of the discriminant function (the class label prediction) does not coincide with the class label y.The minimization of the risk measure is carried out over the parametric class of functions f (x).In other words, the function f (x) provides the minimum of R(w) such that R(w opt ) = min w R(w).

The classi…cation problem under a set of probability distributions
Let us represent the joint probability as F 0 (x; y) = F 0 (x j y) P (y).Here P (y) is the prior probability that an example x belongs to the class y.Then we can rewrite the risk measure taking into account two values of y R(w By assuming that features are independent, we can rewrite the above risk measures as Suppose that the distributions F i are unknown.However, we assume that some lower and upper bounds for a set F i (y) of the CDFs F i (x j y) are known accurate to w, and they are F i (x j y) and F i (x j y), respectively.We can write In other words, there is an unknown precise "true" CDF F i (x j y) 2 F i (y) for every y 2 f 1; +1g and every i = 1; :::; l, but we do not know it and only know that it belongs to the set F i (y).It has been mentioned that the set F i (y) is not the set of parametric distributions having the same parametric form as the bounding distributions, but it is the set of all possible distributions restricted by the lower and upper bounds.

The minimax strategy
One of the possible strategies to derive an estimator is the minimax (pessimistic) strategy.According to the minimax strategy, we select a CDF from the set F i ( 1) and a CDF from the set F i (1) such that the risk measures R 1 (w) and R 1 (w) achieve their maximum for every …xed w.The minimax strategy can be explained in a simple way.We do not know a precise CDF F i and every CDF from F i (y) can be selected.Therefore, we should take the "worst" distribution providing the largest value of the risk measure.The minimax criterion appears as an insurance against the worst case because it aims at minimizing the expected loss in the least favorable case [23].
Denote F(y) = F 1 (y) ::: F l (y).Since the sets F( 1) and F(+1) are obtained independently for y = 1 and y = 1, respectively, then R y (w): The minimax risk functional with respect to the minimax strategy is now of the form: R(w opt ) = min w R(w): Let us consider in detail the …rst problem max F (x j 1)2F ( 1) R 1 (w).Most loss functions L(x; 1) applied in classi…cation are increasing with f .This implies that the upper bound for R 1 (w), i.e., the maximum of R 1 (w) over all distributions from F( 1) is achieved at the CDFs F (x j 1) (see, for instance, Walley's paper [35]).Hence, there holds Here where e F i (x i j y) = F i (x i j y); L(x; y) increases with x i ; F i (x i j y); L(x; y) decreases with x i : The above condition can be rewritten in terms of the function In the same way, we can consider the second problem max F (x j 1)2F (1) R 1 (w).Most loss functions L(x; 1) are decreasing with f .Therefore, the upper bound for R 1 (w) is achieved at the distribution F (x j 1).This implies that Finally, we get the upper bound for the risk measure R(w), which is of the form: Now we have two tasks.First, we have to de…ne CDFs F i (x j y) and F i (x j y) from the available information for every y = 1; 1 and for every i = 1; :::; l.Second, we have to de…ne the prior probabilities of classes P ( 1) and P (1).

The minimin strategy
The minimin strategy can be regarded as a direct opposite of the minimax strategy.According to the minimin strategy, the risk measure R is minimized over all probability distributions from the set F as well as over all values of parameters.The strategy can be called optimistic because it selects the "best" probability distribution from the set F. Of course, the minimin strategy is of little interest.Nevertheless, we study it in order to compare "extreme" cases (minimax and minimin strategies).
Similarly to the minimax strategy, we can write R y (w): Since loss functions L(x; 1) applied in classi…cation are increasing with f , then the lower bound for R 1 (w), i.e., the minimum of R 1 (w) over all distributions from F( 1) is achieved at the distribution F (x j 1).The loss function L(x; 1) is decreasing.Therefore, the lower bound for R 1 (w) is achieved at the distribution F (x j 1).Hence, there holds where b F i (x i j y) = F i (x i j y); L(x; y) decreases with x i ; F i (x i j y); L(x; y) increases with x i : The optimization problem for computing the optimal values of parameters w for the minimin strategy can be written as 4 Mean values of features and a method for constructing the model t .We assume that features with numbers 1; :::; t are observed without loss of generality.However, other l t features are unobserved, and we know only conditional mean values m i (y) of the features for every class and their bounds a i and b i , i = t + 1; :::; l.How to classify the objects in this case?
One of the simplest ways is to assume that the mean values are observed values.In other words, we can write x (i) j = m j (y), j = t + 1; :::; l, for all i = 1; :::; n, i.e., for all observations.This way can be applied when there are a lot of observation.However, when the amount of statistical data is small, the above replacement of observations by mean values may lead to incorrect classi…cation.Moreover, we do not take into account the information about bounds of feature values here, which might be useful.
Another way is to …nd the mean values of every observed feature with the number j = 1; :::; t, for every class as Then we can exploit the simplest classi…cation algorithm considered by many authors, for instance, by [26].The algorithm is based on analyzing the distances between a predicted vector x and two vectors of mean values of features.The smallest distance determines the class of x.It has been noted in [26] that the proposed decision is the best we can do if we have no prior information about the probabilities of the two classes.However, we lose some useful information in this case, which can be inferred from the observations.Therefore, we have to develop a classi…cation method which maximally exploits the available information about features.
The …rst important assumption we use below is that the values of t observed features are governed by the nonparametric or empirical distribution.
By dealing with the unobserved features, we consider two cases or two important assumptions.The …rst one is that we have conditional expectations m j (y) de…ned for every class.The second one is that we have unconditional expectation for every feature, which does not depend on the class.This case is less informative, but it is typical for many applications.It is reduced to the …rst case by accepting the equality m j ( 1) = m j (1).
By assuming that the observed features are governed by the empirical distribution, we can conclude that the distribution of the function f (1) is also empirical, i.e., its PDF is the weighted sum of Dirac functions f (1) f (1) i with weights 1=n y .Hence, we obtain The precise CDFs F j , j = t + 1; :::; l, are unknown.However, we know the mean values of every feature with numbers t + 1; :::; l for every class y and the bounds of their values.Therefore, we can construct a set of CDFs with some lower and upper bounds.Given the mean value m i (y) of the i-th feature and its bounds a i , b i , the lower F i (x j y) and upper F i (x j y) conditional CDFs of the i-th feature values are x < a i ; max 0; x mi(y) x ai ; a i x < b i ; 1; x b i ; Figure 1: The lower and upper probability distributions x b i : It should be noted that the expression for the upper bound F i (x j y) can be obtained by using the natural extension [19,34] which can be represented as the following linear programming problem: Here 1fz xg is the indicator function taking the value 1 if z x.The lower bound F i (x j y) can be obtained in the same way by solving the following programming problem: The same bounds have been di¤erently obtained in the work [11].
The lower and upper CDFs are shown in Fig. 1, where M = m(y) = 2, a = 1, b = 8.The resulting bounds are optimal in the sense that they could not be any tighter under the given information.However, this does not mean that any distribution whose CDF is inscribed within this bounded probability region would have the same expectations m i (y).The obtained set is more rich and produces the p-box.This leads to a more conservative and cautious solution of the classi…cation problem.Now we have two problems.The …rst one is to determine the CDFs F j , j = t + 1; :::; l.The second problem is to solve an optimization problem for computing parameters w by using the above expressions for the risk measure.
Since the function L(f; 1) is increasing, then the upper bound for R 1 (w) can be written as Z Here the upper bound R 1 (w) depends only on the bounds for CDFs e F j (x j j 1).This is a very important property which will be used later.
The function L(f; 1) is decreasing.This implies that the upper bound for Z Here the upper bound R 1 (w) depends also only on the bounds for CDFs e F j (x j j 1).It should be noted that it is di¢ cult to integrate in ( 6)-( 7) in an explicit form in order to get some functions of parameters w even for the simplest loss functions L. However, we can apply the standard Monte-Carlo technique.By using this technique, random values of features with the indices j = t + 1; :::; l, are generated in accordance with the CDFs e F j (x j j 1) for the class y = 1 and with the CDFs e F j (x j j 1) for the class y = 1.By generating K i random vectors of features x ), k = 1; :::; K i , for every i = 1; :::; N (y) and every y = 1; 1 in accordance with the CDF e F j (x j j 1) and the CDF e F j (x j j 1), we rewrite ( 6)-( 7) as follows: i;k ; w (2)   E ; 1); i;k ; w (2)   E ; 1): Finally, we obtain the upper risk measure as a function of parameters w as i;k ; w (2)   E ; y i ); (i;k) j e F j (x j j 1) for i 2 N ( 1) and x (i;k) j e F j (x j j 1) for i 2 N (1).In fact, we extend the training set by generating the "missing" values of features.We reduce the learning problem with combined types of the training information to the standard problem when there are training data in the form of real and generated observations of all features.It is important to note that we do not replace here the "missing" features by their mean values m i ( 1), i = t + 1; :::; l.The "missing" values are replaced by a set of random values of features generated in accordance with the corresponding lower and upper CDFs.
The optimization problem for computing parameters w for the minimin strategy is of the same form as (8).However, the value x (i;k) j is governed by the CDF b F j (x j j 1) for i 2 N ( 1) and x (i;k) j is governed by the CDF b F j (x j j 1) for i 2 N (1).This is just one distinguish of optimization problems by the minimax and minimin strategies.
An important question is how to determine the functions e F j and b F j or how to determine the type of dependence between L and x j .We can propose two possible ways for doing that.First, the dependence can be determined by experts or by a decision maker on the basis of a preliminary analysis of features and classes.Very often, we can evaluate how possible changes of the feature values impact on the output variable y on the basis of physical meaning of the analyzed classi…cation problem.Of course, this way is simple, but, generally, it can not be always applied to classi…cation problems.Second, we can enumerate 2 l t+1 variants of the CDFs e F j and b F j by taking di¤erent lower and upper CDFs instead of e F j and b F j .In accordance with the minimax strategy, the optimal risk measure is the largest value of the risk measure R(w) by optimal parameters w opt .The same procedure can be applied to the minimin strategy.However, we search for the smallest value of the risk measure R(w) by optimal parameters w opt in this case.

A procedure for generation of random feature values
Let us consider how to generate random feature values in accordance with the above CDFs.First, we analyze the lower CDF.It can be seen from its form that the corresponding random variable is concentrated on two subsets.The …rst subset is the interval from m i (y) till b i .The second is the point b i .The probability that the random variable is in the interval Therefore, a random number is generated in two steps.First, a random variable r uniformly distributed in interval [0; 1] is generated.If r is larger than (b i m i (y))=(b i a i ), then x = b i , i.e., the generated number at the second step is b i .If r is smaller than (b i m i (y))=(b i a i ), then we use the well-known inverse transformation method.According to the method, the random number x is computed through the inverse lower CDF, i.e., x = m i (y) a i r 1 r : The right side of the above equality is obtained by means of the inverse transformation of the lower CDF.The same simulation procedure can be provided for the upper probability distribution.A random variable r uniformly distributed in interval [0; 1] is generated.If r is smaller than (b i m i (y))=(b i a i ), then x = a i , i.e., the generated number at the second step is a i .If r is larger than (b i m i (y))=(b i a i ), then, according to the inverse transformation method, the random number x is computed through the inverse upper CDF, i.e., x = m i (y) b i (1 r) r :

Hinge loss function and SVM
A procedure for computing optimal values of parameters w depends on the loss function L. We consider the so-called hinge loss function which is of the form L(f; y) = max(0; 1 yf ).This function is taken for the consideration in order to reduce the classi…cation problem to the SVM method which gives the opportunity to construct nonlinear classi…cation models in a rather simple way.
After substituting the hinge loss function into the objective function ( 8), we get the following optimization problem: i ; x i;k ; w It can be rewritten in a more dense form: i;k ; w Let us introduce a new optimization variable i;k ; w Then we get the optimization problem i;k ; w G i;k 0; 8i = 1; :::; n; k = 1; :::; K i : So, we have the linear optimization problem having (l + 1) + P n i=1 K i optimization variables and 2 P n i=1 K i constraints.
Let us add the standard Tikhonov regularization term 1 2 hw; wi (the most popular penalty or smoothness term) [29] to the objective function ( 9) and the constant "cost" parameter C. The smoothness (Tikhonov) term can be regarded as a constraint which enforces uniqueness by penalizing functions with wild oscillation and e¤ectively restricting the space of admissible solutions.The detailed analysis of regularization methods can be found also in the work [8].Then we get the following quadratic programming problem: subject to ( 10)- (11).Instead of minimizing the primary objective function ( 12), a dual objective function, the so-called Lagrangian, can be formed of which the saddle point is the optimum.The Lagrangian is i ; x i;k ; w E + y i w 0 : Here i;k ; ' i;k , i = 1; :::; n, k = 1; :::; K i , are Lagrange multipliers.Hence, the dual variables have to satisfy positivity constraints i;k 0; ' i;k 0 for all i, k.Hence, we get the simpli…ed Lagrangian i ; x i;k ; w Now we can divide all terms of the above objective function into two parts corresponding to the observed and unobserved features, respectively, i ; w (1) i;k ; w (2)   E : Hence, we obtain the dual optimization problem max i ; x Any data point for which ' i;k > 0 is called a support vector.Let S and N S denote the set of indices of the support vectors and their total number, respectively.Then one of the ways for computing the parameter w 0 is ; where (y s ; x s ) is one of the support vectors.If we assume that K i = K for all i = 1; :::; n, the prior probabilities are de…ned as P (y) = n y =n, then we rewrite the optimization problem as max subject to 0 ' i;k C nK ; i = 1; :::; n; k = 1; :::; K: Finally, we can write the discriminant function i;k ; x (1) ; x (2)   E + w 0 : The main advantage of the SVM is the use of kernels which are functions that transform the input data to a high-dimensional space where the learning problem is solved.There are many types of kernel that may be used in an SVM.Acceptable kernels must satisfy Mercer's condition.Commonly used forms of kernels are linear K(x i ; x j ) = hx i ; x j i, polynomial K(x i ; x j ) = ( hx i ; Here , r, and d are kernel parameters.
The kernel functions allow us to signi…cantly extend the class of discriminant functions that can be used in this approach.

Experimental design
We illustrate the method proposed in this paper via several examples, all computations have been performed using the statistical software R [22].We investigate the performance of the proposed method and compare it with other methods dealing with missing data by considering the accuracy measure (ACC), which is the proportion of correctly classi…ed cases on a sample of data, i.e., ACC is an estimate of a classi…er's probability of a correct response.This measure is often used to quantify the predictive performance of classi…cation methods and it is an important statistical measures of the performance of a binary classi…cation test.It can formally be written as ACC = N T =N .Here N T is the number of test data for which the predicted class for an example coincides with its true class, and N is the total number of test data.First we consider a numerical example with synthetic data.In this example, we generate instances with two features (l = 2) such that the second feature is unobserved.We generate 500 normally distributed random values for every features with the expectations m 1 ( 1) = 4, m 1 (1) = 6, m 2 ( 1) = 5, m 2 (1) = 10, and the standard deviations 1 = 1 and 2 = 3, respectively.We take identical standard deviations for both classes in order to simplify the example.Moreover, we state the lower and upper bounds for values of the second feature Here we use the available mean values of the second feature as values of the feature.We will call this strategy as direct for short.The initially generated 500 normally distributed random values will be used for testing resulting discriminant functions.
The ACC measures and the discriminant functions for the above three training sets will be indexed by numbers 1, 2, 3 corresponding to the minimax, minimin and direct strategies, respectively.We will use the linear and RBF kernels with the parameter = 1=l.By applying the above initial data, we get three discriminant functions corresponding to three strategies (minimax, minimin, direct): f 2 (x) = 0:009x 1 0:49x 2 + 3:91: The corresponding ACCs for linear and RBF kernels are shown in Table 1.One can see from the table that the optimistic and direct strategies provide better results in comparison with the minimax strategy.This can be explained by exploiting the normal distribution (symmetric and unimodal) with rather small standard deviations for generating the random values of the second feature.
We replace the normal distribution of the second feature values by the truncated exponential distribution with the CDF 1 exp((x a 2 )=m 2 (y)) if x < b 2 and 1 if x b 2 .This distribution is not symmetric and its mean value can not replace the corresponding random values.By taking the linear and RBF kernels, n = 10, K = 20, we get the following discriminant functions: f 1 (x) = 1:96x 1 + 0:2x 2 + 7:91; f 2 (x) = 0:09x 1 0:47x 2 + 4:2; The corresponding ACCs for linear and RBF kernels are shown in Table 2.It can be seen from the table that the minimax strategy provides better results.It follows from the fact that the minimax strategy takes into account worst cases of the probability distribution of feature values.Of course, the exploited exponential distribution is not the worst case, but it is not the best case too.We can immediately observe that change for the worse of the probability distribution leads to improving the minimax strategy in comparison with the minimin and direct strategies.
The proposed method has been evaluated and investigated by the following publicly available datasets: Iris, Pima Indian Diabetes, Mammographic masses,  [13].Table 3 is a brief introduction about these datasets, while more detailed information can be found from, respectively, the data resources.For all data we use the repeated random sub-sampling validation procedure, i.e., we randomly split the dataset into two subsets.One of them (training set having n instances) is used to train the model while the other (test set having N n instances) is used to validate the model.The number of instances for training will be denoted as n.Moreover, we take n=2 instances from every class for training.They are randomly selected from the classes.The remaining instances in the dataset are used for validation.The parameter of the RBF kernel for every dataset is chosen in order to maximize the accuracy measure.It is carried out by means of the following procedure.It is well known that letting the C and grow exponentially is a practical method to identify good parameters.An r r uniform grid in the logarithmic coordinate space (C 0 = log 2 C, 0 = log 2 ) is usually used.The point in the grid represents a parameter pair (C 0 ; 0 ).However, we …x the value of C = 100 in order to reduce the number of experiments because our main aim is to compare the proposed models with known models.So, we perform experiments on a 13 uniform grid where 0 has a range 2 6 ; :::; 2 6 .
From every dataset, we randomly select a feature corresponding to missing values and compute its mean values for negative and positive labels, respectively.Moreover, we …nd the smallest and largest values of the selected feature which will be used for determining the lower and upper cumulative distribution functions.Then we generate the random values of the selected feature K = 20 times for every instance.In sum, we have nK instances.The above procedure is repeated N = 50 times such that the selected feature with missing values is chosen randomly in every iteration.In addition to the minimin (ACC1), minimax (ACC2) and direct (ACC3) strategies, we generate random values of the "missing" feature in accordance with the normal distribution and compute the corresponding accuracy measure ACC4.By using the RBF kernels and the  4 that the proposed minimax strategy (ACC2) outperforms the direct strategy and the normal distribution imputation procedure for some real datasets.Of course, there are datasets for which the measures ACC3 or ACC4 are larger than ACC2.If we have seen from the experiments with synthetic data that the minimax strategy provides better results when the distribution of the feature values is not symmetric and its mean value can not replace the corresponding random values, then it is di¢ cult to determine clear conditions of using the proposed model with real data.We can say that these conditions directly depend on a probability distribution of the feature values in real data.When we do not have this information, the proposed method should be used jointly with other models dealing with missing data.

Conclusion
A classi…cation problem under partial information about some features in the form of conditional expectations or mean values of features for every class has been studied in the paper.Its solution is based on the pessimistic (minimax) and optimistic (minimin) decision strategies.
What are the main advantages of the proposed method?First, the classi…-cation algorithm totally exploits the available information in the form of mean values of some features and the bounds of these features.At the same time, it does not employ any additional information which may be unjusti…ed and incorrect.It does not use also additional assumption which may lead to incorrect prediction results.Second, the proposed method has a strong probabilistic background, and this fact allows us to use it in arbitrary applications where the initial information is scarce.Third, the method exploits the well-known minimax and minimin strategies which have a strong explanation.A cautious decision strategy as an intermediate case between pessimistic and optimistic strategies with a prede…ned caution parameter can also be studied in the same way.However, this is a direction for further research.Fourth, the method is reduced to the SVM.This fact allows us to simply construct non-linear classi…cation models by using suitable kernels.Fifth, the method allows us to reduce the classi…cation problem to the standard form.This implies that a standard software can be applied for its implementation.The algorithm for computing the optimal parameters of every classi…cation model can be easily implemented with standard functions of the statistical software package R or by using the well-known software library LIBSVM (A Library for Support Vector Machines) ( [5]).
The numerical examples have illustrated that the minimax classi…ers can provide more accurate results in many cases in spite of their over-conservative decisions.At the same time, the given experiments can be viewed as a preliminary study of the proposed framework for applying the imprecise models to classi…cation problems with missing values.An additional study has to be carried out in order to totally …gure out when the proposed classi…ers outperform the available classi…cation models.
One can also see from the paper that the Monte Carlo technique is a versatile tool for dealing with partial information.Various classi…cation problems under di¤erent types of partial and unreliable information could be solved in the same way.A detailed analysis of the corresponding classi…cation models is another direction for further research.
At the same time, it is well known that one possible limitation of Monte-Carlo methods is the strong dependence of computational e¤ort (proportional to the number of samplings).This implies that the learning of large datasets may lead to a hard computational problem.However, …rst of all, the minimax strategy should be used when the number of instances in training sets is rather small in order to provide the robust classi…cation.When the training set consists of a large number of instances other models might give better results.Second, variance reduction techniques can be applied to the classi…cation procedures to decrease the computational e¤ort.This is also a topic of further research.
The proposed method can be also extended on the case of interval-valued mean values of unobserved features.In this case, the lower and upper CDFs are determined by the lower and upper mean values of features.

a 2 = 4
and b 2 = 14.Then we randomly select n = 10 points (instances) with identical numbers of points (5) for both classes and get three training sets.The …rst and the second training sets are obtained in the following way.We generate the values of the second feature K = 20 times for every example.In sum, we have 200 examples.At that, the values for the …rst training set are generated in accordance with the CDFs e F j (x j j 1) for the class y = 1 and with the CDFs e F j (x j j 1) for the class y = 1.This training set corresponds to the minimax strategy.The values for the second training set are generated in accordance with the CDFs b F j (x j j 1) for the class y = 1 and with the CDFs b F j (x j j 1) for the class y = 1.The second training set corresponds to the minimin strategy.For getting the third training set, we replace all values of the second feature in the set of n = 10 examples by the expectations m 2 (y) for y = 1 and y = 1.

Table 1 :
The ACC measures for linear and RBF kernels by normally distributed

Table 4 :
The ACC measures for real datasets by di¤erent values of n = 100, we get the ACC measures for di¤erent values of n, whose values are shown in Table4.These measures are mean values of the corresponding ACCs computed for every iteration.One can see from Table