Fuzzy One-Class Classification Model Using Contamination Neighborhoods

A fuzzy classification model is studied in the paper. It is based on the contaminated (robust) model which produces fuzzy expected risk measures characterizing classification errors. Optimal classification parameters of the models are derived by minimizing the fuzzy expected risk. It is shown that an algorithm for computing the classification parameters is reduced to a set of standard support vector machine tasks with weighted data points. Experimental results with synthetic data illustrate the proposed fuzzy model.


Introduction
A main goal of the statistical machine learning is to predict an unobserved output value y based on an observed input vector x.A special very important problem of the statistical machine learning is the classification problem which can be regarded as a task of classifying some objects into classes in accordance with their properties or features.A lot of models have been constructed for solving the machine learning problems in last decades.However, a large part of models is based on restrictive assumptions, for instance, the large amount of training data, the known type of the noise probability distribution, point-valued observations, and so forth.At the same time, real applications cannot satisfy all these assumptions or their part due to several reasons.In order to relax some of the restrictive assumptions, many methods have been proposed.One of the directions for developing the corresponding methods is the fuzzy classification which applies the main ideas of fuzzy set theory to various classification problems.
Most fuzzy classification models can conditionally be divided into three groups.Two groups suppose that there are precise observations, but fuzzy sets are used to take into account different contributions of the observations or in order to take into account imprecision of results, for instance, imprecision of separating surfaces.According to models of the first group with precise observations, a membership value or a membership function is assigned to every point from the training set in accordance with some rules [1][2][3][4][5].As a result, different input points can make different contributions to the learning of decision surface.Then the initial classification problem is solved as a weighted classification problem by means of known methods, for example, by means of the support vector machine (SVM) approach proposed by Vapnik [6].According to models of the second group, a fuzzy separating surface, for instance, a fuzzy hyperplane in the feature space, is constructed [7,8].In other words, a result of the classification problem solution is a set of hyperplanes with a membership function derived from the model.The third group of the models supposes that learning observations are interval-valued or fuzzy-valued themselves [9][10][11] due to imperfection of measurement tools or imprecision of expert information if used as data.It should be noted that methods based on fuzzy sets can be used for generalizing the two-class classification into multiclass classification.In particular, Wilk and Wozniak [12] proposed the corresponding method by exploiting a fuzzy inference system.
The main difficulty of most fuzzy models is how to determine the membership functions for points from the training set.Therefore, a fuzzy classification model is proposed in the Advances in Fuzzy Systems given paper which constructs the membership function on the basis of available statistical data by using an extension of the well-known ε-contamination neighborhood or εcontaminated (robust) models [13].Main ideas underlying the proposed model are the following.
The empirical probability distribution accepted for deriving the empirical expected risk (a measure of the classification error) and used in the standard SVM is replaced by a set of probability distributions, which is produced by applying the ε-contaminated model.At that, it is assumed that the contaminating probability distribution can be arbitrary.As a result, we could construct an imprecise classification model for a predefined value ε.However, the main difficulty of using the ε-contaminated model is that the value ε is unknown, and there are no rules or methods for its determination.Therefore, the next idea is to use all values ε and to construct a model which covers all values ε.
Note that the produced set of probability distribution on the basis of the ε-contaminated model for a fixed ε is convex.This implies that the expected risk as an expectation of a loss function lies in an interval with some lower and upper bounds.Moreover, we get a set of nested intervals for different values of ε ranging from 0 to 1, which can be regarded as a fuzzy set.This implies that we obtain the fuzzy expected risk parametrized by classification parameters or by parameters of a separating surface.The next task is to find the classification parameters which minimize the fuzzy expected risk.This task is solved in the framework of SVM by taking a special ranking index for ranging the fuzzy expected risk measures.The fuzziness of the expected risk measure strongly depends on the number of points in the training set.It is assumed that the fuzziness increases with decrease of the number of data points.
It can be seen from the previous idea that the proposed model encompasses peculiarities of the first and the second group of models.
In order to simplify the description of the proposed classification model, we consider the one-class classification model or the well-known novelty detection model [14][15][16][17].Moreover, we restrict ourselves by studying a model proposed by Schölkopf et al. [16,18].Nevertheless, this model can easily be extended in the case of binary classification models.It is important to show main principles for constructing the fuzzy classification model in the paper.
The paper is organized as follows.Section 2 presents the standard one-class classification problem proposed by [16,18].The ε-contaminated robust model and its peculiarities are considered in Section 3. A set of probability distributions produced by the model and the fuzzy expected risk measure are studied in the same section.The SVM approach for computing the optimal classification parameters is provided in Section 4. In this section, one of the possible ways for decomposing the quadratic programming problem is considered.Numerical experiments with synthetic and some real data illustrating accuracy of the proposed model are provided in Section 5.In Section 6, concluding remarks are given.

One-Class Classification
Suppose we have unlabeled training data x 1 , ..., x n ⊂ X, where n is the number of observations, X is some set; for instance, it is a compact subset of R m .According to papers [16,18], a well-known one-class classification (novelty detection) model aims to construct a function f which takes the value +1 in a "small" region capturing most of the data points and −1 elsewhere.It can be done by mapping the data into the feature space corresponding to the kernel and by separating them from the origin with maximum margin.
Let φ be a feature map X → G such that the data points are mapped into an alternative higher-dimensional feature space G.In other words, this is a map into an inner product space G such that the inner product in the image of φ can be computed by evaluating some simple kernel K(x, y) = (φ(x), φ(y)), such as the Gaussian kernel: σ is the kernel parameter determining the geometrical structure of the mapped samples in the kernel space.
The aim of classification is to find a hyperplane f (x, w) = w, φ(x i ) − ρ = 0 that separates the data from the origin with maximal margin.We use the parameter ν ∈ [0; 1] which is analogous to ν used for the ν-SVM [19].It denotes the fraction of input data for which w, φ(x i ) ≤ ρ.
To find the optimal parameters w and ρ, the following quadratic program has to be solved: subject to Slack variables ξ i are used to allow points to violate margin constraints.Using multipliers α i , β i ≥ 0, we introduce a Lagrangian: It is shown in [19] that the dual problem is of the form subject to The value of ρ can be obtained as Advances in Fuzzy Systems 3 After substituting the obtained solution into the expression for the decision function f , we get (8)

Robust Models and Fuzzy Expected Losses
3.1.Robust Models and the Expected Risk.Robust models have been exploited in classification problems due to the opportunity to avoid some strong assumptions underlying the standard classification models.As pointed out by Xu et al. [20], the use of robust optimization in classification is not new.There are a lot of published results providing various robust classification and regression models (see, e.g., [21][22][23][24]) in which box-type uncertainty sets are considered.One of the popular robust classification models is based on the assumption that inputs are subject to an additive noise, and every data point is only known to belong to the interior of an Euclidian ball.Another class of robust models is based on relaxing strong assumptions about a probability distribution of data points (see, e.g., [25]).
We consider a model which can be partially regarded as a special case of these models and is based on using the framework of ε-contaminated (robust) models ([13]).They are constructed by eliciting a Bayesian prior distribution p = (p 1 , . . ., p n ) as an estimate of the true prior distribution.The ε-contaminated model is a class of probabilities which for fixed ε ∈ (0, 1) and p i is the set where q i is arbitrary and q 1 + • • • + q n = 1.The rate ε reflects the amount of uncertainty in p [26].In other words, we take an arbitrary probability distribution q = (q 1 , . . ., q n ) from the unit simplex denoted by S (1, n).According to these models, for 0 < ε < 1, M(ε) is the set of all probabilities with the lower bound (1 − ε}p i and the upper bound (1 − ε}p i + ε.Of course, the assumption that q is restricted by the unit simplex S(1, n) is one of possible types of ε-contaminated models.Generally, there are a lot of different assumptions which produce specific robust models.
Let us rewrite the problem in a general form of minimizing the expected risk [6] Here the loss function L(x) is the hinge loss function which is represented as The standard SVM technique is to assume that F 0 is empirical (nonparametric) probability distribution whose use leads to the empirical expected risk The assumption of the empirical probability distribution means that every point x i has the probability p i = 1/n.This is a too strong assumption when the number of points is not large.Its validity might give rise to doubt in this case.Therefore, in order to relax the strong condition for probabilities of points, we apply the ε-contaminated model.According to the model, we replace the probability distribution p = (1/n, . . ., 1/n) by the set of probability distributions M(ε) = {(1 − ε)n −1 + εq i }.In other words, there is an unknown precise "true" probability distribution in M(ε), but we do not know it and only know that it belongs to the set M(ε).

The Fuzzy Expected Risk.
Let h = (h 1 , . . ., h n ) be a probability distribution which belongs to the set M(ε).Since the set M(ε) is convex, then the values of the expected risk for every value ε are restricted by some lower and upper bounds such that every point of the interval of expected risk corresponds to a probability distribution from M(ε).Note that M(ε 1 ) ⊆ M(ε 2 ) by ε 1 ≤ ε 2 .This implies that there holds If we consider all possible values of the parameter ε, then we get the set of nested intervals [R L (w, ρ, ε); R U (w, ρ, ε)] with the parameter ε.Moreover, the expected risk interval is reduced to a point by ε = 0, and the largest interval takes place by ε = 1.This set of intervals can be viewed as a fuzzy set R(w, ρ) of the expected utility with the membership function μ.The similar idea to obtain fuzzy sets by means of the contamination robust model has been mentioned by Utkin and Zhuk [27].
The upper bound for the expected risk by fixed ε can be found as a solution to the following programming problem: The obtained optimization problem is linear with optimization variables q 1 , . . ., q n , but the objective function depends on w.Therefore, it cannot be directly solved by wellknown methods.In order to overcome this difficulty, note, however, that all points q belong to the simplex S(1, n) in a finite dimensional space.According to some general results from linear programming theory, an optimal solution to the previously mentioned problem is achieved at extreme points of the simplex, and the number of its extreme points is n.Extreme points of the simplex S(1, n) are of the form (1, 0, . . ., 0), (0, 1, . . ., 0), . . ., (0, 0, . . ., 1). ( Advances in Fuzzy Systems This implies that there holds The same can be written for the lower bound:

The Fuzzy Decision Problem and a Way for Its Solving.
Now we can write a new criterion of decision making about optimal parameters w, ρ.The parameters w 0 , ρ 0 are optimal iff for all w, ρ, there holds R(w 0 , ρ 0 ) ≤ R(w, ρ).The next question is how to compare the fuzzy sets.This is one of the most controversial questions in fuzzy literature.Most ranking methods are based on transforming a fuzzy set into a real number called by the ranking index.Here we use the index proposed by [28], which can be written in terms of the considered decision problem as Here η ∈ [0, 1] is a parameter of pessimism.It can be regarded as a caution parameter proposed by [29].The caution parameter reflects the degree of ambiguity aversion.The more ambiguity averse the decision maker is, the higher is the influence of the lower interval limit of generalized expected utility.η = 1 corresponds to strict ambiguity aversion; η = 0 expresses maximal ambiguity seeking attitudes.It is assumed here that the variable μ is a function of ε, for example, μ = 1 − ε.
Then we write a new criterion of decision making.Parameters w 0 , ρ 0 are optimal iff for all w, ρ, there holds It follows from the above that the optimal parameters w, ρ can be obtained by solving the following optimization problem: Denote A = 1 0 εdμ.The optimal expected risk is now of the form where We can see that the previous objective function R(w, ρ) consists of two main parts.The first part is the modified empirical expected risk.The second part can be regarded as Hurwicz criterion with optimism parameter 1 − η, which is exploited in decision problems when we do not have information about states of nature.If A takes values from 0 to 1, then the objective function is a convex combination of the Hurwicz criterion and the expected utility under condition that probabilities of states of nature are identical 1/n.Moreover, R(w, ρ) can be also regarded as the objective function with respect to the Hodges-Lehmann criteria [30] by η = 0.This means that we are not sure that the probability of every point is 1/n, and this disbelief is compensated by a more guaranteed Hurwicz approach with some coefficient depending on the function μ(ε).It is interesting that the fuzzy classification problem has transformed to the standard decision problem comprising several decision criteria.
Before solving the optimization problem for computing parameters w, ρ, we have to define the function μ(ε) which determines the membership function of fuzzy expected losses.As we have pointed out, the simplest way is to assume that μ = 1 − ε.Then we get However, this way does not take into account the possible dependence of the fuzzy set (its fuzziness) on the number of points in the training set.We assume that the fuzziness has to increase with decrease of n.
Let us consider the meaning of A in the objective function (22).If A = 1, then the optimal parameters are defined only by two single points maximizing and minimizing L(w, φ(x i )).This is an extreme case when we suppose that the empirical probability distribution is totally wrong.This takes place when we have a small number of points in the training set.Another extreme case is A = 0.This case corresponds to the standard approach in classification based on the empirical expected risk.It was shown by Vapnik [6] that the second case can be applied by large values of observations.Hence, we can state that A = 0 by n → ∞ and A → 1 by small values of n.

One of the possible functions satisfying the above conditions for
The next task is to minimize the function R(w, ρ) over the parameters w and ρ.This task will be solved in the framework of the SVM.

The SVM Approach
In order to use the SVM approach to the fuzzy classification problem, we use the hinge loss function (10).Let us add the standard Tikhonov regularization term (1/2) w, w (this is the most popular penalty or smoothness term) [31] to the objective function (22).The smoothness (Tikhonov) term can be regarded as a constraint which enforces uniqueness by penalizing functions with wild oscillation and effectively restricting the space of admissible solutions (we refer to [32] for a detailed analysis of regularization methods).Moreover, we introduce the following optimization variables: This leads to the quadratic programming problem subject to If we suppose that min i=1,...,n ξ i = ξ k , then the objective function can be decomposed to n objective functions of the form The optimal parameters w opt and ρ opt correspond to the smallest value of R k , k = 1, . . ., n.
Instead of minimizing the primary objective function, a dual objective function, the so-called Lagrangian, can be formed of which the saddle point is the optimum.Moreover, if we minimize the primary objective function, the dual objective function denoting L k has to by maximized.The Lagrangian is Here ζ i , ψ i , ϕ i , i = 1, . . ., n are Lagrange multipliers.Hence, the dual variables have to satisfy positivity constraints ζ i ≥ 0, ϕ i ≥ 0, ψ i ≥ 0 for all i = 1, . . ., n.The saddle point can be found by setting the derivatives equal to zero: Here 1 k (i) is the indicator function taking the value 1 if i = k.Using ( 30)-( 33), we get the following dual optimization problem: The function f (x) can be rewritten in terms of Lagrange multipliers as Hence, we find the optimal value of ρ by taking f (x, w) = 0; that is, there holds Let us write the Karush-Kuhn-Tucker complementarity conditions: It follows from the second condition that G = ξ i for a single value of i = l such that l = arg max i=1,...,n ξ i .Here we assume that ξ 1 , . . ., ξ n do not coincide.Therefore, η i = 0 for all i / = l.Returning to the constraints (35)-(36), we get Advances in Fuzzy Systems It follows from the previous constraints that the optimization problem (34)-(36) can be decomposed into n problems: subject to (40) and n i=1 ϕ i = ν.The optimal values of ϕ i , i = 1, . . ., n correspond to the smallest value of objective function L k,l , l = 1, . . ., n and to the largest value of objective functions L k,l , k = 1, . . ., n.
So, we have n 2 simple quadratic programming problems whose solution can be obtained by means of well-known methods and tools.

Experiments
We illustrate the method proposed in this paper via several examples; all computations have been performed using the statistical software R. We investigate the performance of the proposed method and compare it with the standard SVM approach by considering the accuracy (ACC), which is the proportion of correctly classified cases on a sample of data and is often used to quantify the predictive performance of classification methods.ACC is an estimate of a classifier's probability of a correct response, and it is an important statistical measure of the performance of a one-class classification test.In novelty detection, ACC is the sum of two accuracy measures: the normal accuracy rate which measures how well the algorithm recognizes new examples of the known examples and the novelty accuracy rate which does the same for examples of an unknown novel example.ACC can formally be written as where y i is the label of the ith test example x i .We will denote the accuracy measure for the proposed model as ACC fuzzy and for the standard SVM as ACC st .
All the experiments use a standard Gaussian radial basis function (GRBF) kernel with the kernel parameter σ = 0.04.Different values for the parameter σ have been tested, choosing those leading to the best results.
We consider the performance of our method with synthetic data having two features x 1 and x 2 .The training set consisting of two subsets is generated in accordance with the normal probability distributions such that N Figure 1 illustrates how the contours f (x, w) = 0 depend on the parameter η for the fuzzy model (thick curve) and for the standard SVM (dashed curve) by the very small number of points in the training set (n = 10).The parameter η takes values 0, 0.5, 1.One can see from the pictures that the region bounded by the "fuzzy" contour decreases with η.The value η = 0 provides the cautious strategy of decision making.It can be seen from the first picture that the region bounded by the "fuzzy" contour is larger than the same region corresponding to the standard model.The regions almost coincide for optimistic strategy when η = 1 (see Figure 1(c)).Figures 2 and 3 illustrate similar dependencies by n = 20 and n = 40, respectively.
The accuracy measure for different conditions is shown in Table 1.It can be seen from the results given in the table that the accuracy measure of the fuzzy model is larger than the same measure of the standard SVM.Of course, this relation takes place when n is rather small, and the contamination of the normal data has a large (in examples 0.3) ratio.
As a further example we applied all analyzed models to the well-known "Iris" data set from the UCI Machine  Learning Repository [33].The data set contains 3 classes (Iris Setosa, Iris Versicolour, Iris Virginica) of 50 instances each.The number of features is 4 (sepal length in cm, sepal width in cm, petal length in cm, petal width in cm).We suppose that examples from the Iris Setosa class are abnormal.For the experiment, we randomly select n points such that (1 − ε 0 )n points are taken from the set of positively labelled examples and ε 0 n points are from negatively labelled examples.Here ε 0 = 50/150 0.333.The parameters for modelling are v = 0.333, σ = 0.001.It is investigated how the accuracy measures depend on the amount n of training data.In particular, if we take n = 50, then ACC fuzzy = 0.834, ACC st = 0.773.If we take n = 80, then ACC fuzzy = 0.833, ACC st = 0.866.One can see that the fuzzy approach provides better numerical results in comparison with the standard approach when the number of examples n is rather small.

Conclusion
In this paper, a fuzzy one-class classification model has been proposed, which is based on applying the ε-contaminated model.The algorithm for computing the optimal parameters of classification is reduced to a finite number of standard SVM tasks with weighted data points, where the weights are assigned in accordance with a predefined rule derived from comparison of fuzzy expected risk measures.It is easy to be implemented with standard functions of the statistical software package R.

Advances in Fuzzy Systems
Experimental results with synthetic data and with the well-known "Iris" data set from the UCI Machine Learning Repository reported have illustrated that the proposed fuzzy model outperforms the standard approach by the small number of observations.Due to the proposed fuzzy model, we do not need to assign a certain value for the contamination parameter ε.However, we have to define the function μ(ε) and the parameter η.
It should be noted that the proposed model can easily be extended on the case of binary or multiclass classification.This is a direction for future work.
We have investigated only one fuzzy ranking index for comparison fuzzy numbers.However, there are many efficient indices whose application to the classification problems in the framework of the proposed model could give better accuracy of classification and better models.This is also another direction for future work.

Table 1 :
Accuracy measures for different n and ν.