A Cost-Sensitive Sparse Representation Based Classification for Class-Imbalance Problem

Sparse representation has been successfully used in pattern recognition and machine learning. However, most existing sparse representation based classification (SRC) methods are to achieve the highest classification accuracy, assuming the same losses for different misclassifications. This assumption, however, may not hold in many practical applications as different types of misclassification could lead to different losses. In real-world application, much data sets are imbalanced of the class distribution. To address these problems, we propose a cost-sensitive sparse representation based classification (CSSRC) for class-imbalance problem method by using probabilistic modeling. Unlike traditional SRC methods, we predict the class label of test samples by minimizing the misclassification losses, which are obtained via computing the posterior probabilities. Experimental results on the UCI databases validate the efficacy of the proposed approach on average misclassification cost, positive class misclassification rate, and negative class misclassification rate. In addition, we sampled test samples and training samples with different imbalance ratio and use F-measure, G-mean, classification accuracy, and running time to evaluate the performance of the proposed method. The experiments show that our proposed method performs competitively compared to SRC, CSSVM, and CS4VM.


Introduction
As a powerful tool for statistical signal modeling, sparse representation (or sparse coding) has been successfully used in pattern recognition fields [1], such as texture classification [2] and face recognition [3,4], in the past few years.In [3], John et al. proposed a sparse representation based classification (SRC) method when they solve the face recognition under various illuminations and occlusions, which represents an input test image as a sparse linear combination of training images and assigned the test image to the class whose training samples can best reconstruct it.In theirs work, they used  1 -regularizer rather than  0 -regularizer to regularize the objective function and then calculated the residuals between the original test sample and the reconstructed one to identify the query image's label.Such a sparse representation based classification framework has achieved a great success in face recognition and has boosted the research of sparsity related machine learning methods.
Traditional classification algorithms [5], including SRC, are designed to achieve the lowest recognition errors and assume the same losses for different types of misclassifications.However, this assumption may not be suitable for many real-world applications.For example, it may cause inconvenience to a gallery who is misclassified as an impostor and not allowed to enter the room controlled by a face recognition system but may result in a serious loss if an impostor is misclassified as a gallery and allowed entering the room.In such settings, the loss of misclassification should be taken into consideration, and "cost" information can be introduced to measure the severity of misclassification.In recent years, many cost-sensitive methods have been proposed.The typical works include the Cost-Sensitive Semisupervised Support Vector Machine (CS4VM) and Cost-Sensitive Laplacian Support Vector Machines (CSLSVM) proposed by Zhou et al. [6,7], a cost-sensitive Naïve Bayes method from a novel perspective of inferring the order relation [8] proposed by Fang et al., and novel cost-sensitive approach proposed by 2 Scientific Programming Castro and Braga to improve the performance of multilayer perceptron [9].In [10], an instance weighting method was incorporated into various Bayesian network classifiers.The probability estimation of Bayesian network classifiers was modified by the instance weighting method, which made Bayesian network classifiers cost-sensitive.In [11], Lo et al. presented a basis expansions model for multilabel classification to handle the cost-sensitive multilabel classification problem, where a basis function is an LP classifier trained on a random -label set.In [12], Wan et al. proposed a cost-sensitive feature selection method called Discriminative Cost-Sensitive Laplacian Score (DCSLS) for face recognition, which incorporated the idea of local discriminant analysis into Laplacian Score.
Cost-sensitive learning always coexists with class-imbalance in most applications with the goal of minimizing the total misclassification cost [13].Class-imbalance has been considered as one of the most challenging problems in machine learning and data mining.The ratio of imbalance (the size of majority class to minority class) can be as huge as 100, even up to 10000.Much work has been done in addressing the class-imbalance problem.Cost-sensitive learning is an effective method to deal with the imbalance data classification problem.In recent year, cost-sensitive learning has been studied widely and become one of the most important topics for solving the class-imbalance problem.In [14], Zhou and Liu studied empirically the effect of sampling and thresholdmoving in training cost-sensitive neural networks and revealed that threshold-moving and soft-ensemble are relatively good choices in training cost-sensitive neural networks.There are also some other cost-sensitive learning methods by improving the existed method.In [15], Sun et al. proposed a cost-sensitive boosting algorithms, which are developed by introducing cost items into the learning framework of AdaBoost.Another strategy for class-imbalance problem is based on exchanging the distribution of data sets.In [16], Jiang et al. proposed a novel Minority Cloning Technique (MCT) for class-imbalanced cost-sensitive learning.MCT alters the class distribution of training data by cloning each minority class instance according to the similarity between it and the mode of the minority class.Generally, users focus more on the minority class and consider the cost of misclassifying a minority class to be more expensive.In our study, we adopt the same strategy to address this problem.
In [17], a probabilistic cost-sensitive classifier was proposed for face recognition; they utilize the probabilistic model to estimate the posterior probability of a testing sample and calculate all the misclassification losses via the posterior probabilities.Motivated by this probabilistic model and probabilistic subspace clustering [17][18][19], we proposed a new method to handle misclassification cost.In sparse representation, it will play an important role for reconstruction if the value of coefficient is higher [20].In other words, the coefficient is 1 when a query sample was represented by a dictionary with the same sample as the query one.Just like Gaussian distribution, a sample that is close to the mean vector has a higher probability.Inspired by probabilistic model, we use coefficient matrix to calculate the posterior probabilities rather than the distribution of noise (residual) in [17] and they have to estimate the distribution of noise.The main advantage of our method is to reduce the computation complexity and computation cost, and the contribution of the proposed method is obtaining the posterior probability by coefficient vector of sparse representation.After calculating all the misclassification losses via the posterior probabilities, the test sample is assigned to the class whose loss is minimal.Experimental results on UCI databases validate the effectiveness and efficiency of our methods.
This paper is organized as follows.Section 2 outlines the details of the relevant method.Section 3 presents the details of the proposed algorithm.Section 4 reports the experiments.Finally, Section 5 concludes the paper and offers suggestions for future research.

Related Works
In this section, we briefly introduce some related works, including sparse representation based classification and costsensitive learning framework.

Sparse Representation Based Classification.
Sparse representation is a typically method in machine learning [3,21,22], which is to use labeled training samples from  distinct object classes to learn a dictionary and determine the label of an unseen new test sample correctly.We denote the data set with   training samples from the th class as a matrix Many method based distances are not robust in real-world applications because of various occlusions.To overcome this limitations, Wright introduced the sparse representation based classification method to represent the query image.Then, the linear representation of  can be rewritten in terms of all training samples as where  0 = [0, . . ., 0,  ,1 ,  ,2 , . . .,  ,  , 0, . . ., 0] T ∈   , whose entries are zero except those associated with the th class.This motivates us to seek the sparsest solution by solving the following optimization problem: where ‖ ⋅ ‖ 0 denotes the  0 -norm, which counts the number of nonzero entries in a vector.However, the above problem of finding the sparsest solution ( 0 -norm minimization problem) is nonconvex and actually NP-hard.Generally, if the solution sought is sparse enough, the solution of the  0 -minimization problem is equal to the solution of the following  1 -minimization problem [4,22,23]: The real data are noisy; it may not represent the test sample exactly.To deal with the noises, John et al. extended the  1norm minimization problem to the following formulation: where  is a noise term with bounded energy ‖‖ 2 < .
The sparse solution  0 can still be obtained by solving the following stable  1 -minimization problem: To better harness such linear structure, they instead classify  based on how well the coefficients associated with all training samples of each object reproduce .Let x be the solution of ( 7), for each class , let   be the characteristic function that selects the coefficients associated with the th class.Using the coefficients, one can approximate the given test sample  as ŷ =   (x), where   (x) = [0, . . .,  ,1 ,  ,2 , . . .,  ,  , . . ., 0] T .They then compute the residual (Euclidean distance)   () between  and ŷ: The label of the test sample  can be identified by minimizing   () as follows: Here, for the ease of understanding, we still preserve the original formulation.We can construct a multiclass cost matrix  as shown in where   indicates the cost of misclassifying a sample of the th class as the th class.The diagonal elements of  are all zero since there is no loss for correct recognition.
Cost-sensitive learning usually sets the misclassification cost as objective function and identifies the label by minimizing loss function.Given a test sample  and its predicted class label as (), respectively, the label is obtained by minimizing the objective function: where loss (,  ()) where φ() is the optimal prediction of  and  represents the gallery subjects in classification problem.

Cost-Sensitive SRC
In [5], Alpaydın calculated the residuals to identify the class label of a test sample , which is the Euclidean distance between reconstructed sample and the original test sample .In cost-sensitive learning, the loss function (see (7)) is regarded as an objective function to identify the label of a test sample.In binary classification problem, there are two misclassification costs, and we denote the cost that misclassifies positive class as negative class by  10 and the cost by  01 conversely.Then a cost matrix can be constructed as shown in where  1 and  0 represents the label of minority class and majority class, respectively.
It is well known that the loss function can be related to the posterior probability (() | ) ≈ (  (  ) | ).Then the loss function can be rewritten as follows: The test sample  belongs to the class with higher probability.Now, we will estimate (  () | ), ( = 0, 1).
In coefficient matrix, the larger the element value is, the more important the role it will play for reconstructing a test sample.In other words, it is best to represent the test sample by training samples and they have the same class label, and there are no samples from different class in this linear combination.The posterior probability can be related to the coefficient matrix.Accordingly, we rewrite the solution of ( 7) represent the positive class coefficient and negative class coefficient, respectively.Here,  + is the number of positive samples and  − is the number of negative samples in dictionary.Then, we can obtain the posterior probabilities: where Then, ( 14) can be written as We can obtain the label of a test sample  by minimizing (16): The whole process of CSSRC is described in Algorithm 1.

Algorithm 1 (CSSRC algorithm).
Input.Dictionary  ∈  × , test sample  ∈ Output.The label () of test sample (1) Normalize the columns of  to unit  2 -norm (2) Solve the  1 -minimization problem: Or alternatively, solve Assume the solution is x = [ +  − ] T (3) Calculate the loss function: where Obtain the label of y: loss (,  ()) .We test the proposed method on seven UCI data sets [24].Detailed information about these data sets is summarized in Table 1.
In cost-sensitive learning, false positive (actual negative but predicted as positive, denoted as FP), false negative (actual positive but predicted as negative, FN), true positive (actual positive and predicted as positive, TP), and true negative (actual negative and predicted as negative, TN) can be given in a confusion matrix as follows: Positive Class Negative Class Positive Class TP FN Negative Class FP TN To binary classification problems, four kinds of misclassification cost are needed, which is referred to as CTP, CFP, CTN, and CFN, respectively.CTP and CTN are the costs of true positive (TP) and true negative (TN).In order to simplify the cost matrix, we set CTP = 0 and CTN = 0. CFN and CFP are the costs of false negative (FN) and false positive (FP).We always assume that the cost of misclassifying positive class instances is much higher than the cost of misclassifying negative class instances, so we set CFN ≫ CFP.In this paper, CFP is set to be a unit cost of 1; CFN is assigned different values: 5, 10, 15, . . ., 50, respectively.In our experiments, we adopt 10-folder cross-validation to get the average cost, and three evaluation criteria are adopted to evaluate the classification performance in cost-sensitive experiments: average cost (AC), error rate of false acceptance (Err(IG)), and error rate of false rejection (Err(GI)).For classimbalance problem, we choose -measure and -mean to evaluate the performance.They are defined as follows [25,26]: where |FP| and |FN| represent the number of false acceptances and false rejections, respectively.,  + , and  − represent the number of test samples, positive class samples, and negative class samples, and  =  + +  − .In order to illustrate the performance of CSSRC, sparse representation based classification (SRC), Cost-Sensitive Support Vector Machine (CSSVM), and Cost-Sensitive Semisupervised Support Vector Machine (CS4VM) are chosen to compare the performance on three experiments.The experiments are performed on Matlab 2014a and the computer with a 2.6 GHz Intel Xeon CPU.

The Effect of Cost for SRC.
For data set Housing, the size is smaller than the other six data sets, so less samples are selected for train set and test set.We select 31 positive samples and 31 negative samples randomly from Housing as test samples and 41 positive samples and 41 negative samples as training samples.We select 61 positive samples and 61 negative samples as test samples from Abalone, Nursery, Letter, Pima, Cmc, and Car and 101 positive samples and 101 negative samples as training samples.We repeat sampling 100 times and get the average results.
Experiment 1.We compare the performance of these 4 methods (CSSRC, SRC, CSSVM, and CS4VM) on Abalone, Nursery, Letter, Pima, Cmc, Housing, and Car.We set cost ratio (the cost of false acceptance respect to false rejection) as 10, and the results are summarized in (22).From Table 2, we can see that the proposed cost-sensitive approach achieves lower average misclassification cost than the other three methods on Abalone, Nursery, Letter, Pima, Housing, and Car except Cmc.CSSRC's average cost is higher than CS4VM but lower than the other two methods on Cmc and lower than CS4VM on the other 6 data sets.The average cost of CSSRC is 0.5122 and CS4VM's average cost is 0.5105.They are in the same order of magnitudes.In other words, our method has better performance than SRC, CSSVM, and CS4VM.Experiment 2. According to the results in Experiment 1, we plot two pictures from Figure 1.For either positive class or negative class, the proposed method can achieve a lower error rate on Nursery and Abalone when the cost ratio ranges from 5 to 50.Although CS4VM can obtain a lower error rate of false rejection, its error rate of false acceptance is very high, this can generate a serious total cost.From Figure 1, we can easily find that our method can achieve lower error rate of false rejection and lower error rate of false acceptance simultaneously.
Experiment 3. In this section, we set cost ratio from 10 to 50, and the results are summarized as Table 3.The first row is coat ratios and the first two columns represent data sets and classification methods, respectively.In Experiment 2 we use merely two data sets, for proving the robust of our method; more data sets are adopted in this experiment.Our proposed cost-sensitive SRC achieved a lower average costs on four data sets.Although it is not the lowest cost on Nursery and Letter, it has the same order of magnitude as the lowest cost value.
The above there experiments have proved the effect of cost term for SRC.Particularly, the comparison of SRC and CSSRC can well validate the conclusion that the cost term can improve the performance of SRC.

Solving Class-Imbalance Problem
Experiment 1.In this section we will solve the class-imbalance problem.Table 1 has summarized the information of data sets we used, and the imbalance ratio higher 10 is Nursery and Letter.In order to set a higher imbalance ratio, we select Nursery in this experiment.Similarly, we compare the performance of these four methods (SRC, CSSVM, CS4VM, and CSSRC) on Nursery.It is difficult to reflect the performance of our method for class-imbalance problem, and -measure, -mean, and classification accuracy have been adopted for the class-imbalance problem.In this experiment, we take the imbalance ratio from [1, 2, . . ., 10], respectively.The size of minority class is 30 and the majority class is 30 multiplying the imbalance ratios in training set, accordingly.We select 61 positive samples and 61 negative samples as test set and run and summarize the results as in Figures 2 and  3; the sampling process has repeated 100 times and gets the average results.
Figure 2 shows the results of -measure on Nursery, and the definition of -measure (the harmonic mean between the classification accuracy of positive class and the classification accuracy of negative class) has been shown in Section 4.1.It is obvious that our method has achieved a higher -measure value with respect to sparse representation based classification, Cost-Sensitive Support Vector Machine,    and Cost-Sensitive Semisupervised Support Vector Machine.Moreover, the method we proposed achieved a more stable performance with the increasing of imbalance ratio.Similarly, -mean (the geometric mean between the classification accuracy of positive class and the classification accuracy of negative class) also achieved a higher value with respect the other three methods in Figure 3.
It is difficult to evaluate the performance of methods solving class-imbalance problem, but we use classification accuracy to reflect the method additionally for persuasive, and this is summarized in Table 4. On the other hand, running time represents the computation cost of a method.The result is shown in Table 5.It is obvious that our method can get the highest classification accuracy and the lowest running time on Nursery.In this paper, we use sparse representation coefficient vectors to estimate the posterior probability; this can well reduce the computing complexity and computation cost.
Experiment 2. In this experiment, we intend to validate the applicability of our method for class-imbalance problem.In Experiment 1 1 we have tested the validity of our method when the class distribution of training samples is imbalance.Now, we will select some training samples to validate our method, where the distribution of training samples is imbalance.Table 1 has summarized the information of data sets we used, and the imbalance ratio of Letter is 24.3.In order to set a higher imbalance ratio, we select Letter in this experiment.Similarly, we compare the performance of these four methods (SRC, CSSVM, CS4VM, and CSSRC) on Nursery.In this experiment, we take the imbalance ratio from [1, 2, . . ., 10], respectively.The size of minority class is 30 and the majority class is 30 multiplying the imbalance ratios in training set, accordingly.We select 61 positive samples and 61 negative samples as test set and run and summarize the results as Figures 4 and 5; the sampling process has repeated 100 times and gets the average results.
Figure 4 has shown the -measure with imbalanced training samples; Figure 5 has shown the -measure with imbalanced test samples.It is obvious that our method achieves a stable and higher result on Letter than the other three methods from Figures 4 and 5.Although sparse representation based classification has a similar result of measure with our method in Figure 5, the running time is   higher than our method in Table 6.Much experiments has been done in this section, we have compared -measure with the imbalanced distribution of training samples and testing samples and running time, we can easily make a conclusion that our method is better than the other three methods, and we can well resolve the class-imbalance problem.

Conclusions and Future Works
This paper, we propose a novel cost-sensitive SRC classifier approach.The proposed approach adopted probabilistic model and sparse representation coefficient matrix to estimate the prior probability and then obtain the label of a testing sample by minimizing the misclassification losses.The experimental results show that the proposed cost-sensitive SRC has a comparable or even lower total cost with higher accuracy compare to the other three classification algorithms.Much experiment has been done and concluded that our method can well solve the class-imbalance problem.In realworld application, nearly all the data sets are class-imbalance.Our research can overcome the difficult the imbalanced distribution of data sets brought in.
In order to simplify the cost matrix, we restrict our discussion to two-class problems.So extending our current work to multiclass scenario is a main research direction for our future work.

Figure 1 :
Figure 1: Error rate of false acceptance and false rejection.

Figure 2 :
Figure 2: The result of -measure on Nursery.

Figure 3 :
Figure 3: The result of -mean on Nursery.

Figure 4 :
Figure 4: The result of -measure on Letter (the distribution of training samples is imbalanced).

Figure 5 :
Figure 5: The result of -measure on Letter (the distribution of test samples is imbalanced).
×  and  = ∑  =1   is the number of all training samples, where  is the number of classes in training set.Given sufficient training samples of the th class, any test sample  ∈   from the same class will be approximately represented linearly by the training samples of class :  =  ,1 V ,1 +  ,2 V ,2 + ⋅ ⋅ ⋅ +  ,  V ,  .   , where   = [ ,1 ,  ,2 , . . .,  ,  ] ∈    .Then, define a new matrix  for the entire training set as follows: Zhang and Zhou categorized the costs into three types: cost of false acceptance   , cost of false rejection   , and cost of false identification   .Empirically, it is evident that   ,   , and   are unequal.Give a cost setting according to the users and reassign   =   /  ,   =   /  , and [7].2.Cost-Sensitive Function.In multiclass cost-sensitive learning, considering  gallery subjects with their class labels  = {  } =1,2,..., , many impostors, whose labels are .In[7],

Table 1 :
Description of data sets.

Table 2 :
Average cost of the four methods (cost ratio 1 : 10).

Table 3 :
Average cost of methods on five data sets with cost ratio from 10 to 50.

Table 4 :
Classification accuracy on Nursery.

Table 5 :
Running time on Nursery.

Table 6 :
Running time on Letter.