A Novel Boundary Oversampling Algorithm Based on Neighborhood Rough Set Model: NRSBoundary-SMOTE

Rough set theory is a powerful mathematical tool introduced by Pawlak to deal with imprecise, uncertain, and vague information. The Neighborhood-Based Rough Set Model expands the rough set theory; it could divide the dataset into three parts. And the boundary region indicates that the majority class samples and the minority class samples are overlapped. On the basis of what we know about the distribution of original dataset, we only oversample the minority class samples, which are overlapped with the majority class samples, in the boundary region. So, the NRSBoundary-SMOTE can expand the decision space for the minority class; meanwhile, it will shrink the decision space for themajority class. After conducting an experiment on four kinds of classifiers, NRSBoundary-SMOTE has higher accuracy than other methods when C4.5, CART, and KNN are used but it is worse than SMOTE on classifier SVM.


Introduction
The imbalanced dataset problem in classification domains occurs when the number of instances that represent one class is much larger than that of the other classes.The minority class is usually more interesting from the point of view of the learning task.There are many situations in which imbalance occurs between classes, such as satellite image classification [1], risk management [2], and medical diagnosis [3,4].When studying problems with imbalanced data, using the classifiers produced by standard machine learning algorithms without adjusting the output threshold may well be a critical mistake [5].
At present, the solutions for the problem of imbalanced dataset classification are developed at both the data and algorithmic levels [6].At the data level, the objective is to rebalance the class distribution by resampling the data space, such as oversampling the minority class and undersampling the prevalent class.At the algorithm level, solutions try to adapt existing classifier learning algorithms to strengthen learning with regard to the minority class, such as costsensitive learning and ensemble learning.Resampling is convenient and effective; therefore, it is an often-used method in dealing with the class imbalance problem.
Previous research improved resampling methods in many aspects and proposed some effective resampling algorithms.SMOTE is an intelligent oversampling algorithm that was proposed by Chawla et al. [7].Its main idea is to form new minority class samples by interpolating between several minority class samples that lie together.Thus, the overfitting problem is avoided and the decision space for the minority class spread further; meanwhile, it reduces the decision space for the majority class, so many researchers proposed different improved methods.Dong and Wang [8] proposed the Random-SMOTE, which is different from SMOTE, which obtained new minority class samples by interpolating among three minority class samples.Yang et al. [9] proposed ASMOTE algorithm which chose not only the minority class samples but also the majority class samples that are near to minority class sample, avoiding synthetic sample overlapping the majority class samples.Han et al. [10] proposed the Borderline-SMOTE.Their study considered the borderline minority samples which were most easily misclassified.They found out the borderline minority samples and thereby generated synthetic samples from them.Compared with SMOTE, Borderline-SMOTE maintained the decision space for the majority class and enlarged the decision space for the minority class.However, when the number of minority class samples is particularly smaller than the one of majority class samples, most of the minority class samples are regarded as noise.Thus, few synthetic samples are generated which makes the method improve little accuracy.For these reasons, it is urgent to study more effective oversampling methods to generate high quality synthetic samples, particularly giving a better way to distinguish the borderline minority class samples.
It is important to find an effective mathematical theory to express and process the uncertainty of the minority class samples.Rough set theory is a powerful mathematical tool introduced by Pawlak [11][12][13][14] to deal with imprecise, uncertain, and vague information.It has been successfully applied to such fields as machine learning, data mining, intelligent data analysis, and control algorithm acquiring.Basically, the idea is to approximate a concept by three description sets, namely, the lower approximation, upper approximation, and boundary region.The rough set theory put the uncertain samples in the boundary region, and the boundary region can be calculated by upper approximation minus lower approximation, and they all can be calculated.Until now, there are many researchers who brought rough set theory to process imbalanced data.Liu et al. [15] proposed the weighted rough set model to process imbalanced data.It gave the minority class samples a higher weight to let the classifier focus on them.Ramentol et al. [16] introduced a hybrid preprocess approach by combining SMOTE with upper approximation.This method filtered the generated synthetic samples by comparing them with the majority class samples in upper approximation.Once the synthetic sample was similar to the majority class samples in upper approximation, they removed it to ensure that the synthetic samples approximate the minority class samples.Grzymala-Busse et al. [17] altered LEM2 algorithm by strengthening the rules to improve the classification of minority class samples.
The remainder of this paper is organized as follows.The basic concepts on neighborhood rough set models are shown in Section 2. By using the oversampling strategy of minority class samples in boundary region, the NRSBoundary-SMOTE algorithm is developed in Section 3. Section 4 presents the experimental evaluation on 15 imbalanced UCI datasets [18] by 10-fold cross validation, which shows the validity of the proposed method.The paper is concluded in Section 5.

Neighborhood-Based Rough Set Model
Neighoborhoods and neighborhood relations are a class of important concepts in topology.Lin [19] pointed out that neighborhood spaces are more general topological spaces than equivalence spaces and introduced neighborhood relation into rough set methodology.Hu et al. [20] discussed the properties of neighborhood approximation spaces and proposed the neighborhood-based rough set model.And then they used the model to build a uniform theoretic framework for neighborhood based classifiers.
For the convenience of description, some basic concepts of the neighborhood rough set model are introduced here at first.
Definition 1 (see [20]).Given arbitrary  i ∈  and  ⊆ , the neighborhood   (  ) of   in the subspace  is defined as where Δ is a metric function.∀ 1 ,  2 ,  3 ∈ , it satisfies Consider that  1 and  2 are two objects. = { 1 ,  2 , . . .,   } is a sample-dimensional space, where (,   ) denotes the value of sample on the th dimension   .Then, a general metric, named Minkowsky distance, is defined as When  = 2, it is the Euclidean distance Δ 2 .But Euclidean distance can only be used to compute continuous features; the nominal features are invalid.Here, we compute them by using Value Difference Metric (VDM) proposed by Stanfill and Waltz [21] in 1986.The distance Δ between two corresponding feature values is defined as follows: In the previous equation,  1 and  2 are the two corresponding feature values. 1 is the total number of occurrences of feature value  1 , and  1 is the number of occurrences of feature value  1 for class i.A similar convention can also be applied to  2 and  2 .k is a constant, which is usually set to 1.
Definition 2 (see [20]).Given a set of samples U, N is a neighborhood relation on U, and {(  ) |   ∈ } is the family of neighborhood granules.Then, we call ⟨, ⟩ a neighborhood approximation space.
Definition 3 (see [20]).Given ⟨, ⟩, for arbitrary  ⊆ , two subsets of objects, called lower approximation and upper approximation of X in terms of relation N, are defined as The boundary region in the approximation space is formulated as

Minority class samples
Majority class samples Definition 4 (see [20]).Given a neighborhood decision table,  = ⟨,  ∪ , , ⟩,  1 ,  2 , . . .,   are the object subset with decisions 1 to N and   (  ) is the neighborhood information granules including   and generated by attributes  ⊆ .Then the lower and upper approximations of the decision D with respect to attributes B are defined as where The decision boundary region of D with respect to attributes B is defined as Decision boundary region is the object subset whose neighborhoods come from more than one decision class.On the other hand, the lower approximation of the decision, also called the positive region of decision, denoted by POS  (), is the subset of objects whose neighborhoods decision only belongs to one of the decision classes.
To explain the samples in lower approximation of decision and boundary region, here we give a sample in Figure 1.
Example 5. Figure 1 gives a sample of binary classification in 2D space, where  1 represent the majority class samples which are labeled by box and  2 represent the minority class samples which are labeled by circle.Consider samples  1 ,  2 , and  3 ; we assign circle neighborhoods to these samples.We can find ( According to the aforementioned definitions,  1 ∈  1 ,  3 ∈  2 , and  2 ∈ ().

Neighborhood Rough Set Boundary SMOTE Algorithm
where rand (0, 1) refers to a random number between 0 and 1.
In view of of geometry, SMOTE can be regarded as interpolating between two minority class samples.The decision space for the minority class is expanded that allows the classifier to have a higher prediction on unknown minority class samples.
The SMOTE algorithm is simple and effective while generating synthetic samples, and the overfitting problem is avoided.It expands the decision space for the minority class, but it may shrink the decision space for the majority class with high confidence in the meanwhile.Thereby, it will lead to poor prediction on the unknown majority class samples.Now, we will give an example to illustrate the drawback of SMOTE (see Figure 2).
In Figure 2, we apply SMOTE to generate synthetic samples for the minority class sample  1 .We generate randomly  ( = 5) nearest minority class samples of  1 denoted by  2 ,  3 ,  4 ,  5 , and  6 .According to Definition 3,  1 ,  5 , and  6 belong to lower approximation of the decision.Furthermore,  5 and  6 are farther from  1 than  2 ,  3 ,  4 .If ones generates synthetic samples between  1 and  5 or between  1 and  6 , the synthetic samples (such as points c and d) will overlap with (or very close to) the majority class samples.Thereby, misclassification will occur easily.Therefore, it is important to find the rational neighborhoods of minority class samples while oversampling.

Neighborhood Rough Set Boundary SMOTE Algorithm.
In order to solve the aforementioned problem, we propose a new oversampling method, namely, Neighborhood Rough Set Boundary SMOTE (NRSBoundary SMOTE).The proposed method consists of three steps.First, we compute the minority class samples in boundary region and the majority class samples in lower approximation of decision.Second, for every minority class sample, we generate synthetic samples by calling SMOTE algorithm.Third, we select the rational synthetic samples without affecting the decision space of the majority class samples in lower approximation of decision.
In Figure 3, an example is given to explain NRSBoundary SMOTE further.The samples in the ellipse all belong to boundary region, while the ones outside belong to the lower approximation of decision.Now, we choose the minority samples in the boundary region for oversampling.We also find their  ( = 5) nearest neighbors of  1 , namely, { 2 ,  3 ,  4 ,  5 ,  6 }.Assume that the synthetic samples are , , , , and , respectively.Obviously,  ∈ ( 1 ); that is, there is a risk that  1 (a majority sample) can be classified into minority classification.Therefore, some effective methods should be adopted to avoid the risk.
It is an effective way that the synthetic sample cannot be in the neighborhood of any majority sample while oversampling.How to measure the neighborhood radius of sample is a primary issue.According to Definition 1, we should obtain the threshold  firstly.Here we compute  as follows [20]: where   ( = 1, 2, . . ., ) is a training sample, min(Δ(  , )) denotes the minimal value of distance between   and the remaining samples excluding , and range(Δ(  , )) denotes the value domain of Δ(  , ).In this case,  is dynamically generated in terms with the whole training samples.In Section 4, we can afford a value domain of , combined with the experimental analysis.
Here we give the NRSBoundary SMOTE (see Algorithm 1) as follows.
Time Complexity Analysis of Algorithm 1. Assume that || =  and the number of features is .The time complexity of step 1 is  (1).The time complexity of step 2 is ().The time complexity of step 3 is ( ×  2 ).The time complexity of step 4 is ( ×  ×  2 ).The time complexity of step 5 is ().So the time complexity of Algorithm 1 is Space Complexity Analysis of Algorithm 1.The space complexity of Algorithm 1 is ( × ).

Experimental Designing and Analysis
In this section, we first present the experimental setup, including the UCI datasets and the evaluation in imbalanced domains.Then we introduce the experimental analysis, which is divided into two parts: first we carry out an analysis of the parameters for our method, and then we develop the comparative analysis with other oversampling methods and some classifiers.

Datasets.
In order to test the proposed algorithm, 15 UCI datasets are downloaded from the machine learning data repository, University of California at Irvine, with different imbalanced rates that from 0.20 to 0.804.There are four multiclass datasets and eleven two-class datasets.Multiclass datasets are modified to obtain two-class imbalance problems, by the union of one or more classes of the minority class and the union of one or more of the remaining classes which are labeled as the majority class.For the missing values, if they are continuous features, we fill them with average values; if they are nominal features, we fill them with values that appear most frequently.The datasets are outlined in Table 1 and sorted by imbalanced rates from low to high.2, where TP means the number of positive samples that are classified into positive, TN means the number of negative samples that are classified into negative, FN means the number of positive samples that are misclassified, and FP means the number of negative samples that are misclassified.

Experimental Evaluation in Imbalanced Domains. The traditional evaluation usually uses Confusion Matrix, showed in Table
From Table 2, one could get some useful evaluation as follows.
= /( + );  = /( + ); -V = 2/( + ), where  and  refer to  and , respectively.There are three evaluations as the formulas called Precision, Recall, and F-value.Precision (also called positive predictive value) is the fraction of retrieved instances that are relevant, while Recall (also known as sensitivity) is the fraction of relevant instances that are retrieved.From the previous formulas, we can decrease FP to increase Precision and increase TP to increase Recall.But in fact they are conflicted.So we use the F-value to consider them comprehensively.Only when Precision and Recall are both higher, F-value will be higher.
Another appropriate metric that could be used to measure the performance of classification over imbalanced datasets is the Receiver Operating Characteristic (ROC) graphics [22].In these graphics, the tradeoff between the benefits (TP) and costs (FP) can be visualized, and it acknowledges the fact that the capacity of any classifier cannot increase the number of true positives without also increasing the false positives.The area under the ROC curve (AUC) [23] corresponds to the probability of correctly identifying which of the two stimuli is noise and which is signal plus noise.AUC provides a singlenumber summary of the performance of learning algorithms.

The Experimental Results and Analysis.
In this paper, we use Recall, F-value, and AUC to evaluate our algorithm.The oversampling method SMOTE [7] and the classifiers (such as C4.5, KNN, CART, and SVM [24]) are used in our experiment, whose source codes are afforded by Weka software [25].We also use Java programming language to implement some other oversampling methods, such as ASMOTE [9], Borderline-SMOTE [10], and SMOTE-RSB * [16].For the objective comparison, the minority class was over-sampled at 100% and the value of  is set to 5 like SMOTE.All results are computed by 10-fold cross validation.
(1) NRSBoundary SMOTE: Parameter Analysis.In the NRS-Boundary SMOTE algorithm, it is important to set a proper value of w.Here, we conduct a series of experiments to find the optimal parameter w which is used to control the radius of the neighborhood.We try w from 0 to 0.2 with step 0.01 to compute the F-value by using the 10-fold cross validation.Figure 4 presents the F-value curves varying with w for some datasets: Pima, VC, Haberman, Transfusion, Colic, and CMC.From Figure 4, we can find that there is a similar trend in these curves: F-value increases at first and decreases after a threshold.So we recommend that w should take values in the range [0.01, 0.05].
(2) Comparative Analysis on C4.5.Tables 3, 4, and 5 give the comparative results of Recall, F-value, and AUC which are computed in different oversampling methods, respectively.Furthermore, none represents the original dataset without resampling.
From Tables 1, 2, and 3, we can figure out that the NRSBoundary-SMOTE has higher accuracy for most datasets.The average value of Recall increases to 0.7182 while the values of the others are between 0.6130 and 0.6886.The average value of F-value increases to 0.6978 while the values of the others are between 0.6505 and 0.6638.The average value of the AUC is up to 0.7882 while the values of the others are between 0.7615 and 0.7695.The NRSBoundary-SMOTE has higher accuracy on three evaluations than the others when the classifier is decision tree C4.5.It shows that our method is feasible by oversampling and strengthening the minority class samples in boundary region.SMOTE over-samples for all the minority class samples.It can expand the decision space of minority class, but it will  decrease the decision space of majority class.Although it can improve Recall of minority class, many majority class samples will be misclassified as minority class, which thereby results in the decreasing of Precision.Thus, the results of F-value have not been improved too much.ASMOTE, similar to SMOTE, considers the near neighborhood of majority class.It can reduce the confliction between synthetic samples and majority class samples and also expand the coverage space of minority class samples.However, some of the synthetic samples are similar to majority class samples, so the decision space of majority class samples decreases as well.
Both Borderline-SMOTE and SMOTE-RSB * sift the synthetic samples more strict than SMOTE, because few synthetic samples are generated when datasets are highly imbalanced.Thus, compared with SMOTE, its improvement is not obvious.
NRSBoundary-SMOTE uses the neighborhood rough set model, which emphasizes oversampling the minority class samples in boundary region, and thereby expands the coverage space of minority class samples in boundary region.Furthermore, it can improve the confidence degree of decision rules by minority class samples in boundary region (or uncertain area).What is more, it has little influence on the 0 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.59  majority class samples in lower approximation of decision; in other words, it has little influence on changing the decision space of majority class.Thus, the results of F-value have been improved.
(3) Comparative Analysis on KNN, CART, and SVM.In addition, in order to test the validity of the proposed method on different classifiers, KNN ( = 3), CART, and SVM are adopted on 15 UCI datasets.The experimental results of Fvalue are shown in Figure 5, where the F-value is the average value of F-values on 15 UCI datasets.
From Figure 5, we find out that NRSBoundary-SMOTE has higher accuracy than other methods when C4.5, CART, and KNN are used.On the contrary, it is worse than SMOTE on SVM.In the course of classification of C4.5, CART, and KNN, they are all based on measuring the distance between the unknown samples and one of the train samples or rules; at the same time, the process of computing neighborhoods in NRSBoundary-SMOTE is similar to these classifiers.Therefore, NRSBoundary-SMOTE can perform better.But SVM works by constructing a separating hyperplane with the maximal margin, which has not been taken into consideration by NRSBoundary-SMOTE algorithm; it has no better effect on SVM.
In NRSBoundary-SMOTE algorithm, one can expand the decision space of the minority samples in boundary region

1 Figure 1 :
Figure 1: A sample with two classes.

Figure 2 :
Figure 2: SMOTE leads to a poor prediction on majority class.

Figure 4 :
Figure 4: F-value curves varying with w.