Immune Centroids Oversampling Method for Binary Classification

To improve the classification performance of imbalanced learning, a novel oversampling method, immune centroids oversampling technique (ICOTE) based on an immune network, is proposed. ICOTE generates a set of immune centroids to broaden the decision regions of the minority class space. The representative immune centroids are regarded as synthetic examples in order to resolve the imbalance problem. We utilize an artificial immune network to generate synthetic examples on clusters with high data densities, which can address the problem of synthetic minority oversampling technique (SMOTE), which lacks reflection on groups of training examples. Meanwhile, we further improve the performance of ICOTE via integrating ENN with ICOTE, that is, ICOTE + ENN. ENN disposes the majority class examples that invade the minority class space, so ICOTE + ENN favors the separation of both classes. Our comprehensive experimental results show that two proposed oversampling methods can achieve better performance than the renowned resampling methods.


Introduction
The class imbalance problem typically occurs when, in a binary classification problem, there are more training examples of one class than those of the other class. This situation is known as the class imbalance problem [1]. Under the circumstances, most standard algorithms fail to properly represent the distributive characteristics of complex imbalanced datasets and then provide unfavorable accuracies across examples of two classes [2]. Furthermore, it is worth pointing out that the minority class is usually the one that has the highest interest from a learning point of view and it also implies a great cost when it is not well classified [3].
Standard classification learning algorithms are often biased towards the majority class (known as the negative class). Therefore, it is not unusual that there is a higher misclassification rate for the minority class (i.e., the positive class) instances. In order to deal with this problem, a large number of approaches have been proposed to counter the sparsity in the distribution. Among them, the "synthetic minority oversampling technique" (SMOTE) [4] has become one of the most renowned approaches in this area. It can be essential to provide new related information on the positive class to the learning algorithm, in addition to undersampling majority class, which is completely different from undersampling majority class. Batista et al. proposed an integration method called SMOTE + ENN [5], which uses Wilson's edited nearest neighbor (denoted as ENN) rule [6] to remove examples whose classes differ from the classes of at least a half of their nearest neighbors. Han et al. present two new minority oversampling methods, borderline-SMOTE1 and borderline-SMOTE2 [7], in which only the minority class examples near the borderline are oversampled. Later, Bunkhumpornpat et al. published safe-level-SMOTE [8]. Their approach samples minority instances along the safe level that is computed using nearest neighboring minority instances. Ramentol et al. came up with another oversampling method with the application of an editing technique based on the rough set theory and the lower approximation of a subset [9].
In this paper we present an immune centroids oversampling technique (ICOTE) based on immune network theory. We utilize an aiNet model [10] to generate immune centroids 2 Computational Intelligence and Neuroscience of clusters of high data density. Our work resamples the minority class by introducing immune centroids of clusters of minority class examples. The resampling creates larger but less specific decision regions. Meanwhile, we also integrate our ICOTE with ENN together, that is, ICOTE + ENN. ICOTE + ENN can not only sample minority class examples but also dispose majority class examples that invade the minority class space. We expect that this hybrid method excels ICOTE in terms of the separation of both classes. Our experimental results show that both ICOTE and ICOTE + ENN achieve better performance in the application of three paradigms, comparing with the existing methods.
The rest of this paper is organized as follows. We review related work in Section 2. Section 3 presents our proposed oversampling methods ICOTE and ICOTE + ENN. Our experimental results and comparisons are shown in Section 4. Finally, we conclude this paper in Section 5.

Related Work
In order to deal with imbalanced issues, some articles studied different resampling techniques, which change the class distribution. These articles empirically showed that applying a preprocessing step in order to balance the class distribution usually is a useful solution [5,[11][12][13]. Furthermore, the main advantage of these techniques is that they are independent of the underlying classifier. Resampling techniques can be categorized into three groups or families: (1) undersampling methods, which create a subset of the original dataset by eliminating some instances (usually majority class instances), (2) oversampling methods, which create a superset of the original dataset by replicating some instances or creating new instances from the existing ones,  Figure 3.  Jo and Japkowicz [15] discussed whether class imbalance is truly responsible for this degradation or whether it can be explained in some other ways. Their experiments suggest that the problem is not directly caused by class imbalance, but rather that class imbalances may yield small groups which, in turn, will cause this degradation. SMOTE [4] and its successors enrich the minority class space without considering data intrinsic characteristics such as small groups. The SMOTEbased methods might create the synthetic examples which underrepresent actual clusters or are attributed to noisy data. We will describe how our method overcomes the inherent drawback in the subsequent section.

Our Methods
In this section, we first briefly introduce the basic concepts and knowledge of immune systems. After that, we present Computational Intelligence and Neuroscience our oversampling method ICOTE based on immune network theory and its improved version ICOTE + ENN.

Immune Systems.
Before discussing our method, we sketch a few aspects of the human adaptive immune system. The immune systems guard our bodies against infections due to the attacks of antigens. The surface receptors on Bcells (one kind of lymphocyte) are able to recognize specific antigens. The response of a receptor to an antigen can activate its hosting B-cell. Activated B-cell then proliferates and differentiates into memory cells. Memory cells secrete antibodies to neutralize the pathogens through complementary pattern matching. During the proliferation of the activated B-cells, a mutation mechanism is employed to create diverse antibodies by altering the gene segments. Some of the mutants may be a better match for the corresponding antigen. In order to be protective, the immune system must learn to distinguish between our own (self) cells and malefic external (nonself) invaders. This process is called self/nonself discrimination: those cells recognized as self do not promote an immune response, and the system is said to be tolerant to them, while those that are not provoke a reaction resulting in their elimination.
The immune network theory, as originally proposed in [16], hypothesizes a novel viewpoint of lymphocyte activities, natural antibody production, preimmune repertoire selection, tolerance and self/nonself discrimination, memory, and the evolution of the immune system. It was suggested that the immune system is composed of a regulated network of cells and molecules that recognize one another. The immune cells can respond either positively or negatively to the recognition signal (antigen or another immune cell or molecule). A positive response would result in cell proliferation, cell activation, and antibody secretion, while a negative response would lead to tolerance and suppression.
Learning in the immune system involves raising the population size and affinity of those lymphocytes that have proven themselves valuable by having recognized any antigen. Burnet [17] introduced clonal selection theory by modifying Jerne's theory. The theory stated that, in a preexisting group of lymphocytes (specifically B-cells), a specific antigen only activates (i.e., selection) its counter-specific cell so that a particular cell is induced to multiply (producing its clones) for antibody production. With repeated exposures to the same antigen, the immune system produces antibodies of successively greater affinities. A secondary response elicits antibodies with greater affinity than in a primary response. Based on the clonal selection principle, de Castro and von Zuben [18] proposed a computational implementation of the clonal selection principle that explicitly takes into account the affinity maturation of the immune response. He also defined aiNet (an artificial immune network model) for data analysis [10]. The aiNet is an edge-weighted graph, not necessarily fully connected, composed of a set of nodes, called antibodies, and sets of node pairs called edges with an assigned number called weight, or connection strength, associated with each connected edge. The aiNet clusters serve as internal images (mirrors) responsible for mapping the existing clusters in the dataset into network clusters. These clusters map those of the original dataset. The shape of the spatial distribution of antibodies follows the shape of the antigenic spatial distribution.

Immune Centroids Resampling.
In this paper we present a resampling method based on immune network theory. We use the aiNet model [10]  Before explaining ICOTE, we introduce some notations to describe the resampling method. Given a set of labeled and class label ∈ {negative, positive}, we measure the affinity (complementarity level) of the antigen-antibody match using Euclidean distance. As we know, the Euclidean distance of two vectors is where is the dimension of each vector. The antigenantibody affinity is inversely proportional to the Euclidean distance. The smaller the distance, the higher the affinity, and vice versa.
Our ICOTE includes four major steps as follows.

Immune Centroids Generation.
There are three steps for generating immune centroids. First, the selected antibodies → are going to proliferate (clone) proportionally to their antigenic affinity: the higher the affinity, the larger the clone size for each selected antibody: Next, each antibody → from the clone set will suffer a mutation with a rate , which is inversely proportional to the antigenic affinity of its parent antibody → : And then we eliminate the memory antibodies (denoted as ) with a low antigen-antibody affinity (clonal suppression) and a high antibody-antibody affinity (network suppression) : 3.2.4. Denormalization. Next, we denormalize memory antibodies and make synthetic examples identical to sample distribution: 3.2.5. Attribute Replacement. At the end, we put back constant-value attributes: de-fselect ( ⃗ ) = ( 1 , . . . , , 1 , . . . , , + +1 , . . . , ) , Correspondingly, the algorithm is described as shown in Algorithm 1. ICOTE samples minority class examples to generate memory antibodies (immune centroids). The shape of the spatial distribution of the immune centroids follows that of the minority class examples. Therefore, it avoids more small groups or outliers introduced by oversampling. For instance, we depict the immune centroids in Figure 4. Intuitively, each of them in star shares the same group with one or several neighboring minority class examples. Introducing immune centroids for learning not only creates larger and less specific decision regions but also decreases the likelihood of overfitting occurring, which is the major drawback of SMOTE [4].
We also propose an integrated method called ICOTE + ENN, integrating ICOTE with the Wilson's edited nearest neighbor rule (i.e., ENN) [6]. In this integrated method, ICOTE oversamples minority class examples, and ENN discards "dirty" examples deviating from the majority class space. When the class of an example differs from the class of more than a half of the nearest neighbors, the example will be removed from the training set. The result of the integrated method is illustrated in Figure 5. Figure 5 shows that the hybrid method makes the two class spaces separated. In the next section, we will show our empirical results for our two methods.

Experiments
In this section, we will investigate the performance of our proposed oversampling methods ICOTE and ICOTE + ICOTE and compare them with the existing well-known oversampling methods.

Experimental Settings.
Our experiments are conducted based on three base classifiers: NN, C4.5, and SVM. We use these algorithms, since they are available within the KEEL software tool [19]. In the experiments, the parameter values are set based on the recommendations from the corresponding authors. The specific settings are as follows. (1) Instance based learning ( NN) [20]: in this algorithm, we set = 1 and use the Euclidean distance metric.
(2) C4.5 decision tree [21]: for C4.5, we set a confidence level as 0.25 and the minimum number of item sets per leaf as 2 and use pruning.
(3) Support vector machines (SVM) [22]: for SVM, we choose Polykernel reference functions, setting an internal parameter 1.0 for the exponent of each kernel function and a penalty parameter of the error term as 1.0.
We conduct experiments on 38 datasets from the KEEL dataset repository [23], whose characteristics are summarized in Table 1, namely, the number of examples (#Ex.), number of attributes (#Atts.), and instance ratio (IR). The experiments are evaluated in terms of one of the popular metrics, the area under the ROC curve (AUC) [24,25]. The experimental results are obtained based on 5-fold cross-validation.
We choose 5-fold cross-validation, because it can keep sufficient positive class instances in different folds. Thus, we can avoid additional problems in the data distribution which were discussed in [26,27], especially for highly imbalanced datasets.
We must point out that the dataset partitions employed in this paper are available from the KEEL dataset repository [23], so that researchers can use the same data partitions for comparisons.

Evaluation in Imbalanced Domains.
In imbalanced domains, a well-known approach to unify these measures and to produce an evaluation criterion is to use the receiver operating characteristic (ROC) graphic [24]. This graphic allows the visualization of the trade-off between the benefits (TP rate ) and costs (FP rate ), as it evidences that any classifier cannot increase the number of true positives without also increasing the false positives. The area under the ROC curve (AUC) [25] corresponds to the probability of correctly  identifying which one of the two stimuli is noise and which one is signal plus noise. The AUC provides a single measure of a classifier's performance for evaluating which model is better on average. The AUC measure is computed just by obtaining the area of the graphic: AUC combines the individual measures of both the positive and negative classes so that we can utilize it to measure quality results of different paradigms for imbalanced data.

Experimental Results.
In this section, we investigate the performance of the resampling methods on the imbalanced datasets listed in Table 1.
As shown in the previous work [14, Table 4] on the keel datasets, SMOTE [4] and SMOTE + ENN [5] have the highest rank for the three classification algorithms ( NN, C4.5, and SVM) used in their study, and both ADASYS [28] and SL-SMOTE [29] achieve the 2nd highest AUC values. So we select these four resampling algorithms and compare our ICOTE and ICOTE + ENN with them. The average AUC results of different resampling methods with three base learners NN, C4.5, and SVM over all 38 datasets are shown in Table 2. Besides, we also have the experimental results obtained based on the three base learners directly without using resampling techniques, which is denoted as "none" in     Table 2. Please note that our experimental results on each dataset are shown in the Appendix. From Table 2, we can see that our methods ICOTE and ICOTE + ENN perform much better than the other four resampling methods, on all three base learners. And ICOTE + ENN does improve the performance of ICOTE. Our experimental results also show that SMOTE and SMOTE + ENN perform better than SL-SMOTE and ADASYN. SMOTE + ENN does improve the performance of SMOTE a little on all the three base learners. Between SL-SMOTE and ADASYN, SL-SMOTE performs better. That is, ADASYN is the worst among the six resampling methods.
Besides the average results shown in Table 2, we also rank the resampling methods on each dataset with each base learner. The average ranks of each method with each base learner are shown in Figure 6. From Figure 6, we can see that the average rank of ICOTE + ENN is the best under any one of the three base learners. ICOTE ranks the second consistently. SMOTE + ENN always ranks the third. SMOTE always ranks the fourth. It is obvious that "none" (without using resampling techniques) performs the worst when either C4.5 or SVM is used as the base learner. Between SL-SMOTE and ADASYN, SL-SMOTE always performs better than ADASYN. These conclusions are consistent with the conclusions we made based on the average AUC, shown in Table 2.
For the sake of finding out which algorithms are distinctive among the pair comparisons of these methods, we carry out a Shaffer post hoc test [30], which is shown in Tables 3-5. In these tables, a "+" symbol implies that the algorithm in the row is statistically better than the one in the column, "−" implies the contrary, and "=" means that the two algorithms compared show no significant difference. In brackets, the unadjusted value associated with each comparison is also presented. Shaffer's procedure rejects those hypotheses that have an unadjusted value ≤ 0.002. In order to explain why ICOTE and ICOTE + ENN obtain the highest performance, we may emphasize two feasible reasons. The first one is related to the addition of significant information within the minority class examples by including immune centroids of clusters. These immune centroids allow the formation of larger clusters to help the classifiers to separate both classes, and its cleaning procedure also adds benefits to the generalization ability during learning. The second reason is that the immune centroids represent inherent clusters of the minority class examples and overcome 8 Computational Intelligence and Neuroscience

Conclusions
In this paper we present two overresampling methods based on immune network theory. We draw the conclusions based on our experimental results and analyses as follows.
( (3) We compare our proposed methods ICOTE and ICOTE + ENN with representative resampling methods. Our experimental results showed that our approaches make significant improvement.