Imbalanced Learning Based on Logistic Discrimination

In recent years, imbalanced learning problem has attracted more and more attentions from both academia and industry, and the problem is concerned with the performance of learning algorithms in the presence of data with severe class distribution skews. In this paper, we apply the well-known statistical model logistic discrimination to this problem and propose a novel method to improve its performance. To fully consider the class imbalance, we design a new cost function which takes into account the accuracies of both positive class and negative class as well as the precision of positive class. Unlike traditional logistic discrimination, the proposed method learns its parameters by maximizing the proposed cost function. Experimental results show that, compared with other state-of-the-art methods, the proposed one shows significantly better performance on measures of recall, g-mean, f-measure, AUC, and accuracy.


Introduction
Recently, class imbalance problem, also called skewed or rare class problem, has drawn a significant number of interests in academia, industry, and government. For the two-class case, this problem is characterized as having many more examples of one class (majority class or negative class) than the other (minority class or positive class) [1][2][3]. In many real-world applications, the correct prediction of examples in positive class is often more meaningful than the contrary case. For example, in cancer detection, most patients belong to common disease, rare patients may have cancer, and how to effectively recognize cancer patients is very meaningful. However, conventional classification methods such as C4.5, naive bayes, and neural network, try to pursue a high accuracy by assuming that all classes have similar size, which leads to the fact that the rare class examples are often overlooked and misclassified to majority class [4,5].
Many approaches have been proposed to tackle this problem, which can be roughly categorized into three levels: data preprocessing level, algorithm learning level, and prediction postprocessing level. For the data preprocessing level, the algorithms focus more on examples with positive class through one of the three approaches: (1) the algorithms running on the rebalanced data sets obtained by manipulating the data space [6,7] such as undersampling technique and oversampling one, (2) actively selecting the more valuable examples to learn models and leaving the ones with less information to improve models' performance [8,9], and (3) weighting data space using information concerning misclassification costs to avoid costly errors [10]. The approaches at the algorithm learning level try to adjust existing classifier learning algorithms such that the learned models are biased towards correctly classifying positive class examples, such as two-phase rule induction [11] and one-class learning. Existing approaches at prediction postprocessing level try to focus more on positive class by moving a decision threshold [12] or minimizing a cost function [13].
In this paper, we reconsider the imbalanced problem at algorithm level and propose a novel method called ILLD (Imbalanced Learning Based on Logistic Discrimination) to tackle the problem. The motivation is inspired by the following observation: there are very few researches studying the logistic discrimination on the class imbalanced problem, although it has many merits including understandability, solid theoretical basics, and, most importantly, high generalization ability. Unlike the traditional logistic discrimination, 2 Computational Intelligence and Neuroscience ILLD achieves high performance on imbalanced data by maximizing the proposed cost function APM (Accuracy-Precision Based Metric) which takes into account the accuracies of both positive class and negative class as well as the precision of positive class. Experimental results show that ILLD can much better boost the performance of logistic discrimination on measures of recall, -measure, -mean, and AUC while keeping its high performance on accuracy. Compared with other state-of-the-art classification methods, ILLD shows a much better performance.
The rest of this paper is organized as follows: after presenting related work in Section 2, Section 3 describes the proposed imbalanced learning method; Section 4 presents the experimental results; and, finally, Section 5 concludes this work.

Related Work
2.1. Imbalanced Learning. Technically speaking, the data set which exhibits an unequal distribution between its classes can be considered imbalanced or skewed. However, in the community, only the data sets corresponding to the ones exhibiting extreme imbalances are treated as imbalanced data sets. There are two forms of imbalance, namely, withinclass imbalance and between-class imbalance. For the withinclass imbalance, some subconcepts exist in limited examples, which increase the difficulty of correctly classifying examples. With respect to the between-class imbalance, one class extremely out-represents another [1,2]. Usually, the second form of imbalance is often discussed in community.
There are many factors that influence the modeling of a capable classifier when facing rare events. Examples include the skewed data distribution which is considered to be the most influential factor, small sample size, separability, and existence of within-class subconcepts [14].
The skewed data distribution is often denoted by imbalance degree which is the ratio of the sample size of the positive class to that of the negative class. Reported studies indicate that a relatively balanced distribution usually attains a better result. However, to what imbalance degree the class distribution deteriorates the classification performance cannot be stated explicitly, since other factors such as sample size and separability also affect performance [1,2,14].
Small sample size means the sample size is limited; uncovering regularities inherent in small class is unreliable. In [15], the authors suggest that the imbalanced class distribution may not be a hindrance to classification by providing a large enough data set.
The difficulty in separating the rare class from the prevalent class is the key issue of the imbalanced problem. Assuming that there exist highly discriminative patterns among each class, then not very sophisticated rules are required to distinguish class objects. However, if patterns among each class are overlapping, discriminative rules are hard to be induced [1,2,14].
Within-class concepts mean that a single class is composed of various subclusters or subconcepts. Instances of a class are collected from different subconcepts. These subconcepts do not always contain the same number of instances.
The presence of within-class concepts worsens the imbalance distribution problem [14]. In general, we only consider imbalanced data distribution in imbalanced learning and fix other factors.

Logistic Discrimination.
Logistic discrimination, also called logistic regression, is a typical probability statistical classification model [16], which has been widely used in many fields such as medical domain and social surveys because of its understandability, solid theoretical basics, and, most importantly, high generalization ability. For the two-class case, logistic discrimination is defined as where ( ) is the logistic sigmoid function defined as For a given data set = {(x , ) | = 1, 2, . . . , }, where ∈ {+, −} is the label associated with example x , the likelihood function of this model can be written as where = 0 if = + and 1 otherwise. Defining a cost function by taking the negative logarithm of the likelihood, we have the cross-entropy error function in the form The logistic discrimination uses (5) as the cost function; however, it is not suitable for the class imbalanced problem because the cross-entropy error function defined in (5) does not consider the importance of each class. To handle this problem, a novel cost function called APM (Accuracy-Precision Based Metric) is proposed, which takes into account the accuracies of both positive and negative classes as well as the precision of positive class. For more details refer to Section 3.

Strategies to Handle Imbalanced
Problem. The imbalanced problems rise from the scarce representation of the most important examples, which leads to the fact that the learned models tend to focus more on normal examples, overlooking the rare class examples. Many approaches have been proposed to handle the problem, which can be mainly grouped into the following three categories.
(i) Data preprocessing based strategy. These techniques preprocess the given imbalanced data set to change the data distribution such that standard learning algorithms focus more on the cases that are relevant for the user. Reported studies of preprocessing data sets can be categorized into three types: resampling, active learning, and weighting the data space. The object of resampling techniques is to rebalance the class distribution by resampling the data space. Commonly used resampling methods include randomly of informatively undersampling instances in negative class [6], randomly oversampling examples of positive class, oversampling based on cluster algorithm [17,18], and oversampling the positive class by creating new synthetic instances [7]. Resampling data space technique is often used to deal with imbalanced learning problems, but the real class distribution is always unknown and differs from data to data. Active learning is to actively select the more valuable examples to learn models and leave the ones with less information to improve models' performance by interacting with the user. Several approaches based on active learning have been proposed. For example, Ertekin [9] presented an adaptive oversampling algorithm called VIRTUAL (Virtual Instances Resampling Technique Using Active Learning) to generate synthetic examples for the positive class during the training process, Mi [19] developed a method that combines SMOTE and active learning with SVM, and so on. The strategies of weighting the data space aim to modify the training data set distribution using information concerning the misclassification costs, such that Wang and Japkowicz [10] combined an ensemble of SVM with asymmetric misclassification costs.
(ii) Algorithm based strategy. It modifies existing classifier learning algorithms such that the learned models are biased towards the cases that are more concerned by the user. Many algorithms based imbalanced learning approaches have been proposed; for example, Cao et al. [20] presented a framework for improving the performance of cost-sensitive neural networks that adopts Particle Swarm Optimization for optimizing misclassification cost, feature subset, and intrinsic structure parameters; Alejo et al. [21] proposed two strategies for dealing with imbalanced domains using RBF neural networks which include a cost function in the training phase.
(iii) Prediction postprocessing based strategy. The approaches of the strategy learn a standard model on the original data set and only modify the predictions of the learned model according to the user references and the imbalance of the data set. There exist two main types of solutions: threshold method and cost-sensitive postprocessing. For the former, each example is associated with a score which expresses the degree to which an example is a member of a class. Based on the scores, a threshold is used to generate different classifiers by varying the threshold for an example belonging to a class [12]. With respect to the latter, several methods exist for making models cost-sensitive in a post hoc manner. This type of strategy was mainly explored for classification tasks and aims at changing only the model predictions for making it cost-sensitive [13].
In this paper, we propose a novel algorithm based imbalanced learning method to improve the performance of the logistic discrimination. Besides, we apply sampling techniques to the logistic discrimination to enhance its performance. Two widely used sampling techniques are selected: random undersampling and oversampling. The corresponding experimental results are presented in Section 4.

Imbalanced Learning Based on Logistic Discrimination
3.1. Accuracy-Precision Based Metric. The traditional logistic discrimination learns its parameters by maximizing the cross-entropy error function defined in (5). However, this approach ignores the diverse costs of classes, which leads to the fact that the learned models have low performance on the positive classes. To tackle this problem, a novel cost function is proposed to guarantee that the learned models perform well on both positive class and negative class. The relevant symbols are defined as follows. Define and as follows: where = (y = | x ) is defined by (1) or by (2). From (6), we have that is the estimation of the number of examples correctly classified as class (corresponding to ) and is the estimation of number of examples with class incorrectly classified. For two-class problem, we have Let class "+" be the positive class as used before; then the cost function APM is defined as Since is the number estimation of examples being correctly classified as class and is that of the ones being incorrectly classified as aforementioned, APM is the estimation of the following equation: 4 Computational Intelligence and Neuroscience Input: : training data set : parameter greater than zero Output: learned parameters w Process: (1) randomly initialize w (0) ; (2) H (0) = I (unit matrix); (3) g 0 = ∇ APM // calculate the gradient of object function by (14) (4) w (1) Calculate the gradients of ARM ( ) as g using (1)  (9) update p ( −1) and q ( −1) using (17); (10) update H ( ) using (16); where accuracy + is the accuracy (or recall) of positive class (+). Similarly, accuracy − is the accuracy (or recall) of negative class (−) and precision + is the precision of positive class (+). More details about these measures are discussed in Section 4.2. In this way, RPM considers all the three factors: the precision of minority class and the recall of both minority class and majority class.
Taking the gradient of APM (see (8)) with respect to w results in where similarly, Combining (11), (12), (13), and (10), we have that the gradient of APM defined by (8) is The proposed method for the imbalanced problem uses a quasi-Newton method BFGS which uses (14) as base function for learning its parameters. For more details refer to Section 3.2.

Algorithm.
Based on the cost function APM proposed in Section 3.1, a novel imbalanced learning approach called ILLD (Imbalanced Learning Based on Logistic Discrimination) is proposed to tackle data imbalance. ILLD uses quasi-Newton method BFGS [22][23][24] to maximize the cost function to learn parameters, where BFGS is an iterative process. Formally, the iterative process is as follows: where is the step length along with the Newton direction of the th iteration and H is the approximate Hessian matrix calculated by where The details about the learning process of ILLD are shown in Algorithm 1. ILLD firstly initializes w (0) randomly and H (0) to be unit matrix of which the value of each diagonal element is equal to 1 and 0 for others (lines 1∼2) and calculates w (1) using (11) based on w (0) and H (0) (lines 3∼4). Then ILLD optimizes the cost function ARM to find out the best Computational Intelligence and Neuroscience 5 parameter vector w (lines 4∼11). Specifically, for the th iteration, ILLD calculates the gradients of ARM ( ) as g using (14) and, based on g and g −1 , updates p ( −1) and q ( −1) using (17) (lines 8∼9). Then, it updates H ( ) using (16) (line 10) and, finally, updates w ( +1) using (15) (line 11). The convergence rate of ILLD is ( 2 ) [22][23][24] and the stopping condition is that the absolute of the difference between the values calculated by (15) for two consecutive iterations is not larger than 0.001 (line 13).
3.3. Discussion. Unlike traditional logistic discriminations which only consider the overall performances, ILLD takes into account more factors through the accuracy-precision based metric. Indeed, this criterion involves the accuracies (or recalls) of both positive class and negative class as well as the precision of positive class, which result from the prediction confusion matrix (discussed in Section 4.2). Thus ILLD considers not only the overall performance of logistic discrimination but also the performance on each class.
Considering only the former terms of ARM defined by (8), we have Similarly, considering only the former terms of (9), we have the evaluation measure of AUC [2] as shown in the following: Therefore, the proposed measure (without considering the last term) is the estimation of AUC. Besides, comparing (18), (19), and the evaluation of -mean defined as -mean = √accuracy + × accuracy − (20) we conclude that the proposed measure (without considering the last term) uses the arithmetic mean of accuracies (or recalls) of both positive class and negative class instead of the geometric mean as the cost function to supervise the learning process of the logistic discriminations.
Omitting the second term of both (11) and (12) We observe from (21) that the proposed cost function is the metric that combines the accuracy (or recall) and precision of positive class together as -measure does.

Data Sets and Experimental
Setup. The 14 data sets utilized in this paper are randomly selected from the UCI repository [25]. Of these data sets, breast-Wisconsin, hepatitis, horse-colic, and ionosphere are imbalanced 2-class data sets. Others are 2-class imbalanced data sets derived from multiclass data sets by treating one class of a multiclass data set as the positive class while treating the union of all other classes as the negative class [26]. The imbalanced degree of these data sets varies from 0.0376 (highly imbalanced) to 0.3696 (only slightly imbalanced), where imbalanced degree is defined as the ratio of the sample size of the positive class to that of the negative class. The details about the data sets are shown in Table 1, where #Degree is the imbalance degree, #Exs is the size of data sets, #Attrs is the number of attributes, and #Cls is the number of classes. For each data set, a 5 × 2fold cross-validation [27] is performed.
To evaluate the performance of ILLD, we compare it with LD, LD-US, and LD-OS, where LD denotes that traditional logistic discrimination (cross-entropy error function is treated as cost function) is simply applied to imbalanced problem and LD US and LD OS denote that LD runs on data sets obtained by undersampling and by oversampling the training data sets, respectively. Here, the prediction postprocessing approaches such as threshold method [12] are not used for comparisons, since the study in [28] concluded that the operations of moving the decision threshold, applying a sampling strategy, and adjusting the cost matrix produce classifiers with the same performance.

Evaluation Metrics.
Evaluation metric is extremely essential to assessing the effectiveness of an algorithm and, traditionally, accuracy is the most frequently used one. Considering two-class classification problem and letting "+" and "−" be the positive and negative classes, respectively, as aforementioned, then examples can be categorized into four groups after a classification process as denoted in the confusion matrix presented in Table 2, and thus the accuracy is defined as 6 Computational Intelligence and Neuroscience However, the evaluation metrics used for the balanced problem is very different from that used for the imbalanced one, and accuracy is inadequate for imbalanced learning. In lieu of accuracy, other assessment metrics including recall, precision, -measure, and -mean are frequently adopted in the research community to evaluate the performance of models on imbalanced learning problems. These metrics are designed based on the accuracy of both positive class and negative class and the precision of negative class, specifically: Then, -measure and -mean are defined as where is a coefficient to adjust the relative importance of precision versus recall (usually, = 1).
From (24), -measure combines recall and precision as a measure of the effectiveness of classification in terms of a ratio of the weighted importance on either recall (accuracy + ) or precision (precision + ) as determined by the user. So,measure represents a harmonic mean between recall and precision. Like -measure, -mean is also a metric which evaluates models' performance by considering two metrics; specifically, -mean measures the balanced performance of a classifier using the geometric mean of the recall of positive class and that of negative class.
In the case of the soft-type classifiers, that is, classifiers that output a continuous numeric value to represent the confidence of an example belonging to the predicted class, AUC is a commonly used measure to evaluate models' performances, which can be calculated by The AUC allows the evaluation of the best model on average. In this paper, we employ accuracy, recall, -measure,mean, and AUC to evaluate the classification performance   on imbalanced data sets. Though accuracy is inadequate to evaluate the classification performance, poor accuracy means a bad classifier. An efficient classifier should improve recall, -measure, -mean, or AUC without decreasing accuracy.

Experimental Results.
To evaluate the performance of ILLD (the proposed method), ILLD is compared with LD, LD-US, and LD-OS (for more details about LD, LD-US, and LD-OS refer to Section 4.1). The corresponding results are reported in both tables and figures, where Tables 3, 4, 5, 6, and 7 report the results of the four comparing methods on the measures of accuracy, recall, -mean, -measure, and AUC and Figures 1, 2, 3, and 4 report the ranks of the methods on recall, -mean, -measure, and AUC. In these tables, a bullet (an open circle) next to a result indicates that ILLD significantly outperforms (is outperformed by) the respective method (column) for respective data set (row) in pairwise -test at 95% significance level. The last rows in these tables are the average results. The ranks of these methods shown in Figures 1, 2, 3, and 4 are calculated as follows [29,30]: on a data set, the best performing algorithm gets the rank of 1.0, the second best performing algorithm gets the rank of 2.0, and so on. In case of ties, average ranks are assigned.    Table 3 reports the accuracies of ILLD, LD-US, LD-OS, and LD. As shown in Table 3, ILLD outperforms LD on three data sets and is outperformed by LD on four ones for -test at 95% significant level. Moreover, the average accuracy of ILLD is 1.19 percentage points lower than the one of LD. The results are acceptable although LD is better than ILLD since 8 Computational Intelligence and Neuroscience   we focus on imbalanced learning of which accuracy is not an ideal metric to evaluate its performance. Compared to LD-US and LD-OS, ILLD shows significantly better performances on 13 and 7 out of the 14 data sets, respectively, and 11.47 and 2.74 percentage points higher performance on the average accuracies, respectively. Table 4 and Figure 1 show the summarizing results and the ranks of the four comparing methods on measure of recall, respectively. From Table 4, ILLD significantly outperforms LD on 10 out of the 14 data sets, and the average recall of ILLD is 0.1454 higher than LD (recall ∈ [0, 1]). These results indicate that the proposed cost function APM is appropriate for imbalanced problem and thus ILLD can improve the performance of logistic discrimination on positive class while keeping its high performance on the measure of accuracy. Also, ILLD performs comparable to LD-US and outperforms LD-OS. Specifically, ILLD significantly outperforms LD-US and LD-OS on 2 and 3 data sets, respectively, and is only outperformed by LD-US on one data set. Besides, from Figure 1, we can see that the average ranks of ILLD, LD-US, LD-OS, and LD are 1.61, 1.61, 3.0, and 3.78, respectively. Combining with the results in Table 3, we have that LD-US achieves a high recall by sacrificing the high performance of logistic discrimination on accuracy. Table 5 and Figure 2 illustrate the summarizing results and the ranks of ILLD, LD-US, LD-OS, and LD onmeasure, respectively. From Table 5, ILLD shows much better performance comparing to LD-US, LD-OS, and LD. Specifically, ILLD significantly outperforms them on 11, 7, and 6 out the 14 data sets, respectively, and is only outperformed by LD-OS on one data set. Moreover, Figure 2 shows that ILLD wins on 14, 12, and 11 out of the 14 data sets comparing to LD-US, LD-OS, and LD, respectively. Besides, the -measure of ILLD ranks first on 10 data sets.
-mean summaries and the corresponding ranks of ILLD, LD-US, LD-OS, and LD are reported in Table 6 and Figure 3. Similar to the results shown in Table 4 and Figure 2, Table 5 shows that ILLD significantly outperforms LD-US, LD-OS, and LD on 7, 6, and 10 out of 14 data sets, respectively, and Figure 3 shows that ILLD wins on 13, 11, and 14 data sets comparing to the four methods, respectively. Besides, ILLD ranks first with average rank of 1.36, followed by LD-OS (2.78), LD-US (2.5), and LD (3.36). Table 7 and Figure 4 depict AUCs and the ranks of the four comparing methods, respectively. On the 14 data sets, ILLD wins (significantly wins) on 11 (6), 12 (7), and 11 (6)

Conclusion
In this paper, we first construct a novel cost function called APM (Accuracy-Precision Based Metric) which considers the accuracies of both positive class and negative class as well as the precision of positive class and then propose a method called ILLD (Imbalanced Learning Based on Logistic Discrimination) to handle data imbalances. Also, we apply undersampling and oversampling to improve the performance of logistic discrimination on the imbalanced problem. Experimental results show that these methods can significantly improve the performance of logistic discrimination on positive class, and ILLD presents significantly better performances compared to other advanced methods on measures of recall, -measure, and -mean, while keeping the high performance of logistic discrimination on accuracy.