The classification problem for imbalance data is paid more attention to. So far, many significant methods are proposed and applied to many fields. But more efficient methods are needed still. Hypergraph may not be powerful enough to deal with the data in boundary region, although it is an efficient tool to knowledge discovery. In this paper, the neighborhood hypergraph is presented, combining rough set theory and hypergraph. After that, a novel classification algorithm for imbalance data based on neighborhood hypergraph is developed, which is composed of three steps: initialization of hyperedge, classification of training data set, and substitution of hyperedge. After conducting an experiment of 10-fold cross validation on 18 data sets, the proposed algorithm has higher average accuracy than others.
The imbalanced dataset problem in classification domains occurs when the number of instances that represent one class is much larger than that of the other classes. The minority class is usually more interesting from the point of view of the learning task. There are many situations in which imbalance occurs between classes, such as satellite image classification [
Previous research improved resampling methods in many aspects and proposed some effective resampling algorithms. SMOTE is an intelligent oversampling algorithm that was proposed by Chawla et al. [
While in recent years, with the rapid developing of ensemble methods for classification, they have been applied to imbalanced data classification, ensemble learning is a machine learning paradigm where multiple learners (called base learners) are trained to solve the same problem [
In the 1970s, Rumelhart and Norman proposed three types of human learning: accretion, tuning, and restructuring [
Although hypernetwork has been widely used in solving various machine learning problems, it usually produces poor overall classification performance when dealing with class imbalance problems. Like most of the traditional classification algorithms, hypernetwork assumes that the class distribution of datasets is balanced. The goal of the hypernetwork learning is to extract hyperedges (or decision rules) that can cover as many samples as possible. Hyperedges are critical for differentiating class membership which are copied and added while hyperedges with poor differential ability are discarded. However, within the context of class imbalance, many samples in minority class are usually viewed as noises. Therefore, the number of hyperedges corresponding to the majority significantly surpasses that of hyperedges corresponding to the minority. As a result, most of the minority samples are misclassified in a traditional hypernetwork. Thus, this paper attempts to combine hypernetwork with rough set to address the problem.
Rough set theory is a powerful mathematical tool introduced by Pawlak [
The remainder of this paper is organized as follows. The basic concepts on neighborhood rough set models are shown in Section
In 1970, Berge [
Given
then the binary relation
An example of hypergraph.
Neighborhoods and neighborhood relations are a class of important concepts in topology. Lin [
Given arbitrary
Consider that
But Euclidean distance can only be used to compute continuous features; the nominal features are invalid. Here, we compute them by using value difference metric (VDM) proposed by Stanfill and Waltz [
In the previous equation,
Given
Vertices of hypergraph represent the attribution of samples in some literatures like literature [
An example of neighborhood hypergraph.
Given
Given
Given
Given
Given If If
Given
The lower approximation of decision
To explain how to divide the upper approximation, lower approximation, and boundary region, here we give an example (Example 1) in Figure
An example of upper, lower approximation, and boundary region.
In Figure
First, one calculates the sample set
Traditional hypernetwork model has limit on some aspects as follows:
The proposed algorithm aims at tackling imbalanced data classification problem including two aspects as follows.
The flow chart of the algorithm is shown in Figure
The flow chart of algorithm.
Hyperedges are generated based on the samples, which reserve the real distribution of the sample set and thereby provide a foundation for hyperedge selection. Meanwhile, one can change some attribute values while generating hyperedge. Thus, more decision rules are generated for sample classification, which can improve the accuracy of sample classification to some extent.
In this paper, attribution set of hyperedges is
In order to process imbalanced dataset, two classes will be considered in the following definitions.
Given
In this paper, the process of hyperedge generation consisted of two stages: attribution inheritance and class confirmation.
Attribution inheritance.
Given
According to formula ( otherwise, generate a random number
One uses the generated hyperedge set to classify the training set. Through analyzing the classification result, one can know the classification accuracy of hyperedge set and determine whether to replace the hyperedge for the hyperedge set or not. By repeating the process of training sample classification and hyperedge replacement, we can make the distribution of hyperedge set approach the real distribution of training sample set gradually.
Given
One should consider the factor when we use neighborhood hypergraph to classify the samples:
Given sample set
(
(
One uses the classification rules above to classify the training set. If the accuracy is higher than 0.95, one can output the hyperedge set. Otherwise, the hyperedge replacement operation should be adopted (see Section
In experimental evaluation, we conclude that
In the process of hyperedge initialization, one generates part of the attribute values randomly. As a result, some of hyperedges are not suitable for sample classification. In order to acquire better performance, one should replace the poor hyperedges by generating new hyperedges, namely, hyperedge replacement.
The algorithm divided hyperedge set into upper approximation, lower approximation, boundary region, and negative region. The confidence degree of hyperedges in lower approximation is 1. The confidence degree of hyperedges in boundary region is between 0 and 1. Hyperedges whose confidence degree is 0 belong to negative region. Hyperedges in lower approximation are all retained because they are very helpful for classification. On the contrary, since hyperedges in negative region are counteractive for classification, they will be replaced. For the hyperedges in boundary region, they will be dealt with by a threshold
It is composed of three steps. Set the confidence degree threshold of each hyperedge (in this paper Find out those hyperedges whose confidence degree is under the threshold from the hyperedge set.
According to Definitions
If
An example of situation 1.
In this case, one can assume that
It means that samples surrounding
Now, let us give an example below to explain the situation (see Figure
Hyperedges whose confidence degree is under
According to Formula (
This kind of hyperedges has the same class with few samples surrounding them, which results in the poor effect on classification. Thus, they should be replaced.
One selects a hyperedge
In this paper, sample classification and hyperedge replacement are based on the neighborhood radius of sample. According to Definition
Here we give the N-HyperGraph (see Algorithm
According to the formula ( FOR each WHILE ( Generate hyper-edge values of Calculate the distance between Calculate IF ELSE Calculate IF ELSE END IF END IF END WHILE END FOR
Calculate FOR each Calculate IF ELSE END IF END FOR Calculate the classification accuracy of training data set: IF ELSE GOTO Step 3; END IF
FOR each Calculate the confidence-degree IF END IF END FOR WHILE ( Generate a new hyper-edge END WHILE GOTO Step 2;
RETURN
There are two main parameters in the algorithm:
In order to test the proposed algorithm in this paper, one selects 18 UCI datasets which are downloaded from the machine learning data repository, University of California, at Irvine. The imbalanced rate is from 1.37 to 28.10. There are seven multiclass datasets and eleven two-class datasets. Multiclass datasets are modified to obtain two-class imbalance problems, by the union of one or more classes of the minority class and the union of one or more of the remaining classes which are labeled as the majority class. For the missing values, if they are continuous features, we fill them with average values; if they are nominal features, we fill them with values that appear most frequently. The datasets are outlined in Table
Data description.
Dataset | Size | Attribute | Class label (minority : majority) | Class distribution |
---|---|---|---|---|
Bupa | 345 | 6C | 01 : 02 | 145/200 |
Colic | 368 | 7C 15N | No : yes | 136/232 |
Reprocessed | 294 | 13C | 01 : 00 | 106/188 |
Machine | 209 | 7C | Others : 2 | 74/135 |
Labor | 57 | 8C 8N | Bad : good | 20/37 |
Tic | 958 | 9N | Negative : positive | 332/626 |
Iris | 150 | 4C | Iris-virginica : others | 50/100 |
Seed | 210 | 7C | 02 : others | 70/140 |
Vc | 310 | 6C | Normal : Abnormal | 100/210 |
Glass | 214 | 9C | 01, 02 : others | 68/146 |
Haberman | 306 | 3C | 02 : 01 | 81/225 |
Transfusion | 748 | 4C | 01 : 00 | 178/570 |
Abalone (7 : 15) | 494 | 7C 1N | 15 : 07 | 103/391 |
Balance-scale | 625 | 4C | B : others | 49/576 |
Abalone (9 : 18) | 731 | 7C 1N | 18 : 9 | 42/689 |
Yeast (POX : CTY) | 483 | 8C | POX : CYT | 20/463 |
Car | 1728 | 6N | Good : others | 69/1659 |
Yeast (ME2 : others) | 1484 | 8C | ME2 : others | 51/1433 |
C: continuous, N: nominal.
The traditional evaluation usually uses Confusion Matrix, showed in Table
Confusion matrix.
Predict to positive | Predict to negative | |
---|---|---|
Positive | TP | FN |
Negative | FP | TN |
From Table
There are three evaluations as the formulas called Precision, Recall, and
Another appropriate metric that could be used to measure the performance of classification over imbalanced datasets is the receiver operating characteristic (ROC) graphics [
In order to evaluate the performance of N-HyperGraph in this paper, one compares it with some other algorithms in related literatures: SVM and J48 (C4.5) [
Contrastive experiment results on Precision, Recall,
Precision.
Dataset | SVM | CS-EN-HN | SMOTE + C4.5 | SMOTE-RSB* + C4.5 | NRSBoundary-SMOTE + C4.5 | N-HyperGraph |
---|---|---|---|---|---|---|
Bupa |
|
0.6827 | 0.5663 | 0.6581 | 0.5614 | 0.5497 |
Colic | 0.6857 | 0.5887 | 0.7391 | 0.8103 | 0.7687 |
|
Reprocessed | 0.0000 | 0.6344 | 0.7030 |
|
0.6979 | 0.5558 |
Machine | 0.0000 | 0.7542 | 0.8250 | 0.8873 |
|
0.9025 |
Labor | 0.9411 |
|
0.6667 | 0.6667 | 0.8421 | 0.7733 |
Tic | 0.9908 | 0.6534 | 0.6882 | 0.8044 | 0.8100 |
|
Iris |
|
0.5061 | 0.9057 | 0.8703 | 0.8888 | 0.8310 |
Seed | 0.9295 |
|
0.9577 | 0.9577 | 0.9577 | 0.8614 |
Vc | 0.0000 | 0.6133 | 0.6696 |
|
0.6695 | 0.4842 |
Glass | 0.8775 |
|
0.6933 | 0.8214 | 0.7428 | 0.8652 |
Haberman | 0.4999 | 0.2334 | 0.4516 | 0.4590 | 0.4948 |
|
Transfusion | 0.4186 | 0.3139 | 0.4722 | 0.5299 | 0.5000 |
|
Abalone (7 : 15) | 0.7951 | 0.7552 | 0.8056 | 0.8155 | 0.8100 |
|
Balance-scale | 0.0000 | 0.1605 | 0.0000 | 0.0000 | 0.0000 |
|
Abalone (9 : 18) | 0.0000 | 0.3910 | 0.4167 | 0.4347 | 0.5000 |
|
Yeast (POX : CTY) |
|
0.6371 | 0.6000 | 0.9268 | 0.7736 | 0.7000 |
Car | 0.5000 | 0.5100 | 0.6849 | 0.6849 |
|
0.6731 |
Yeast (ME2 : others) | 0.0000 | 0.1272 | 0.3214 |
|
0.4200 | 0.3265 |
|
||||||
Average | 0.5233 | 0.5853 | 0.6204 | 0.6816 | 0.6682 |
|
In order to view the performance on 5 algorithms, the average accuracies of different indicator of 5 algorithms are showed in Figure
The average value of each indicator.
Tables
Recall.
Dataset | SVM | CS-EN-HN | SMOTE + C4.5 | SMOTE-RSB* + C4.5 | NRSBoundary-SMOTE + C4.5 | N-HyperGraph |
---|---|---|---|---|---|---|
Bupa | 0.0413 | 0.7856 | 0.6483 | 0.5310 | 0.6621 |
|
Colic | 0.1764 | 0.6531 | 0.7500 | 0.6912 | 0.7574 |
|
Reprocessed | 0.0000 | 0.5999 | 0.6698 | 0.6509 | 0.6320 |
|
Machine | 0.0000 | 0.8404 | 8919 | 0.8514 | 0.8919 |
|
Labor | 0.8000 | 0.5000 | 0.5000 | 0.7000 | 0.8000 |
|
Tic | 0.6536 |
|
0.7711 | 0.7680 | 0.7319 |
|
Iris | 0.9800 | 0.5000 | 0.9600 | 0.9400 | 0.9600 |
|
Seed | 0.9428 | 0.9428 | 0.9714 | 0.9714 | 0.9714 |
|
Vc | 0.0000 | 0.9500 | 0.7700 | 0.6400 | 0.7900 |
|
Glass | 0.6323 |
|
0.7647 | 0.6764 | 0.7647 | 0.9000 |
Haberman | 0.0246 | 0.7732 | 0.5185 | 0.3457 | 0.5926 |
|
Transfusion | 0.1011 | 0.9117 | 0.4775 | 0.3989 | 0.5000 |
|
Abalone (7 : 15) | 0.6407 |
|
0.8447 | 0.8155 | 0.7864 |
|
Balance-scale | 0.0000 | 0.5500 | 0.0000 | 0.0000 | 0.0000 |
|
Abalone (9 : 18) | 0.0000 | 0.8666 | 0.3571 | 0.4347 | 0.3809 |
|
Yeast (POX : CTY) | 0.1372 | 0.8000 | 0.4500 | 0.4751 | 0.8039 |
|
Car | 0.0145 |
|
0.7246 | 0.7246 | 0.6956 |
|
Yeast (ME2 : others) | 0.0000 | 0.8700 | 0.3529 | 0.3137 | 0.4118 |
|
|
||||||
Average | 0.2858 | 0.8068 | 0.6346 | 0.6071 | 0.6740 |
|
Dataset | SVM | CS-EN-HN | SMOTE + C4.5 | SMOTE-RSB* + C4.5 | NRSBoundary-SMOTE + C4.5 | N-HyperGraph |
---|---|---|---|---|---|---|
Bupa | 0.0789 |
|
0.6045 | 0.5878 | 0.6076 | 0.6422 |
Colic | 0.2807 | 0.5688 | 0.7445 | 0.7460 | 0.7630 |
|
Reprocessed | 0.0000 | 0.5374 | 0.6698 | 0.6831 | 0.6633 |
|
Machine | 0.0000 | 0.8404 | 0.8571 | 0.8690 | 0.8980 |
|
Labor |
|
0.6666 | 0.5714 | 0.6829 | 0.8205 | 0.8171 |
Tic | 0.7876 | 0.7900 | 0.7273 | 0.7858 | 0.7689 |
|
Iris |
|
0.4835 | 0.9320 | 0.9038 | 0.9230 | 0.9045 |
Seed | 0.9361 | 0.9545 |
|
|
|
0.9209 |
Vc | 0.0000 |
|
0.7163 | 0.6919 | 0.7248 | 0.6501 |
Glass | 0.7350 |
|
0.7273 | 0.7419 | 0.7536 | 0.8808 |
Haberman | 0.0470 | 0.3475 | 0.4828 | 0.3944 | 0.5393 |
|
Transfusion | 0.1628 | 0.4582 | 0.4749 | 0.4551 | 0.5000 |
|
Abalone (7 : 15) | 0.7096 | 0.8595 | 0.8246 | 0.8155 | 0.7980 |
|
Balance-scale | 0.0000 | 0.2326 | 0.0000 | 0.0000 | 0.0000 |
|
Abalone (9 : 18) | 0.0000 | 0.5071 | 0.3846 | 0.3076 | 0.4324 |
|
Yeast (POX : CTY) | 0.2413 | 0.6493 | 0.5143 |
|
0.7885 | 0.8066 |
Car | 0.0282 | 0.6728 | 0.7042 | 0.7042 | 0.6906 |
|
Yeast (ME2 : others) | 0.0000 | 0.1924 | 0.3364 | 0.3765 | 0.4158 |
|
|
||||||
Average | 0.3235 | 0.6191 | 0.6252 | 0.6409 | 0.6695 |
|
Dataset | SVM | CS-EN-HN | SMOTE + C4.5 | SMOTE-RSB* + C4.5 | NRSBoundary-SMOTE + C4.5 | N-HyperGraph |
---|---|---|---|---|---|---|
Bupa | 0.2026 |
|
0.6441 | 0.6518 | 0.6433 | 0.5592 |
Colic | 0.4008 | 0.5828 | 0.7740 | 0.7910 | 0.8100 |
|
Reprocessed | 0.0000 | 0.5973 |
|
0.7466 | 0.7311 | 0.7247 |
Machine | 0.0000 | 0.8371 | 0.8941 | 0.8949 | 0.9196 |
|
Labor | 0.8000 | 0.7071 | 0.6576 | 0.7534 |
|
0.8232 |
Tic | 0.8015 | 0.8466 | 0.7926 | 0.8318 | 0.8156 |
|
Iris | 0.4427 | 0.5112 |
|
0.9349 | 0.9499 | 0.9427 |
Seed | 0.6473 | 0.9623 |
|
|
|
0.9511 |
Vc | 0.0000 | 0.7729 | 0.7941 | 0.7589 |
|
0.6831 |
Glass | 0.7141 |
|
0.8026 | 0.7938 | 0.8187 | 0.8897 |
Haberman | 0.2957 | 0.4884 | 0.6332 | 0.5431 | 0.6808 |
|
Transfusion | 0.1628 | 0.4582 | 0.6308 | 0.5956 | 0.6496 |
|
Abalone (7 : 15) | 0.6626 | 0.9565 | 0.8940 | 0.8808 | 0.8649 |
|
Balance-scale | 0.0000 | 0.6016 | 0.0000 | 0.0000 | 0.0000 |
|
Abalone (9 : 18) | 0.0000 | 0.8521 | 0.5884 | 0.4833 | 0.6100 |
|
Yeast (POX : CTY) | 0.3704 | 0.7088 | 0.6665 | 0.8552 | 0.8630 |
|
Car | 0.1195 | 0.9790 | 0.8453 | 0.8453 | 0.8285 |
|
Yeast (ME2 : others) | 0.0000 | 0.5086 | 0.5861 | 0.5566 | 0.6352 |
|
|
||||||
Average | 0.3118 | 0.7113 | 0.7158 | 0.7162 | 0.7475 |
|
AUC.
Dataset | SVM |
CS-EN-HN |
SMOTE + C4.5 | SMOTE-RSB* + C4.5 | NRSBoundary-SMOTE + C4.5 | N-HyperGraph |
---|---|---|---|---|---|---|
Bupa | 0.5181 |
|
0.6468 | 0.6652 | 0.6401 | 0.6427 |
Colic | 0.5645 | 0.8265 | 0.7960 | 0.7855 | 0.8102 |
|
Reprocessed | 0.5000 | 0.6750 |
|
0.7565 | 0.7817 | 0.7662 |
Machine | 0.5000 | 0.9201 | 0.9199 | 0.9359 | 0.9430 |
|
Labor | 0.8864 | 0.7500 | 0.7500 | 0.7655 | 0.8243 |
|
Tic | 0.8252 |
|
0.8638 | 0.8941 | 0.8848 | 0.9959 |
Iris |
|
0.7500 | 0.9408 | 0.9197 | 0.9468 | 0.9450 |
Seed | 0.9535 | 0.9780 | 0.9730 |
|
|
0.9535 |
Vc | 0.5000 |
|
0.8107 | 0.8380 | 0.8277 | 0.7381 |
Glass | 0.7956 |
|
0.8442 | 0.8640 | 0.8637 | 0.9400 |
Haberman | 0.5079 | 0.5866 | 0.6255 | 0.6174 | 0.6636 |
|
Transfusion | 0.5286 | 0.8558 | 0.6813 | 0.7007 | 0.7048 |
|
Abalone (7 : 15) | 0.7986 |
|
0.8901 | 0.9046 | 0.8627 | 0.9910 |
Balance-scale | 0.5000 | 0.6256 | 0.5000 | 0.5000 | 0.5000 |
|
Abalone (9 : 18) | 0.6611 | 0.9333 | 0.7294 | 0.6514 | 0.7065 |
|
Yeast (POX : CTY) | 0.5000 | 0.8999 | 0.6881 | 0.8685 | 0.8762 |
|
Car | 0.5069 | 0.9792 | 0.9775 | 0.9775 |
|
0.9888 |
Yeast (ME2 : others) | 0.5000 | 0.6790 | 0.7894 | 0.7412 | 0.6916 |
|
Average | 0.6398 | 0.8509 | 0.7898 | 0.7980 | 0.8054 |
|
We can find out from Tables
In total, the experimental results of N-HyperGraph are better than all of the other algorithms. Since the rough set theory is used in N-HyperGraph, it is more efficient to process the uncertain samples, especially in boundary region of hyperedge set. What is more, weights are calculated through the neighborhood rough set model; thus it makes more hyperedges involve in the class decision of a hyperedge, improving the accuracy. Due to the two aspects above in the proposed algorithm, the results of classification are improved.
As SMOTE oversamples all minority class samples, it decreases the decision space of majority class. Although it can improve Recall of minority class, many majority class samples will be misclassified as minority class, thereby resulting in the decreasing of Precision. SMOTE-RSB filters the synthetic samples more strictly than SMOTE, because few synthetic samples are generated when datasets are highly imbalanced. Thus, compared with SMOTE, its improvement is not obvious. RSBoundary-SMOTE takes neighborhood rough set into consideration and emphasizes resampling for minority class samples which belong to boundary region and thus improves the
In this paper, one proposed a new algorithm based on hypernetwork called N-HyperGraph to solve the problem of classifying imbalance dataset. At first, hyperedge set is divided according to rough set theory. Then, some poor hyperedges are replaced by combining with the imbalanced degree, in order to improve the accuracy. The experimental results on 18 UCI datasets with different degree of imbalance show that the classification result of the proposed algorithm N-HyperGraph improves obviously in contrast with another four algorithms. However, the algorithm N-HyperGraph will cost much time, due to calculating the distance between hyperedge and sample. Thus, how to reduce the running time of the algorithm is our future work.
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work is supported by the National Natural Science Foundation of China (NSFC) under Grant nos. 61309014 and 61379114, Natural Science Foundation Project of CQ CSTC under Grant no. cstc2013jcyjA40063, and Doctor Foundation of Chongqing University of Posts and Telecommunications under Grant no. A2012-08.