A Novel Algorithm for Imbalance Data Classification Based on Neighborhood Hypergraph

The classification problem for imbalance data is paid more attention to. So far, many significant methods are proposed and applied to many fields. But more efficient methods are needed still. Hypergraph may not be powerful enough to deal with the data in boundary region, although it is an efficient tool to knowledge discovery. In this paper, the neighborhood hypergraph is presented, combining rough set theory and hypergraph. After that, a novel classification algorithm for imbalance data based on neighborhood hypergraph is developed, which is composed of three steps: initialization of hyperedge, classification of training data set, and substitution of hyperedge. After conducting an experiment of 10-fold cross validation on 18 data sets, the proposed algorithm has higher average accuracy than others.


Introduction
The imbalanced dataset problem in classification domains occurs when the number of instances that represent one class is much larger than that of the other classes. The minority class is usually more interesting from the point of view of the learning task. There are many situations in which imbalance occurs between classes, such as satellite image classification [1], risk management [2], and medical diagnosis [3,4]. When studying problems with imbalanced data, using the classifiers produced by standard machine learning algorithms without adjusting the output threshold may well be a critical mistake [5]. At present, the solutions for the problem of imbalanced dataset classification are developed at both the data and algorithmic levels [6]. At the data level, the objective is to rebalance the class distribution by resampling the data space, such as oversampling the minority class and undersampling the prevalent class. At the algorithm level, solutions try to adapt existing classifier learning algorithms to strengthen learning with regard to the minority class, such as costsensitive learning, ensemble learning, and hypernetwork [7].
Previous research improved resampling methods in many aspects and proposed some effective resampling algorithms. SMOTE is an intelligent oversampling algorithm that was proposed by Chawla et al. [8]. Its main idea is to form new minority class samples by interpolating between several minority class samples that lie together. Thus, the overfitting problem is avoided and the decision space for the minority class spread further; meanwhile, it reduces the decision space for the majority class, so many researchers proposed different improved methods. Dong and Wang [9] proposed the Random-SMOTE, which is different from SMOTE, which obtained new minority class samples by interpolating among three minority class samples. Yang et al. [10] proposed ASMOTE algorithm which chose not only the minority class samples but also the majority class samples that are near to minority class sample, avoiding synthetic sample overlapping the majority class samples. Han et al. [11] proposed the Borderline-SMOTE. Hu and Li [12] proposed NRSBoundary-SMOTE algorithm which can expand the decision space for the minority class; meanwhile, it will shrink the decision space for the majority class.
While in recent years, with the rapid developing of ensemble methods for classification, they have been applied to imbalanced data classification, ensemble learning is a machine learning paradigm where multiple learners (called base learners) are trained to solve the same problem [13]. Due to the outstanding performance of ensemble methods, they 2 The Scientific World Journal are applied to imbalanced dataset by combining with other techniques. Chawla et al. have developed SMOTEBoost algorithm by integrating Adaboost (the most famous boosting algorithm) and synthetic minority oversampling technique (SMOTE) [14]. Similarly to SMOTEBoost, RUSBoost also introduces data sampling into the Adaboost algorithm, while it applies random undersampling to the majority class; but SMOTEBoost creates synthetic new minority class instances by operating in the feature space [15]. Błaszczyński et al. integrate a selective data preprocessing method SPIDER with Ivotes ensemble algorithm developing the framework called IIvotes [16]. Besides, cost-sensitive learning becomes an effective tool to solve class imbalanced problem, which involves two types: binary classification problem and multiclassification problem. It can be implemented by two ways that are rescaling and reweighted, respectively. Both of them aim at making the trained classification algorithms cost sensitive. Rescaling changes the distribution of samples in training data. It has been used in cost-based sampling [17], REBALANCE [18], Rescalenew [19], and so on. Differing from rescaling, reweighted adjusts the class probability distribution in classifier based on costs. It has been used in MetaCost [20] proposed by Domings and AdaCost [21], which is improved by Fan et al. according to AdaBoost.
In the 1970s, Rumelhart and Norman proposed three types of human learning: accretion, tuning, and restructuring [22]. Based on their study, professor Zhang proposed three basic principles of cognitive learning [23]: (1) continuity, (2) glocality, and (3) compositionality. He used a hypergraph as presentation form and proposed a hypernetwork model, which can be used for cognitive learning and memory. Hypernetwork is a probabilistic graph with numbers of hyperedges. A hyperedge can be regarded as a component, a subject, or even a circuit. From the perspective of data, a hyperedge is the combination of sample attributes and class. So far, hypernetwork can just deal with discrete data. Dataset must be discrete before using for building a hypernetwork classifier. A hypernetwork model includes three steps: (1) initializing of a hypernetwork model according to training dataset, (2) evolutionary learning of a hypernetwork, and (3) classification of test dataset using the evolutionary hypernetwork. In step (1), a sample is used to generate many hyperedges through inheriting some attributes of the sample and its class. In step (2), operations of match, selection, and replacement are repeatedly executed for hyperedges. It is started from a randomly initialized hypernetwork; in each iteration, fitness of a hyperedge is calculated for evaluating and ordering. Hyperedges which own low fitness are replaced with new generated hyperedges. In this way, hyperedges with high class discernibility in pattern space can be found out by the hypernetwork [7]. After the above steps, a hypernetwork model is built and classifies the test data through joint probability.
Although hypernetwork has been widely used in solving various machine learning problems, it usually produces poor overall classification performance when dealing with class imbalance problems. Like most of the traditional classification algorithms, hypernetwork assumes that the class distribution of datasets is balanced. The goal of the hypernetwork learning is to extract hyperedges (or decision rules) that can cover as many samples as possible. Hyperedges are critical for differentiating class membership which are copied and added while hyperedges with poor differential ability are discarded. However, within the context of class imbalance, many samples in minority class are usually viewed as noises. Therefore, the number of hyperedges corresponding to the majority significantly surpasses that of hyperedges corresponding to the minority. As a result, most of the minority samples are misclassified in a traditional hypernetwork. Thus, this paper attempts to combine hypernetwork with rough set to address the problem.
Rough set theory is a powerful mathematical tool introduced by Pawlak [24][25][26][27] to deal with imprecise, uncertain, and vague information. It has been successfully applied to such fields as machine learning, data mining, intelligent data analysis, and control algorithm acquiring. Basically, the idea is to approximate a concept by three description sets, namely, the lower approximation, upper approximation, and boundary region. The rough set theory puts the uncertain samples in the boundary region and the boundary region can be calculated by upper approximation minus lower approximation, and they all can be calculated. Until now, there are many researchers who brought rough set theory to process imbalanced data [28,29].
The remainder of this paper is organized as follows. The basic concepts on neighborhood rough set models are shown in Section 2. The neighborhood hypergraph algorithm is developed in Section 3. Section 4 presents the experimental evaluation on 18 imbalanced UCI datasets [30] by 10-fold cross validation, which shows the validity of the proposed method. The paper is concluded in Section 5.

Hypergraph and Neighborhood
Hypergraph Model 2.1. The Definition of Hypergraph. In 1970, Berge [31] used hypergraph to define hypernetwork. And it was the first time to establish undirected hypergraph theory systematically and it was applied on the operations research by matroid.

The Neighborhood
Hypergraph. Neighborhoods and neighborhood relations are a class of important concepts in topology. Lin [33] pointed out that neighborhood spaces are more general topological spaces than equivalence spaces and introduced neighborhood relation into rough set methodology. Hu et al. [34] discussed the properties of neighborhood The Scientific World Journal approximation spaces and proposed the neighborhood-based rough set model. And then they used the model to build a uniform theoretic framework for neighborhood based classifiers. For the convenience of description, some basic concepts of the neighborhood rough set model are introduced here at first.
Definition 2 (see [34]). Given arbitrary ∈ and ⊆ , the neighborhood ( ) of in the subspace is defined as where Δ is a metric function. ∀ 1 , 2 , 3 ∈ ∪ , it satisfies Consider that 1 and 2 are two objects. = { 1 , 2 , . . . , } is a sample-dimensional space, where ( , ) denotes the value of sample on the th dimension . Then, a general metric, named Minkowsky distance, is defined as But Euclidean distance can only be used to compute continuous features; the nominal features are invalid. Here, we compute them by using value difference metric (VDM) proposed by Stanfill and Waltz [35] in 1986. The distance between two corresponding feature values is defined as follows: In the previous equation, 1 and 2 are the two corresponding feature values. 1 is the total number of occurrences of feature value 1 and 1 is the number of occurrences of feature value 1 for class . A similar convention can  be also applied to 2 and 2. is continuous, which is usually set to 1.
Vertices of hypergraph represent the attribution of samples in some literatures like literature [36] and so on. However, in this paper, vertices of hypergraph are denoted as samples and different samples on one hyperedge have the same attributes set. An example of neighborhood hypergraph is as in Figure 2.

Definition 7. Given
= ⟨ , ⟩, for arbitrary ∈ , one knows ( ) ∈ { , }, where denotes minority decision and denotes majority decision. Then sets of decision and decision in hyperedge set are, respectively, defined as thus, the degree of imbalance for is defined as Definition 8. Given = ⟨ , ⟩, for arbitrary ∈ , assume ( ) ∈ { , }, where denotes minority decision and denotes majority decision.
(1) If ( ) = , then (2) If ( ) = , then Definition 9. Given = ⟨ , ⟩, is the attitudes set of samples, and is the samples decision. For arbitrary hyperedge set ( ⊆ ), according to decisions , the hyperedge set is divided into equivalence classes: 1 , 2 , . . . , . For arbitrary ⊆ , the upper approximation, lower approximation, boundary region, and negative domains of decision related to set of attributes are, respectively, defined as The lower approximation of decision that related to the set of attributes is also called positive domain, denoted by POS ( ). The size of positive domain reflects the separable  degree of classification problem in a given attribute space; the bigger the positive region, the smaller the border.
To explain how to divide the upper approximation, lower approximation, and boundary region, here we give an example (Example 1) in Figure 3.
In Figure 3, the hyperedge 2 is simultaneous in the neighborhood of samples 1 and 2 ; in other words, it links 1 and 2 . From the graph, we can know easily that whether a hyperedge is in the neighborhood is up to the fact that whether the symbol of the hyperedge is inside of the neighborhood of the sample.

Neighborhood Hypergraph Classification Algorithm
Traditional hypernetwork model has limit on some aspects as follows: (1) discretized datasets.
(2) There is no special processing to the samples in boundary region. However, some advantages will appear when rough set theory is combined with hypernetwork: (1) hypernetwork can directly deal with numeric data, which avoids information loss of data.
(2) In the process of hyperedge learning, hyperedge set is divided into three parts that are upper approximation, lower approximation, and boundary region. In addition, hyperedges in  boundary region will be processed specially, which will result in the improvement of classification accuracy. The proposed algorithm aims at tackling imbalanced data classification problem including two aspects as follows.
(1) Improve the degree of imbalance of hyperedge set. The class of traditional hyperedge is inherited from samples directly, which is helpless to improve the degree of imbalance of hyperedge set. In the paper, when initializing hyperedges, classes of fractional hyperedges are determined according to the classes of samples, which reduces the degree of hyperedge set to some extent.
(2) Set classification condition. The classification process of traditional Hypernetwork does not take the degree of hyperedge set into consideration, resulting in a low accuracy of minority class. However, one sets a threshold, which equals the square of the degree of imbalance, as a classification condition. This method makes the classifier pay more attention to minority class and thus can deal with class imbalance problem appropriately.
The flow chart of the algorithm is shown in Figure 4. Then, one analyzes each part of the flow chart of the remaining section specifically as follows.

Hyperedge Initialization.
Hyperedges are generated based on the samples, which reserve the real distribution of the sample set and thereby provide a foundation for hyperedge selection. Meanwhile, one can change some attribute values while generating hyperedge. Thus, more decision rules are generated for sample classification, which can improve the accuracy of sample classification to some extent.
In this paper, attribution set of hyperedges is (namely, the universal set of attributions for samples); that is to say, a hyperedge is exactly a sample and denoted by a small dot in the figure. In hyperedge initialization, we can assign a value on each attribute and determine the classification of each hyperedge. In order to process imbalanced dataset, two classes will be considered in the following definitions.
So the degree of imbalance for is defined as In this paper, the process of hyperedge generation consisted of two stages: attribution inheritance and class confirmation.
(1) Attribution inheritance: hyperedge and sample have the same attribution set, and the attribute values of the hyperedge are assigned partly based on the sample. One selects 7/10 of all attributes of one hyperedge which are selected randomly and they inherit the corresponding attribution value of the sample. The remaining attribute values are generated randomly in the range of the corresponding attribute values of the sample. In Figure 5, is a sample and is a hyperedge.
(2) Class confirmation: the class of a hyperedge is confirmed by the whole dataset. There are two cases as follows.
One should consider the factor when we use neighborhood hypergraph to classify the samples: (1) the amounts of majority hyperedge and minority hyperedge in the neighborhood of a sample; (2) the degree of the imbalance of hyperedge set. Combining with the above, one presents the classification method.
In experimental evaluation, we conclude that 2 is a good choice to enhance the accuracy of minority class, while is a poor one to classify the minority class samples.

Hyperedge Replacement.
In the process of hyperedge initialization, one generates part of the attribute values randomly. As a result, some of hyperedges are not suitable for sample classification. In order to acquire better performance, one should replace the poor hyperedges by generating new hyperedges, namely, hyperedge replacement. The algorithm divided hyperedge set into upper approximation, lower approximation, boundary region, and negative region. The confidence degree of hyperedges in lower approximation is 1. The confidence degree of hyperedges in boundary region is between 0 and 1. Hyperedges whose confidence degree is 0 belong to negative region. Hyperedges in lower approximation are all retained because they are very helpful for classification. On the contrary, since hyperedges in negative region are counteractive for classification, they will be replaced. For the hyperedges in boundary region, they will be dealt with by a threshold . When the confidence degree of a hyperedge is less than , it will be replaced. Through the above, one can enhance the pertinence and validity of hyperedge replacement.
It is composed of three steps.
(2) Find out those hyperedges whose confidence degree is under the threshold from the hyperedge set.
According to Definitions 7 and 8, one can calculate the confidence degree of each hyperedge following the three cases below ( denotes a hyperedge).

Case 1. If
( ) = , then 1 is not in any neighborhood of samples, as shown in Figure 6.
In this case, one can assume that 1 is in the overlapped neighborhood of the nearest five samples. Then the confidence degree of 1 can be calculated according to Formula (6).
Case 2 (Conf ( ) ≥ ). It means that samples surrounding 1 have the same class with 1 . Thus 1 is helpful for the sample classification and should remain.
Case 3 (0 ≤ Conf ( ) < ). Now, let us give an example below to explain the situation (see Figure 7). According to Formula (4) and Definition 6, we know This kind of hyperedges has the same class with few samples surrounding them, which results in the poor effect on classification. Thus, they should be replaced.
(3) Generate new hyperedge and hyperedge replacement. One selects a hyperedge from the hyperedge set which should be replaced and generates a sample . Then a new hyperedge is initialized by using . After that, one can replace with . Repeat the process above until no hyperedge needs to be replaced.

Neighborhood Hypergraph Algorithm.
In this paper, sample classification and hyperedge replacement are based on the neighborhood radius of sample. According to Definition 1, the computational formula of the neighborhood radius of sample, denoted by , is as follows [34]: where ( = 1, 2, . . . , ) is a training sample, min(Δ( , )) denotes the minimal value of distance between and the remaining samples excluding , and range(Δ( , )) denotes the value domain of Δ( , ).
Here we give the N-HyperGraph (see Algorithm 1). There are two main parameters in the algorithm: (1) the radius of neighborhood ; (2) threshold of the confidence degree . The former is important to control the number of hyperedges. The number is increasing with the increasing of . The latter is vital to ensure the quality of hyperedges. The higher is, the better hyperedges can be obtained. Of course, when the is too big, there is no sense that almost all the hyperedges will be replaced.

Datasets.
In order to test the proposed algorithm in this paper, one selects 18 UCI datasets which are downloaded from the machine learning data repository, University of California, at Irvine. The imbalanced rate is from 1.37 to 28.10. There are seven multiclass datasets and eleven twoclass datasets. Multiclass datasets are modified to obtain twoclass imbalance problems, by the union of one or more classes of the minority class and the union of one or more of the remaining classes which are labeled as the majority class. For the missing values, if they are continuous features, we fill them with average values; if they are nominal features, we fill them with values that appear most frequently. The datasets are outlined in Table 1 and sorted by imbalanced rates from low to high. Table 2, where TP means the number of positive samples that are classified into positive, TN means the number of negative samples that are classified into negative, FN means the number of positive samples that are misclassified, and FP means the number of negative samples that are misclassified.

Experimental Evaluation in Imbalanced Domains. The traditional evaluation usually uses Confusion Matrix, showed in
From Table 2, one could get some useful evaluation as follows.
There are three evaluations as the formulas called Precision, Recall, and -value. Precision (also called positive predictive value) is the fraction of retrieved instances that are relevant, while Recall (also known as sensitivity) is the fraction of relevant instances that are retrieved. From the previous formulas, we can decrease FP to increase Precision and increase TP to increase Recall. But in fact they conflicted. So we use the -value to consider them comprehensively. Only when Precision and Recall are both higher, -value will be higher.
Another appropriate metric that could be used to measure the performance of classification over imbalanced datasets is the receiver operating characteristic (ROC) graphics [37]. In these graphics, the tradeoff between the benefits (TP) and costs (FP) can be visualized, and it acknowledges the fact that the capacity of any classifier cannot increase the number of true positives without also increasing the false positives. The area under the ROC curve (AUC) [38] corresponds to the probability of correctly identifying which of the two stimuli is noise and which is signal plus noise. AUC provides a single number summary of the performance of learning algorithms.

Experimental Results and Analysis.
Contrastive experiment results on Precision, Recall, -value, -means, and AUC among each algorithm are shown in Table 3 to Table 7.
The Scientific World Journal 9    In order to view the performance on 5 algorithms, the average accuracies of different indicator of 5 algorithms are showed in Figure 8. Tables 3, 4 We can find out from Tables 3-7 and Figure 8 that the Precision performance is unsteady for the proposed algorithms N-HyperGraph. As it is mentioned before, the process of hyperedge initialization is based on the degree of imbalance of training set. The generation of hyperedges depends on the imbalanced degree, which results in the fact that the generated hyperedges incline to the minority sample. Thus, the proposed algorithm is not steady on Precision. But, it can work better than SVM and J48 (C4.5) for Recall,value, -means, and AUC.
In total, the experimental results of N-HyperGraph are better than all of the other algorithms. Since the rough set theory is used in N-HyperGraph, it is more efficient to process the uncertain samples, especially in boundary region of hyperedge set. What is more, weights are calculated through the neighborhood rough set model; thus it makes more hyperedges involve in the class decision of a hyperedge, improving the accuracy. Due to the two aspects above in the proposed algorithm, the results of classification are improved.
As SMOTE oversamples all minority class samples, it decreases the decision space of majority class. Although it can improve Recall of minority class, many majority class samples will be misclassified as minority class, thereby resulting in the decreasing of Precision. SMOTE-RSB filters  the synthetic samples more strictly than SMOTE, because few synthetic samples are generated when datasets are highly imbalanced. Thus, compared with SMOTE, its improvement is not obvious. RSBoundary-SMOTE takes neighborhood rough set into consideration and emphasizes resampling for minority class samples which belong to boundary region and thus improves the -value. However, N-HyperGraph replaces hyperedges repetitively according to neighborhood rough set. The distribution of hyperedge set draws near to the true distribution of samples gradually, which makes a more obvious improvement on classification performance. Besides, since CS-EN-HN can just deal with discrete data, too much information loss of data makes its performance worse than N-HyperGraph.

Conclusion
In this paper, one proposed a new algorithm based on hypernetwork called N-HyperGraph to solve the problem of classifying imbalance dataset. At first, hyperedge set is divided according to rough set theory. Then, some poor hyperedges are replaced by combining with the imbalanced degree, in order to improve the accuracy. The experimental results on 18 UCI datasets with different degree of imbalance show that the classification result of the proposed algorithm N-HyperGraph improves obviously in contrast with another four algorithms. However, the algorithm N-HyperGraph will cost much time, due to calculating the distance between hyperedge and sample. Thus, how to reduce the running time of the algorithm is our future work.