AGNES-SMOTE: An Oversampling Algorithm Based on Hierarchical Clustering and Improved SMOTE

Aiming at low classiﬁcation accuracy of imbalanced datasets, an oversampling algorithm—AGNES-SMOTE (Agglomerative Nesting-Synthetic Minority Oversampling Technique) based on hierarchical clustering and improved SMOTE—is proposed. Its key procedures include hierarchically cluster majority samples and minority samples, respectively; divide minority subclusters on the basis of the obtained majority subclusters; select “seed sample” based on the sampling weight and probability distribution of minority subcluster; and restrict the generation of new samples in a certain area by centroid method in the sampling process. The combination of AGNES-SMOTE and SVM (Support Vector Machine) is presented to deal with imbalanced datasets classiﬁcation. Experiments on UCI datasets are conducted to compare the performance of diﬀerent algorithms mentioned in the literature. Experimental results indicate AGNES-SMOTE excels in synthesizing new samples and improves SVM classiﬁcation performance on imbalanced datasets.


Introduction
Imbalanced dataset is featured with having fewer instances of some classes than others in a dataset. In the biclass cases, one class with fewer samples is referred to as a minority class, and the other class with more samples is the majority class [1]. In reality, there are many scenarios of imbalanced data classification, such as credit card fraud detection, information retrieval and filtering, and market analysis [2]. Conventional classifiers typically favor the majority class, giving rise to classification errors. e imbalance of sample sizes between two different classes is regarded as betweenclass imbalance, and the imbalanced data distribution density within one class is within-class imbalance. Withinclass imbalance forms multiple subclasses with the same class but different data distribution [3,4]. Both the two abovementioned imbalances will cause classification errors. In addition, oversampling algorithms often cause problems such as synthetic samples overlap [5] and samples distributed "marginally" [6], which reduce classification performance. erefore, how to improve conventional algorithms to solve the imbalanced classification of datasets and promote classification performance becomes the research focus of data mining and machine learning.
Researches on imbalanced datasets mainly include data processing and classification algorithm [7,8]. Cost-sensitive learning [9] and integrated learning [10] are representative classification algorithms. e most frequently used methods to process data are oversampling and undersampling methods, which balance two classes by increasing minority samples and decreasing majority samples, respectively. Sampling methods based on data are usually simple and intuitive. Undersampling method usually causes information loss while oversampling method tends to balance the original dataset. us, the latter one is often adopted in data classification.
At present, the most frequently used oversampling method is SMOTE algorithm proposed by Chawla's team [11] in 2002. It created new synthetic samples by linear interpolation of sample x and sample y, in which x referred to an existing minority sample and another minority sample y was picked up randomly from the nearest neighbors of x.
is algorithm neglected uneven data distribution in minority class and the possibility of samples overlap when synthesizing samples. Han Hui's team [12] suggested Borderline-SMOTE algorithm in 2005, which divided minority samples into boundary area, safe area, and dangerous area. is algorithm synthesized samples by selecting samples from the boundary area, which avoided selecting minority samples indiscriminately and produced lots of redundant new samples caused by SMOTE algorithm. ADASYN algorithm, proposed by He's team, [13] indicated that the samples size needed to be generated by each minority sample was automatically determined based on data distribution. Minority samples with more neighboring majority samples generated more samples. Compared with SMOTE, it divided the sample distribution exhaustively. Cluster-SMOTE [14] adopted the K-means algorithm to cluster minority samples, found minority subclusters, and, then, applied SMOTE algorithm, respectively. However, this algorithm did not determine the optimal size of subclusters and did not calculate the sample size generated by each subcluster. K-means-SMOTE [15] combined K-means clustering algorithm with SMOTE algorithm. Compared with Cluster-SMOTE, K-means-SMOTE clustered the entire datasets, found the overlap and avoided oversampling in unsafe areas, restricted the synthetic samples in the target area, and eliminated within-class and between-class imbalances. Meanwhile, it avoided noise samples and attained good results. CBSO [16] combined clustering with the data generation mechanism of the existing oversampling technology to ensure that the generated synthetic samples were always in the minority class area and avoided generating erroneous samples. Although the abovementioned oversampling methods indeed improve classification accuracy to a certain extent, they have the following deficiencies: (1) when oversampling, much attention has been paid to solving between-class imbalance while has been paid less attention to within-class imbalance.
(2) Clustering can address the issue of between-class and within-class imbalance, but two classes aliasing are exacerbated, leading to generating new overlapping synthetic samples. e conventional k-means clustering algorithm needs to set k value when clustering, which is more effective for spherical datasets and is more complex. (3) e minority boundary is not maximized, affecting synthetic samples quality. (4) No restrictions on destination area of synthetic samples result in synthetic samples distributed marginally. (5) Noise samples interfere.
Based on the above discussion, this paper offers an improved oversampling method-AGNES-SMOTE. Its procedures are listed as follows: filter noise samples, adopt the AGNES algorithm to cluster minority samples, form the minority subclusters iteratively, and consider the majority samples distribution during the merging process to avoid generating overlapping synthetic samples. Repeat this operation until the distance between the two closest minority subclusters exceeds the set valve-value. en, determine sampling weights according to sample size in minority subcluster, calculate the probability distribution of each minority subcluster according to the distance between the minority samples and their neighbor majority samples, and combine the two to select "seed sample" for oversampling. Restrict the generation of new samples in a certain area by centroid method in the synthesizing process. Select a sample from all "seed samples," randomly select two neighboring minority samples from the subcluster where the selected "seed sample" is located, form a triangle with the three selected samples, and synthesize new samples on the line from the three samples to the centroid. Compared with other algorithms, AGNES-SMOTE attains a better result in the experiment.

Preliminary Theory
2.1. SMOTE Algorithm. SMOTE algorithm alleviates the problem of data imbalance by artificially synthesizing new minority samples and calculates the distance between one sample X � x 1 , x 2 , . . . , x n and the other sample Y � y 1 , y 2 , . . . , y n by the Euclidean distance. e subscript numbers 1, 2, . . . , n are sample dimension values. e Euclidean distance D between sample X and sample Y is For each sample X in minority class, search for its nearest neighbor samples K and randomly select N samples from these nearest neighbor datasets. For each original sample, select N samples from K-nearest neighbor samples, and then perform interpolation between the original samples and their nearest neighbor samples. e formula is described as follows: where i is 1, 2, . . ., N; X new is the new synthetic samples; X is the selected original sample; rand(0, 1) is a random number between 0 and 1; and Y i is N samples selected from the nearest K samples of the original sample X.

AGNES Algorithm.
e conventional AGNES algorithm is all about hierarchical clustering. It treats every data as a cluster and gradually merges those clusters according to some certain criteria. For example, if the distance between the two data objects in different clusters is the smallest, the two clusters may be merged. e merging is repeated until a certain termination condition is met. In AGNES, the distance between clusters is attained by calculating the distance between the closest data objects in two clusters, so a cluster can be represented by all objects in the cluster.
Compared with aggregating samples with the conventional centroid method, the AGNES algorithm is more accessible, independent of the selected initial values, and free from the samples' distribution shape. It also can aggregate all samples together. Considering the influence of between-class and within-class samples imbalance on model performance, the AGNES algorithm is more applicable to deal with unbalanced data distribution of within-class imbalance.

Divide Minority Clusters.
e AGNES-SMOTE algorithm is proposed in this paper to refine SMOTE and its improved algorithm. e newly proposed algorithm filters noise samples first, uses AGNES to cluster samples, and divides datasets into subclusters. In the clustering process, this paper uses the average distance method to calculate the distance between two subclusters. Merge the two closest subclusters to form a new subcluster. Reduce the size of the subclusters by one. en, continue to merge the two closest subclusters. Stop clustering until the distance between the subclusters exceeds the set valve-value. To avoid generating overlapping samples, majority samples distribution needs to be considered.
Before clustering minority samples with AGNES, cluster majority samples first to get majority subclusters set. e subclusters in the set represent the majority class.
en, judge the distance between the majority class and minority class. If the distances between majority samples and any two minority subclusters are less than the minimum distances between two minority subclusters, the merged minority subclusters will produce overlapping samples and the two minority subclusters should not be merged. e specific steps to classify clusters are listed as follows: Step 1. Given the original dataset I, use K-nearest neighbor to filter noise samples in dataset I. Set K � 5 to traverse samples in dataset I. If more than 4/5 sample classes of K-nearest neighbors in dataset I are opposite to the selected sample class, judge the selected sample as noise sample and eliminate it. e remaining samples constitute the sample set I ′ .
Step 2. Cluster majority samples in I ′ , treat each sample as an independent subcluster, use formula (3) to calculate the distance between the subclusters, merge the two closest subclusters, repeat the above procedures until the distance reaches the preset valvevalue T h , and obtain some majority subcluster In this formula, p and q are samples in subclusters C i and C j , respectively. |C i | and |C j | represent their sample sizes.
Step 3. Divide minority samples according to the obtained majority subcluster set C maj ; treat each minority sample as a separate subcluster, and obtain minority subcluster set: C min � C min 1 , C min 2 , . . . , C min m .
Step 4. Calculate the distance between two minority subclusters with formula (3) and record the minimum distance D min and its corresponding subcluster numbers i and j.
Step 5. Traverse the majority subclusters in the set C maj . If there is a majority subcluster C maj k and the distances from it to minority subclusters C min i and C min j are both less than the distance between the two minority subclusters, the minority subclusters C min i and C min j will not be merged, and the minimum distance D min will be set large to avoid being considered when remerging. Otherwise, if the minority subclusters C min i and C min j are merged into a new minority subcluster C min m , the size of minority subclusters will be reduced by one.
Step 6. When new minority subcluster C maj n is merged, recalculate the distance between C maj n in minority subcluster set C min and the remaining minority subclusters with formula (3). Repeat Step 3 to Step 5 until the distance between the nearest minority subclusters exceeds the set valve-value T h ; then, stop merging minority subclusters. Get the final minority subcluster e valve-value is the key condition for merging subcluster. For better estimating valve-value T h , define a value d avg first: In the formula, x a and x b are samples in minority subcluster C min i , and |C min i | is the sample size of this subcluster. Suppose d represents the median distance between a sample in minority subcluster and the rest of the samples. d avg represents the average of these median distances. Taking the average of the median distance as the reference value can avoid noise samples interference. Redefine the valve-value T h as follows: Parameter f is the distance adjustment factor, which can adjust valve-value T h . e value range of parameter f will be discussed later.

Determine Sampling Weight and Probability Distribution.
In classification tasks, the imbalances of within-class samples and between-class samples will affect model performance.
e density of each subcluster varies with its sample size. e sampling weight of each minority subcluster is determined by its denseness. Set small weight for dense subcluster and large weight for sparse subcluster to avoid overfitting. us, sampling weights assigned to minority subclusters vary with their sizes, denoted as W i : N represents the size of minority subclusters and num i represents the sample size of i -th minority subcluster. From formula (6), it is known that the larger the sample size in a Scientific Programming certain minority subcluster is, the larger the proportion of the sample sizes in the total minority subclusters is and the smaller W i will be; namely, the assigned weight and synthetic samples size both become smaller, eventually achieving balanced sample distribution in the same class.
As shown in formula (7), the sampling size num i of each minority subcluster can be determined by W i (sampling weights of each subcluster) and N maj − N min (the difference between the sizes of majority sample and minority sample after excluding noise samples): In addition, when classifying samples, the minority samples closer to the decision boundary are more prone to be misclassified, increasing the learning difficulty of minority samples. erefore, it is necessary to select samples for oversampling. To ensure the quality of synthetic samples, the probability distribution of minority subclusters is introduced to select "seed samples" from minority samples with important but difficult information. e probability of each sample being selected is set as D i : e probability distribution of minority subclusters is In this equation, y b is x's b th majority sample's neighbor . 1 ≤ b ≤ k . d xy b denotes the Euclidean distance between sample x in minority subcluster and majority sample y b . i represents one sample in minority subcluster, n is the sample size of a certain minority subcluster, and k signifies neighbors' size. It can be reckoned from the formula that the probability of each sample being selected is determined by the distance between this sample and majority class boundary; the probability of minority samples closer to the majority class boundary being selected is higher than that of samples far away; the probability of each sample being selected constitutes the probability distribution of minority subclusters. In this way, the distribution characteristics of samples are considered and the minority class decision boundaries are extended effectively.

Restrict the Generation of New Samples in a Certain Area.
Determine the synthetic samples size of each minority subcluster, and select the "seed sample" according to the probability distribution of each minority subcluster. Consider the generation of new samples in a certain area to improve classifier performance and prevent synthetic samples from being distributed marginally. erefore, when synthesizing samples, the new generated samples distribution needs to be taken into account. Select a sample from "seed samples," randomly select two neighboring minority samples from the subcluster where the selected "seed sample" is located, and form a triangle with the three selected samples as vertexes. Synthesize new samples on the line from the three vertexes to the centroid, respectively. One triangle generates three new synthetic samples. e centroid method is adopted to restrict the generation of new samples in a certain area. Set three samples distribution as X 1 , X 2 , and X 3 ; their centroid X T can be calculated by the following formula: where X i represents the horizontal coordinates of three vertexes and Y i represents the vertical coordinates of three vertexes. is method makes the new samples move closer to the centroid, which addresses the issue of the marginal distribution of new samples caused by SMOTE. When synthesizing new samples, restrict the generation of new samples in a certain area. ose synthetic samples will move closer to the centroid.

AGNES-SMOTE Algorithm.
e procedures of the AGNES-SMOTE algorithm are depicted below. Use K-nearest neighbor to filter noise samples in the original dataset. Adopt the AGNES algorithm to cluster majority samples and divide them into several majority subclusters. Cluster minority samples and merge the two closest subclusters on the basis of the obtained majority subclusters and keep clustering until the distance between two minority subclusters exceeds the set valve-value; then obtain minority subclusters. Assign weight to each minority subcluster and calculate the probability distribution of each minority subcluster, and combine the two to oversample samples in minority subcluster. Restrict the generation area of synthetic samples by the centroid method. e detailed Algorithm 1 is as follows: (1) Delete noise samples from original datasets to obtain ClearData datasets, and then split ClearData into majority sample group and minority sample group. Use AGNES to cluster majority sample group to obtain majority subclusters. en, cluster minority sample group. When clustering, determine whether there exist majority samples between the two nearest minority subclusters. If no majority samples exist, merge minority subclusters (line 1 to line 10).
(2) Calculate sample size of the obtained minority subcluster, assign sampling weight to minority subcluster, calculate the size of samples needed to be synthesized, and then calculate the probability distribution of each minority subcluster (lines 15 to 23). (3) Finally, in each minority subcluster, select "seed samples" based on the size and the probability distribution of samples needed to be synthesized. Select a sample from all "seed samples," randomly select two neighboring minority samples from the subcluster where the selected "seed sample" is located, form a triangle with the three selected samples as vertexes, and synthesize new samples on the line from the three samples to the centroid, respectively. en, add new synthetic samples to the synthetic samples group (lines 24 to 36).

Datasets.
In this paper, nine UCI datasets groups [20] are selected for the experiment, whose structures are listed in Table 2. e hierarchical random division is adopted in this paper to ensure the consistent imbalance ratio of samples in the training set and test set. 50% cross-validation is used as an evaluation method. Each dataset is divided into 10 parts. Select one part as verification set in turn and the remaining nine parts as the training set. Obtain the average of 10 results. e parameters of the SVM classifier are set as follows: the kernel function is a Gaussian radial basis and the penalty factor C is 10.

Determine Parameters f.
e performance of the AGNES-SMOTE algorithm is affected by the parameters to some extent.
e distance adjustment factor f is used to control subcluster merging when clustering. If f value is too small, the size of minority subclusters will be too large while the size of samples in each subcluster will be too small, which reduce the diversity of synthetic samples and cause overfitting. If f value is too large, merged clusters will contain majority samples, resulting in overlapping when synthesizing.
As shown in Table 3, five datasets are used as test datasets to determine the range of parameter f. f � 1.0 indicates there is no need to adjust the valve-value. en, f � 1.0 is used as the axis to select f values. After testing, the results show that when f � 1.0, 3 datasets obtain maximum F-measure value; when f � 0.6, 1 dataset obtains maximum F-measure value; and when f � 1.5, 1 dataset obtains maximum F-measure value. erefore, the reference range of parameter f should be between 0.3 and 1.5. When f > 2.5, F-measure values will be similar because when parameter f becomes larger, all subclusters will eventually merge into one.

Experimental Results and Analysis
(1) Analysis of synthetic data distribution results: this paper uses artificial datasets to verify and compare synthetic samples distribution of the new proposedly algorithm and SMOTE. e results are shown in the following figures, in which the red dots represent majority samples and the black crosses represent minority samples and their synthetic samples. Compared with Figure 1, the synthetic samples sampled by the SMOTE algorithm are more distributed in the edge area and even mixed into majority samples which cause overlapping. As the new synthetic samples are highly similar and repeated, within-class imbalance in original dataset has not been improved. In view of the shortcomings in Figure 2, AGNES-SMOTE effectively filters noise samples; when clustering, divide minority subclusters in consideration of majority samples distribution to avoid new synthetic samples mixing into majority sample area and reduce noise impact. Assign sampling weights to minority subclusters to achieve within-class balance of minority subcluster. Sample more marginal samples susceptible to be misclassified on the basis of the probability distribution to form an obvious boundary between two sample classes. For samples distributed marginally, the centroid method is used to restrict the generation of new samples in a certain area, which further guarantees the quality and diversity of synthetic samples. e data distribution is shown in Figure 3.  Table 4. e experimental results in Table 4 indicate that AGNES-SMOTE has better AUC values on datasets Ecoli, Libra, Yeast1, Optical_digits, and Abalone than other sampling algorithms. Besides, AGNES-SMOTE has large AUC values on datasets Libra and Optical_digits because of their large imbalance ratios and rich features; thus more samples are needed to be synthesized. AGNES-SMOTE considers within-class imbalance, selects samples, restricts the generation area of synthetic samples, and reduces the overlap of synthetic samples to ensure synthetic samples quality and provide various information for the classifier. AGNES-SMOTE has low AUC values on datasets Haberman, Yeast1, and Liver due to their smaller imbalance ratio and fewer features. e F-measure values and G-mean values with SMOTE, K-means-SMOTE, Cluster-Smote, and AGNES-SMOTE on each dataset are listed in Tables 5 and 6.
Tables 5 and 6 indicate that the AGNES-SMOTE algorithm attains good F-measure values and G-mean values on most datasets. It greatly improves F-measure values and G-mean values on datasets Ecoli, Yeast1,

Conclusion
Regarding imbalanced datasets classification, the existing oversampling algorithms mainly deal with between-class imbalance and neglect within-class imbalance. Some problems are ignored, such as samples being oversampled are not selected, noise is not removed, synthetic samples will overlap, and samples will be distributed "marginally." To solve the abovementioned problems, an oversampling algorithm-AGNES-SMOTE-is presented in this paper, which is based on the hierarchical clustering and improved SMOTE. is algorithm follows the following procedures: filter noise samples in the dataset; cluster majority samples and minority samples through the AGNES algorithm, respectively; divide minority subclusters in the light of the obtained majority subclusters; select samples for oversampling based on sampling weight and the probability   Comparative experiments on data processing with different algorithms have been conducted. Experimental results indicate that AGNES-SMOTE improves the classification accuracy of minority samples and the overall classification performance. However, the new oversampling algorithm proposed in this paper is only available for biclass cases. In practice, most data fall into multiple categories, so optimized oversampling algorithms for multiclass data classification will be expected in the future.

Data Availability
e data used to support the results of this study are available on the website: https://archive.ics.uci.edu/ml/index.php.

Conflicts of Interest
e authors declare that they have no conflicts of interest.