HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data Partition

. The classical classiﬁers are ineﬀective in dealing with the problem of imbalanced big dataset classiﬁcation. Resampling the datasets and balancing samples distribution before training the classiﬁer is one of the most popular approaches to resolve this problem. An eﬀective and simple hybrid sampling method based on data partition (HSDP) is proposed in this paper. First, all the data samples are partitioned into diﬀerent data regions. Then, the data samples in the noise minority samples region are removed and the samples in the boundary minority samples region are selected as oversampling seeds to generate the synthetic samples. Finally, a weighted oversampling process is conducted considering the generation of synthetic samples in the same cluster of the over-sampling seed. The weight of each selected minority class sample is computed by the ratio between the proportion of majority class in the neighbors of this selected sample and the sum of all these proportions. Generation of synthetic samples in the same cluster of the oversampling seed guarantees new synthetic samples located inside the minority class area. Experiments conducted on eight datasets show that the proposed method, HSDP, is better than or comparable with the typical sampling methods for F-measure and G-mean.


Introduction
In the era of big data, tremendous amount of data generated by various real-world applications brings the challenges to data mining. Among the challenges, classification of imbalanced datasets has drawn interest in various application areas. A dataset is imbalanced when the number of samples in one category is much less than the number of samples in other categories. If the samples come from two classes, the data of the larger number is called the majority class, and the data of the smaller number is called the minority class. Our research focuses on the binary classification problem (the two-class classification) and the prediction of minority samples is more important, because the cost of misclassification for minority samples is greater than the cost of misclassification for majority samples. e issue of binary classification of imbalanced data exists in various applications, such as medical diagnosis.
Most classifiers aim at maximizing the overall classification accuracy of a dataset. erefore, when classifying imbalanced data, the classifier is biased to meet the classification accuracy of the majority samples, causing low classification accuracy over the minority class. In addition, an imbalanced dataset combination with other difficulty factors such as class overlapping, presence of outliers, and small disjunctions will be more difficult for the classifier to predict minority class [1]. Figure 1(a) shows the skewness distribution between classes. Figure 1(b) shows class overlapping, and Figure 1(c) shows the small disjunctions of minority class. erefore, how to improve the classification accuracy of minority samples while ensuring the overall classification performance of the classifier for imbalance data is an urgent problem to be solved. e remainder of this paper is organized as follows. Section 2 presents related works. Section 3 describes the proposed HSDP method. Section 4 introduces the experimental settings. Section 5 presents the experimental results and compares our approach with some typical techniques. Finally, the conclusion is drawn in Section 6.

Related Works
e techniques proposed to improve classification for imbalanced data can be categorized into two major groups: data-level methods and algorithm-level methods. e algorithm-level methods modify the classifier in order to improve the accuracy of imbalanced data. e algorithmlevel methods mainly include cost-sensitive methods and ensemble learning methods. e data-level methods mainly include undersampling for the majority class [2], oversampling for the minority class. In contrast, the data-level methods are conducive to enhancing the generalization ability of the model, and the oversampling methods have more advantages because they do not lose the data sample information [3,4]. SMOTE is the one of most popular oversampling algorithms [5]. SMOTE first selects a random seed x from the minority samples and then randomly selects sample y among its k neighbors in the same class, Finally, a new synthetic sample s is generated by the linear interpolation. is can be expressed as where gap is a random number between 0 and 1. Although the SMOTE algorithm has shown successful performance in various classification scenes, the SMOTE algorithm also has some weaknesses: (1) if the noisy sample is selected, oversampling process may generate more noisy samples. (2) It does not consider the data distribution when generating the synthetic sample, thereby increasing the overlaps between different classes [6]. (3) It oversamples uninformative minority samples because it chooses a minority sample seed to oversample with uniform probability. However, those minority samples on the boundary area contain more information than ones far from the boundary [7]. erefore, researchers have proposed some improved versions of SMOTE. e Borderline-SMOTE algorithm [8] oversamples the borderline minority samples. However, sometimes the Borderline-SMOTE generates new synthetic samples in unsuitable areas, such as noise regions and overlapped areas. ADASYN algorithm [9] pays more attention to the minority samples that are difficult to learn. It can adaptively generate minority samples according to the ratio of samples of majority class in the neighborhood samples.
e K-means-SMOTE algorithm [10] combats between-class imbalance and within-class imbalance. But it does not provide a strategy for determining the optimal number of clusters k, which has a great impact on the performance of oversampling. e MWMOTE technique [11] analyzes the hard-to-learn minority samples and assign them weights according to their importance in learning.
In summary, the methods above have mitigated some of the problems of SMOTE, but neither of them has effectively solved all the three problems. So, the proposed hybrid sampling method based on data partition attempts to overcome all three problems. It is able to select proper minority samples for oversampling and improve the synthetic sample generation scheme. e generation scheme includes the size of synthetic samples for selected minority samples and the control of the location of the generated samples in data space.

Overview.
e data samples present different distribution characteristics in data space, and the data distribution can be considered when undersampling or oversampling. Different sampling methods are used in different regions that may improve classification performance, and we propose the hybrid sampling method of imbalanced data based on data partitioning (HSDP). e method consists of four stages: (1) partitioning space of the input imbalanced data into five regions; (2) removing the samples in the noise minority samples region; (3) using agglomerative hierarchical clustering method to cluster the minority samples; (4) oversampling process. In the first stage, the data space is divided into five regions: the boundary minority samples region, the noise minority samples region, the safe minority samples region, the boundary majority samples region, and the safe majority samples region. And the first two stages are performed because our aim is to oversample the borderline minority samples while ignoring the noisy minority samples. e basic idea is that the borderline samples are apt to be misclassified. In the third stage, clustering the minority samples is to ensure that the generated samples must be inside the minority class regions. In the fourth stage, the oversampling process is performed, which adaptively generates synthetic samples for borderline minority samples in the same cluster of the oversampling seed. e oversampling process in the same cluster ensures that the generated samples locate inside the minority class regions.

Data Partition.
According to the proportion of minority samples in neighborhoods of each minority sample, the data space is divided into five regions [12]: the boundary minority samples region, the noise minority samples region, the safe minority samples region, the boundary majority samples region, and the safe majority samples region, as shown in Figure 2.
Given an imbalanced training dataset S and minority class label class (min), the training dataset is divided into majority class set S maj and minority class set S min firstly. en, for each sample x i in the minority class set S min , we calculate k neighbors around x i through the K-nearest neighbor algorithm. Next, in these k neighbors, the number of the minority class samples N b min is computed and the majority class samples are put into the boundary majority samples region S maborder . Finally, by judging N b min , each sample is added to corresponding region. If N b min � 0, the minority class sample is added to the noise minority samples region S miout . If N b min � k, the minority class sample is added to the safe minority samples region S misafe . If 1 < N b min < k, the minority class sample is added to the boundary minority samples region S miborder . e safe majority samples region S masafe and the set Swith the sample in S miout removed are determined at the end. e DP algorithm for the data partitioning is described as follows (Algorithm 1):

Clustering Minority Class Samples Based on Hierarchical
Clustering. Most of the existing oversampling methods are the K-NN based approach. To generate a synthetic sample from the minority class sample B and k � 5, sample A may be chosen (as shown in Figure 3). By this way, the generation of a synthetic sample (shown by square) may locate in the majority class region.
Our proposed method chooses the sample from the same cluster (Cluster1) of B. It ensures that A will not be chosen, because B and A are not in the same cluster.
us, the oversampling process is performed in a safe range and the generation of minority samples must locate inside the minority class region. e hierarchical clustering algorithm is used to cluster the minority class samples in this work. And, the key steps of the agglomerative hierarchical clustering algorithm are described as follows: (1) Assign each data sample to a cluster initially.
(2) Find the two closest clusters and merge them into a single cluster. And, this will reduce the total number of clusters by one. (3) Compute the distance between the newly generated cluster and all the previous clusters. (4) Repeat steps 2-3 until a certain termination condition is reached. e termination condition is the number of clusters set in advance or distance threshold.
However, using agglomerative hierarchical clustering algorithm to cluster the minority class samples, whether the two minority clusters are merged or not, not only the distance between the minority clusters but also the distribution of the majority class samples should be examined. If the distance between two minority class clusters is d min , the distances from a certain majority class cluster to these two minority clusters are d 1 and d 2 , respectively; where d 1 < d min and d 2 < d min . en, these two minority class clusters cannot been merged. erefore, modifications to agglomerative hierarchical clustering algorithm have been made. First, agglomerative hierarchical clustering algorithm is used to cluster the majority class samples to obtain the majority cluster set C maj . en, the minority class samples are clustered. e minority class cluster algorithm based on hierarchical clustering algorithm [13] (MDH) is described below (Algorithm 2).
Because the distance threshold T hi is the termination condition of the clustering process, the setting T hi is particularly critical. In this work, T hi is computed as follows: (2) e parameter d avg represents the average of the distance from each minority class sample to any other minority samples in the Set S. e parameter r is used to tune the output distance of the cluster algorithm. And the specific value analysis of r is discussed in Section 5.

Description of Hybrid Sampling Algorithm Based on Data
Partition. We proposed a hybrid sampling algorithm based on data partition. Firstly, the boundary region can be obtained by DP algorithm. en, the total number of synthetic samples generated in the boundary region is calculated. Next, the weight g i of each sample x i in the boundary minority samples region S miborder is computed by the ratio between the proportion of majority class in the neighbors of this selected sample and the sum of all these proportions. Finally, for each sample x i in the boundary minority samples region S miborder , g i synthetic samples should be generated in the same cluster of x i . e hybrid sampling algorithm based on data partition (HSDP) is implemented as follows (Algorithm 3):

e Time Complexity Analysis of HSDP Algorithm.
In the DP algorithm, supposing the number of minority samples is N min and the number of majority samples is N maj , each minority sample needs to calculate the distance from other samples to find neighbor samples. erefore, the time complexity of calculating the distance between samples is In the MDH algorithm, the distance between each minority cluster and the majority clusters needs to be calculated. Suppose that the current number of minority clusters is n min and the number of majority clusters is n max . e time complexity of calculating the distance between minority Complexity clusters is O(n 2 min ), and find the two minority clusters with the smallest distance. en, calculate the distance between the majority cluster and the two minority clusters with the smallest distance. So, the calculation time complexity to get the distance between the majority cluster and the minority cluster is O(n max ), and then determine whether to merge the The noise minority samples region The safe majority samples region The safe minority samples region The boundary majority samples region The boundary minority samples region S maj .add(x i )//add the sample x i into majority class dataset S maj (6) end if (7) end for (8) for x i in S min (9) list � neighbors(x i , k)//find k neighbor samples of each minority class samples (10) N b min � 0//initialize the number of minority class samples (11) for z j in list (12) if (class (z j ) � � class (min))//label of sample z j is the class (min) (13) N b min + + Input: majority class cluster C maj � C maj 1 , C maj 2 , . . . , C maj m , minority class without samples of the noise minority samples region S � x 1 , x 2 . . . x n , threshold T hi Output: minority class cluster C min � C min 1 , C min 2 , . . . , C min m Process: (1) d � 0;//initialize the minimum distance between clusters (2) for (i � 0; i < n; i + +) (3) C min i � x i //initialize the minority class cluster (4) end for (5) Matrix � d(C min )//the distance matrix between clusters (7) d � dis(C min i , C min j )//the minimum distance d and corresponding cluster numbers C min i and C min j (8) for C i in C maj //each majority class cluster (9) if (dis(C min i , C i ) < d) && (dis(C min j , C i )) < d)) (10) U(C min i , C min j ) � not//the flag of cluster mergence is not (11) else (12) C min i � C min i ∪ C min j //merge C min i and C min j into a single cluster (13) C min j � ϕ (14) C min .length − − //reduce the total number of clusters by one (15) end if (16) end for (17) end while ALGORITHM 2: MDH Algorithm.
Input: imbalanced dataset S Output: balanced dataset S Process: Step 1: S misafe , S miborder , S masafe , S maborder can be obtained by DP algorithm.
Step 2: count the number (m) of samples in the S misafe and S miborder . Count the number (n) of samples in the S masafe and S maborder .
Meanwhile, count the number (s) of samples in the S miborder .
Step 3: calculate the number of synthetic data samples that need to be generated for minority class: is the synthesis scaling factor. b � 1 means a balanced dataset is obtained after the oversampling process.
Step 4: for each sample x i ∈ S miborder , calculate the ratio of majority class samples belonging to the k neighbors of x i . is ratio r i is defined as r i � N i /k, i � 1, 2, . . . , s Step 5: the weight is determined by w i � (r i / s i�1 r i ).
Step 6: calculate the number of synthetic data samples for each sample x i in the boundary minority samples regionS miborder : Step 7: for each sample x i in the boundary minority samples region, generate g i synthetic data samples according to the following steps: Do the loop from 1 to g i : (a) Randomly select another sample y from the same cluster of Generate a synthetic data sample: where p ∈ [0, 1] End loop ALGORITHM 3: HSDP Algorithm. Complexity 5 two minority clusters according to the distance between the clusters, and the possible number of merger time is n min . erefore, the time complexity of the MDH algorithm is O(n min × (n 2 min + n max )). In the HSDP algorithm, assuming that the number of boundary minority samples is N miborder , the number of minority class samples is N min , and the number of majority class samples is N maj , and the time complexity the step of determining the k neighbors of boundary minority samples is O (N miborder × (N min + N maj )). In the step of sample generation, the computational time complexity is O (N miborder × g i ).
According to the analysis of the above steps, the time complexity of the HSDP algorithm should be O(N min × (N max + N 2 min )).

Dataset Description.
We test our algorithm on datasets from various filed, including 8 imbalanced datasets. All these datasets are available from KEEL Repository and UCI Repository. Table 1 describes the information of these datasets.
In this study, we research the binary classification. In the two-classification problem, the majority of samples are usually also marked as negative samples, and the minority samples are also marked as positive samples.

Evaluation Metrics.
For the classification problem of imbalanced data, the overall classification accuracy is not suitable for evaluation of classifiers performance, because sometimes a classification algorithm with a better overall accuracy may be at the expense of large prediction error over the minority class. erefore, F-measure and G-mean are usually used to evaluate the performance of imbalanced classification algorithms.
F-measure and G-mean are calculated based on the confusion matrix, as shown in Table 2.
Based on the confusion matrix, the following equations are derived: F-measure is computed as shown in formula (6). Fmeasure is the harmonic mean between the Recall and Precision. e higher F-measure can ensure that both Recall and Precision are higher, where β is a coefficient to indicate the relative importance of Recall and Precision (usually β � 1). G-mean is calculated as shown in formula (7). G-mean is the geometric mean of the minority class accuracy and majority class accuracy, and it assigns equal importance to performance of the classifier on minority class and majority class.

Experiment and Result Analysis
e experiment platform is Anaconda. Since the purpose is evaluating the proposed sampling method, we do not choose any special classifier; rather, we apply several of them such as KNN and RF. In order to compare the performance of our proposed hybrid sampling method (HSDP) against the other techniques, comparative experiments were carried out, including SMOTE, ADASYN, and Borderline-SMOTE.

Analysis of Experimental Results.
In order to guarantee the fair comparison, the experiment uses a 10-fold crossvalidation method. Tables 3-6, respectively, show the F-measure and G-mean values of various algorithms on each dataset.
e best results of F-measure and G-mean are bold faced on each dataset in the above tables. It is evident that the KNN and RF combined with the sampling method are better than themselves without combining sampling method in most cases, On the F-measure, the HSDP algorithm obtained 5 best results on 8 datasets, and 6 best results on G-mean value.
is shows that the HSDP algorithm proposed in this paper can improve the classification effect of minority class.
Compared with the SMOTE method and the ADASYN method, (1) the HSDP method does not oversample for all minority class samples but focuses on the minority samples in the boundary area that are more important in classification and (2) the HSDP method removes the noise data, thus avoiding the noisy samples generation.
In contrast to Borderline-SMOTE, our proposed HSDP method not only considers the importance of minority class samples in boundary area but also considers the distribution characteristics of data samples, avoiding any wrong synthetic sample generation.

Analyzing the Influence of the Parameter Value Used in HSDP Algorithm.
e parameters involved in the proposed method (HSDP) include the number of neighbor samples k and the distance adjustment factor r. e value of k cannot be too small, because this will take the boundary minority class samples as noisy data and delete them by mistake. e value of r is used to control the number of clusters. With smaller r value, the number of clusters increases and the number of samples decreases in the clusters, which will result in a decrease in diversity when synthesizing samples.
In order to determine the optimal value range of r and k, we use Pima, Glass5, and Yeast3 as the test datasets. For k value (k � 3, 5, 7, 9 and 11), the G-mean are given as shown in Table 7. For pima dataset, G-mean obtains the maximum value when k is 5. When k is 9, the glass5 achieves the maximum G-mean value. And Yeast3 achieved the maximum G-mean value when k is 7. It is evident that the value of k is appropriate in the range of 5-9. For r value (r � 0.6, 0.8, 1.0, 1.2, and 1.6), the G-mean are given as shown in Table 8. e Pima dataset achieves the     maximum G-mean value when r is 0.8. Glass5 achieved the maximum G-mean value when r was 1.0. And Yeast3 achieved the maximum G-mean value when r was 1.2. It can be seen that the value of k is appropriate in the range of 0.8-1.2.

Conclusion
Data resampling method is one of the effective methods to deal with imbalanced data classification. Aiming at the problems of undersampling method and oversampling method, this paper proposes a hybrid sampling method, HSDP, based on data partition. is method uses the appropriate sampling methods for samples in different regions. And, it assigns reasonable weight to the boundary minority samples. Furthermore, it is able to oversample the selected samples inside the minority class area in the data space. e effectiveness of proposed method for the imbalanced data classification was confirmed by experiments, yet the values of the parameters used in the algorithm are selected through experiments many times. e future research direction is how to determine values of the parameters adaptively of HSDP for different datasets.

Data Availability
We use datasets from KEEL Repository and UCI Repository; our method and related parameters are provided in our paper.

Conflicts of Interest
e authors declare that they have no conflicts of interest.