A Novel SMOTE-Based Classification Approach to Online Data Imbalance Problem

In many practical engineering applications, data are usually collected in online pattern. However, if the classes of these data are severely imbalanced, the classification performance will be restricted. In this paper, a novel classification approach is proposed to solve the online data imbalance problem by integrating a fast and efficient learning algorithm, that is, Extreme Learning Machine (ELM), and a typical sampling strategy, that is, the synthetic minority oversampling technique (SMOTE). To reduce the severe imbalance, the granulation division for major-class samples is made according to the samples’ distribution characteristic, and the original samples are replaced by the obtained granule core to prepare a balanced sample set. In online stage, we firstly make granulation division for minor-class and then conduct oversampling using SMOTE in the region around granule core and granule border. Therefore, the training sample set is gradually balanced and the online ELM model is dynamically updated. We also theoretically introduce fuzzy information entropy to prove that the proposed approach has the lower bound of model reliability after undersampling. Numerical experiments are conducted on two different kinds of datasets, and the results demonstrate that the proposed approach outperforms some state-of-the-art methods in terms of the generalization performance and numerical stability.


Introduction
With more and more successful real applications, machine learning acting as an efficient technique of data analysis and modeling is now becoming an important research field in the area of aeronautics science and mechanical engineering, for example, building surrogate model dynamically to support system design or identifying the characteristics of system by giving samplings.Among many concrete topics, online learning and imbalanced classification are both receiving a lot of attentions, and there also are many researches in the last decade for these two issues separately.However, to the best of our knowledge, there are few studies about the incorporation of these two topics, that is, data imbalance problem in online learning procedure.We also name it as online data imbalance problem which can be widely found in many real engineering applications such as fault diagnosis and damage detection.Therefore, studying this problem is of great significance.
In this paper, we try to provide an efficient solution from the perspectives of sampling strategy and learning algorithm.
Considering the whole distribution characteristic of dataset and the feature of online learning, we present a novel SMOTE-based classification approach to the online data imbalance problem.This approach borrows the idea of granulation division to conduct oversampling and undersampling simultaneously.For major class, we use granule cores to replace the original samples for undersampling, while, for minor class, we firstly make granulation division and then conduct oversampling using SMOTE in the region around granule core and granule border.The above strategy is capable of following the data distribution of sample set, so it can be conducted repeatedly in the offline and online stages to balance the total sample set easily.Moreover, we introduce an efficient online learning algorithm, online sequential extreme learning machine (OS-ELM), which is combined with the proposed sampling strategy to achieve the fast and robust online learning for imbalanced data.To testify the effectiveness of the proposed approach, we first prove theoretically by means of the fuzzy information entropy that the proposed approach has the lower bound of model reliability after undersampling.And the experimental results on two different kinds of datasets also demonstrate the comparative performance of the proposed approach.

Related Works
Nowadays, the researches for the traditional data imbalance problem focus on two strategies [1].One is data-based strategy which aims to make the original dataset balanced using mainly undersampling or oversampling techniques.The key idea of this kind of method is how to explore and obey the inner distribution characteristic of sample set in the sampling procedure.The synthetic minority oversampling technique (SMOTE) [2] is a widely used technique due to its simple form and straight idea, but it also suffers from the relatively low accuracy because SMOTE is incapable of exploiting the data distribution of sample set, especially in online case.Therefore, many SMOTE-based methods were developed with integrating other techniques.To improve the quality of synthetic samples, Verbiest et al. [3] introduced fuzzyrough selection algorithm to reduce the noise generated by SMOTE after the balance stage.Gao et al. [4] utilized particle swarm optimization to optimize the undersampling procedure of SMOTE and then introduced RBF classification to reduce the misclassified cases.To reduce the imbalance level between classes, Zeng et al. [5] integrated the kernel trick and SMOTE into a new support vector machine (SVM) algorithm for data imbalance problem.As a combination with learning algorithm, Jeatrakul et al. [6] introduced SMOTE to neural networks in order to improve the generalization performance.Granular computing, known as an abstract idea for data processing, has instinct capacity to effectively remodel the original data according to data distribution, and then a high value of keeping the raw information of sample set is put.Therefore, granular computing has also been introduced to solve the data imbalance problem in SVM.For example, Wang et al. [7] utilized granulation division to hierarchically suppress the samples before SVM training.It is worth noticing that although the methods discussed above can balance sample set to some extent and thus improve the classification accuracy, they easily cause severe information loss if the distribution characteristic and feature are not considered well.
The other strategy is algorithm-based strategy which tries to improve the classification efficiency by developing the algorithms structure.For example, Hwang et al. [8] added weight factors to Lagrange multiplayer to improve the effectiveness of SVM upon facing imbalanced data.To lessen the misclassification rate, Yu et al. [9] calculated the moving distance of hyperplane by adjusting the decision threshold of SVM.Many other algorithms such as pricesensitive learning [10], weighted support vector machine [11], and weighted boost learning [12] were devoted to solve data imbalance problem.Although this strategy has been researched thoroughly, most of the algorithms can not apply to the online case directly due to lack of online structure.Besides, upon facing a large amount of data, it is generally hard for these algorithms to get results quickly.As extension form of single-hidden layer feedforward neural network (SLFN), extreme learning machines (ELMs), introduced by Huang et al. [13], have been recognized by their high learning speed and good generalization capacity for solving many problems of regression estimate and pattern recognition.As a sequential extension of ELM, online sequential ELM (OS-ELM) proposed by Liang et al. [14] can learn data one by one or chunk by chunk with fixed varied chunk size at very high speed.Although ELM has also been developed for data imbalance problem [15], it seems that OS-ELM has not been widely applied to data imbalance problems.
According to our literature survey, there are not too many researches about online data imbalance problem.By introducing prior duplication strategy, Vong et al. [16] firstly generated synthetic minority class samples and then utilized OS-ELM to establish an online sequential prediction model.Focusing on the modeling of data distribution in online pattern, Mao et al. [17] introduced the principal curve to exploit the inner structure of online data and then applied SMOTE to conduct oversampling by means of the distance from sample to principal curve.However, although this method could overcome many shortages of traditional methods, the principal curve is not well applicable to tackle the dataset with no apparent distribution feature.We noticed another recent work [18] for this problem which tried to adopt granulation division to remodel the distribution characteristic with a theoretical analysis about concrete information loss.Although it neglects some potential shortage of synthetic samples and the theoretical analysis needs to be improved largely, it is still an interesting attempt at this problem with reference value.

SMOTE. SMOTE (synthetic minority oversampling technique
) is a common oversampling method proposed by Chawla et al. [2].In the SMOTE, instead of mere data oriented duplicating, the minority class is oversampled by creating synthetic instances in the feature space formed by the instance and its -nearest neighbors, which effectively avoid the overfitting problem.
This method is described as follows.Choose two samples,  1 and  2 , from the given minority sample set randomly, where each sample has  attributes.For  1 and  2 , calculate the difference on the th attribute; that is, diff  =  2 −  1 .Then, we obtain the th attribute value of the new target sample according to where rand[0, 1] means a random number between 0 and 1.So the final synthetic sample of  1 and  2 is where diff = (diff 1 , diff 2 , . . ., diff  ).
According to the sampling rate we set execution times and repeat the above process.Incorporating the synthetic samples and the original samples, the final minority sample set is obtained.

Review of ELM and OS-ELM.
As originally proposed for solving the single-hidden layer feedforward neural network (SLFN), it has been proved that, with at most  hidden neurons, ELM can learn  distinct samples with zero errors by adopting any bounded nonlinear activation function [19].Then, based on this approximation ability, ELM received wide attentions and has been developed into various forms, for example, multioutput regression [20].The most important feature of ELM is its fast speed, owing to its single-hidden layer structure requiring no iterative process.In ELM, all the hidden node parameters are randomly generated without tuning.As an extension version of ELM, online sequential extreme learning machine (OS-ELM) is a faster and more accurate algorithm, which has been widely used in many fields, such as pattern recognition and data mining.The process of OS-ELM is divided into two steps: initialization phase and sequential learning phase and the detailed algorithm is described as follows [14].

Online Sequential Extreme Learning Machine Based on Granulation Division and SMOTE
To improve the classification accuracy of minority class, we proposed a new algorithm based on granulation division and SMOTE using extreme learning machine.The main idea is improving the accuracy of minority class and reducing the information loss of majority class.
For the convenience of description of the algorithm, we give some definitions in the beginning.Suppose that  = {(  ,   ),  = 1, 2, . . .,  1 } and  = {(  ,   ),  = 1, 2, . . .,  2 } represent majority sample set and minority sample set, respectively, where   and   mean -dimensional vector.Dimension indicates the number of features.  = 1 means the corresponding sample is the majority and   = 0 means minority sample.
Definition 2 (granule dispersion).Granule dispersion represents the discrete degree in a granule.Obviously, the granule dispersion is inversely proportional to the number of samples in a granule and is directly proportional to the maximum radius of granule: where   is the sample in the granule with   as the granule core.It is easy to know that the bigger the granule dispersion is, the more sparse and scattered the samples in the granule are and thus the higher the information loss is upon using granule core instead of the whole granule.
Definition 3 (sample weight).For each sample in the granule except the samples farthest from the granule core, the oversampling is conducted using SMOTE.Obviously, the sample weight is inversely proportional to the distance between the granule core and the virtual samples: 4.1.Offline Stage.Firstly, we refactor the imbalanced sample set using the proposed method and get the balanced sample set  = {( ,   ) |  = 1, 2, . . .,  0 }.Then establish the initial model.The main idea is undersampling for majority class by choosing all the granule cores, which can reduce the number of majority samples and ensure that the samples' distribution trend is consistent with the trend before undersampling.For initial majority sample set  = {(  ,   ),  = 1, 2, . . .,  1 }, the first granulation division is conducted.Then, we obtain the new majority sample set   = {(  ,   ),  = 1, 2, . . .,  11 }.
Clustering algorithm [21] is adopted to simulate the process of granulation division.We set the clustering algorithm  1 in the first granulation division according to the overall distribution of original samples.The up and down threshold values of maximum radius of granule are set to [ 1 ,  2 ], which can guarantee that the sample distribution trend keeps unchanged before and after the first undersampling.Then, we choose  1 clustering center as the granule core and replace the original majority samples.Merging the new majority sample set   = {(  ,   ),  = 1, 2, . . .,  11 } and the minority sample set  = {(  ,   ),  = 1, 2, . . .,  2 }, we obtain the new training sample set  = {(  ,   ) |  = 1, 2, . . .,  0 }.
Given the hidden active function () and the number of hidden nodes , choose input weight   and bias   ,  = 1, 2, . . .,  0 randomly and calculate the hidden layer output matrix H 0 : The output vector is T 0 = [ 1 ,  2 , . . .,   0 ]  and the output weight is where

and then
we have
(1) Granulation division for majority classis Ω  is conducted.We choose  1 points as the initial granule core uniformly according to the samples distribution trend, where  1 is set as about three times the number of minority samples.We can obtain  1 clustering center in each iteration, namely, granule core, by equation of clustering algorithm =1   , until the distance between each sample and the clustering center meets the condition of the following equation.Finally, we can obtain  1 clustering center  = { 1 ,  2 , . . .,   1 }, namely,  1 granule core.Now, the majority sample set is Ω  and the imbalance rate reduces to 3 : 1: (2) Granulation computing for minority class: the value of  2 is set to half the number of original minority samples: If there are other samples except the granule core, we add the virtual samples within the granule and granule boundary using SMOTE.Then, we obtain the new minority sample set by merging the virtual samples and original minority samples.The detailed description is as follows.
Step 1. Choose the granule core as the center of a circle.We add virtual samples between the granule core and all the other samples using SMOTE, as shown in the following equation: The virtual samples are generated between the granule core and other samples in the granule.Every time there are  virtual samples generated.The value of  1 is set according to the actual situation.
Step 2. According to the following equation: 2 virtual samples will be generated between the granule core and most of the samples in the granule, except the sample farthest from the granule core, which can ensure that the new virtual samples are not too far from the granule core and thus maintain credibility.Usually we set  2 ≤ (1/3) 1 , where  1 means the total sample numbers in the granule.The new virtual samples expand the distribution range of the granule and do not affect the overall credibility simultaneously.Besides, the random number is between 1 and  1 , 1 <  1 ≤ 1.5, which can ensure that the distance between new virtual samples and the granule core is farther than the distance between raw samples and the granule core, and most of the virtual samples are still in the granule.
Step 3. Set the sample weight of virtual samples according to Definition 3 and then update the virtual samples.Merging the virtual samples and the original minority samples, we can get the new minority sample set Ω  .
The corresponding hidden layer matrix of newΩ +1 is and now the hidden layer matrix becomes H +1 = [    Φ  ]  .Update the network weight according to the following equation: where  +1 = [    Ω  ]  is the output vector and We have because namely, Calculate the inversion of both ends of the equation according to Sherman-Morrision matrix inversion lemma.We obtain the recursive expression of  +1 : So  +1 can be calculated based on   , which reduces calculation and greatly improves the computational efficiency.We can obtain H +1 † by substituting ( 15) into ( 14) and then update the network weight  +1 .

The Reliability Analysis
According to the discussion as above, we reduce the majority samples using granulation division both in offline stage and in online stage.For the original majority sample set   = {(  ,   ),  = 1, 2, . . ., }, we only choose  most representative samples and get the new balanced majority sample set    .Although the imbalanced phenomenon could be reduced to some extent, there is a loss of information in the undersampling because of abandoning some samples.To illustrate the rationality of the proposed method, we give the lower bound of the model reliability after undersampling based on information entropy [22], which can indicate indirectly that there is upper bound of the information loss in undersampling.
Suppose that the loss sample set is  = {(  ,   ),  = 1, 2, . . ., } in every online undersampling, where   is a sample in the granule centered with the granule core   .As discussed in Section 4.2, we reject all samples in this granule except   .The sample weight of   is defined as follows: Then the missed classed probability is where   means the number of samples and ∑   =1 |  −   | means the sum of Euclidean distance between each sample and the granule core.
In offline stage, the loss sample set is  1 = {( 1 ,   ),  = 1, 2, . . .,  0 }, where  1 means the sample in the th granule.The granule core  1 will join the final balanced sample set representing the whole granule.So the misclassification rate is Theorem 4. At present,  is the sample number of majority set  and  is the number of misclassification majority samples.Let   represent the lower bound of model reliability.Because binary classification result obeys the binomial distribution, the lower bound of model reliability can be obtained when the confidence coefficient is determined: where  is negatively correlated with   .It can be seen that the fuzzy reliability is only related to the discrete degree.
Proof.According to the definition of fuzzy reliability ∑  =0 (/) −  (1 −   )  = 1 − ,   reaches the maximum when  is the minimum with definite , because As can be seen from the above equation,   is only related to So the smaller the maximum radius of granule is, the smaller the distance sum of samples is, the higher the number of samples in granule is, the smaller the dispersion is, and the bigger the value of Proof.According to the definition of fuzzy reliability, we have According to the equation, the value of   0 is related to the sum of distance between the granule core and other samples.The smaller the distance sum is, the smaller the dispersion is and the more compact the samples in granule are, which will cause the bigger fuzzy reliability and more reliable model.Theorems 4 and 5 prove the reasonability of the proposed algorithm from the point of information entropy.Considering the extreme case, if the granule dispersion is 0, namely, not undersampling by granulation division, the misclassification rate of majority class is almost 0; that is, lim   →1 (Φ) → 0 means that it does not provide the information entropy and the information loss is 0, which is accordant with the practical situation.

Simulation Experiment
In order to demonstrate the effectiveness and the superiority of the proposed algorithm, we conduct the simulation experiments on the chessboard-shaped dataset with uneven density distribution and the imbalanced distribution meteorological data of Macao in 2010 and 2012.At the same time, we compare the experimental result of our algorithm with that of SVM (support vector machine) [23], OS-ELM (online sequential ELM) [14], and MCOS-ELM (Metacognitive OS-ELM) [16].Among them, MCOS-ELM is an online extreme learning algorithm presented by Vong et al. [16] for the online data imbalance problem.For better demonstration, we call the proposed method DGSMOTE (Division of Granulation and SMOTE OS-ELM).Before the training, we apply the normalization procedure to the dataset.We take the average value of 30 trials as the final experimental result.

Construct the Chessboard-Shaped Dataset.
In the chessboard-shaped dataset, both the majority and the minority samples take up eight cells in the chessboard.According to the respective class in each cell, some data chunks are randomly generated.Ultimately, the quantity of majority and minority samples is 1000 * 8 and 100 * 8, respectively; that is to say, the ratio of the classes is 10 : 1.The testing data are generated with the same method.

Experimental Results Analysis of Chessboard-Shaped
Data.In the offline stage, we conduct the undersampling on the majority samples first.Then the changes in the distribution of the chessboard-shaped data are shown as Figure 1(b) after the first granulation division.In the online stage, after conducting granulation division, we process the SMOTE algorithm to realize the oversampling for the minority samples.As a result, the changes of the dataset are shown as Figure 1(c).
It can be seen from Figure 1 that, compared with the original samples, now the classifications of dataset at the moment are nearly balanced.For the different classes of the chessboard-shaped dataset, Table 1 presents the changes in their numbers before and after the process of this proposed algorithm.
In the simulation experiment, the activation function of the hidden layer is set as "sig" and the numbers of hidden nodes are set as 140.We take the mean value of 30 trials as the experimental result.Finally, the performance comparison of the four models is shown as Table 2.
From Table 2, though the DGSMOTE's whole testing accuracy is not the highest among the models, its testing accuracy and testing times do not appear to be much different from the other three algorithms.In addition, the minority training accuracy of DGSMOTE is much higher than that of others.Compared with LS-SVM, DGSMOTE shows superior performance in both testing training speed and accuracy.This demonstrates the instantaneity of the proposed algorithm.Furthermore, the new DGSMOTE presents good performance upon using -mean to evaluate specialty and the sensitivity of the algorithm.It can highly improve the classification accuracy of minority with less decrease of amplitude of majority accuracy.At the same time, it can eliminate the bias which is generated by applying traditional algorithm to handle the imbalanced data.
To strengthen the reliability and observability of our algorithm, the classification results and the minority accuracy variation with different numbers of hidden nodes on chessboard-shaped dataset are shown in Figures 2 and 3, where the dark spot means the misclassified samples.They also reflect the good generalization and learning performance of DGSMOTE.
From Figure 2, classification accuracy of minority samples is significantly less than that of majority samples upon using OS-ELM to classify the dataset.Namely, the model of OS-ELM possesses obvious bias in the classification.Compared with the improved OS-ELM algorithms, the DGSMOTE has a better overall classification performance with little effect on the majority classification accuracy.During the procedure of undersampling and oversampling in our algorithm, both the whole distribution characteristics and the original feature of the samples are fully considered.So loss information value of the balanced samples is low and stable.We transform the number of the hidden nodes for the three online extreme learning algorithms.Then the accuracy of each node for the corresponding algorithm is obtained by taking the mean value of 20 trials.Hence, we get Figure 3.By observing the changes of minority accuracy, we know that the whole performance of DGSMOTE algorithm is not only high but also stable.
After synthesizing all the above indicators, it is obvious that the DGSMOTE possesses effective generalization performance and outstanding learning ability.In order to display the sensitivity and specificity of our algorithm, we employ ROC curve to reveal the excellent performance.The ROC curves of the four models on the chessboard-shaped dataset are shown in Figure 4.
AUC denotes the area under the ROC curve.The larger the value of AUC is, the better the classification is.From Figures 1-3, we can know that the proposed DGSMOTE algorithm significantly outperforms the other three models with better overall performance and lower minority misclassification rate.Besides, it reduces the loss cost generated by misclassification because of its strong recognition capability and lower classification bias.Macao forecasting dataset is obtained from the website of the Macao Meteorological Bureau [24].Compared with chessboard-shaped dataset, it has less samples but more attributions and is a kind of flow distribution data.According to its own features, we choose PM 10 and SO 2 as the two main characteristics from all the six features for illustration.And the changes after the first granulation division are shown in Figure 5(b).
After the first granulation division, the imbalanced ratio of the new sample set is markedly decreased compared to that of the original dataset.
In the online stage, we first conduct granular computing.Next, we apply SMOTE algorithm to process the oversampling for the minority.Figure 5(c) displays how the minority samples of Macao forecasting data change after the procedure of DGSMOTE.
It is obvious form Figure 5 that the sample data are nearly balanced after the oversampling and undersampling.Table 3 shows the changes in the number of the observations before and after being handled by our algorithm.
The next step is to use the new balanced sample set to establish the initial model of the online extreme learning machine.Similarly, the activation function of the hidden layer is set as "sig."According to features of the forecasting data, the numbers of hidden nodes were assigned as 30.The four models established by the four algorithms conduct the learning on the two different forecasting datasets, respectively.Finally, the comparative performances of the models are presented in Tables 4 and 5.
As can be seen from Tables 4 and 5, compared with the other three algorithms, DGSMOTE can effectively increase  In order to make the validity and stability of our proposed algorithm more clear, the classification results and the minority accuracy variation with different numbers of hidden nodes on the two datasets are shown in Figures 6, 7, and 8, respectively.In Figures 6 and 7, the dark spots mean the misclassified samples.
From Figures 6 and 7, our proposed algorithm has superior recognition ability to the other three algorithms as well as avoiding significant decline of majority class accuracy.That is to say, the DGSMOTE effectively eliminates the bias generated by applying the original OS-ELM algorithms to handle the imbalanced problems.In Figure 8, the curves of DGSMOTE are much smoother without erratic fluctuation along with the variation of the number of hidden nodes.And they further testify the favorable generalization and stability of our proposed algorithm.
We still use ROC curves to exhibit the outstanding overall effect and superior performance of the proposed algorithm.According to ROC curves in Figures 9 and 10, it is obvious that the DGSMOTE algorithm has an advantage over the other three algorithms and possesses better overall performance and recognition ability upon dealing with the flow distribution imbalanced data.This shows more research and application value for the practical problems.

Conclusion
In this paper, a novel classification approach based on SMOTE is proposed from the application in actual engineering.In the offline stage, we conduct the granulation division according to the distribution and the clustering characteristics of the majority samples.The central sample in each granule is used to replace the granule itself.Finally, the balanced offline dataset is obtained.In the online stage, we first process the granulation division for the minority class on the basis of the offline stage and then conduct the SMOTE to realize the oversampling of the minority samples.Our algorithm effectively increases the classification accuracy of the minority class under the premise that the overall distribution was unchanged and the information loss of majority samples reduced.
Furthermore, entropy theorem is used to testify the rationality of the proposed algorithm.The final experimental results demonstrate that the overall generalization performance, classification efficiency, and classification accuracy of the online imbalanced samples can get improved by applying granulation division to make the dataset balanced.For the online imbalanced small sample set and the large scale data, our research is of both great theoretical significance and practical value.

Figure 1 :Figure 2 :Figure 3 :
Figure 1: The distribution of the offline chessboard-shaped data (a) before and (b) after the first granulation and (c) the result after using SMOTE.

Figure 5 :Figure 6 :Figure 7 :Figure 8 :
Figure 5: The distribution of the forecasting data (a) before and (b) after the first granulation division and (c) the result after using SMOTE.
Figures 9 and 10 indicate the ROC curves of the four models on the Macao forecasting data in 2010 year and 2011 year, respectively.

Figure 9 :
Figure 9: Comparison of the ROC on Macao forecasting data in 2010 year for the four algorithms (a) SVM, (b) OS-ELM, (c) MCOS-ELM, and (d) DGSMOTE.

Table 1 :
Changes in the numbers before and after balancing the offline data.

Table 3 :
Changes in the numbers before and after balancing the offline data.

Table 4 :
Comparative results on Macao forecasting data in 2010.

Table 5 :
Comparative results on Macao forecasting data in 2011.