Cost-Sensitive Classification for Evolving Data Streams with Concept Drift and Class Imbalance

Class imbalance and concept drift are two primary principles that exist concurrently in data stream classification. Although the two issues have drawn enough attention separately, the joint treatment largely remains unexplored. Moreover, the class imbalance issue is further complicated if data streams with concept drift. A novel Cost-Sensitive based Data Stream (CSDS) classification is introduced to overcome the two issues simultaneously. The CSDS considers cost information during the procedures of data preprocessing and classification. During the data preprocessing, a cost-sensitive learning strategy is introduced into the ReliefF algorithm for alleviating the class imbalance at the data level. In the classification process, a cost-sensitive weighting schema is devised to enhance the overall performance of the ensemble. Besides, a change detection mechanism is embedded in our algorithm, which guarantees that an ensemble can capture and react to drift promptly. Experimental results validate that our method can obtain better classification results under different imbalanced concept drifting data stream scenarios.


Introduction
Data stream classification has attracted much attention in the scenario of big data mining due to its presence in many real-world fields, such as social network analysis, weather prediction, online medical diagnosis, and weblog mining [1][2][3][4][5]. Concept drift is a common feature of data streams [6][7][8][9], which refers to the phenomenon of target concepts of streams changing over time. Concept drift can deteriorate the performance of classification because the model trained on old concepts may be unsuitable for new concepts. For example, fashion trends in recommend systems may be influenced by customer behavior, and the weather forecast model may no longer be applicable as the season changes.
erefore, an efficient data stream learning model should have the capability of capturing drifts promptly and updating the model accordingly [7,10].
A growing number of methodologies have been proposed for dealing with concept drift [9]. Among these techniques, the window-based method adopts a natural way of forgetting mechanism to add new instances and eliminate outdated instances. e sliding window is the most frequently used window technology. It adopts the first-in-firstout structure to move on processed instances and ensure that the current window stores the latest instances. Because ensemble algorithms have the advantage of modularity and can quickly adapt to changes, ensemble-based methods are the most common methods for handling concept drift.
Although much work has been done on concept drift [5][6][7], the class imbalance problem [11] (i.e., negative class instances are more extensive than other classes) further increases the difficulty of addressing concept drift [12]. Class imbalance commonly exists in the real world. Examples include cancer diagnosis, financial fraud detection, and geological disaster prediction. For binary classification, the class that has more instances is called the majority class (negative class), and the other is the minority class (positive class). For example, in the online fraud identification of automobile insurance, fraudulent customers accounted for only 1% of the total customers in 100 000 instances. Finding a way to identify only 1% of fraudulent instances correctly can significantly reduce economic loss.
Several popular methods for dealing with the class imbalance issue [13][14][15][16][17][18] can be broken down into main groups: data-level techniques, cost-sensitive learning, and ensemble methods. Cost-sensitive learning methods aim to minimize the total cost. Some researchers argue that the cost-sensitive strategy is the most effective and frequent technique for dealing with class imbalance [11].
How to tailor the cost-sensitive learning strategy and adapt it to a nonstationary environment to enhance the capability of dealing with class imbalance is meaningful work. In practice, constructing classifiers under evolving data streams existing class imbalance is not a trivial task. It should address the following subproblems: (1) How can concept drift be handled? (2) How can class imbalance be managed?
A novel cost-sensitive learning scheme, named Cost-Sensitive based Data Stream (CSDS), is devised to tackle the combined issue to address these challenges. e contributions are threefold: (1) A novel cost-sensitive variant of the ReliefF algorithm, named Cost-Sensitive based on ReliefF (CS-ReliefF), is proposed. e CS-ReliefF considers cost information in feature weighting to address the class imbalance issue at the data level. (2) A dynamic cost-sensitive weighting mechanism is developed in the classification stage, incorporating cost value into the learning to alleviate the class imbalance at the algorithm level. (3) e performance of our algorithm was implemented on different kinds of class imbalance data stream benchmarks. e results demonstrated that CSDS achieves the best overall performance in G-mean, running time, and concept drifts adaption.

Class Imbalance Learning for Static Data.
Researchers have done several works on class imbalance classification on static datasets [11]. e research work is mainly divided into three categories: data preprocessing techniques, cost-sensitive learning methods, and ensemble-based methods [13].
Data preprocessing techniques are mainly to alleviate the influence of class imbalance employing changing the original data distribution. Undersampling and oversampling are two common data preprocessing technologies. e undersampling method balances the classes by deleting the majority of instances, resulting in information loss [14]. e oversampling technique balances the data by duplicating a minority of instances. However, due to the uncertainty in the synthesis of new instances, it may weaken the classifier's performance [15]. SMOTE [16] is the most famous random oversampling algorithm, synthesizing new minority instances near the original minority instances. However, it often results in overfitting.
Cost-sensitive learning solutions [17] assign different costs to different classes, which seek to minimize the total cost. Suppose that the majority class is misclassified as a minority class. In that case, a lower misclassification cost is assigned, and when a minority class is misjudged as a majority class, a higher misclassification cost is assigned. In this way, we could balance the class distribution of the data. Most of these methods extend traditional machine learning methods to make them cost-sensitive. For example, literature [18] introduced cost-sensitive strategies into the SVM algorithm to minimize cost-sensitive hinge losses. AdaCost algorithm proposed in literature [19] reduces the weight of misclassified instances by introducing a cost-sensitive weight function into the AdaBoost. Sun et al. presented a series of algorithms based on cost-sensitive learning [20].

Data Streams Learning under Concept Drift and Class
Imbalance. Although many efforts have been made focusing on class imbalance or concept drift separately [28][29][30][31][32], the combination of the two issues in data stream classification has not yet drawn enough attention.
Gao et al. proposed a general framework, called Sample and Ensemble (SE), for addressing class imbalance issues under streaming scenarios [28]. e SE divides the continuously arriving block into two groups: majority class instances and minority class instances. And then, SE collects the minority instances of the previous blocks and removes some of the majority class instances from the current chunk. Chen and He [29] introduced a novel ensemble solution, called Recursive Ensemble Approach (REA), for tackling class imbalance issues under a nonstationary environment. REA utilized the K-NN algorithm to measure the similarity between the minority class instances of the previous block and the minority class instances of the current block and chose the previous minority instances to balance the classes of the current block. Polikar et al. [30] presented an algorithm based on the Learn ++ framework [31] to deal with class imbalance under a data stream environment named Learn ++ .NIE. Recently, Mirza et al. introduced an online version of Extreme Learning Machine to solve the class imbalance issue [32].
In [33], a novel neural networks framework based on a cost-sensitive strategy was devised for handling the class imbalance issue. Li et al. introduce an ensemble algorithm using a multiwindow strategy to handle class imbalance issues [34]. More specially, three windows are designed in the algorithm: the current data block, the latest minority instances, and the pool of base classifiers. Lu et al. extended and improved the classic dynamic weighted majority (DWM) to effectively deal with the imbalance issue and named Dynamic Weighted Majority for Imbalance Learning (DWMIL) [35]. Moreover, DWMIL used an underbagging strategy during data preprocessing to handle class imbalance. However, it has the drawback of overfitting. Zyblewski et al. proposed a dynamic classifier ensemble selection for imbalanced drifted data streams [36]. Most recently, Cano and Krawczyk proposed an algorithm called Kappa Update Ensemble (KUE) [37], which utilized the Kappa statistic for dynamically updating weights of base classifiers.
Simultaneously, some common problems exist in imbalanced data stream classification methods: these algorithms can deal with a specific type of concept drift. Besides, class imbalance often exists in the data stream together with concept drift. Most algorithms only focus on one problem and do not fully consider two issues simultaneously.

Cost-Sensitive Based Data Stream Algorithm.
A novel ensemble framework based on cost-sensitive feature selection is introduced to handle this study's joint issue. As shown in Figure 1, the proposed algorithm primarily consists of four steps: Step 1: Data preprocessing: a cost-sensitive feature selection based on the ReliefF algorithm, named costsensitive ReliefF (CS-ReliefF), is devised. CS-ReliefF incorporates the cost information into feature selection, which selects a subset of features helpful in identifying the minority class. Hence, the feature set is more meaningful for effective prediction and has the effect of dimension reduction.
Step 2: Change detection: our algorithm employs concept detection to capture the changes explicitly, and when concept drift is detected, a new member classifier is built on the latest data.
Step 3: Classification module: a novel weighting scheme, that is, the weight of the base classifier, is updated based on accuracy and the total cost of misclassification on the latest data.
Step 4: Prediction: the weighted majority voting rule is used for predicting unknown instances.

Cost-Sensitive ReliefF Algorithm.
A novel cost-based feature selection, named Cost-Sensitive ReliefF algorithm (CS-ReliefF), is proposed in this section. We adopt the ReliefF algorithm [38] mainly because it is simple, fast, and effective. More specially, we tailed the famous feature select algorithm ReliefF into a cost-sensitive learning model, which takes advantage of cost information into account during feature selection. e main idea of the ReliefF algorithm is to weigh features according to their classification contribution. Specifically, the ReliefF randomly selects an instance x i with class value y, finds its k nearest neighbors from the same class and different classes, and is denoted by H j and M j (y), respectively. It updates the weights of all features based on their ability to distinguish neighboring instances.
Let x i and x j denote two instances, and their classes are y i and y j . e function diff (f, x i , x j ) is defined as the difference between the value of feature f for two instances x i and x j , and it can be calculated according to where max f and min f represent the maximum and minimum values of f, respectively, and the diff (f, x i , x j ) reflects the discrimination between x i and x j on f. 1]. ReliefF initializes the weights of all features to zero firstly. en, the RelifF randomly selects an instance x i and searches its k nearest neighbors. e ReliefF updates the weight of each feature according to where P (y) is the prior probability of class y estimated from the training set, and r is a user-defined parameter indicating the number of iterations. Unlike ReliefF, the proposed CS-ReliefF algorithm updates W f considering cost information according to equation (4). In this way, the CS-ReliefF algorithm tends to select features with low costs.
where Cost f is the test cost of f, and λ is the influence factor specified by the user. Cost f is generated by a normal distribution, and the cost function is defined as follows: where μ and σ 2 are the mean and variance. To avoid the randomness of one sampling, the above process needs to be iterated r times. In our algorithm, the parameter p% is Computational Intelligence and Neuroscience 3 selected to adapt to the dynamically changing feature space. We adopt p as 75 in the following experiments. e pseudocode is shown in Algorithm 1.

Cost-Sensitive Ensemble Learning
. . , C k } represent an ensemble with k base classifiers. e CSDS uses a sliding window technique to divide data stream S into data blocks More specially, whenever a new instance at time j is observed and inserted into the window, the j − |W| is discarded. When a new block B i arrives or a change occurs, we evaluate the effect of features according to equation (4) and use this to select an effective feature subset F' ⊆ F (refer to Section 3.2). e CSDS uses B i to build a new classifier C′ and weights it according to the following equation: e CSDS weights each base classifier C i ∈ E according to the following equation: where MSE r represents a randomly predicting classifier's performance of, and is employed as, the baseline for predicting the current distribution; MSE i represents the mean square error of C i on B i at time t, respectively; Cost i is the total misclassification cost of C i on B i . When the ensemble is full, the worst classifier is removed, and the newly learned classifier is added to the ensemble. Our method adopts the weighted voting rule to make the final prediction. Moreover, the proposed algorithm utilizes the Drift Detection Method (DDM) [39], a change detection schema by detecting the classification model's error rate, as the change detector. In fact, it could be any change detection algorithm. e pseudocode of the cost-sensitive-based data stream algorithm is shown as follows (Algorithm 2).

Experiments and Analysis
e experiments were carried out on the Scikit-Multiflow framework [40], which is a python platform based on popular open-source frameworks including scikit-learn and MOA [41]. It includes data stream generators, classification algorithms, and evaluation methods.

Data Benchmarks.
In the following experiments, we employed eight data stream benchmarks, including synthetic and real-world streams. e stream generators generated the synthetic streams in the Scikit-Multiflow framework. We adopted the ConceptDriftStream generator to simulate concept drift and used the ImbalancedStream generator to set the class imbalance rate ( # majority instances: # minority instances). e description of the streams is shown in Table 1.
e Hyperplane that simulates a d-dimensional hyperplane is the most popular synthetic data to simulate gradual concept drift. A gradual drift stream with 1 m instances was generated in our experiments, and its imbalance rate was set to 5. e SEA dataset is the most commonly used dataset representing sudden drift scenarios in data stream mining. We use the data stream generator to generate a data set of a sudden change in concept recurrence. e data set has a total of 1 m instances, which reappear every 250 K instances. Each instance is described by three attributes, which are used to represent one of the four concepts. e LED dataset contains data used to predict the sevensegment LED display. We chose the 24 attributes version of the LED. We generate a mixed drifts stream containing 1 m instances, including sudden and gradual concept drifts.
e Rotating spiral is a dataset with the class imbalance and gradual concept drift. It is used to describe four types of spirals. e rotating spiral data stream contains 1 m instances, and the imbalance rate is 19. e Spam dataset is a representative imbalanced realworld dataset, which collects e-mail messages from the Spam e Sensor dataset has 2219803 instances, and five attributes describe each instance. e data is the information of 54 sensors of Intel Berkeley Research Lab in two months. Since attributes such as brightness and temperature change over time, the stream may contain concept drift.
e Electricity dataset is one of the most widely used real-world datasets. It was collected from the Electricity Market in Australia, containing 45312 instances, each described by seven attributes. e purpose of this dataset is to predict whether the price of electricity will increase (up) or decrease (down) with changes in market demand and supply. e classes are approximately balanced.
e Airlines stream contains 539383 instances, and eight attributes describe each instance. e class of Airlines is a delay, which indicates whether the flight is delayed.

Evaluation Metrics for Class Imbalance Learning.
e Gmean is the geometric means of the recall of abnormal classes and that of normal classes, often used to measure the classifier's ability to handle unbalanced data [11]. It is often Input: D: data in sliding window; F: feature set; k: number of neighbors; r: number of iterations Output: Feature weight vector W (1) Begin (2) for all f ∈ F do (3) W f � 0; (4) end for (5) for i � 1 to r do (6) random select x ∈ D; (7) sampling x j ∈ D, if y j � y then add x j to H i , otherwise add to M j , until |H i | � |M j | � k; (8) end for (9) for all f ∈ F do (10) Update W f according to equation (5); (11) endfor (12) endfor (13) Select the top p% of the features; (14) end.
ALGORITHM 2: Cost-sensitive based data stream algorithm.
Computational Intelligence and Neuroscience applied in data streams with class imbalance to reduce the bias of the overall accuracy. For binary class classification, the G-mean is as follows [42,43].
G-mean can be extended to multiclass cases. Assuming that there are m classes, G-mean is still the geometric average of various correct rates, defined as where G-mean i is G-mean of ith class.

Experimental Results.
We verified the effectiveness of CSDS using cost-sensitive strategies in evolving data stream scenarios involving different types of drifts and class imbalance. CSDS was compared with the following methods: (i) VFDT: VFDT is an incremental decision tree classification based on the Hoeffding inequality theory, which can guarantee the constructed decision tree's accuracy with a certain probability. (ii) AUE2: AUE2 is a block-based ensemble that combines the accuracy-based weighting mechanism with the incremental learning of the Hoeffding tree and aims to deal with various types of drift. (iii) KUE: KUE is a dynamic weighting ensemble that utilized the Kappa statistic to update base classifiers' weight dynamically.
e evaluation can generate an incremental learning curve of metrics changing over time. For a fair comparison, the maximum number of the compared ensemble algorithms was set to 15. We chose the Hoeffding tree as the base classifier. e performance can be evaluated concerning G-mean and time (the averaged results of 10 runs), as shown in Tables 2 and 3.

G-Mean Analysis.
Concerning the G-mean, CSDS achieved the best average ranking, as shown in Table 2. CSDS gained the best performance over four data streams: Hy-perPlane, SEA, Spam, and Electricity. We find that CSDS classification performs better in class imbalance data streams environment. VFDT obtains the worst performance. is is because it cannot solve the class imbalance challenge, but also incapable of dealing with concept drift. CSDS classification employs the cost-sensitive learning strategy during the data preprocessing and classification stages. CSDS uses the CS-ReliefF algorithm to incorporate cost information into feature selection to select a subset of features helpful in identifying minority classes. erefore, the feature set is more meaningful for effective prediction and has the effect of dimension reduction. Simultaneously, a dynamic cost-sensitive weighting strategy is developed to reduce class imbalance at the algorithm level.

Time Analysis.
In terms of running time, VFDT performs best, followed by our algorithm, and KUE performed the worst. As shown in Table 3, we observe that the ensemble algorithms have certain advantages in Gmean, but it does not perform well in running time. Although the single decision tree classifier VFDT has apparent advantages in time, it performs the worst in overall performance. Overall, in most cases, CSDS can achieve a good compromise between G-mean and running time and adapt to drifts faster than other ensemble methods. Our algorithm benefits from the modular characteristics of ensemble learning, which can better deal with recurring gradual drifts. Meanwhile, the change   Computational Intelligence and Neuroscience detection mechanism is embedded in our algorithm to capture sudden drift in time.
Next, we adopt graphical plots to visualize how the algorithm is affected by different kinds of change. X-axis and y-axis denote the number of processed instances and Gmean of the algorithms, respectively. e SEA dataset is used to simulate scenes with sudden changes and to detect the ability to address sudden concept drift. In this scenario, the curves of the G-mean with the increment of processed increases are shown in Figure 2. e performance of the VFDT is the worst, followed by AUE2 and KUE, and CSDS is the best. Moreover, around the 50Kth instance, the G-mean values of all algorithms undergo rapid fluctuations except CSDS.
is may benefit from the concept drift detection mechanism, which can promptly capture a sudden drift, thereby establishing a new classifier to adapt such drift. e LED dataset simulates mixed concept drift, that is, scenes with gradual and sudden concept drift. It is intended to verify the algorithm's responsiveness to mixed drift. Specifically, the dataset is a stream containing two gradual drifts and one abrupt concept drift. When processing half of the stream, the target concept suddenly shifts from one concept to another. As shown in Figure 3, we have observed that all algorithms maintain a higher G-mean value when the data is relatively stable. When a sudden change occurs in the data stream, the performance of all algorithms except CSDS drops sharply.
is may be because CSDS can capture different kinds of changes timely and reconstruct a new model to recover from concept drift quickly. Besides, CSDS provides the best overall performance utilizing costsensitive learning strategies in feature selection and classification.
Changes in real-world streaming scenarios are often complex and changeable, so simulating the real-world environment can better verify classifiers' performance. Figure 4 illustrates the G-mean curves on the Spam data stream change over time. All the curves show varying degrees of fluctuation, which implies that there may exist a drift in the stream. Since VFDT and AUE2 cannot deal with imbalances, the dataset simulates a real-world nonstationary scenario that includes class imbalance. ey perform poorly in a scenario where class imbalance and concept drift coexist.
In contrast, the curves of CSDS and KUE are not significantly affected by concept drift, since CSDS is oriented to the data stream's changing characteristics to respond to these problems quickly and in real-time. Additionally, costsensitive learning strategies are adopted at the data level (feature selection) and algorithm level (classifier weighting) to effectively deal with imbalances.
Finally, we adopted the nonparametric Friedman test with a significance level α � 0.05 to perform statistical tests on all competitive algorithms [44]. e statistical test results show that the null hypothesis is rejected. at is, there is no significant difference between the algorithms. After that, we further employ the Nemenyi test [45] to verify whether the performance of our method is statistically different from other algorithms. e result is shown in Figure 5. e results show that CSDS is significantly better than VFDT.

Conclusion
is study provides novel insight into how to utilize a cost-sensitive learning strategy to deal with class imbalance under dynamic streaming scenarios. An ensemble schema based on a cost-sensitive strategy is devised to handle the combination of the two issues. Firstly, a costsensitive version of the ReliefF algorithm that incorporates cost information during the data preprocessing is proposed to solve the class imbalance issue at the data level. Secondly, a cost-sensitive classifier weighting scheme utilizing cost information is devised in the ensemble stage. Moreover, a change detection module is embedded in the ensemble to capture drift in real-time. Finally, extensive experimental results show that our method is superior to the competitive algorithms and gains the best trade-off between performance and resources, especially for nonstationary data streams with imbalanced class environments. Furthermore, the results verified its statistical significance with a nonparametric Friedman test.
is study focuses on the topic of single-label data stream classification. Multilabel data streams are common in many real applications. In the future, we plan to extend costsensitive learning into the multilabel stream scenario.

Conflicts of Interest
e authors declare no conflicts of interest.