Applying Cost-Sensitive Extreme Learning Machine and Dissimilarity Integration to Gene Expression Data Classification

Embedding cost-sensitive factors into the classifiers increases the classification stability and reduces the classification costs for classifying high-scale, redundant, and imbalanced datasets, such as the gene expression data. In this study, we extend our previous work, that is, Dissimilar ELM (D-ELM), by introducing misclassification costs into the classifier. We name the proposed algorithm as the cost-sensitive D-ELM (CS-D-ELM). Furthermore, we embed rejection cost into the CS-D-ELM to increase the classification stability of the proposed algorithm. Experimental results show that the rejection cost embedded CS-D-ELM algorithm effectively reduces the average and overall cost of the classification process, while the classification accuracy still remains competitive. The proposed method can be extended to classification problems of other redundant and imbalanced data.


Introduction
With the appearance of gene chips, the classification methodology for gene expression data is developed into molecule phase [1]. The classification of gene expression data represents a crucial component in next generation cancer diagnosis technology [2]. For a particular tumor tissue with a series of known features, scientists believe that the classification of the gene array tells important information for identifying the tumor type and consequently influences the treatment plan [3][4][5]. However, the gene expression data on the other hand is known as large-scale, highly redundant, and imbalanced data, usually with relatively small sample size. Specifically, the number of features can be a hundred times larger than the number of samples [6]. This particular property of the gene expression data makes most of the traditional classifiers, such as extreme learning machine (ELM) [7], support vector machine (SVM), and multilayer neural networks, face difficulty in producing accurate and stable classification result. In 2012, we presented the integrated algorithm of Dissimilar ELM (D-ELM) by selective elimination of ELM based on V-ELM, which provided stable classification results compared to individual ELMs [8,9].
Besides the accuracy, classification cost is another important aspect in performance evaluation for classification problems. In the cancer diagnosis progress, the cost of classifying a patient with cancer into negative class (false-negative) is much higher than that of classifying a patient without cancer into positive class (false-positive) [10]. Both false-negative and false-positive cases are recognized as misclassification cases. However, the costs of false-negative can be human lives due to the wrong medical treatments. Besides the misclassification cost, in recent years, the rejection cost also catches people's attention for cost-sensitive classifier development [11]. By considering misclassification and rejection cost, the classifiers become more stable and reliable.
In this study, aiming at extending the D-ELM to increase its classification stability, we embedded misclassification costs into D-ELM and named the proposed extension as CS-D-ELM. Furthermore, we embed the rejection costs into the CS-D-ELM to increase the classification stability of the proposed algorithm. The rejection cost embedded CS-D-ELM algorithm achieves the minimum classification cost with competitive classification accuracy. We validated CS-D-ELM by several commonly used gene expression datasets and compared the experimental results of using D-ELM, 2 Computational Intelligence and Neuroscience CS-ELM, and CS-SVM. The results show that the CS-D-ELM and rejection cost embedded CS-D-ELM both effectively reduce the overall misclassification costs and consequently enhance the classification reliability.
The rest of the paper is organized as follows. Related works, such as ELM, extensions of ELM, and cost-sensitive classifiers, are introduced in Section 2. In Section 3, the proposed algorithm is described in detail. The original D-ELM algorithm is extended by embedding misclassification costs and rejection costs. The experimental results are shown in Section 4. Conclusion, limitation, and future works are stated in Section 5.  [12][13][14]. The most famous advantage of ELM is the one-step training process, which results in much faster learning speed compared with traditional machine learning techniques, such as multilayer neural networks or support vector machine (SVM). The SLFN can also be applied to other research fields [15]. However, problems arise while the classification accuracy performance of a single ELM is not stable. Integrated ELM algorithms are developed to solve the above problem. Wang et al. [16] proposed an upper integral network with extreme learning mechanism. The upper integral extracts the maximum potential of efficiency for a group of features with interaction. Lan et al. [17] presented an enhanced integration algorithm with more stable performance and higher classification accuracy for Ensemble of Online Sequential ELM (EOS-ELM). Tian et al. [18,19] used the Bagging Integrated Model and the modified AdaBoost RT to modify the conventional ELM, respectively. Lu et al. [20] proposed several algorithms to reduce the computational cost of the Moore-Penrose inverse matrices for ELM. Zhang et al. [21] introduced an incremental ELM which combines the deep feature extracting ability of Deep Learning Networks with the feature mapping ability of the ELM. Cao et al. [22] presented the majority Voting ELM (V-ELM), and this algorithm is widely used in various fields. Lu et al. [8,9] presented the integrated algorithms of Dissimilar ELM (D-ELM) which is more adaptive for different individual ELMs compared with [22].

Cost-Sensitive Classifiers.
In most integrated algorithms, the possibilities of samples belonging to given classes are calculated before judging the class labels of samples. However, if there are two or more probabilities which are equal or close to each other, misclassification is likely to happen. Therefore, the misclassification cost issue is studied to improve the classification performance of integrated algorithms. Foggia et al. [23] proposed a method to calculate the misclassification value and false rejection value of multiexpert systems based on Bayesian decision rules. Experimental results showed that their method was optimal. In 2003, Zadrozny et al. [24] introduced cost-sensitive factors into machine learning techniques, which further reduced the classification costs. In 2010, Masnadi-Shirazi and Vasconcelos [25] introduced both the misclassification costs and rejection costs into SVM, which improved the performance of cost-sensitive SVM algorithms. In 2011, Fu [26] proposed a cost-sensitive classification algorithm named Cost-MCP Boost for multiclassification problems. The Cost-MCP Boost algorithm solved the cost merge problem while the multiclass costsensitive classifications were converted into two-class costsensitive classifications; and the classification results were determined by the classes with smaller misclassification costs. In 2013, Cao et al. [27] proposed an optimized cost-sensitive SVM to deal with the imbalanced data learning problem.
Embedding classification costs into the ELM improves both the classification accuracy and the stability of the classifier [28]. Lu et al. [29][30][31] proposed cost-sensitive ELM for gene expression data classification. Experimental results showed that the misclassification cost dropped drastically and the classification accuracy was raised. Zong et al. [32] and Mirza et al. [33] utilized a weighted ELM to deal with imbalanced data. By assigning different weights to samples following user instructions, the weighted ELM can be generalized to cost-sensitive ELM. Riccardi et al. [34] worked on a cost-sensitive AdaBoost algorithm which is based on ELM. The cost-sensitive ELM is used for ordinal regression, which turns out to produce competitive results. Most recently, Fu et al. [35] showed some experimental results on the stability and generalization of ELM. The study provides some useful guidelines to ensemble ELM with cost-sensitive factors to produce more stable classification results. Wang et al. [36] indicated that samples with higher fuzziness outputted by the classifier mean a bigger risk of misclassification. They proposed a fuzziness category based divide-and-conquer strategy to promote the classifier performance.

Cost-Sensitive Dissimilar Extreme Machine (CS-D-ELM)
The ultimate goal of this study is to minimize the conditional risk: where ( | ) is the conditional probability risk that the sample x is classified into the class I. ( | ) is the probability that sample x belongs to the class j. ( , ) is the risk that a sample belonging to class j is misclassified to class i, where i and j belong to the set { 1 , 2 , . . . , } and m is the class number.

The General Form of D-ELM.
The D-ELM is an improved algorithm for majority Voting ELM [37]. It selects the most suitable ELM individuals after a training process in order to improve the consistency in the voting procedure and therefore increase the classification accuracy. For example, suppose there are N ELMs and M training samples available for initialization. We construct a dissimilarity matrix to eliminate inappropriate ELM individuals. The remaining ELMs are considered in consistent form and are able to provide more stable classification results. The dissimilarity matrix is defined by inconsistency degree of outputs. If the th ELM and the jth ELM have equal judgement results of and for the sample k, respectively, we mark Dif( , ) = 0, ( = 1, 2, . . . , ; = 1, 2, . . . , ; = 1, 2, . . . , ). Otherwise, we mark Dif( , ) = 1. Suppose that Div , = ∑ =1 Dif( , ) denotes the difference between the th ELM and the jth ELM. The dissimilarity matrix is expressed as ] . ( Obviously, Div is a matrix with zeros for all diagonal elements.
denotes the dissimilarity between the th ELM and the rest of the ELMs. It is defined as The average classification accuracy is denoted as . The ELM with smaller value is eliminated under the condition of 0 < ≤ 0.5. And the ELM with bigger value is eliminated under the condition of 0.5 < < 1. The remaining ELMs are selected to proceed to the voting procedure.
The overall D-ELM algorithm can be divided into three parts. First, N independent ELMs are trained using given data; and a number of ELMs are eliminated according to dissimilarity theory. Second, the remaining K ELMs are trained again using the same hidden layer node number and activation function. For each independent ELM input layer, hidden layer weights and bias are randomly generated and unrelated. Last, for each testing sample tx, the K independent ELMs can maximally predict K individual classification results. An initial empty vector ( , ( 1 ), , ( 2 ), . . . , , ( )) (m is the number of classes) is used to store the classification results of tx for the K ELMs. For example, for the lth ( ∈ [1, . . . , ]) ELM classifier, if the classification result of tx is ( ∈ { 1 , 2 , . . . , }), then the following operations are carried out: The final vector , is obtained after all K ELMs are processed. We get the probability for each class in the classification result: After calculating the conditional probability of the test sample tx by D-ELM, if tx is classified into sth class correctly, the probability that tx belongs to the sth class is bigger than the probability that it belongs to other classes; that is, there is an inequality: For example, in a two-class classification, the probabilities that a testing sample tx belongs to the positive class and negative class are ( | ) = , ( )/ and ( | ) = , ( )/ , respectively.

Embedding Misclassification Costs into D-ELM.
For each test sample tx, it is not enough to only know the probability ( | ) ( ∈ { 1 , 2 , . . . , }) of tx. When the cost is unequal, even if the inequality (6) is satisfied, we cannot determine the class label s of tx. Therefore, in this section, the asymmetric misclassification costs are embedded in order to improve D-ELM to CS-D-ELM.
Suppose the probability ( | ) is calculated by the D-ELM method in Section 3.1; the class label of tx is determined by taking the least cost that tx belongs to a class i. The following equation is derived from (1): All class labels of test samples are recalculated according to the principle of minimizing the average misclassification costs. Let be the real class label of sample tx after integrating the misclassification cost information of tx. After embedding the misclassification cost into the D-ELM, the classification results are as follows: where ( | ) = , ( )/ ( ∈ { 1 , 2 , . . . , }) is the probability calculated by D-ELM in Section 3.1.

The CS-D-ELM Algorithm.
A general form of CS-D-ELM algorithm can be described as follows: (1) Set initial values for all N ELMs.
(3) Calculate the hidden layer output matrix of the th ELM.

4
Computational Intelligence and Neuroscience

Embedding Rejection Costs into CS-D-ELM.
The samples with low classification reliability are more likely to be misclassified. In order to reduce the high cost of misclassification, rejection options are embedded into CS-D-ELM to prevent automatic classification of samples with low classification reliabilities. There are three kinds of rejection costs: (1) The costs that require further analysis for unclassified samples.
if ( ) ≥ , the test sample is classified into the sth class. If ( ) < , the test sample is processed by rejection treatment. For the two-class classification problems, embedded misclassification costs and rejection costs, given a test sample where , ∈ {0, , }. The rejection threshold is determined on a case-by-case basis which depends on the specific dataset. An example of a particular value calculation can be found in Section 4.3.      Diabetes, Heart, and Leukemia datasets, respectively. The average misclassification costs produced by CS-D-ELM are lower than those based on CS-ELM and CS-SVM.

Experimental Results of CS-D-ELM with Embedded
Rejection Costs. Before embedding the rejection costs into the CS-D-ELM, an important preprocessing step must be performed, which is the rejection threshold determination. Taking the Heart dataset as an example, on the basis of the invariant cost matrix, we set the rejection costs to (0, 1) = (1, 0) = 0.2. Randomly selected 300 samples are treated as the training sample set; and the remaining samples are treated as testing sample set. Again, 30 independent experiments are performed with average misclassification costs recorded. Figure 7 shows the relationship between the rejection threshold and the average misclassification costs. The average misclassification cost is minimized while rejection threshold  Table 2. We use a 32 Core@ 2.26 Hz Dell server machine with 128 G RAM. The voting process is conducted with 100 ELM individuals. Under the assumption that time is not the top priority in this study, the following conclusion can be drawn from this  Table 3.
As results shown in Tables 3 and 4, the G-means values are improved in CS-D-ELM and CS-D-ELM with embedded reject costs, and the effect of CS-D-ELM with embedded rejection costs is better. The average misclassification costs of CS-D-ELM are smaller than those of D-ELM in all datasets, indicating that CS-D-ELM can effectively reduce the misclassification costs in the classification process. The average misclassification costs of CS-D-ELM with embedded rejection costs are lower than those of the standard CS-D-ELM in all datasets. Again, the same conclusion can be drawn:

Conclusion and Discussion
The traditional classification algorithms are all based on the classification accuracy. When the misclassification costs are not equal, they cannot achieve the minimum average misclassification cost requirements in cost-sensitive classification process. This paper first reconstructs the classification results by introducing the probability estimation and misclassification costs into the classification process and proposes the cost-sensitive D-ELM algorithm which is called CS-D-ELM. Furthermore, by embedding the rejection costs, the average misclassification costs are further reduced. For computational complexity evaluation, the time taken by the voting procedure is negligible as the training process of each ELM is the most costly part. The speed of the proposed algorithm depends on the number of parallelizable cores in the machine. In our case, a 32-core server machine is utilized to run the CS-D-ELM with 100 ELM individuals. The overall running time of the proposed algorithm is three to four times slower than each ELM individual. However, as the average running time for each ELM individual is generally less than one second, the overall running speed is still fast.
The embedding of misclassification costs and rejection costs is proved to be useful for classification cost reduction and accuracy improvement [8,11,30]. The way of embedding misclassification costs and rejection costs into the D-ELM can be employed for other cost-sensitive algorithms. As a future work, we are implementing a cost-sensitive rotational forest algorithm (CS-RoF) for gene expression data classification. Similar algorithms can also be extended for classification problems of other imbalanced datasets.