Neurofuzzy methods capable of selecting a handful of useful features are very useful in analysis of high dimensional datasets. A neurofuzzy classification scheme that can create proper linguistic features and simultaneously select informative features for a high dimensional dataset is presented and applied to the diffuse large B-cell lymphomas (DLBCL) microarray classification problem. The classification scheme is the combination of embedded linguistic feature creation and tuning algorithm, feature selection, and rule-based classification in one neural network framework. The adjustable linguistic features are embedded in the network structure via fuzzy membership functions. The network performs the classification task on the high dimensional DLBCL microarray dataset either by the direct calculation or by the rule-based approach. The 10-fold cross validation is applied to ensure the validity of the results. Very good results from both direct calculation and logical rules are achieved. The results show that the network can select a small set of informative features in this high dimensional dataset. By a comparison to other previously proposed methods, our method yields better classification performance.
An innovation in computational intelligence mechanism, not only to develop the high accuracy mechanisms but also to be interpreted easily by human, is an interesting research topic. In order to achieve the interpretability purpose, linguistic features are more desirable than other types of features. An algorithm for finding appropriate symbolic descriptors to represent ordinary continuous features must be developed for classification mechanism. Enormous research works in neural networks are accomplished in classification accuracy [
Due to linguistic feature requirement, in some situations depending on classification models, input features are
We initially proposed a neurofuzzy method that could select features and simultaneously create rules for low-dimensional datasets [
In this paper, we investigate the usefulness of our proposed neurofuzzy method by applying it to the high dimensional diffuse large B-cell lymphomas (DLBCL) microarray classification problem. The number of features in this microarray dataset (7,070 features) is much larger than what we have tried previously. Moreover, the number of samples is very small (77 samples). Therefore, this problem is very challenging and it is interesting to see whether our method would work in this dataset with huge number of features, but very small number of samples. The findings of informative features and rules will be useful in diagnosis of this kind of cancer. The results will also indicate the generalization of our method.
This paper is organized as follows. A neurofuzzy method with feature selection and rule extraction and its training scheme designed for a high dimensional dataset is described in Section
Three requirements are concerned in the design of a neural network structure for rule extraction. Less complication of network structure is the first requirement. Therefore, only three layers (an input, a hidden, and an output layers) in a neural network are constructed. Consistent with the first requirement, the small number of linguistic variables is the second requirement. The set of linguistic terms
The neural network designed based on the aforementioned requirements is displayed in Figure
Our neurofuzzy classification model.
The number of nodes constructed in the input layer is the same number of original input features. For the forward pass, each node is reproduced
The next layer is the hidden layer. In addition to calculating fuzzy values corresponding to the specified linguistics of the original input, the maximum membership value is classified to the informative class. Since we use 3 linguistic terms, that is, small (S), medium (M), and large (L), for each original feature, the number of nodes constructed in this hidden layer is 3 times of that of the input layer. Each node uses Gaussian membership function to specify the membership values of the original feature; that is,
For each original input, the most informative linguistic feature is defined as the feature with maximum membership value corresponding to the input value
In the output layer, the number of nodes is equal to the number of the classes in the dataset. Weights are fully connected between this layer and hidden layer. Hence, we utilize this layer for feature selection purpose. The importance of linguistic features is specified by the corresponding weight values. Sigmoid function is used as the activation function in this layer. Therefore the output is
The algorithm for parameter tuning of the proposed model is slightly different from the conventional algorithm as Algorithm
While (performance ≤ threshold) Compute the delta values (local gradient values) at the output nodes using ( Update weights between hidden and output layers using ( For each input features If (hidden node connecting to input feature belongs to informative class) Compute the delta values (local gradient values) at the output nodes using ( Update mean and standard deviation using ( Else Retain original values End If End For End While
The standard error backpropagation algorithm is used in the backward pass of the proposed model. Consider the
In the hidden layer, the update process follows the parameter tuning algorithm displayed at the beginning of this section. Only membership functions of the informative class are updated. Since Gaussian membership functions are chosen for the classified informative linguistic feature, we update 2 parameters, that is,
In Figure
Genes corresponding to the selected features.
Feature index | Gene |
---|---|
83 | MDM4 |
87 | STX16 |
207 | NR1D2 |
355 | DCLRE1A |
450 | PARK7 |
546 | ATIC |
931 | HG4263-HT4533_at |
2164 | CSRP1 |
2360 | NASP |
2479 | PGK1 |
6264 | HLA-DPB1_2 |
7043 | HLA-A_2 |
7052 | ITK_2 |
7057 | PLGLB2 |
(a) Initial membership functions of the Wisconsin breast cancer dataset. (b) Updated membership functions after training by the neurofuzzy classification model.
For the typical direct calculation in a neural network, the output is
The algorithm selects the informative linguistic features twice. The first selection is done by considering the bipolar output from the hidden layer. The linguistic feature with output of +1 is considered an informative feature. The second selection is done by considering the weight values between the hidden layer and the output layer. The larger weight value indicates the more informative feature. Consider the proposed network for logical rule extraction, the IF part is extracted from the hidden layer. As mentioned previously, we use 3 membership functions representing the linguistic terms
From the structure described earlier, the rule extracted from our network can be interpreted by 2 approaches. Both approaches combine all conditions to only 1 rule. The first approach is the “
In case of the class order “Class 2 then Class 1,” the rule will be slightly modified to Rule
The second approach creates a rule with the consideration of the order of informative linguistic features and class order. We call this approach the “
In case of the class order “Class 2 then Class 1,” the extracted rule will be Rule
An iterative partition method is designed for the application on a dataset that has a large amount of features. The idea is to partition the entire set of features into subclusters. Each cluster is used in informative feature selection. The structure of the algorithm is displayed in Figure
Iterative partition method for neurofuzzy method with feature selection and rule extraction for large dataset.
The diffuse large B-cell lymphomas (DLBCL) dataset consisting of 77 microarray experiments with 7,070 gene expression levels [
The selected informative features from the learning phase are used for classification task in both direct calculation and logical rule. The 10-fold cross validation is performed in the experiments. The results displayed are the average results on validation sets over 10 cross validations. During the training step, because the number of features was too large, the original features were divided into small subsets, rather than using the entire 7,070 features at once. The 10-fold cross validation was performed on each subset. The informative features from each subset were selected and form the final informative features by combining them together. The learning rate for weight updating was set to 0.1, and the learning rates used in updating
We tried several choices of the number of selected features (
Means (
Initial values | ||||||
---|---|---|---|---|---|---|
Feature |
|
|
|
|
|
|
83 | 185.636 | 254.651 | 323.665 | 69.014 | 69.014 | 69.014 |
87 | 200.414 | 255.633 | 310.852 | 55.219 | 55.219 | 55.219 |
207 | −164.730 | −89.712 | −14.693 | 75.018 | 75.018 | 75.018 |
355 | −22.187 | 11.675 | 45.538 | 33.862 | 33.862 | 33.862 |
450 | 3384.316 | 4126.221 | 4868.126 | 741.905 | 741.905 | 741.905 |
546 | 1810.664 | 2396.104 | 2981.544 | 585.440 | 585.440 | 585.440 |
931 | −148.675 | −26.558 | 95.558 | 122.117 | 122.117 | 122.117 |
2164 | 865.360 | 1032.987 | 1200.614 | 167.627 | 167.627 | 167.627 |
2360 | 1107.608 | 1373.013 | 1638.418 | 265.405 | 265.405 | 265.405 |
2479 | 25.192 | 42.494 | 59.795 | 17.301 | 17.301 | 17.301 |
6264 | 5419.876 | 6722.623 | 8025.370 | 1302.747 | 1302.747 | 1302.747 |
7043 | 11535.574 | 12983.805 | 14432.037 | 1448.232 | 1448.232 | 1448.232 |
7052 | 259.998 | 393.052 | 526.106 | 133.054 | 133.054 | 133.054 |
7057 | 422.074 | 516.844 | 611.614 | 94.770 | 94.770 | 94.770 |
Means (
Final values | ||||||
---|---|---|---|---|---|---|
Feature |
|
|
|
|
|
|
83 | 42.217 | 244.422 | 448.788 | 70.018 | 68.702 | 68.640 |
87 | 89.961 | 244.306 | 401.299 | 55.559 | 53.710 | 53.294 |
207 | −316.059 | −88.806 | 138.408 | 73.797 | 75.289 | 75.806 |
355 | −84.419 | 12.311 | 109.255 | 31.586 | 31.900 | 37.146 |
450 | 1811.843 | 4047.102 | 6272.187 | 746.321 | 748.685 | 750.148 |
546 | 594.454 | 2361.230 | 4125.430 | 590.479 | 591.615 | 595.343 |
931 | −404.615 | −26.100 | 348.347 | 124.371 | 124.241 | 125.706 |
2164 | 552.324 | 1048.629 | 1545.862 | 168.960 | 168.791 | 173.626 |
2360 | 566.830 | 1351.743 | 2133.887 | 262.543 | 263.800 | 265.277 |
2479 | −10.591 | 41.380 | 95.801 | 17.087 | 18.011 | 23.355 |
6264 | 2972.543 | 6890.328 | 10810.781 | 1308.614 | 1309.833 | 1308.325 |
7043 | 8683.574 | 12940.488 | 17192.838 | 1424.231 | 1424.378 | 1422.902 |
7052 | −21.680 | 387.083 | 797.666 | 137.781 | 137.389 | 139.518 |
7057 | 217.169 | 507.375 | 803.842 | 100.208 | 102.060 | 98.289 |
The classification rates on the validation sets of 10-fold cross validation achieved by using the direct calculation were 92.21%, 89.61%, 84.42%, and 84.42%, when the numbers of selected linguistic features were set to 14, 10, 5, and 3, respectively. These results show that the proposed method can select a set of informative features out of a huge pool of features. As shown in Table
Comparison between the proposed method and other algorithms on the DLBCL dataset.
Method | Number of |
Classification |
---|---|---|
Naïve Bayes [ |
3–8 | 83.76 |
Our method without constrained weight initialization | 10 | 89.61 |
Decision trees [ |
3–8 | 85.46 |
Our method without constrained weight initialization | 14 | 92.21 |
|
3–8 | 88.60 |
Weighted voting model [ |
30 | 92.20 |
VizRank [ |
3–8 | 93.03 |
Our method with constrained weight initialization | 10 | 97.40 |
SVM [ |
3–8 | 97.85 |
Our method with constrained weight initialization | 14 | 100.00 |
One of the good features of our method is the automatic rule extraction. Even though this approach usually does not yield as good performance as the direct calculation, it provides rules understandable for human. This is more desirable from the human interpretation aspect than the black-box based direct calculation.
The
Top-10 linguistic features for DLBCL dataset selected by our method in 10-fold cross validation (most informative: right, least informative: left).
Cross validation | Class | Feature type | Feature ranking | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | Linguistic | S | M | M | S | L | S | M | L | S | S |
Original | 7057 | 207 | 931 | 87 | 2479 | 7052 | 546 | 2360 | 83 | 7043 | ||
2 | Linguistic | S | M | L | L | M | S | L | L | L | S | |
Original | 450 | 7052 | 6264 | 207 | 83 | 2360 | 83 | 87 | 7043 | 546 | ||
|
||||||||||||
2 | 1 | Linguistic | L | L | M | M | S | S | S | M | L | S |
Original | 355 | 2479 | 207 | 931 | 87 | 7052 | 83 | 546 | 2360 | 7043 | ||
2 | Linguistic | S | M | L | L | L | M | S | L | L | S | |
Original | 355 | 7052 | 83 | 6264 | 207 | 83 | 2360 | 87 | 7043 | 546 | ||
|
||||||||||||
3 | 1 | Linguistic | S | M | S | M | L | S | M | L | S | S |
Original | 7052 | 931 | 7057 | 207 | 2479 | 87 | 546 | 2360 | 83 | 7043 | ||
2 | Linguistic | L | S | S | L | S | L | M | S | L | L | |
Original | 7057 | 2479 | 450 | 207 | 2360 | 83 | 83 | 546 | 87 | 7043 | ||
|
||||||||||||
4 | 1 | Linguistic | M | L | M | L | M | S | S | S | L | S |
Original | 931 | 2164 | 207 | 2479 | 546 | 7052 | 87 | 83 | 2360 | 7043 | ||
2 | Linguistic | S | L | S | L | L | S | M | L | S | L | |
Original | 2479 | 6264 | 450 | 207 | 83 | 2360 | 83 | 87 | 546 | 7043 | ||
|
||||||||||||
5 | 1 | Linguistic | L | M | L | M | S | S | M | L | S | S |
Original | 2164 | 931 | 2479 | 207 | 7052 | 87 | 546 | 2360 | 83 | 7043 | ||
2 | Linguistic | S | S | L | S | L | L | M | L | S | L | |
Original | 2479 | 450 | 6264 | 2360 | 83 | 207 | 83 | 87 | 546 | 7043 | ||
|
||||||||||||
6 | 1 | Linguistic | L | M | S | M | L | S | S | M | L | S |
Original | 355 | 207 | 7052 | 931 | 2479 | 87 | 83 | 546 | 2360 | 7043 | ||
2 | Linguistic | L | L | S | L | M | L | S | L | S | L | |
Original | 7057 | 6264 | 450 | 207 | 83 | 83 | 2360 | 87 | 546 | 7043 | ||
|
||||||||||||
7 | 1 | Linguistic | S | S | M | M | L | S | M | S | L | S |
Original | 7052 | 7057 | 931 | 207 | 2479 | 87 | 546 | 83 | 2360 | 7043 | ||
2 | Linguistic | L | S | S | L | L | S | M | L | S | L | |
Original | 6264 | 2479 | 450 | 207 | 83 | 2360 | 83 | 87 | 546 | 7043 | ||
|
||||||||||||
8 | 1 | Linguistic | L | M | M | S | S | S | L | S | L | S |
Original | 355 | 546 | 931 | 7052 | 87 | 7057 | 2479 | 83 | 2360 | 7043 | ||
2 | Linguistic | L | L | S | M | S | L | L | L | S | L | |
Original | 7057 | 207 | 2360 | 83 | 2479 | 6264 | 83 | 87 | 546 | 7043 | ||
|
||||||||||||
9 | 1 | Linguistic | S | S | M | L | M | S | M | S | L | S |
Original | 7057 | 7052 | 931 | 2479 | 207 | 87 | 546 | 83 | 2360 | 7043 | ||
2 | Linguistic | L | S | S | L | L | S | M | L | S | L | |
Original | 6264 | 450 | 2479 | 207 | 83 | 2360 | 83 | 87 | 546 | 7043 | ||
|
||||||||||||
10 | 1 | Linguistic | M | L | M | S | S | S | M | S | L | S |
Original | 207 | 2479 | 931 | 87 | 7052 | 7057 | 546 | 83 | 2360 | 7043 | ||
2 | Linguistic | L | L | S | L | S | L | M | L | S | L | |
Original | 6264 | 7057 | 450 | 207 | 2360 | 83 | 83 | 87 | 546 | 7043 |
From Table
We show the overall feature informative levels from the 10-fold cross validation using top-10 features in Table
Feature informative levels from the 10-fold cross validation using top-10 features.
Original feature | Linguistic term | Informative level |
---|---|---|
Class 1 | ||
7043 | S | 10.0 |
2360 | L | 8.7 |
83 | S | 8.1 |
546 | M | 6.5 |
87 | S | 5.5 |
2479 | L | 4.2 |
7052 | S | 3.9 |
207 | M | 2.8 |
931 | M | 2.8 |
7057 | S | 1.9 |
355 | L | 0.3 |
2164 | L | 0.3 |
|
||
Class 2 | ||
7043 | L | 9.8 |
546 | S | 9.1 |
87 | L | 8.1 |
83 | M | 6.2 |
2360 | S | 5.5 |
83 | L | 5.5 |
207 | L | 4.1 |
6264 | L | 2.3 |
450 | S | 2.0 |
2479 | S | 1.4 |
7057 | L | 0.5 |
7052 | M | 0.4 |
355 | S | 0.1 |
Using the
The classification problem of diffuse large B-cell lymphoma (DLBCL) versus follicular lymphoma (FL) based on high dimensional microarray data was investigated by our neurofuzzy classification scheme. Our direct calculation method could achieve 100% classification rate on the validation sets of 10-fold cross validation by using only 14 out of 7,070 features in the dataset. These 14 features including genes MDM4, STX16, NR1D2, DCLRE1A, PARK7, ATIC, HG4263-HT4533_at, CSRP1, NASP, PGK1, HLA-DPB1_2, HLA-A_2, ITK_2, and PLGLB2 were automatically selected by our method. The method could also identify the informative linguistic features for each class. For the DLBCL class, the first 5 most informative linguistic features were “HLA-A_2 is Small,” “NASP is Large,” “MDM4 is Small,” “ATIC is Medium,” and “STX16 is Small,” respectively. For class 2, the first 5 most informative linguistic features were “HLA-A_2 is Large,” “ATIC is Small,” “STX16 is Large,” “MDM4 is Medium,” and “NASP is Small,” respectively. The terms Small, Medium, and Large of each original feature were automatically determined by our method. The informative linguistic features were used to create rules that achieved 90.91% classification rate on the validation sets of 10-fold cross validation. Even though this rule creation approach yielded worse classification performance than the direct calculation, it is more desirable from the human interpretation aspect. It can be seen that very good results are achieved in this standard high dimensional dataset. A set of selected informative genes will be useful for further investigation in the fields of bioinformatics or medicines. To ensure that this set of selected features can be used in general, it should be applied to more DLBCL versus FL cases.
The authors declare no conflict of interests.