On Feature Selection and Rule Extraction for High Dimensional Data: A Case of Diffuse Large B-Cell Lymphomas Microarrays Classification

Neurofuzzy methods capable of selecting a handful of useful features are very useful in analysis of high dimensional datasets. A neurofuzzy classification scheme that can create proper linguistic features and simultaneously select informative features for a high dimensional dataset is presented and applied to the diffuse large B-cell lymphomas (DLBCL) microarray classification problem. The classification scheme is the combination of embedded linguistic feature creation and tuning algorithm, feature selection, and rule-based classification in one neural network framework.The adjustable linguistic features are embedded in the network structure via fuzzy membership functions.The network performs the classification task on the high dimensional DLBCLmicroarray dataset either by the direct calculation or by the rule-based approach. The 10-fold cross validation is applied to ensure the validity of the results. Very good results from both direct calculation and logical rules are achieved. The results show that the network can select a small set of informative features in this high dimensional dataset. By a comparison to other previously proposed methods, our method yields better classification performance.


Introduction
An innovation in computational intelligence mechanism, not only to develop the high accuracy mechanisms but also to be interpreted easily by human, is an interesting research topic.In order to achieve the interpretability purpose, linguistic features are more desirable than other types of features.An algorithm for finding appropriate symbolic descriptors to represent ordinary continuous features must be developed for classification mechanism.Enormous research works in neural networks are accomplished in classification accuracy [1][2][3][4].The better performance of rules generated from neural network than that from the decision tree in noisy conditions was demonstrated [1].The subset method which conducted a breadth first search for all the hidden and output nodes over the input links was proposed [2].Knowledge insertion was applied to reduce training times and improve various features of the neural networks [3,4].
However, these algorithms are difficult to comprehend due to the large number of parameters and the complicated structure inside the networks.The methods to extract rules from neural network without the consideration of linguistic feature have been proposed in some research works [5,6].Fuzzy sets are appropriate choice in preparing linguistic data to more interpretable information for humans [7][8][9][10][11].The methods to transform numeric data into linguistic terms before training and then extracting the rules were proposed [12][13][14][15][16][17][18].A supervised type of neural network with a structure which supported the simplicity of the rule extraction was proposed [12].Rules were extracted from neural networks using structural learning based on the matrix of importance index [13].Other rule extraction methods were proposed by simply determining the typical fuzzy membership functions using the expectation maximization (EM) algorithm [14] or determining the context-dependent membership functions for crisp and fuzzy linguistic variables which allowed different linguistic variables in different rules [15].The logical rule extraction from data was proposed with an assumption that a set of symbolic or continuous valued predicate functions has been defined for some objects, thus providing values of features for categorization of these objects [16].Nice examples of neurofuzzy methods for a medical prediction problem and a biometric classification problem can be found in [17] and [18], respectively.Nevertheless to discover proper linguistic features for continuous data representation in these methods is the tradeoff between simplicity and accuracy.
Due to linguistic feature requirement, in some situations depending on classification models, input features are  times increased corresponding to  linguistic terms for each original feature.This problem is more serious in high dimensional datasets.The high dimensional feature vectors may contain noninformative features and feature redundancy that, in turn, can cause unnecessary computation cost and difficulty in creating classification rules.Therefore, an algorithm for informative feature selection must be developed.Although some neurofuzzy methods were utilized for feature selection in high dimensional datasets, they are not widely used.On the other hand, there are several research works on the use of neurofuzzy methods as a classifier [19,20].In [21], a neurofuzzy method was utilized to select good features by utilizing a relationship between fuzzy criteria.In [22], a neurofuzzy scheme which combines neural networks and fuzzy rule base systems was proposed for simultaneous feature selection and fuzzy rule-based classification.There were three subprocesses in the learning phase.The network structure was changed in phases 2 and 3, and the parameters of membership functions were fine-tuned in phase 3.Although we have similar purpose with [22], the network structures and learning methods are different.In addition, our method can automatically create linguistic features and all parameters are modified automatically in one learning phase without changing the network structure or retraining network, while in [22] there is more than one learning phase, and that algorithm cannot fine-tune parameters without retraining network with different structure.
We initially proposed a neurofuzzy method that could select features and simultaneously create rules for lowdimensional datasets [23,24].Although the method in this paper is similar, the training process for a high dimensional dataset in this paper is different.To emphasize its usefulness, Chen and Lin have adopted our method into skin color detection [25].With the consideration of the uncomplicated network structure, the main components including linguistic feature creation and tuning algorithm, feature selection, and rule-based classification were embedded in one neural network mechanism.The problem of using linguistic feature is to define a particular linguistic set for each feature in a given dataset.The fuzzy membership function was embedded in our network for linguistic feature creation rather than using the ordinary feature.The reason of this combination model is the applicability of the neural network's learning algorithms developed for fuzzy membership function.The original features were transformed to linguistic features and then classified to informative and noninformative classes, that is, either +1 or −1.Features with high weight values referred to as the informative features were therefore selected.
In this paper, we investigate the usefulness of our proposed neurofuzzy method by applying it to the high dimensional diffuse large B-cell lymphomas (DLBCL) microarray classification problem.The number of features in this microarray dataset (7,070 features) is much larger than what we have tried previously.Moreover, the number of samples is very small (77 samples).Therefore, this problem is very challenging and it is interesting to see whether our method would work in this dataset with huge number of features, but very small number of samples.The findings of informative features and rules will be useful in diagnosis of this kind of cancer.The results will also indicate the generalization of our method.
This paper is organized as follows.A neurofuzzy method with feature selection and rule extraction and its training scheme designed for a high dimensional dataset is described in Section 2. The experimental setup, data description, experimental results, and discussion are given in Section 3. Section 4 concludes the paper.

Neurofuzzy Method with Feature Selection and Rule Extraction
Three requirements are concerned in the design of a neural network structure for rule extraction.Less complication of network structure is the first requirement.Therefore, only three layers (an input, a hidden, and an output layers) in a neural network are constructed.Consistent with the first requirement, the small number of linguistic variables is the second requirement.The set of linguistic terms {small, medium, large} is sufficiently understood.The final requirement is that there is only one best combination rule used for classification.The combination rule is created with the consideration of the class order described in the next section.
The neural network designed based on the aforementioned requirements is displayed in Figure 1.The original features are fed forward to the input layer.Each original feature is reproduced  times corresponding to the number of specified linguistic terms and used as the input to the hidden layer.Instead of using the static linguistic feature from preprocessing, we add the hidden layer with fuzzy logic membership functions.A Gaussian membership function is used to represent membership value of each group.In addition to weight updating equations, the modifications of the center  and the spread  are required during the training process.The second layer is the combination of fuzzification and informative linguistic feature classification.The class of linguistic features is decided and fed as the input to the output layer.
The number of nodes constructed in the input layer is the same number of original input features.For the forward pass, each node is reproduced  times.In our structure  is equal to 3. The number of outputs from this layer is therefore triple of that of the original input features and represented by where  = 1, 2, . . .,  denotes the order number of the original input and  = 1, 2, 3 denotes the order number of the linguistic feature created for each original input.Between the input layer and hidden layer, all connection weights are set to unity.
The next layer is the hidden layer.In addition to calculating fuzzy values corresponding to the specified linguistics of the original input, the maximum membership value is classified to the informative class.Since we use 3 linguistic terms, that is, small (S), medium (M), and large (L), for each original feature, the number of nodes constructed in this hidden layer is 3 times of that of the input layer.Each node uses Gaussian membership function to specify the membership values of the original feature; that is, is the membership value of  in  which is the original input  from the linguistic node .Each node in this layer has two parameters to consider.The first parameter is the spread   for each original feature .The initial values of   for  = S, M, and L come from the spreads divided by 3 (  /3) of all data points in linguistic term .The second parameter is the center   .The initial values of   for  = S, M, and L are set to (  −(  /3)),   , and (  +(  /3)), respectively, where   is the center of the membership set of linguistic term  of original feature .
For each original input, the most informative linguistic feature is defined as the feature with maximum membership value corresponding to the input value  in  .The parameters   and   which are mean and standard deviation of only the most informative linguistic variable are modified.Consider two linguistic terms (classes) of  (the most informative linguistic term) and  (the noninformative linguistic term); the output  ℎ  of the hidden layer is equal to +1 if the input  in  belongs to class .Alternatively,  ℎ  is equal to −1 if the input belongs to class , for  = 1, 2, 3. Therefore, the output from the hidden layer is ±1.Each original feature has one informative linguistic term represented by +1 and two noninformative linguistic terms represented by −1.All outputs with identified informative values are used as input to the final layer.
In the output layer, the number of nodes is equal to the number of the classes in the dataset.Weights are fully connected between this layer and hidden layer.Hence, we utilize this layer for feature selection purpose.The importance of linguistic features is specified by the corresponding weight values.Sigmoid function is used as the activation function in this layer.Therefore the output is where The subscript , ordering from 1 to , represents the class indices of the dataset.  represents the weight connected between node  in the hidden layer and node  in the output layer.The summation of the product between weights and outputs from the hidden layer is represented by   .

Parameter Tuning.
The algorithm for parameter tuning of the proposed model is slightly different from the conventional algorithm as Algorithm 1.

Weight Modification.
The standard error backpropagation algorithm is used in the backward pass of the proposed model.Consider the th neuron of the output layer at iteration ; the error signal is defined by where   () and   () represent the error signal and desired output, respectively.
Backpropagating from the output layer, the delta value (local gradient value) is defined as where   (  ()) =   ()/  ().Given the learning rate   , the connected weights between the hidden layer and the output layer are updated by

Membership Function Parameter Modification.
In the hidden layer, the update process follows the parameter tuning algorithm displayed at the beginning of this section.Only membership functions of the informative class are updated.Since Gaussian membership functions are chosen for the classified informative linguistic feature, we update 2 parameters, that is,  and .Similar to the output layer, we perform the backpropagation algorithm in this hidden layer with the delta value defined as The parameters  and  belonging to the informative linguistic features at iteration ( + 1) are updated by where Δ  and Δ  are defined by   and   are the learning rates for the parameters  and , respectively.
In Figure 2(a), the Wisconsin breast cancer dataset with 9 original features and 683 data points is used to illustrate the initial membership functions.Figure 2(b) shows the illustrators of all corresponding parameters tuned after training.For the DLBCL dataset used in this research, the number of original features is too large.Therefore, we cannot display the initial membership functions of all features.However, the means and standard deviations of a set of 14 selected features' initial membership functions of are shown later in Table 1.

Rule Extraction Methods.
For the typical direct calculation in a neural network, the output is   as shown in (3).The class decision is simply the class with the corresponding maximum output.For the rule extraction purpose, however, the weight values are used to verify the importance of the features after the training process.In each fold of the 10fold cross validation, after the learning phase, the connection weights between the hidden layer and the output layer are sorted to prioritize the informative features to be described in more detail below.The algorithm selects the informative linguistic features twice.The first selection is done by considering the bipolar output from the hidden layer.The linguistic feature with output of +1 is considered an informative feature.The second selection is done by considering the weight values between the hidden layer and the output layer.The larger weight value indicates the more informative feature.Consider the proposed network for logical rule extraction, the IF part is extracted from the hidden layer.As mentioned previously, we use 3 membership functions representing the linguistic terms {S, M, L} for each original feature.The summation of the product of weights and output from the hidden layer is interpreted to the logical OR in the extracted rules.The final classified class from the output layer is interpreted to THEN in the classification rules.After finishing the training phase, the weights are sorted.We use the final values of weights to select the "Top " informative features.
From the structure described earlier, the rule extracted from our network can be interpreted by 2 approaches.Both approaches combine all conditions to only 1 rule.The first approach is the "simple OR" rule.All  features are used to create a simple OR rule of each class, for example, when a 2class problem is assumed and  is set to 3. Considering class 1, if the first informative order is "Feature 1 is Large," the second informative order is "Feature 5 is Medium," and the third informative order is "Feature 3 is Small."Considering class 2, if the first informative order is "Feature 10 is Small," the second informative order is "Feature 5 is Small," and the third informative order is "Feature 6 is Large."In case of the class order "Class 1 then Class 2," the rule automatically generated from the simple OR approach of our proposed method is as Rule 1.
In case of the class order "Class 2 then Class 1," the rule will be slightly modified to Rule 2.
The second approach creates a rule with the consideration of the order of informative linguistic features and class order.We call this approach the "layered" rule.All N linguistic features are created with the consideration of the order of informative features and class order.In case of the class order "Class 1 then Class 2," the rule automatically generated from the layered approach for the same scenario as above is as Rule 3.
In case of the class order "Class 2 then Class 1," the extracted rule will be Rule 4.

Neurofuzzy Method with Feature Selection and Rule Extraction for High Dimensional Dataset via Iterative Partition
Method.An iterative partition method is designed for the application on a dataset that has a large amount of features.The idea is to partition the entire set of features into subclusters.Each cluster is used in informative feature selection.The structure of the algorithm is displayed in Figure 3.The first step in the algorithm is to define the desired number of features () to be used in the final step.The original dataset is partitioned into  subset.Each subset is used as an input to the neurofuzzy method.All selected features are then combined to create the dataset with selected features.The partitioning and feature selection is iteratively performed until the desired number of informative features is achieved.

Experimental Results and Discussion
The diffuse large B-cell lymphomas (DLBCL) dataset consisting of 77 microarray experiments with 7,070 gene expression levels [27] was utilized in this research.It was made available to the public at http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi.There are two classes in the dataset, that is, diffuse large B-cell lymphoma (DLBCL) is referred to as class 1 and follicular lymphoma (FL) is referred to as class 2. These 2 types of B-cell lineage malignancies have very different clinical presentations, natural histories, and response to therapy [27].Because DLBCLs are the most common lymphoid malignancy in adults, a method that can efficiently classify these 2 lymphomas is, therefore, very  desirable.In the dataset, there are 58 samples of DLBCL class, and 19 samples of FL class.
The selected informative features from the learning phase are used for classification task in both direct calculation and logical rule.The 10-fold cross validation is performed in the experiments.The results displayed are the average results on validation sets over 10 cross validations.During the training step, because the number of features was too large, the original features were divided into small subsets, rather than using the entire 7,070 features at once.The 10fold cross validation was performed on each subset.The informative features from each subset were selected and form the final informative features by combining them together.The learning rate for weight updating was set to 0.1, and the learning rates used in updating  and  were set to 0.001.

Classification Results on DLBCL Microarrays by Direct
Calculation Using Selected Features.We tried several choices of the number of selected features () and found that  = 14 was adequate for this problem.After training, 14 features were automatically selected by our method.The set of features that yielded the best results on validation sets among those in 10-fold cross validation consisted of features 83 (MDM4), 87 (STX16), 207 (NR1D2), 355 (DCLRE1A), 450 (PARK7), 546 (ATIC), 931 (HG4263-HT4533 at), 2164 (CSRP1), 2360 (NASP), 2479 (PGK1), 6264 (HLA-DPB1 2), 7043 (HLA-A 2), 7052 (ITK 2), and 7057 (PLGLB2).The genes corresponding to the selected features are shown in Table 1.The values of means and standard deviations of all membership functions of the 14 selected features before and after training are shown in Tables 2 and 3, respectively.
The classification rates on the validation sets of 10-fold cross validation achieved by using the direct calculation were 92.21%, 89.61%, 84.42%, and 84.42%, when the numbers of selected linguistic features were set to 14, 10, 5, and 3, respectively.These results show that the proposed method can select a set of informative features out of a huge pool of features.As shown in Table 4, the classification performance is comparable to those performed by previously proposed methods [26,27].However, rather than using random initial weights connecting between the hidden layer and the output layer, we used the weights achieved in the current cross validation to be the initial weights for the next cross validation.This constrained weight initialization yielded 100.00%, 97.40%, 90.91%, and 92.21% using the direct calculation, when the numbers of selected linguistic features were set to 14, 10, IF "Feature 1 is Large" OR "Feature 5 is Medium" OR "Feature 3 is Small" THEN "Class is 1" ELSEIF "Feature 10 is Small" OR "Feature 5 is Small" OR "Feature 6 is Large" THEN "Class is 2" END Rule 1 IF "Feature 10 is Small" OR "Feature 5 is Small" OR "Feature 6 is Large" THEN "Class is 2" ELSEIF "Feature 1 is Large" OR "Feature 5 is Medium" OR "Feature 3 is Small" THEN "Class is 1" END Rule 2 IF "Feature 1 is Large" THEN "Class is 1" ELSEIF "Feature 10 is Small" THEN "Class is 2" ELSEIF "Feature 5 is Medium" THEN "Class is 1" ELSEIF "Feature 5 is Small" THEN "Class is 2" ELSEIF "Feature 3 is Small" THEN "Class is 1" ELSEIF "Feature 6 is Large" THEN "Class is 2" END Rule 3 IF "Feature 10 is Small" THEN "Class is 2" ELSEIF "Feature 1 is Large" THEN "Class is 1" ELSEIF "Feature 5 is Small" THEN "Class is 2" ELSEIF "Feature 5 is Medium" THEN "Class is 1" ELSEIF "Feature 6 is Large" THEN "Class is 2" ELSEIF "Feature 3 is Small" THEN "Class is 1" END Rule 4 5, and 3, respectively.To ensure that the set of all parameters achieved here could get 100% correct classification on this dataset, we tried to use the networks to classify all 77 microarrays (rather than considering the results from 10-fold cross validation in which the outputs could be different when the random groups are different.)We found that each of all 10 networks from 10-fold cross validation still yielded 100% correct classification on the entire dataset.

Classification Results on DLBCL Microarrays by Logical
Rule Using Selected Features.One of the good features of our method is the automatic rule extraction.Even though this approach usually does not yield as good performance as the direct calculation, it provides rules understandable for human.This is more desirable from the human interpretation aspect than the black-box based direct calculation.The  selected linguistic features were used to create rules using both simple OR and layered approaches as mentioned in Section 2.2.The classification rate of 90.91% on validation sets using only 5 selected linguistic features was achieved using the simple OR rule with the class order "Class 1 then Class 2," where Class 1 and Class 2 denote the DLBCL class and FL class, respectively.For more details of the results, Table 5 shows the top-10 linguistic features for the DLBCL dataset selected by our method in each cross validation of the 10fold cross validation.The classification rates in all 10 cross validations were 75.00%, 100.00%, 100.00%, 75.00%, 100.00%, 87.50%, 75.00%, 100.00%, 100.00%, and 100.00%, respectively.The classification rate from the layered rule was 81.82% using the same top 5 linguistic features.The details of classification rates in all 10 cross validations were 75.00%, 87.50%, 100.00%, 87.50%, 62.50%, 75.00%, 75.00%, 85.71%, 85.71%, and 85.71%, respectively.When using the aforementioned constrained IF "HLA-A 2 is Small" THEN "Class is DLBCL" ELSEIF "HLA-A 2 is Large" THEN "Class is FL" ELSEIF "NASP is Large" THEN "Class is DLBCL" ELSEIF "ATIC is Small" THEN "Class is FL" ELSEIF "MDM4 is Small" THEN "Class is DLBCL" ELSEIF "STX16 is Large" THEN "Class is FL" ELSEIF "ATIC is Medium" THEN "Class is DLBCL" ELSEIF "MDM4 is Medium" THEN "Class is FL" ELSEIF "STX16 is Small" THEN "Class is DLBCL" ELSEIF "NASP is Small" THEN "Class is FL" END Rule 6 weight initialization, the classification rates were the same.We also tried to increase the number of selected features to 10 but it did not help.The results were the same as that using 5 features.
From Table 5, it can be seen that the sets of informative features for class 1 and class 2 are different across the cross validations.That will result in a number of different rules.However, in the real application we will have to come up with the best rules among them.We propose to use the summation of the feature informative level in all 10 cross validations.The feature informative level is simply defined by the informative order.For example, if only 10 linguistic features are considered, the most informative one will have the feature informative level of 1.0, the second one will have that of 0.9, and so on.Hence, the tenth most informative feature will have the feature informative factor of 0.1, and the remaining will get the feature informative level of 0.0.The informative levels of each feature are then summed across all cross validations to yield the overall informative level of that feature.
We show the overall feature informative levels from the 10-fold cross validation using top-10 features in Table 6.In this case, the highest possible informative level is 10.0.Out of 42 linguistic features for each class, there were only 12 and 13 linguistic features with nonzero informative level for class 1 and class 2, respectively.That means the remaining features did not appear at all in the top-10 list of any cross validation.The results showed that, for class 1, the first 5 most informative linguistic features ranking from the most informative to the less informative were "Feature 7043 (HLA-A 2) is Small," "Feature 2360 (NASP) is Large," "Feature 83 (MDM4) is Small," "Feature 546 (ATIC) is Medium," and "Feature 87 (STX16) is Small," respectively.For class 2, the first 5 most informative linguistic features were "Feature 7043 is Large," "Feature 546 is Small," "Feature 87 is Large," "Feature 83 is Medium," and "Feature 2360 is Small," respectively.It is worthwhile noting that the last statement can also be "Feature 83 is Large" because it has the same feature informative level of 5.5 as for "Feature 2360 is Small."This information was used to create rules as described in Section 2.2.Hence, the rule extracted using the simple OR approach is as Rule 5.
Using the layered approach, the extracted rule is in Rule 6.

Conclusion
The classification problem of diffuse large B-cell lymphoma (DLBCL) versus follicular lymphoma (FL) based on high dimensional microarray data was investigated by our neurofuzzy classification scheme.Our direct calculation method could achieve 100% classification rate on the validation sets of 10-fold cross validation by using only 14 out of 7,070 features in the dataset.These 14 features including genes MDM4, STX16, NR1D2, DCLRE1A, PARK7, ATIC, HG4263-HT4533 at, CSRP1, NASP, PGK1, HLA-DPB1 2, HLA-A 2, ITK 2, and PLGLB2 were automatically selected by our method.The method could also identify the informative linguistic features for each class.For the DLBCL class, the first 5 most informative linguistic features were "HLA-A 2 is Small," "NASP is Large," "MDM4 is Small," "ATIC is Medium," and "STX16 is Small," respectively.For class 2, the first 5 most informative linguistic features were "HLA-A 2 is Large," "ATIC is Small," "STX16 is Large," "MDM4 is Medium," and "NASP is Small," respectively.The terms Small, Medium, and Large of each original feature were automatically determined by our method.The informative linguistic features were used to create rules that achieved 90.91% classification rate on the validation sets of 10-fold cross validation.Even though this rule creation approach yielded worse classification performance than the direct calculation, it is more desirable from the human interpretation aspect.It can be seen that very good results are achieved in this standard high dimensional dataset.A set of selected informative genes will be useful for further investigation in the fields of bioinformatics or medicines.To ensure that this set of selected features can be used in general, it should be applied to more DLBCL versus FL cases.

Figure 2 :
Figure 2: (a) Initial membership functions of the Wisconsin breast cancer dataset.(b) Updated membership functions after training by the neurofuzzy classification model.

Figure 3 :
Figure 3: Iterative partition method for neurofuzzy method with feature selection and rule extraction for large dataset.

Table 1 :
Genes corresponding to the selected features.

Table 2 :
Means (  ) and variances (  ) of membership functions {S, M, L} of 14 selected features before training for the DLBCL dataset.

Table 3 :
Means (  ) and variances (  ) of membership functions {S, M, L} of 14 selected features after training for the DLBCL dataset.

Table 4 :
Comparison between the proposed method and other algorithms on the DLBCL dataset.

Table 6 :
Feature informative levels from the 10-fold cross validation using top-10 features.