Integrated Use of Statistical-Based Approaches and Computational Intelligence Techniques for Tumors Classification Using Microarray

With the recent development of biotechnologies, cDNA microarray chips are increasingly applied in cancer research. Microarray experiments can lead to amore thorough grasp of themolecular variations among tumors because they can allow themonitoring of expression levels in cells for thousands of genes simultaneously. Accordingly, how to successfully discriminate the classes of tumors using gene expression data is an urgent research issue and plays an important role in carcinogenesis. To refine the large dimension of the genes data and effectively classify tumor classes, this study proposes several hybrid discrimination procedures that combine the statistical-based techniques and computational intelligence approaches to discriminate the tumor classes. A real microarray data set was used to demonstrate the performance of the proposed approaches. In addition, the results of cross-validation experiments reveal that the proposed two-stage hybridmodels aremore efficient in discriminating the acute leukemia classes than the established single stage models.


Introduction
The recent development of cDNA microarray technologies has made it possible to analyze thousands of genes simultaneously and has led to the prospect of providing an accurate and efficient means for classifying and diagnosing human cancers [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20].Advances in microarray discrimination method promise to greatly advance cancer diagnosis, especially in situations where tumors are clinically atypical.The main challenge of microarray analysis, however, is the overwhelming number of genes compared to the smaller number of available tumor samples, that is, a very large number of variables relative to the number of observations [10,[21][22][23].As a consequence, the issue of developing an accurate discrimination method for tumor classification using gene expression data has received considerable attention recently.
Many approaches have been proposed for tumor classification using microarray data [10,[22][23][24][25][26][27][28][29][30][31][32][33].The existing methods can be divided into two types, the statisticalbased methods [10,22,[24][25][26] and computational intelligence methods [22,[27][28][29][30][31][32][33].Due to the fact that the dimension of the genes data is very large, but there are only a few observations available, it is a must to reduce and refine the whole data set before we perform the classification tasks.While most related works have focused on the use of a single technique for tumor classification, little research has been done on the integrated use of several techniques simultaneously to classify tumor classes.To achieve the high accuracy for a particular classification problem with smaller computational time, hybrid evolutionary computation algorithms are commonly used for optimizing the resolution process [34][35][36].As a consequence, in this study, we aim to develop several effective two-stage hybrid discrimination approaches that integrate the framework of statistical methods and the computational intelligence methods for tumors classification based on gene expression data.
The remainder of this paper is structured as follows.The second section reviews several existing approaches considered in our comparison study.The third section addresses the proposed hybrid approaches for tumors classification.
The fourth section shows classification results from the crossvalidation.The final section reports the research findings and presents a conclusion to complete this study.

Review of Established Methods
Consider a two-class classification problem.Let  ̃ = ( 1 ,  2 , . . .,   ) be the gene expression profile vector, where   is the expression level of the th gene in the th tumor sample,  = 1, 2, . . ., ,  = 1, 2, . . ., .Let   be a binary disease status variable (1 for case group  1 and −1 for control group  2 as a general example).Accordingly, the microarray data may be summarized as the following set: The following sections briefly review several well-known established microarray classification methods.

Logistic Regression.
The microarray discrimination approach with the use of logistic regression (LR) model was also studied for disease classification [22,25,26].
The structure of the logistic regression model can be briefly described as follows.Let be the conditional probability of event {  = 1} under a given series of independent variables ( 1 ,  2 , . . .,   ).The logistic regression model then is defined as follows: Collinearity diagnosis procedure should be conducted first to exclude variables exhibiting high collinearity.After collinearity diagnosis, the remaining variables are then used for logistic regression modeling and testing.Afterward, using logistic regression with Wald-forward method, we can identify significant independent variables, say,  * 1 , where  () is a neuron in the previous layer,   (  ) is the connection weight from neuron  () to neuron  (), and   (  ) is the output of node  ().The sigmoid functions are given by where net  (net  ) is the input signal from the external source to the node  () in the input layer and   (  ) is a bias.The conventional technique used to derive the connection weights of the feedforward network is the generalized delta rule [37].

Support Vector Machine.
To classify tumor classes using microarray data, the discrimination method with the use of support vector machine (SVM) has also been discussed [22,[30][31][32][33].The structure of SVM algorithm can be described as follows.Let {(  ,  ̃)}  =1 ,  ̃ ∈   ,   ∈ {−1, 1}, be the training set with input vectors and labels, where  is the number of sample observations and  is the dimension of each observation, and   is known target.The algorithm is to seek the hyperplane  ̃ ⋅  ̃ +  = 0, where  ̃is the vector of hyperplane and  is a bias term, to separate the data from two classes with maximal margin width 2/‖ ̃‖2 .In order to obtain the optimal hyperplane, the SVM was used to solve the following optimization problem: Because it is difficult to solve (10), SVM transforms the optimization problem to be dual problem by Lagrange method.The value of  in the Lagrange method must be nonnegative real coefficients.Equation ( 10) is transformed into the following constrained form [38]: In (11),  is the penalty factor and determines the degree of penalty assigned to an error.Typically, it could not find the linear separate hyperplane for all application data.For problems that can not be linearly separated in the input space, SVM employs the kernel method to transform the original input space into a high dimensional feature space, where an optimal linear separating hyperplane can be found.The common kernel functions are linear, polynomial, radial basis function (RBF), and sigmoid.Although several choices for the kernel function are available, the most widely used kernel function is the RBF which is defined as [39] where  denotes the width of the RBF.Consequently, the RBF is used in this study and the multiclass SVM method is used in this study [40].

Multivariate Adaptive Regression Splines.
The multivariate adaptive regression splines (MARS) have also been applied for tumor classification using gene expression data [22,30].The general MARS function can be represented as follows: where  0 and   are the parameters,  is the number of basis functions (BF),   is the number of knots,   takes on values of either 1 or −1 and indicates the right or left sense of the associated step function, ](, ) is the label of the independent variable, and   is the knot location.The optimal MARS model is chosen in a two-step procedure.Firstly, construct a large number of basis functions to fit the data initially.Secondly, basis functions are deleted in order of least contribution using the generalized cross-validation (GCV) criterion.To measure the importance of a variable, we can observe the decrease in the calculated GCV values when a variable is removed from the model.The GCV is defined as follows: where  is the observations and () is the cost penalty measures of a model containing  basis function.

The Proposed Hybrid Discrimination Methods
The two-stage hybrid procedure is commonly used in various fields such as financial distress warning system [41,42], medical area [43], statistical inference [44,45], and statistical process control [36,[46][47][48].To obtain the best accuracy for a specific classification problem, hybrid evolutionary computation algorithms are commonly used to optimize the resolution process [34][35][36].In this section, several two-stage hybrid discrimination methods that integrate the framework of statistical-based approaches and computational intelligence methods are proposed for tumor classification based on gene expression microarray data.The proposed methods include five components: the FLDA, the LR model, the MARS model, the ANN, and the SVM classifiers.The proposed hybrid discrimination methods combine the statistical-based discrimination methods and computational intelligence methods.In stage 1, influencing variables are selected using LR or MARS.In stage 2, the selected important influencing variables are then taken as the input variables of FLDA, LR, ANN, SVM, or MARS.The following sections address the proposed approaches.

The Cross-Validation Experiments
This study performs a series of cross-validation experiments to compare the proposed approaches with those previously discussed in literature.This study considers a leukemia dataset that was first described by Golub et al. [5] and was examined in Dudoit et al. [10] and Lee et al. [22].This dataset contains 6817 human genes and was obtained from Affymetrix high-density oligonucleotide microarrays.The data consist of 25 cases of acute myeloid leukemia (AML) and 47 cases of acute lymphoblastic leukemia (ALL).
Since the dimension of the data is very large ( = 6817) but there are only a few observations ( = 72), it is essential to reduce and refine the whole set of genes (independent variables) before we can construct the discrimination model.To refine the set of genes, Golub et al. [5], Dudoit et al. [10], and Lee et al. [22] proposed the methods of subjective ratios to select genes.It is well known that the two-sample -test is the most popular test to test for the differences between two groups in means.For the sake of strictness, instead of using a somewhat arbitrary criterion like that used in Golub et al.
[5], Dudoit et al. [10], or Lee et al. [22], this study applies the two-sample -test with a significance level of 0.0001 to select the influencing genes.The results are given in Table 1.The significant variables selected using two-sample -test are then served as the input variables of the established single stage discrimination methods reviewed in Section 2 and the proposed two-stage hybrid methods introduced in Section 3. To examine the presence of collinearity, the variance inflation factor (VIF) was calculated.As shown in Table 2, all the values of VIFs are less than 10.Consequently, there is no high collinearity among these variables.In addition, this study adopts the suggestions of Dudoit et al. [10] and Lee et al. [22] and performs a 2 : 1 cross-validation (training set : test set).
The difficulty with ANN is that the design parameters, such as the number of hidden layers and the number of neurons in each layer, have to be set before training process can proceed.User has to select the ANN structure and set the values of certain parameters for the ANN modeling process.
> #Find the best parameter gamma&cost > p<-seq(-1,1,1) > obj<-tune.svm(y∼., data=train, sampling="cross", gamma=2 ∧ (p), cost=2 However, there is no general and explicit approach to select optimal parameters for the ANN models [49].Accordingly, the selection of design parameters for ANN may be based on the trial and error procedure.
This study employs the highest accurate classification rate (ACR) as the criterion for selecting the ANN topology.The topology is defined as {  - ℎ -  -}, where it stands for the number of neurons in the input layer, number of neurons in the hidden layer, number of neurons in the output layer, and learning rate, respectively.Actually, too few hidden nodes would limit the network generation capability, while too many hidden nodes may result in overtraining or memorization by the network.Since there are 11 input nodes and one output node used in this study, the numbers of hidden nodes to test were selected as 9, 10, 11, 12, and 13.The learning rates are chosen as 0.1, 0.01, or 0.001, respectively.After performing the ANN modeling, this study found that the {11-9-1-0.01}topology has the best ACR results.
This study also performed the SVM modeling to the microarray dataset.The two parameters,  and , are the most important factors to affect the performance of SVM.The grid search method uses exponentially growing sequences of  and  to determine good parameters.The parameter set of  and  which generates the highest ACR is considered to be ideal set.Here, the best two parameter values for  and  are 2 and 0.5, respectively.The SVM package was performed in running the dataset, and the corresponding output is displayed in Algorithm 1. Observing Algorithm 1, in the case of  = 2 and  = 0.5, we can have ACR = 100% for the initial training stage.Consequently, in the testing stage, we are able to obtain ACR = 25% and ACR = 93.75% for AML and ALL, respectively, by using the same parameter settings (i.e.,  = 2 and  = 0.5).Accordingly, the ACR = 70.83%for the case of full sample.For MARS modeling, the results are displayed in Table 3.During the selection process, four important explanatory variables were chosen.The corresponding relative importance indicators are showed in Table 3.As a consequence, those four important variables would be served as the input variables for hybrid modeling process.In addition, the results of ACR for each modeling are listed in Table 4.
The rationale behind the proposed hybrid discrimination method is to obtain the fewer but more informative variables by performing the first stage LR or MARS modeling.The selected significant variables are then served as the inputs for the second stage of discrimination approach.In this study, the significant variables selected by performing LR and MARS modeling are variables  1 ,  2 ,  7 , and  8 and variables  2 ,  6 ,  7 , and  8 , respectively.For the hybrid LR-ANN model, the {4-6-1-0.01}topology provided the best ACR results.For the MARS-ANN hybrid model, the {4-6-1-0.01}topology also gave the best ACR results.Additionally, for both LR-SVM and MARS-SVM modeling, the best two parameter values for  and  are the same and they are 2 and 0.5, respectively.
For each of the thirteen different approaches, FLDA, LR, ANN, SVM, MARS, LR-FLDA, LR-ANN, LR-SVM, LR-MARS, MARS-FLDA, MARS-LR, MARS-ANN, and MARS-SVM, this study presents the corresponding ACRs in Table 4.By comparing the ACR with AML, while the LR has highest ACR (i.e., 62.50%) among the 5 single stage methods, both LR-SVM and MARS-LR have the highest ACR (i.e., 75.00%) among the 8 two-stage methods.Apparently, the two-stage methods provide a better classification performance.By comparing the ACR with ALL, the single stage methods of FLDA, ANN, and SVM give the highest ACR (i.e., 93.75%), and the two-stage methods of LR-ANN, LR-MARS, and MARS-ANN have the same ACR (i.e., 93.75%).It seems that the single stage and two-stage methods achieve a similar performance.As shown in Table 4, it can be seen that, among the thirteen methods mentioned above, the two-stage hybrid model of LR-MARS has the highest ACRs (i.e., 83.33%) for the full sample.As a consequence, the proposed two-stage hybrid approaches are more efficient for tumor classification than the established single stage methods.In addition, Table 5 lists the overall averaged ACRs and the associated standard errors (in parentheses) for single stage and two-stage methods.In comparison to the single stage and the proposed two-stage methods in Table 5, one is able to observe that our proposed methods almost provide more accurate results than the single stage methods.Although the single stage methods have larger averaged ACR value than two-stage methods in classifying ALL, the difference is not too significant.In addition, observing Table 5 it can be found that the proposed two-stage approaches have the smaller standard errors for all the cases, which imply the robustness of the mechanisms.Figure 1 provides a comparison with respect to the overall improvement percentage in the single stage method.From Figure 1, it can be seen that the two-stage approaches are more robust than the single stage method.

Conclusions
This study proposes several two-stage hybrid discrimination approaches for tumor classification using microarray data.The proposed approaches integrate the framework of several frequently used statistical-based discrimination methods and computational intelligence classifying techniques.Based on the results of cross-validation in Table 4, it can be easily observed that the proposed hybrid method LR-MARS is more appropriate for discriminating the tumor classes.
Computational intelligence methodology is very useful in many aspects of application and can deal with complex and computationally intensive problems.With the use of several computational intelligence techniques, this study develops two-stage hybrid discrimination approach for tumor classification.The proposed hybrid model is not the only discrimination method that can be employed.Based on our work further research can be expanded.For example, one can combine other computational intelligence techniques, such as rough set theory [50] or extreme learning machine, with neural networks or support vector machine to refine the structure further and improve the classification accuracy.Extensions of the proposed two-stage hybrid discrimination method to other statistical techniques or to multistage discrimination procedures are also possible.Such works deserve further research and are our future concern.

Figure 1 :
Figure 1: Improvement of the proposed approach in comparison with the single stage method.

Table 1 :
The influencing genes selected by using two-sample -test with a significance level of 0.0001.

Table 3 :
The relative importance of four explanatory variables for MARS modelling.

Table 4 :
ACRs for thirteen approaches using cross-validation.

Table 5 :
Overall averaged ACR and the associated standard error (in parentheses) for single stage and two-stage methods.