Gene-Based Multiclass Cancer Diagnosis with Class-Selective Rejections

Supervised learning of microarray data is receiving much attention in recent years. Multiclass cancer diagnosis, based on selected gene profiles, are used as adjunct of clinical diagnosis. However, supervised diagnosis may hinder patient care, add expense or confound a result. To avoid this misleading, a multiclass cancer diagnosis with class-selective rejection is proposed. It rejects some patients from one, some, or all classes in order to ensure a higher reliability while reducing time and expense costs. Moreover, this classifier takes into account asymmetric penalties dependant on each class and on each wrong or partially correct decision. It is based on ν-1-SVM coupled with its regularization path and minimizes a general loss function defined in the class-selective rejection scheme. The state of art multiclass algorithms can be considered as a particular case of the proposed algorithm where the number of decisions is given by the classes and the loss function is defined by the Bayesian risk. Two experiments are carried out in the Bayesian and the class selective rejection frameworks. Five genes selected datasets are used to assess the performance of the proposed method. Results are discussed and accuracies are compared with those computed by the Naive Bayes, Nearest Neighbor, Linear Perceptron, Multilayer Perceptron, and Support Vector Machines classifiers.


Introduction
Cancer diagnosis, based on gene expression profiling, have improved over the past 40 years. Many microarray technologies studies were developed to analyze the gene expression. These genes are later used to categorize cancer classes. Two different classification approaches can be used: class discovery and class prediction. The first is an unsupervised learning approach that allows to separate samples into clusters based on similarities in gene expression, without prior knowledge of sample identity. The second is a supervised approach which predicts the category of an already defined sample using its gene expression profiles. Since these classification problems are described by a large number of genes and a small number of samples, it is crucial to perform genes selection before the classification step. One way to identify informative genes pointed in [1] is the test statistics.
Researches show that the performance of supervised decisions based on selected gene expression can be comparable to the clinical decisions. However, no classification strategy is absolutely accurate. First, many factors may effectively decrease the predictive power of a multiclass problem. For example, findings of [2] imply that information useful for multiclass tumor classification is encoded in a complex gene expression and cannot be given by a simple one. Second, it is not possible to find an optimal classification method for all kinds of multiclass problems. Thus, supervised diagnosis are always considered as an important adjunct of traditional diagnostics and never like its substitute.
Unfortunately, supervised diagnosis can be misleading. They may hinder patient care (wrong decision on a sick patient), add expense (wrong decision on a healthy patient) or confound the results of cancer categories. To overcome 2 Journal of Biomedicine and Biotechnology these limitations, a multi-SVM [3] classifier with classselective rejection [4][5][6][7] is proposed. Class-selective rejection consists of rejecting some patients from one, some, or all classes in order to ensure a higher reliability while reducing time and expense costs. Moreover, any of the existing multiclass [8][9][10] algorithms have taken into consideration asymmetric penalties on wrong decisions. For example, in a binary cancer problem, a wrong decision on a sick patient must cost more than a wrong decision on a healthy patient. The proposed classifier handles this kind of problems. It minimizes a general loss function that takes into account asymmetric penalties dependant on each class and on each wrong or partially correct decision.
The proposed method divides the multiple class problem into several unary classification problems and train one ν-1-SVM [11][12][13] coupled with its regularization path [14,15] for each class. The winning class or subset of classes is determined using a prediction function that takes into consideration the costs asymmetry. The parameters of all the ν-1-SVMs are optimized jointly in order to minimize a loss function. Taking advantage of the regularization path method, the entire parameters searching space is considered. Since the searching space is widely extended, the selected decision rule is more likely to be the optimal one. The stateof-art multiclass algorithms [8][9][10] can be considered as a particular case of the proposed algorithm where the number of decisions is given by the existing classes and the loss function is defined by the Bayesian risk.
Two experiments are reported in order to assess the performance of the proposed approach. The first one considers the proposed algorithm in the Bayesian framework and uses the selected microarray genes to make results comparable with existing ones. Performances are compared with those assessed using Naive Bayes, Nearest Neighbor, Linear Perceptron, Multilayer Perceptron, and Support Vector Machines classifiers, invoked in [1]. The second one shows the ability of the proposed algorithm solving multiclass cancer diagnosis in the class-selective rejection scheme. It minimizes an asymmetric loss function. Experimental results show that, a cascade of class-selective classifiers with class-selective rejections can be considered as an improved supervised diagnosis rule. This paper is outlined as follows. Section 2 presents a description of the model as a gene selection task. It introduces the multiclass cancer diagnosis problem in the class-selective rejection scheme. It also proposes a supervised training algorithm based on ν-1-SVM coupled with its regularization path. The two experiments are carried out in Section 3, results are reported, compared and discussed. Finally, a conclusion is presented in Section 4.

Models and Methods
This section describes the multiclass cancer diagnosis based on microarray data. Feature selection is evoked as a first process in a gene-based cancer diagnosis. Test statistics are used as a possible way for informative genes identification [1]. Once genes selection is processed, a classification problem should be solved. The multiclass cancer diagnosis problem, formulated in the general framework of classselective rejection, is introduced. A solution based on ν-1-SVM [11][12][13] is proposed. First a brief description of ν-1-SVM and the derivation of its regularization path [14,15] is presented. Second, the proposed algorithm [3] is explained. It allows to determine a multiclass cancer diagnosis that minimizes an asymmetric loss function in the class-selective rejection scheme.

Genes Selection Using Test Statistics.
Gene profiles are successfully applied to supervised cancer diagnosis. Since cancer diagnosis problems are usually described by a small set of samples with a large number of genes, feature or gene selection is an important issue in analyzing multiclass microarray data. Given a microarray data with N tumor classes, n tumor samples and g genes per sample, one should identify a small subset of informative genes that contribute most to the prediction task. Various feature selection methods exist in literature. One way pointed in [1] is to use test statistics for the equality of the class means. Authors of [1] formulate first the expression levels of a given gene by a one-way analysis of variance model. Second, the power of genes in discriminating between tumor types is determined by a test statistic. The discrimination power is the value of the test evaluated at the expression level of the gene. The higher the discrimination power is, the more powerful the gene is in discriminating between tumor types. Thus, genes with higher power of discrimination are considered as informative genes.
Let Y j p be the expression level from the pth sample of the jth class, the following general model is considered: Y j p = μ j + j p for j = 1, . . . , N; p = 1, . . . , n j with N j=1 n j = n. (1) In the model μ j represents the mean expression level of the gene in class w j , j p are independent random variables and E( j p ) = 0, V ( j p ) = σ 2 j < ∞ for j = 1, . . . , N; p = 1, . . . , n j . For the case of homogeneity of variances, the ANOVA F or F test [16] is the optimal one testing the means equality hypothesis. With heterogeneity of variances, the task is challenging. However, it is known that, with a large number of genes present, usually in thousands, no practical test is available to locate the best set of genes. Thus, the authors of [1] studied six different statistics.
(i) ANOVA F test statistic, the definition of this test is where Y j· = nj p=1 Y j p /n j and Y ·· = N j=1 n j Y j· /n, . For simplicity, is used to indicate the sum taken over the index j. Under means equality hypothesis and assuming variance homogeneity, this test has a distribution of F N−1,n−N [16].
Journal of Biomedicine and Biotechnology 3 (ii) Brown-Forsythe test statistic [17], given by Under means equality hypothesis, B is distributed approximately as F N−1,τ where (iii) Welch test statistic [18], defined as with ω j = n j /s 2 j and h j = ω j / ω j . Under means equality hypothesis, W has an approximate distribution of F N−1,τω where (iv) Adjusted Welch test statistic [19]. It is similar to Welch statistic and defined to be where ω * j = n j /(Φ j s 2 j ) with Φ j chosen such that 1 ≤ Φ j ≤ (n j − 1)/(n j − 3) and h * j = ω * j / ω * j . Under means equality hypothesis, W * has an approximate distribution of (v) Cochran test statistic [20]. This test statistic is simply the quantity appearing in the numerator of the Welch test statistic W, that is, Under means equality hypothesis, C has an approximate distribution of χ 2 N−1 . (vi) Kruskal-Wallis test statistic. This is the well-known nonparametric test given by where R j is the rank sum for the jth class. The ranks assigned to Y j p are those obtained from ranking the entire set of Y j p . Assuming each n j ≥ 5, then under means equality hypothesis, H has an approximate distribution of χ 2 N−1 [21].
These tests performances are evaluated and compared over different supervised learning methods applied to publicly available microarray datasets. Experimental results show that the model for gene expression values without assuming equal variances is more appropriate than that assuming equal variances. Besides, under heterogeneity of variances, Brown-Forsythe test statistic, Welch test statistic, adjusted Welch test statistic, and Cochran test statistic, perform much better than ANOVA F test statistic and Kruskal-Wallis test statistic.

Multitumor Classes with Selective Rejection.
Once gene selection is processed, the classification problem should be solved. Let us define this diagnosis problem in the classselective rejection scheme. Assuming that the multiclass cancer problem deals with N tumor classes noted w 1 . . . w N and that any patient or sample x belongs to one tumor class and has d informative genes, a decision rule consists in a partition Z of R d in I sets Z i corresponding to the different decision options. In the simple classification scheme, the options are defined by the N tumor classes. In the classselective rejection scheme, the options are defined by the N tumor classes and the subsets of tumor classes (i.e. assigning patient x to the subset of tumor classes {w 1 , w 3 } means that x is assigned to cancer categories w 1 and w 3 with ambiguity).
The problem consists in finding the decision rule Z * that minimizes a given loss function c(Z) defined by where c i j is the cost of assigning a patient x to the ith decision option when it belongs to the tumor class w j . The values of c i j being relative since the aim is to minimize c(Z), the values can be defined in the interval [0; 1] without loss of generality. P j is the a priori probability of tumor class w j and P(D i /w j ) is the probability that patients of the tumor class w j are assigned to the ith option.

μ-1-SVM.
To solve the multiclass diagnosis problem, an approach based on ν-1-SVM is proposed. Considering a set of m samples of a given tumor classes The decision function f λ X (·) is parameterized by λ = νm (with 0 ≤ ν < 1) to control the number of outliers. It is designed by minimizing the volume of R λ under the constraint that all the samples of X, except the fraction ν of outliers, must lie in R λ . In order to determine R λ , the space of possible functions f λ X (·) is reduced to a Reproducing Kernel Hilbert Space (RKHS) with kernel function K(·, ·). Let Φ : X → H be the mapping defined over the input space X. Let ·, · H be a dot product defined in H . The kernel K(·, ·) over X × X is defined by: 4 Journal of Biomedicine and Biotechnology Without loss of generality, K(·, ·) is supposed normalized such that for any x ∈ X, K(x, x) = 1. Thus, all the mapped vectors Φ(x p ), p = 1, . . . , m are in a subset of a hypersphere with radius one and center O. Provided K(·, ·) is always positive and Φ(X) is a subset of the positive orthant of the hypersphere. A common choice of K(·, ·) is the Gaussian This yields f λ X (·) as the solution of the following convex quadratic optimization problem: where ξ p are the slack variables. This optimization problem is solved by introducing lagrange multipliers α p . As a consequence to Kuhn-Tücker conditions, w λ is given by which results in The dual formulation of (13) is obtained by introducing Lagrange multipliers as min α1,...,αm A geometrical interpretation of the solution in the RKHS is given by Figure 1. f λ X (·) and b λ define a hyperplane W λ orthogonal to f λ X (·). The hyperplane W λ separates the Φ(x p )s from the sphere center, while having b λ / w λ H maximum which is equivalent to minimize the portion S λ of the hypersphere bounded by W λ that contains the set Tuning ν or equivalently λ is a crucial point since it enables to control the margin error. It is obvious that changing λ leads to solve the optimization problem formulated in (16) in order to find the new region R λ . To obtain great computational savings and extend the search space of λ, we proposed to use ν-1-SVM regularization path [14,15]. Regularization path was first introduced by Hastie et al. [14] for a binary SVM. Later, Rakotomamojy and Davy [15] developed the entire regularization path for a ν-1-SVM. The basic idea of the ν-1-SVM regularization path is that the parameter vector of a ν-1-SVM is a piecewise linear function of λ. Thus the principle of the method is to start with large λ, (i.e., λ = m − ) and decrease it towards zero, keeping track of breaks that occur as λ varies.
As λ decreases, w λ H increases and hence the distance between the sphere center and W λ decreases. Samples move from being outside (non-margin SVs with α λ p = 1 in Figure 1) to inside the portion S λ (non-SVs with α λ p = 0). By continuity, patients must linger on the hyperplane W λ (margin SVs with 0 < α λ p < 1) while their α λ p s decrease from 1 to 0. α λ p s are piecewise-linear in λ. Break points occur when a point moves from a position to another one. Since α λ p is piecewise-linear in λ, f λ (·) and b λ are also piecewiselinear in λ. Thus, after initializing the regularization path (computing α λ p by solving (16) for λ = m − ), almost all the α λ p s are computed by solving linear systems. Only for some few integer values of λ smaller than m, α λ p s are computed by solving (16) according to [15].
Using simple linear interpolation, this algorithm enables to determine very rapidly the ν-1-SVM corresponding to any value of λ.

Multiclass SVM Based on μ-1-SVM.
Given N classes and N trained ν-1-SVMs, one should design a supervised decision rule Z, moving from unary to multiclass classifier by assigning samples to a decision option. To determine the decision rule, first a prediction function should decide the winning option. A distance measure between x and the training class set w j , using the ν-1-SVM parameterized by λ j , is defined as follows: where θ λj is the angle delimited by w λj and the support vector as shown in Figure 1. cos(θ λj ) is a normalizing factor which is used to make all the d λ j (x) comparable. Using Φ(x) = 1 in (17) leads to the following: Since the α λj p are obtained by the regularization path for any value of λ j , computing d λj is considered as an easyfast task. The distance measure d λj (x) is inspired from [22]. When data are distributed in a unimodal form, the d λj (x) is a decreasing function with respect to the distance between a sample x and the data mean. The probability density function is also a decreasing function with respect to the distance from the mean. Thus, d λj (x) preserves distribution order relations. In such case, and under optimality of the ν-1-SVM classifier, the use of d λj (x) should reach the same performances as the one obtained using the distribution. Figure 1: Training data mapped into the feature space on a portion S λ of a hypersphere.

Non-margin SV
In the simplest case of multiclass problems where the loss function is defined as the error probability, a patient x is assigned to the tumor class maximizing d λj (x).
To extend the multiclass prediction process to the classselective scheme, a weighted form of the distance measure is proposed. The weight β j is associated to d λj . β j reflects an adjusted value of the distance d λj according to the penalty associated with the tumor class w j . Thus, introducing weights leads to treat differently each tumor class and helps solving problems with different costs c i j on the classification decisions.
Finally, in the general case where the loss function is considered in the class-selective rejection scheme, the prediction process can be defined as follows: a blinded sample x is assigned to the ith option if and only if Thus, in contrast to previous multiclass SVMs, which construct the maximum margin between classes and locate the decision hyperplane in the middle of the margin, the proposed approach resembles more to the robust Bayesian classifier. The distribution of each tumor class is considered and the optimal decision is slightly deviated toward the class with the smaller variance.
The proposed decision rule depends on σ, ν and β vectors of σ j , ν j and β j for j = 1, . . . , N. Tuning ν j is the most time expensive task since changing ν j leads to solve the optimization problem formulated in (16). Moreover, tuning ν j is a crucial point, it enables to control the margin error. In fact, it was shown in [11] that this regularization parameter is an upper bound on the fraction of outliers and a lower bound on the fraction of the SVs. In [9,23] a smooth grid search was supplied in order to choose the optimal values of ν. The N values ν j s were chosen equal to reduce the computational costs. However, this assumption reduces the search space of parameters too. To avoid this restriction, the proposed approach optimizes all the ν j with j = 1, . . . , N corresponding to the Nν-1-SVMs using regularization path and consequently explores the entire parameters space. Thus the tuned ν j are most likely to be the optimal ones. The parameter σ are set equals σ 1 = σ 2 = · · · = σ N .
The optimal vector of σ j , λ j and β j , j = 1, . . . , N, is the one which minimizes an estimator of c(Z) using a validation set. Since the problem is described by a sample set, an estimator c(Z) of c(Z) given by (11) is used: where P j and P(D i /w j ) are the empirical estimators of P j and P(D i /w j ), respectively. The optimal rule is obtained by tuning λ j , β j and σ j so that the estimated loss c(Z) computed on a validation set is minimum. This is accomplished by employing a global search for λ j and β j and an iterative search over the kernel parameter. For each given value σ of the parameter kernels, ν-1-SVMs are trained using the regularization path method on a training set. Then the minimization of c(Z) over a validation set is sought by solving an alternate optimization problem over λ j and β j which is easy since all ν-1-SVM solutions are easily interpolated from the regularization path. σ is chosen from a previously defined set of real numbers [σ 0 , . . . , σ s ] with s ∈ ℵ. Algorithm 1 elucidates the proposed approach.

Experimental Results
In this section, two experiments are reported in order to assess the performance of the proposed approach. First, the cancer diagnosis problem is considered in the traditional Bayesian framework. Five gene expression datasets and five supervised algorithms are considered. Each gene dataset was selected using the six test statistics of [1]. The decisions are given by the possible set of tumor classes and the loss function is defined as the probability of error to make results comparable with those of [1]. Second, in order to show the advantages of considering the multiclass cancer diagnosis in class-selective rejection scheme, one gene dataset is considered and studied with an asymmetric loss function. A cascade of classifiers with rejection options is used to ensure a reliable diagnosis. For both experiments, the loss function was minimized by determining the optimal parameters β j and λ j for j = 1, . . . , N for a given kernel parameter σ and by testing different values of σ in the set [2 −3 , 2 −2 , 2 −1 , 2 0 , 2 1 , 2 2 ]. Finally, the decision rule which minimizes the loss function is selected.
The cancer diagnosis problem was considered in the traditional Bayesian framework. Decisions were given by the set of possible classes and loss function was defined by the error risk. This means that in (20) c i j are defined according 1 θ := ∅ 2 C := ∅ 3 for σ ← σ 0 to σ s do 4 / * Using the Training Set * / 5 for j ← 1 to N do 6 T r a i nν-1-SVM on w j , namely solving the QP (16)  7 Derive the regularization path for w j , namely compute the α λj s 8 end 9 / * Using the Validation Set * / 10 λ := λ 0 11 β := β 0 12 repeat Assign x to a decision ψ i according to (19)    .
to the Table 2. The performance of the proposed method was measured by evaluating its accuracy rate and it was compared to results obtained by the five predictors evoked in [1]: Naive Bayes, Nearest Neighbor, Linear Perceptron, Multilayer Perceptron Neural Network with five nodes in the middle layer, and Support Vector Machines with secondorder polynomial kernel.
To compute the generalization accuracy of the proposed classifier, Leave One Out (LOO) resampling method is used to divide a gene dataset of n patients into two sets, a set of n − 1 patients and a test set of 1 blinded patient. This method involves n separate runs. For each run, the first set of n − 1 samples is divided using 5 Cross-validation (5-CV) into a training set and a validation set. Nν-1-SVMs are trained using the training set for all values of ν j . The decision is obtained by tuning the parameters β j , λ j and σ j for j = 1, . . . , N so that the loss function computed on the validation set is minimum. Optimal parameters are then used to build the decision rule using the whole n − 1 samples. The blinded test set is classified according to this rule. The overall prediction error is the sum of the patients misclassified on all n runs. Table 3 reports errors of the proposed algorithm, the average value and the median value of the 5 classifiers prediction errors reported in [1] when 50 informative genes are used. Table 4 reports values when 100 informative genes are used. F, B, W, W * , C, and H represent the six test statistics.
Experimental results show that, for ovarian, NCI, lung cancer and lymphoma multiclass genes problems, the proposed approach achieves competitive performances compared to the 5 classifiers reported in [1]. For these datasets, prediction errors of the proposed approach are less than the mean and median values of the 5 classifiers prediction errors reported in [1]. However, for leukemia72, the proposed Table 3: Prediction errors of the proposed classifier, mean and median values of the 5 classifiers prediction errors according to [1] with 50 informative selected genes.          Table 8: Confusion matrix of the 50 W * lung cancer problem with class-selective rejection using cost matrix defined in Table 7. Total of misclassified is equal to 10, total of partially and totally rejected samples is equal to 8. algorithm performances are almost in the same range of those provided by the 5 classifiers reported in [1]. The proposed approach prediction error is equal, or in the worst case, slightly higher than the mean and median errors. Moreover, we can note that focussing on the test statistics comparison, experimental results confirm those of [1]. B, W and W * can be the most performing tests under variances heterogeneity assumptions.

Class-Selective Rejection Framework.
In the following, we present the study of lung cancer problem in the classselective rejection scheme. Lung cancer diagnosis problem is determined by the gene expression profiles of 67 lung tumors and 6 normal lung specimens from patients whose clinical course was followed for up to 5 years. The tumors comprised 41 Adenocarcinomas (ACs), 16 squamous cell carcinomas (SCCs); 5 cell lung cancers (LCLCs) and 5 small cell lung cancers (SCLCs). ACs are subdivided into three subgroups 21 AC of group 1 tumors, 7 AC of group 2 tumors and 13 AC of group 3 tumors. Thus, the multiclass diagnosis cancer consists of 7 classes.
Authors in [28] observed that AC of group 3 tumors shared strong expression of genes with LCLC and SCC tumors. Thus, poorly differentiated AC is difficult to distinguish from LCLC or SCC. Confusion matrices (Tables 5  and 6) computed in the Bayesian framework, with 50 W * and 50 H prove well these claims. It can be noticed that 8 of the 16 misclassified 50 W * patients and 8 of the 15 misclassified 50 H patients correspond to confusion between these three subcategories. Therefore, one may define a new decision option as a subset of these three classes to reduce error.
Moreover, same researches affirm that distinction between patients with nonsmall cell lung tumors (SCC, AC and LCLC) and those with small cell tumors or SCLC is extremely important, since they are treated very differently. Thus, a confusion or wrong decision among patients of nonsmall cell lung tumors should cost less than a confusion between nonsmall and small lung cells tumors. Besides, one may provide an extra decision option that includes all the subcategories of tumors to avoid this kind of confusion. Finally, another natural decision option can be the set of all  classes, which means that the classifier has totally withhold taking a solution. Given all these information, the loss function can be empirically defined according to the asymmetric cost matrix given in Table 7. Solving 50 W * lung cancer problem in this scheme leads to the confusion matrix presented in Table 8. As a comparison with Table 5, one may mainly note that the number of misclassified patients decreases from 16 to 10 and 8 withhold decisions or rejected patients. This partial rejection contributes to avoid confusion between nonsmall and small lung cells tumors and reduces errors due to indistinctness among LCLC, SCC and AC3. Besides, according to the example under study, no patient is totally rejected. It is an expected result since initially (Table 5) there was no confusion between normal and tumor samples.
To take a decision concerning the rejected patients, we may refer to clinical analysis. It is worth to note that for partially rejected patients, clinical analysis will be less expensive in terms of time and money than those on completely blinded patients. Moreover, a supervised solution can be also proposed. It aims to use genes selected from another test statistic in order to assign rejected patients to one of the possible classes. According to Tables 3 and 4, prediction errors computed on same patients using genes selected by different test statistics may decrease since errors of two different test statistics do not occur on the same patients. Thus, we chose 50 H lung cancer dataset to reclassify the 8 rejected patients of Table 8. Five of them were correctly classified while three remained misclassified. Results are reported in Table 9. The number of misclassified patients decreases to 13 which is less than all the prediction errors obtained with 50 informative genes (lung cancer problem prediction errors of Table 3). In fact, many factors play an important role in the cascade classifiers system such as the asymmetric costs matrix which has been chosen empirically, the choice of test statistics, the number of classifiers in a cascade system,. . . . Such concerns are under study.

Conclusion
Cancer diagnosis using genes involve a gene selection task and a supervised classification procedure. This paper tackles the classification step. It considers the problem of gene-based multiclass cancer diagnosis in the general framework of class-selective rejection. It gives a general formulation of the problem and proposes a possible solution based on ν-1-SVM coupled with its regularization path. The proposed classifier minimizes any asymmetric loss function. Experimental results show that, in the particular case where decisions are given by the possible classes and the loss function is set equal to the error rate, the proposed algorithm, compared with the state of art multiclass algorithms, can be considered as a competitive one. In the class-selective rejection, the proposed classifier ensures higher reliability and reduces time and expense costs by introducing partial and total rejection. Furthermore, results prove that a cascade of classifiers with class-selective rejections can be considered as a good way to get improved supervised diagnosis. To get the most reliable diagnosis, the confusion matrix defining the loss function should be carefully chosen. Finding the optimal loss function according to performance constraints is an promising approach [30] which is actually under investigation.