Combining Dissimilarities in a Hyper Reproducing Kernel Hilbert Space for Complex Human Cancer Prediction

DNA microarrays provide rich profiles that are used in cancer prediction considering the gene expression levels across a collection of related samples. Support Vector Machines (SVM) have been applied to the classification of cancer samples with encouraging results. However, they rely on Euclidean distances that fail to reflect accurately the proximities among sample profiles. Then, non-Euclidean dissimilarities provide additional information that should be considered to reduce the misclassification errors. In this paper, we incorporate in the ν-SVM algorithm a linear combination of non-Euclidean dissimilarities. The weights of the combination are learnt in a (Hyper Reproducing Kernel Hilbert Space) HRKHS using a Semidefinite Programming algorithm. This approach allows us to incorporate a smoothing term that penalizes the complexity of the family of distances and avoids overfitting. The experimental results suggest that the method proposed helps to reduce the misclassification errors in several human cancer problems.


Introduction
DNA Microarray technology provides us a way to monitor the expression levels of thousands of genes simultaneously across a collection of related samples. This technology has been applied particularly to the prediction of different types of human cancer with encouraging results [1]. Support Vector Machines (SVM) [2] are powerful machine learning techniques that have been applied to the classification of cancer samples [3]. However, the categorization of different cancer types remains a difficult problem for classical SVM algorithms. In particular, the SVM is based on Euclidean distances that fail to reflect accurately the proximities among the sample profiles [4]. Non-Euclidean dissimilarities misclassify frequently different subsets of patterns because each one reflects complementary features of the data. Therefore, they should be integrated in order to reduce the fraction of patterns misclassified by the base dissimilarities.
In this paper, we introduce a framework to learn a linear combination of non-Euclidean dissimilarities that reflect better the proximities among the sample profiles. Each dissimilarity is embedded in a feature space using the Empirical Kernel Map [5,6]. After that, learning the dissimilarity is equivalent to optimize the weights of the linear combination of kernels. Several approaches have been proposed to this aim. In [7,8] the kernel is learnt optimizing an error function that maximizes the alignment between the input kernel and an idealized kernel. However, this error function is not related to the misclassification error and is prone to overfitting. To avoid this problem, [9] learns the kernel by optimizing an error function derived from the Statistical Learning Theory. This approach includes a term to penalize the complexity of the family of kernels considered. This algorithm is not able to incorporate infinite families of kernels and does not overcome the overfitting of the data.
In this paper, the combination of distances is learnt in a (Hyper Reproducing Kernel Hilbert Space) HRKHS following the approach of hyperkernels proposed in [10]. This formalism exhibits a strong theoretical foundation and is less sensitive to overfitting. Moreover, it allow us to work with infinite families of distances. The algorithm has been 2 Journal of Biomedicine and Biotechnology applied to the prediction of different kinds of human cancer. The experimental results suggest that the combination of dissimilarities in a Hyper Reproducing Kernel Hilbert Space improves the accuracy of classifiers based on a single distance, particularly for nonlinear problems. Besides, our approach outperforms the Lanckriet formalism specially for multicategory problems and is more robust to overfitting. This paper is organized as follows. Section 2 introduces the algorithm proposed, the material and the methods employed. Section 3 illustrates the performance of the algorithm in the challenging problem of gene expression data analysis. Finally, Section 4 gets conclusions and outlines future research trends.

Distances for Gene Expression Data
Analysis. An important step in the design of a classifier is the choice of a proper dissimilarity that reflects the proximities among the objects. However, the choice of a good dissimilarity is not an easy task. Each measure reflects different features of the data and the classifiers induced by the dissimilarities misclassify frequently a different set of patterns. In this section, we comment shortly the main differences among several dissimilarities proposed to evaluate the proximity between biological samples considering their gene expression profiles. For a deeper description and definitions see [11]. Let x = [x 1 , . . . , x d ] be the vectorial representation of a sample where x i is the expression level of gene i. The Euclidean distance evaluates if the gene expression levels differ significantly across different samples: An interesting alternative is the cosine dissimilarity. This measure will become small when the ratio between the gene expression levels is similar for the two samples considered. It differs significantly from the Euclidean distance when the data is not normalized by the L 2 norm: The correlation measure evaluates if the expression level of genes change similarly in both samples. Correlation-based measures tend to group together samples whose expression levels are linearly related. The correlation differs significantly from the cosine if the means of the sample profiles are not zero. This measure is more sensitive to outliers: where x and y are the means of the gene expression profiles.
The Spearman rank dissimilarity is less sensitive to outliers because it computes a correlation between the ranks of the gene expression levels: where x i = rank(x i ) and y j = rank(y j ). An alternative measure that helps to overcome the problem of outliers is the Kendall-τ index which is related to the Mutual Information probabilistic measure [11]: where C xij = sign(x i − x j ) and C yij = sign(y i − y j ). Finally, the dissimilarities have been transformed using the inverse multiquadratic kernel because this transformation helps to discover certain properties of the underlying structure of the data [12,13]. The inverse multiquadratic transformation is based on the inverse multiquadratic kernel defined as follows: where c is a smoothing parameter. Considering that x − y is the Euclidean distance, (6) can be rewritten in terms of a dissimilarity as follows: The above nonlinear transformation gives more weight to small dissimilarities, particularly when c becomes small.

ν-Support
Vector Machines. Support Vector Machines [2] are powerful classifiers that are able to deal with high dimensional and noisy data keeping a high generalization ability. They have been widely applied in cancer classification using gene expression profiles [1,14]. In this paper, we will focus on the ν-Support Vector Machines (SVM). The ν-SVM is a reparametrization of the classical C-SVM [2] that allows to interpret the regularization parameter in terms of the number of support vectors and margin errors. This property helps to control the complexity of the approximating functions in an intuitive way. This feature is desirable for the application we are dealing with because the sample size is frequently small and the resulting classifiers are prone to overfitting.
be the training set codified in R d . We assume that each x i belongs to one of the two classes labeled by y i ∈ {−1, 1}. The SVM algorithm looks for the linear hyperplane f (x; w) = w T x + b that maximizes the margin γ = 2/ w 2 . γ determines the generalization ability of the SVM. The slack variables ξ i allow to consider classification errors and are defined as For the ν-SVM, the hyperplane that minimizes the prediction error is obtained solving the following optimization problem [2]: where ν is an upper bound on the fraction of margin errors and a lower bound on the number of support vectors. Therefore, this parameter controls the complexity of the approximating functions. The optimization problem can be solved efficiently in the dual space and the discriminant function can be expressed exclusively in terms of scalar products: where α i are the Lagrange multipliers in the dual optimization problem. The ν-SVM algorithm can be easily extended to the nonlinear case substituting the scalar products by a Mercer kernel [2]. Besides, non-Euclidean dissimilarities can be incorporated into the ν-SVM via the kernel of dissimilarities [5]. Finally, several approaches have been proposed in the literature to extend the SVM to deal with multiple classes. In this paper, we have followed the one-against-one (OVO) strategy. Let k be the number of classes, in this approach k(k − 1)/2 binary classifiers are trained and the appropriate class is found by a voting scheme. This strategy compares favorably with more sophisticated methods and it is more efficient computationally than the one-against-rest (OVR) approach [15].

Empirical Kernel
Map. The Empirical Kernel Map allows us to incorporate non-Euclidean dissimilarities into the SVM algorithm using the kernel trick [5,13].
Let d : X × X → R be a dissimilarity and R = {p 1 , . . . , p n } a subset of representatives drawn from the training set. Define the mapping φ : This mapping defines a dissimilarity space where feature i is given by d(·, p i ).
The set of representatives R determines the dimensionality of the feature space. The choice of R is equivalent to select a subset of features in the dissimilarity space. Due to the small number of samples in our application, we have considered the whole training set as representatives. Notice that it has been suggested in literature [13] that for small samples reducing the set of representatives does not help to improve the classifier performance.

Learning a Linear Combination of Dissimilarities in an HRKHS.
In order to learn a linear combination of non-Euclidean dissimilarities, we follow the approach of Hyperkernels developed by [10]. To this aim, each distance is embedded in an RKHS via the Empirical Kernel Map presented in Section 2.3. Next, a regularized quality functional is introduced that incorporates an l 2 -penalty over the complexity of the family of distances considered. The solution to this regularized quality functional is searched in a Hyper Reproducing Kernel Hilbert Space. This allows to minimize the quality functional using an SDP approach.
Let X train = {x 1 , x 2 , . . . , x m } and Y train = {y 1 , y 2 , . . . , y n } be a finite sample of training patterns where y i ∈ {−1, +1}. Let K be a family of semidefinite positive kernels. Our goal is to learn a kernel of dissimilarities k ∈ K that represents the combination of dissimilarities and minimizes the following empirical quality functional: where l is a loss function, H is the L 2 norm defined in a reproducing kernel Hilbert space, and λ is a regularization parameter that controls the balance between training error and the generalization ability. By virtue of the representer theorem [2], we know that (11) can be written as a kernel expansion: However, if the family of kernels K is complex enough it is possible to find a kernel that achieves zero error overfitting the data. To avoid this problem, we introduce a term that penalizes the kernel complexity in an HRKHS. A rigorous definition of the HRKHS is provided in the appendix: where H is the L 2 norm defined in the Hyper Reproducing Kernel Hilbert space generated by the hyperkernel k. λ Q is a regularization parameter that controls the complexity of the resulting kernel.
The following theorem allows us to write the solution to the minimization of this regularized quality functional as a linear combination of hyperkernels in an HRKHS.

Theorem 1 (Representer theorem for Hyper-RKHS [10]).
Let X, Y be the combined training and test set, then each minimizer k ∈ H of the regularized quality functional Q reg (k, X, Y ) admits a representation of the form for all x, x ∈ X, where β i j ∈ R, for each However, we are only interested in solutions that give rise to positive semidefinite kernels. The following condition over the hyperkernels [10] allows us to guarantee that the solution is a positive semidefinite kernel. Property 1. Given a hyperkernel k with elements such that for any fixed x ∈ X, the function k(x p , x q ) = k(x, (x p , x q )), with x p , x q ∈ X, is a positive semidefinite kernel, and β i j ≥ 0 for all i, j = 1, . . . , m, then the kernel is positive semidefinite.
Now, we address the problem of combining a finite set of dissimilarities. As we mentioned in Section 2.3, each dissimilarity can be represented by a kernel using the Empirical Kernel Map. Next, the hyperkernel is defined as where each k i is a positive semidefinite kernel of dissimilarities and c i is a constant ≥0. Now, we show that k is a valid hyperkernel. First, k is a kernel because it can be written as a dot product Next, the resulting kernel (15) is positive semidefinite because for all x, k(x, (x p , x q )) is a positive semidefinite kernel and β i j can be constrained to be ≥0. Besides, the linear combination of kernels is a kernel and therefore is positive semidefinite. Notice that k(x, (x p , x q )) is positive semidefinite if c i ≥ 0 and k i are pointwise positive for training data. Both RBF and multiquadratic kernels verify this condition. Finally, we show that the resulting kernel is a linear combination of the original k i . Substituting the expression of the hyperkernel (16) in (15), the kernel is written as Now the kernel can be written as a linear combination of base kernels: Therefore, the above kernel introduces into the ν-SVM a linear combination of base dissimilarities represented by k l with coefficients γ l = c l m i, j=1 β i j k l (x i , x j ). The previous approach can be extended to an infinite family of distances. In this case, the space that generates the kernel is infinite dimensional. Therefore, in order to work in this space, it is necessary to define a hyperkernel and to optimize it using an HRKHS. Let k be a kernel of dissimilarities. The hyperkernel is defined as follows [10]: where c i ≥ 0 and i = 0, . . . , ∞. In this case, the nonlinear transformation to feature space is infinite dimensional. Particularly, we are considering all powers of the original kernels which is equivalent to transform nonlinearly the original dissimilarities: where n is the dimensionality of the space which is infinite in this case. As we mentioned in Section 2.1, nonlinear transformations of a given dissimilarity provide additional information that may help to improve the classifier performance.
As for the finite family, it can be easily shown that k is a valid hyperkernel provided that the kernels considered are pointwise positive. The Inverse Multiquadratic kernel satisfies this condition. Next, we derive the hyperkernel expression for the multiquadratic kernel. (20), one has the following expression for the harmonic hyperkernel:

Then, computing the infinite sum in
λ h is a regularization term that controls the complexity of the resulting kernel. Particularly, larger values for λ h give more weight to strongly nonlinear kernels while smaller values give coverage for wider kernels. In this paper one has considered the inverse multiquadratic kernel defined in (6). Substituting in (22), one gets the inverse multiquadratic hyperkernel: where x = (x, x ) and x = (x , x ).

ν-SVM in an HRKHS.
In this section, we detail how to learn the kernel for a ν-Support Vector Machine in an HRKHS. First, we will introduce the optimization problem and next, we will explain shortly how to solve it using an SDP approach.
We start some notation that is used in the ν-SVM algorithm. For p, q, r ∈ R n , n ∈ N let r = p • q be defined as element by element multiplication, r i = p i × q i . The pseudoinverse of a matrix K is denoted by K † . Define the hyperkernel Gram matrix K by K i j pq = k((x i , x j ), (x p , x q )), the kernel matrix K = reshape (Kβ) (reshaping an m 2 by 1 vector, Kβ, to an m × m matrix), Y = diag(y) (a matrix with y on the diagonal and zero otherwise), G(β) = YKY (the dependence on β is made explicit), and 1 is a vector of ones.
The ν-SVM considered in this paper uses an l 1 soft margin, where l( . This error Journal of Biomedicine and Biotechnology 5 is less sensitive to outliers which are convenient features for microarray datasets. Let ξ i be the slack variables that allow for errors in the training set. Substituting in (13) Q emp by the one optimized by ν-SVM (8) the regularized quality functional in an HRKHS can be written as where ν is the regularization parameter that achieves a balance between training error and the complexity of the approximating functions and λ Q is a parameter that penalizes the complexity of the family of kernels considered. The minimization of the previous equation leads to the following SDP optimization problem [10]. min β,γ,η,ξ,χ where z = γy + χ1 + η − ξ The value of α which optimizes the corresponding Lagrange function is G(β) † z, and the classification function, f = sign(K(α • y) − b offset ), is given by K is the hyperkernel defined in Section 2.4 which represents the combination of dissimilarities considered. Finally, the algorithm proposed can be easily extended to deal with multiple classes via a one-against-one approach (OVO). This strategy is simple, more efficient computationally than the OVR, and compares well with more sophisticated multicategory SVM methods [15].
2.6. Implementation. The optimization problem (25) were solved using SeDuMi 1.1R3 [16] and YALMIP [17] SDP optimization packages running under MATLAB. As in the SDP problem there are m 2 coefficients β i j , the computational complexity is high. However, it can be significantly reduced if the Hyperkernel {k((x i , x j ), ·) | 1 ≤ i, j ≤ m 2 } is approximated by a small fraction of terms, p m 2 for a given error. In particular, we have chosen an m × p truncated lower triangular matrix G which approximate the hyperkernel matrix to an error δ = 10 −6 using the incomplete Cholesky factorization method [18].

Datasets and Preprocessing.
The gene expression datasets considered in this paper correspond to several human The first dataset was obtained from 77 patients with (diffuse large B-cell lymphoma) DLBCL (58 samples) or FL (follicular lymphoma) (19 samples) and they were subjected to transcriptional profiling using oligonucleotide Affymetrix gene chip hu68000 containing probes for 6817 genes [19]. The second dataset consists of frozen tumors specimens from newly diagnosed, previously untreated MLBCL patients (34 samples) and DLBCL patients (176 samples). They were hybridized to Affymetrix hgu133b gene chip containing probes for 44000 genes [20]. In both cases the raw intensities have been normalized using the rma algorithm [21] available from Bioconductor package [11]. The third problem we address concerns the clinically important issue of metastatic spread of the tumor. The determination of the extent of lymph node involvement in primary breast cancer is the single most important risk factor in disease outcome and here the analysis compares primary cancers that have not spread beyond the breast to ones that have metastasized to axillary lymph nodes at the time of diagnosis. We identified tumors as "reported negative" (24) when no positive lymph nodes were discovered and "reported positive" (25) for tumors with at least three identifiably positive nodes [22]. All assays used the human HuGeneFL Genechip microarray containing probes for 7129 genes. The fourth dataset [23] address the clinical challenge concerning medulloblastoma due to the variable response of patients to therapy. Whereas some patients are cured by chemotherapy and radiation, others have progressive disease. The dataset consists of 60 samples containing 39 medulloblastoma survivors and 21 treatment failures. Samples were hybridized to Affymetrix HuGeneFL arrays containing 5920 known genes and 897 expressed sequence tags.
All the datasets have been standarized subtracting the median and dividing by the Inter-quantile range. The rescaling were performed based only on the training set to avoid bias.

Journal of Biomedicine and Biotechnology
Regarding the identification of multiple classes of cancer we have considered three different datasets. The first one consists of 49 samples of Breast Cancer generated using 1channel oligonucleotide Affymetrix HuGeneFl [1]. The second and third datasets consist of 58 and a129 samples from Diffuse large B-cell lymphoma with survival data. Fourth different subclasses can be identified. Data preparatory steps have been performed by the authors of the primary study [1]. The 10% oligonucleotides with smaller Interquantile Range were filtered to remove genes with expression level constant across samples.

Performance Evaluation.
In order to assure an honest evaluation of all the classifiers we have performed a double loop of crossvalidation [15]. The outer loop is based on stratified tenfold cross-validation that iteratively splits the data in ten sets, one for testing and the others for training. The inner loop perform stratified ninefold cross-validation over the training set and is used to estimate the optimal parameters avoiding overfitting. The stratified variant of cross-validation keeps the same proportion of patterns for each class in training and test sets. This is necessary in our problem because the class proportions are not equal. Finally, the error measure considered to evaluate the classifiers has been accuracy. This metric computes the proportion of samples misclassified. The accuracy is easy to interpret and allows us to compare with the results obtained by previously published studies.

Parameters for the Classification Algorithm.
The parameters for the ν-SVM and for the classifiers based on a linear combination of dissimilarities have been set up by a nested stratified tenfold crossvalidation procedure [15]. This method avoids the overfitting as is described in Section 2.8 and takes into account the asymmetric distribution of class priors.
For the ν-SVM we have considered both, linear and inverse multiquadratic kernels. The optimal parameters have been obtained by a grid search strategy over the following set of values: ν = {0.1, 0.2, 0.3, 0.4, 0.5} and σ = {d/2, d, 2d}, where d denotes the dimensionality of the input space.
Additionally, for the finite family of distances c i = 1/M where M is the number of dissimilarities considered, and λ Q = 1 because the misclassification errors are hardly sensitive to the regularization parameter that controls the kernel complexity. Finally, for the infinite family of dissimilarities, the regularization parameter λ h in the Harmonic hyperkernel (22) has been set up to 0.6 which gives an adequate coverage of various kernel widths. Smaller values emphasizes only wide kernels. All the base kernel of dissimilarities have been normalized so that all ones have the same scale.
Regarding the Lanckriet [9] formalism that allows to combine a finite set of dissimilarities, several values for the regularization parameter C have been tried, C = {0.1, 1, 10, 100, 1000}. A grid search strategy has been applied to determine the best values for both, the kernel parameters and the regularization parameter. The kernel matrices have been normalized by the trace as recommended in the original paper.

Gene Selection.
Gene selection can improve significantly the classifier performance [24]. Therefore, we have evaluated the classifiers for the following subsets of genes {280, 146, 101, 56, 34}. The ν-SVM is robust against noise and is able to deal with high dimensional data. However, the empirical evidence suggests that considering a larger subset of genes or even the whole set of genes increases the misclassification errors. The genes are ranked according to the ratio of betweengroup to within-group sums of squares defined in [25]: where x (k) · j and x · j denote "respectively" the average expression level of gene j for class k and the overall average expression level of gene j across all samples, y i denotes the class of sample i, and I(·) is the indicator function. Next, the top ranked genes are chosen. This feature selection method is simple but compares well with more sophisticated methods [24]. Finally, the ranking of genes has been carried out considering only the training set to avoid bias. Therefore, feature selection is repeated in each iteration of crossvalidation.

Results and Analysis
The algorithms proposed have been applied to the identification of several cancer human samples using microarray gene expression data.
First, we address several binary categorization problems. Table 2 reports the accuracy for the two combination approaches proposed in this paper. The first one considers the finite set of dissimilarities introduced in Section 2.1. The second one considers an infinite family of distances obtained by transforming nonlinearly the base dissimilarities Table 3: Accuracy for the ν-SVM using a linear combination of non-Euclidean dissimilarities in an HRKHS. The ν-SVM based on the best distance, the classical ν-SVM, and the Lanckriet formalism have been taken as a reference. to feature space. We have compared with the ν-SVM based on the best distance (linear and nonlinear kernel) and the classical ν-SVM. The performance for the Lanckriet formalism [9] that allow us to incorporate a finite linear combination of dissimilarities is also reported. Before computing the kernel of dissimilarities, all the distances have been transformed using the multiquadratic kernel introduced in Section 2.1. This nonlinear transformation helps to improve the accuracy for all the techniques evaluated. From the analysis of Table 2, the following conclusions can be drawn.
(i) The ν-SVM based on a finite set of distances improves the ν-SVM based on the best dissimilarity for brain prognosis and Lymphoma datasets. The error is not reduced for Lymphoma cell B and Breast LN. This may be explained because the ratio (var/samp.) in Table 1 suggests that both datasets are quite noisy and nonlinear. The combination of a finite set of dissimilarities is not able to improve the separation between classes and increases slightly the overfitting of the data. Similarly, our algorithm helps to improve the SVM based on coordinates, particularly for the previous problems. We also report that working directly from a dissimilarity matrix may help to reduce the misclassification errors.
(ii) The infinite family of distances outperforms the ν-SVM based on the best distance disregarding the kernel considered for all the datasets. The improvement is more relevant in brain cancer prognosis. Brain cancer prognosis is a complex problem according to the original study [23] and the nonlinear transformations of the dissimilarities help to reduce the misclassification errors. Besides, the infinite family improves the accuracy of the finite family of distances particularly for lymphoma cell B and Breast LN. This suggests that both datasets are nonlinear.
(iii) The Lanckriet formalism and the finite family of dissimilarities perform similarly. However, the infinite family of distances outperforms the Lanckriet formalism particularly for brain and Lymphoma cell B which are more complex problems.
(iv) The best distance depends on the dataset considered.
Next we move to the categorization of multiple cancer types. Table 3 compares the proposed algorithms with ν-SVM based on the best distance (linear and nonlinear kernel) and the classical ν-SVM. The accuracy for the Lanckriet formalism has also been reported. Our approach considers an infinite family of distances obtained by transforming nonlinearly the base dissimilarities to feature space.
Before computing the kernel of dissimilarities, all the distances have been transformed using the multiquadratic kernel introduced in Section 2.1. From the analysis of Table 3, the following conclusions can be drawn.
(i) The combination of non-Euclidean dissimilarities helps to improve the SVM based on the best dissimilarity disregarding the kernel considered for the two first datasets. The error is slightly larger for the third dataset which may suggest that the problem is linear.
(ii) Our algorithm improves the SVM based on coordinates. The experimental results suggest that the nonlinear transformations of the dissimilarities help to increase the separation among classes.
(iii) The Hyperkernel classifier outperforms the Lanckriet formalism for multicategory problems. As the number of classes growths the number of samples per class comes down and the Lanckriet formalism seems to be less robust to overfitting.
Finally, notice that our algorithm allow us to work with applications in with only a dissimilarity is defined. Moreover, we avoid the complex task of choosing a dissimilarity that reflects properly the proximities among the sample profiles.

Conclusions
In this paper, we propose two methods to incorporate in the ν-SVM algorithm a linear combination of non-Euclidean dissimilarities. The family of distances is learnt in a (Hyper Reproducing Kernel Hilbert Space) HRKHS using a Semidefinite Programming approach. A penalty term has been added to avoid the overfitting of the data. The algorithm has been applied to the classification of complex cancer human samples. The experimental results suggest that the combination of dissimilarities in a Hyper Reproducing Kernel Hilbert Space improves the accuracy of classifiers based on a single distance particularly for nonlinear problems. Besides, this approach outperforms the Lanckriet formalism specially for multi-category problems and is more robust to overfitting. Future research trends will focus on learning the combination of dissimilarities for other classifiers such as k-NN.