Determining an optimal decision model is an important but difficult combinatorial task in imbalanced microarray-based cancer classification. Though the multiclass support vector machine (MCSVM) has already made an important contribution in this field, its performance solely depends on three aspects: the penalty factor C, the type of kernel, and its parameters. To improve the performance of this classifier in microarray-based cancer analysis, this paper proposes PSO-PCA-LGP-MCSVM model that is based on particle swarm optimization (PSO), principal component analysis (PCA), and multiclass support vector machine (MCSVM). The MCSVM is based on a hybrid kernel, i.e., linear-Gaussian-polynomial (LGP) that combines the advantages of three standard kernels (linear, Gaussian, and polynomial) in a novel manner, where the linear kernel is linearly combined with the Gaussian kernel embedding the polynomial kernel. Further, this paper proves and makes sure that the LGP kernel confirms the features of a valid kernel. In order to reveal the effectiveness of our model, several experiments were conducted and the obtained results compared between our model and other three single kernel-based models, namely, PSO-PCA-L-MCSVM (utilizing a linear kernel), PSO-PCA-G-MCSVM (utilizing a Gaussian kernel), and PSO-PCA-P-MCSVM (utilizing a polynomial kernel). In comparison, two dual and two multiclass imbalanced standard microarray datasets were used. Experimental results in terms of three extended assessment metrics (
Cancer is a disorder caused by excessive and uncontrolled cell division in a body. A total of 9.6 million people died of cancer in 2018 [
This challenge encourages application of data mining techniques, especially the use of gene expression data in determining the types of cancer cells. The level of gene expression can duly indicate the activity of a gene in a body cell based on the number of messenger ribonucleic acids (mRNAs). It is well known to contain information about the disease that may be in the gene sample, which may help experts in treating or preventing the disease [
Though next-generation sequencing (NGS) especially RNA-sequencing (RNA-Seq) is slowly replacing microarrays when analyzing and identifying complex mechanism in gene expression, e.g., in the gene expression-based cancer classification problem, it is relatively expensive compared to microarrays. Since microarrays have been used for a long time, there exist robust statistical and operational methods for their processing [
The DNA microarray technology has the capability of determining the level of thousands of genes concurrently in a given experiment, which so far has facilitated the development of cancer classification by the use of gene expression data [
Clinical decision support is the most recent application of DNA microarrays in the medical domain. This support can take the form of disease diagnosis or predicting clinical outcomes in response to a treatment. Currently, the two major areas in medicine that are drawing much attention in this regard are management of cancer and other contagious diseases [
With the rapid development of artificial intelligence (AI), machine-learning algorithms such as artificial neural network (ANN), support vector machine (SVM), and k-nearest neighbor (KNN), many researchers have immensely applied them in the gene expression-based cancer diagnosis. For instance, the artificial neural networks (ANNs) have been proposed for the microarray gene classification due to their superior ability to map input-output structured data. Khan and Meltzer utilized the ANN in analyzing microarray gene data from patients with small round blue-cell tumours [
Furey proposed an SVM based on a simple kernel to carry out gene expression data analysis, which turned out to perform remarkably [
Based on these previous research studies, it is evident that SVM has already made an important contribution in the field of microarray-based cancer classification. However, many researchers have pointed out that though the SVM is a promising classifier in microarray-based cancer classification, its performance solely depends on three aspects: the penalty parameter C of this classifier, the type of kernel utilized, and its parameters [
To improve the classification accuracy of the SVM classifier, some techniques have been presented to search for the optimal model parameters, such as the grid-search and the gradient descent [
Recently, some meta-heuristic techniques, such as particle swarm optimization (PSO), genetic algorithm (GA), bat algorithm (BA), and dragonfly algorithm (DA) have attained promising results when utilized in tuning SVM classifier’s parameters [
Considering PSO has a number of desirable properties, including simplicity of implementation, scalability of dimension, and a good empirical performance, and is computationally efficient compared to other optimization techniques [
The objective of this research is to construct a MCSVM classifier with three different standard kernel functions (linear, Gaussian, and polynomial). Use PCA to reduce the dimensional complexity of the considered microarray datasets and optimize all the parameters of this model using PSO.
The overall structure of this paper takes the form of five chapters, including this introductory chapter. The remaining part of this paper proceeds as follows: a detailed presentation of the proposed model is presented in Section
Microarray gene expressions can differ by an order of magnitude. Thus, it is necessary to normalize these data to improve the performance of subsequent microarray data analysis stages like gene selection/feature extraction, clustering, and classification [
In this paper, the microarray gene expressions are linearly transformed from the interval
Since the min-max normalization has the advantage of preserving exactly all the relationships among the original gene data values and does not introduce any bias [
One of the major challenges encountered in working with DNA microarray data is their high dimensionality that is coupled with a relatively small sample size. While there is a plethora of crucial information that can be derived from these large datasets, their high-dimensional nature can often hide the critical information. Thus, a process that can reduce the dimensionality complexity of this type of data is required. In addition, a dimensionality reduction step will minimize errors obtained in the subsequent classification stage [
In this paper, principal component analysis (PCA) that includes the calculation of variance of proportion for eigenvector is used. The steps of this algorithm are as follows: Let Compute the mean (centroid) where Compute the covariances (degree to which the genes are linearly correlated) as per the following equation: where Form a covariance matrix The diagonal elements of the transformed matrix are the eigenvalues Calculate corresponding eigenvectors as Sort the eigenvalues in descending order, i.e., The eigenvectors corresponding to the Select the first Form the principal component matrix Compute dimensionally reduced microarray gene expression data
Hence, the analysis reduces the highly dimensioned original microarray datasets to
To be able to measure the generalization error for each considered model, per-fold PCA was adopted. This is achieved by first conducting a separate PCA on each calibration set and then applying this transformation on the validation set. This same transformation is achieved by first subtracting the means of the calibration set from the validation set and then projecting these data onto the principal components of the training set achieved this. The underlying assumption is that the testing and training set should be derived from the same distribution, which justifies this process.
The MCSVM classifier is based on Vapnik–Chervonenkis (VC) dimension of the statistical learning theory and the structural risk minimization [
The main objective of MCSVM is to map the preprocessed, nonlinear inseparable microarray gene expression data into a linear highly dimensioned manifold
The parameter
The feature space
The common kernel functions that are utilized as continuous predictors include [ Linear kernel: Polynomial kernel: where Gaussian kernel:
These MCSVM kernel functions can be broadly categorized as follows: local kernel functions and global kernel functions. Samples far apart have a great impact on the global kernel values while samples close to each other greatly influence the local kernel values. The linear and polynomial kernels are good examples of global kernels while the Gaussian radial basis function and the Gaussian are local kernels [
Relatively speaking, the linear kernel function has a better extraction of global features from samples, the polynomial kernel has good generalization ability, and the Gaussian kernel (the most widely used kernel) has a good learning ability among all the single kernel functions. Thus, it is evident that utilizing a single kernel function-based MCSVM classifier in a given application such as gene expression data may neither attain good learning ability, proper global feature extraction ability, and a better generalization capability. In trying to overcome this hiccup, two or more kernel functions can be combined [
In trying to build a kernel model that has better global feature extraction, good learning, and prediction abilities, the work presented in this paper combines the merits of two global kernels (linear and polynomial) and one local kernel (Gaussian). This paper therefore proposes a novel kernel “linear-Gaussian-polynomial (LGP)” kernel, which is formulated as follows:
In this paper, we utilize different values of
The LGP kernel function takes better global feature extraction ability from the linear kernel, good prediction ability from the polynomial kernel, and better learning ability from the Gaussian kernel. Mercer’s theorem provides the necessary and sufficient qualifiers of a valid kernel function. It states that a kernel function is a permissible kernel if the corresponding kernel matrix is symmetric and positive semidefinite (PSD) [
A kernel matrix can be validated that it is PSD by determining its spectrum of eigenvalues. It is important to note that a symmetric is positive definite if and only if all its eigenvalues are nonnegative. Considering this, for the proposed kernel to be permissible, it must satisfy Mercer’s theorem. This validity can be proved by using the Taylor expansion for the exponential function of equation (
From equation (
Functions of Mercer’s kernels
Functions of a Mercer kernel
Since the proposed hybrid LGP kernel combines three valid Mercer’s kernels, i.e., linear, Gaussian, and polynomial kernels, it also a valid Mercer’s kernel that can be used for training and classification of the multiclass support vector machine (MCSVM).
By using the proposed LGP-MCSVM, the nonlinear transformation of the microarray gene sample points to get the corresponding kernel matrix so as to obtain the classification results during the training phase of the MCSVM classifier.
Currently, there is no widely accepted method for optimizing these parameters. The “grid-search (GS)” with exponentially growing sequences of combination
In this paper, particle swarm optimization (PSO) optimization technique is adopted to optimally search for the best parameter combinations for the considered models [
Parameters and their respective ranges.
Parameter | Range |
---|---|
|
|
|
|
|
|
|
|
|
−15 |
The parameters that need to be determined in the PSO algorithm include the dimension of the search space
Initial PSO parameters setting.
Parameter | Range |
---|---|
Maximum number of iterations | 50 |
Inertial weight, |
1 |
Number of particles/swarm size | (1) PSO + L-MCSVM = 10 |
(2) PSO + G-MCSVM = 20 | |
(3) PSO + | |
(4) PSO + LGP-MCSVM = 80 | |
Cognition learning factor, |
2.0 |
Social learning factor, |
2.0 |
The main process of the proposed algorithm is outlined as follows: Transforming the cancer microarray data into the right format for the SVM package. Loading a cancer microarray dataset. Randomly dividing the loaded microarray data into two sets: training set and testing set. Initialize the PSO parameters such as the population size, the maximum number of iterations, and the considered multiclass SVM parameters. Adopt PSO to search for the optimal solution of particles in the global space by using 5-fold cross-validation that incorporates per-fold PCA feature extraction. This process is presented below. To achieve 5-fold cross-validation incorporating PCA, the following steps were followed: For Carry out PCA on data present in the remaining 4 folds to generate a loadings matrix Transform this data (data in the remaining 4 folds, i.e., calibration set) into a set of principal component (PC) scores using the first Build a considered SVM classification model using a set of parameter values using the generated PC score data in step (iii) Transform the held-out test fold data (i.e., data in fold Compute the classification accuracy of the built SVM classification model in step (iv) using the transformed test fold For the considered parameters set, store their optimal parameter values set (i.e., a set of parameters that yields the highest classification accuracy) Report optimal parameters for the considered model. Carry out PCA on the whole training set data (i.e., the training set obtained in step 3) to generate a loading matrix. Transform this whole training set data into a set of PC scores using the first Build an optimal model for the considered SVM classification model using the optimal parameter values set obtained in step (vii) using the generated PC scores data in step 9. Transform the whole testing set data (i.e., the testing set obtained in step 3) into a set of principal component (PC) scores using the Compute the classification accuracy of the built optimal SVM classification model in step 8 using the transformed whole testing set data in step 9. Report this test classification accuracy.
The schematic diagram in Figure
Scheme of the proposed PSO-PCA-LGP-MCSVM algorithm.
It is important to mention that the whole analysis process is conducted using the LIBSVM framework in MATLAB [
To assess the performance of the proposed PSO-PCA-LGP-SVM algorithm, several experiments were conducted on four publicly available datasets. Summary of all the datasets utilized in this research can be found in Table Colon dataset [ Leukemia (AML-ALL) dataset [ St. Jude Leukemia dataset [ Lung Cancer dataset [
The cancer microarray datasets utilized in this paper.
Category | Dataset | Sample size | Number of genes | Number of classes |
---|---|---|---|---|
Two-class | AML-ALL | 72 | 7129 | 2 |
Colon | 62 | 2000 | 2 | |
Multiclass | St. Jude | 215 | 12558 | 7 |
Lung | 203 | 3312 | 5 |
Due to the small number of instances in the considered datasets, all the datasets were initially split into two disjoint sets: the training set and the test set. Utilizing 5-fold cross-validation, the training set was randomly divided further into 5 subsets (approximately) equal in size. Each time 4 subsets were selected as the calibration set and the remaining subset was used as the validation set. This process was repeated 5 times. Finally, the average of classification accuracy on the validation set was used as one of the evaluation metrics. It is important to point out that by using 5-fold cross-validation to dynamically divide the microarray training samples, the considered models turn out to be more stable and objective.
The percentage proportion for the calibration, validation, and test sets for all the considered microarray datasets is presented in Table
Percentage proportion for the calibration, validation, and test sets.
Dataset | % proportion for calibration set | % proportion for validation set | % proportion for test set |
---|---|---|---|
AML-ALL | 61.1 | 15.3 | 23.6 |
Colon | 58.1 | 14.5 | 27.4 |
St. Jude | 57.7 | 14.4 | 27.9 |
Lung | 57.1 | 14.3 | 28.6 |
When the samples in a dataset are unevenly distributed among the classes (for instance, in the case of microarray datasets), the task of classification in imbalanced domains must be defined. The majority class, as a result, influences the data mining algorithms skewing their performances towards it [
Most algorithms simply compute the accuracy on the basis of the percentage of correct samples.
However, in the case of microarrays, these results are highly deceiving since the minority classes hold minimal effects on the overall classification accuracy. Thus, a consideration of a complete confusion matrix (Table
Confusion matrix for a two-class problem.
Positive prediction | Negative prediction | |
---|---|---|
Positive class | True positive (TP) | False negative (FN) |
Negative class | False positive (FP) | True negative (TN) |
The description in Table
Two most frequently used metrics for class imbalance problem, namely,
The overall classification accuracy (
However, all these evaluation metrics are appropriate for estimating binary-class imbalance tasks. To extend them for multiclass, the following transformations should be considered [
The experimental results for the 4 classification models on the 4 microarray datasets are reported in Tables
Accuracy of all considered models on the four microarray datasets.
Models | Colon | Lung | AML-ALL | St. Jude |
---|---|---|---|---|
PSO + L-MCSVM |
|
0.9596 |
|
0.9422 |
PSO + P-MCSVM | 0.8235 |
|
|
|
PSO + G-MCSVM | 0.8235 | 0.9608 | 0.9412 | 0.9572 |
PSO + LGP-MCSVM |
|
|
|
|
Values in bold represent the best result and values in italic denote the worst in each column, respectively.
Models | Colon | Lung | AML-ALL | St. Jude |
---|---|---|---|---|
PSO + L-MCSVM |
|
0.9246 | 0.9328 | 0.7870 |
PSO + P-MCSVM | 0.8211 |
|
|
|
PSO + G-MCSVM | 0.8211 | 0.9306 |
|
0.8477 |
PSO + LGP-MCSVM |
|
|
|
|
Values in bold represent the best result and values in italic denote the worst in each column, respectively.
Models | Colon | Lung | AML-ALL | St. Jude |
---|---|---|---|---|
PSO + L-MCSVM |
|
0.9791 |
|
0.9557 |
PSO + P-MCSVM | 0.8235 |
|
|
|
PSO + G-MCSVM | 0.8235 | 0.9792 |
|
0.9661 |
PSO + LGP-MCSVM |
|
|
|
|
Values in bold represent the best result and values in italic denote the worst in each column, respectively.
From Tables Lung and St. Jude datasets are slightly sensitive to the class imbalance while Colon and AML-ALL are not, as shown by the difference between Accuracy and The hybrid kernel boosted the classification performance of the multiclass on three datasets, i.e., Colon, Lung, and St. Jude. These promotions are better portrayed by the Of all the considered models, the PSO-PCA-P-MCSVM reported the least performance in all the considered metrics for all the four datasets. However, it is important to note that a promising kernel can be obtained if we embed into the exponential kernel.
In summary, compared with single kernel-based models (i.e., PSO-PCA-L-MCSVM, PSO-PCA-G-MCSVM, and PSO-PCA-P-MCSVM), the proposed PSO-PCA-LGP-MCSVM model that is based on a hybrid linear-Gaussian-polynomial (LGP) kernel with a better global feature extraction ability, good prediction ability, and better learning ability, has an attractive classification ability in cancer diagnosis using both imbalanced dual and multiclass microarray datasets. Moreover, due to the excellent global searching ability of the particle swarm optimization, it can effectively optimize the hybrid kernel-based MCSVM when solving a wider range of classification problems.
Techniques to choose or construct suitable kernel functions and optimally tune its parameters for MCSVM have received a considerable and critical attention in imbalanced microarray-based cancer diagnosis. A novel classification model, PSO-PCA-LGP-MCSVM, that is based on MCSVM with a hybrid kernel, i.e., linear-Gaussian-polynomial (LGP), is proposed in this paper. The LGP kernel combines the advantages of three standard kernels, i.e., linear, Gaussian, and polynomial kernels in a novel manner where the linear kernel is linearly combined with a polynomial kernel that is embedded into a Gaussian kernel. Using PSO to optimally tune the LGP kernel-based MCSVM resulted in better generalization, learning, and predicting ability as evidenced by the promising results in terms of three extended measures
The data used to support the findings of this study are available from the corresponding author upon request.
The authors declare that there are no conflicts of interest regarding the publication of this paper.
This work was fully supported by the African Development Bank (AfDB), through the Ministry of Education, Kenya Support for Capacity Building.
The results presented in Tables