Gene expression data composed of thousands of genes play an important role in classification platforms and disease diagnosis. Hence, it is vital to select a small subset of salient features over a large number of gene expression data. Lately, many researchers devote themselves to feature selection using diverse computational intelligence methods. However, in the progress of selecting informative genes, many computational methods face difficulties in selecting small subsets for cancer classification due to the huge number of genes (high dimension) compared to the small number of samples, noisy genes, and irrelevant genes. In this paper, we propose a new hybrid algorithm HICATS incorporating imperialist competition algorithm (ICA) which performs global search and tabu search (TS) that conducts fine-tuned search. In order to verify the performance of the proposed algorithm HICATS, we have tested it on 10 well-known benchmark gene expression classification datasets with dimensions varying from 2308 to 12600. The performance of our proposed method proved to be superior to other related works including the conventional version of binary optimization algorithm in terms of classification accuracy and the number of selected genes.
DNA microarray technology which can measure the expression levels of thousands of genes simultaneously in the field of biological tissues and produce databases of cancer based on gene expression data [
Due to the high dimensions of gene expression data compared to the small number of samples, noisy genes, and irrelevant genes, the conventional classification methods cannot be effectively applied to gene classification due to the poor classification accuracy. With the inherent property of gene data, efficient algorithms are needed to solve this problem in reasonable computational time. Therefore, many supervised machine learning algorithms, such as Bayesian networks, neural networks, and support vector machines (SVMs), combined with feature selection techniques, have been used to process the gene data [
Metaheuristic algorithms, as a kind of random search techniques, cannot guarantee finding the optimal solution every time. Due to the fact that a single metaheuristic algorithm is often trapped into an immature solution, the recent trends of research have been shifted towards the several hybrid methods. Kabir et al. [
So, in this paper, we concentrate on imperialist competition algorithm inspired by sociopolitical behavior which is a kind of new swarm intelligent optimization algorithms to address the process of feature selection from gene expression data. It starts with an initial population and effectively searches the solution space through some specially designed operators to converge to optimal or near-optimal solution. Although ICA has been proved a potential search technique for solving optimization problem, it still faces some difficulties that ICA is easy to trap into local optima and cannot get a better result. Tabu search (TS) as a local search technique just can make up for the deficiency of the ICA algorithm. It has the ability to avoid convergence to local optimal by a flexible memory system including aspiration criterion and tabu list. Due to local search property of TS, the convergence speed of TS largely depends on the initial solution. The parallelism of population in ICA would help the TS find the promising regions of the search space very quickly. So the hybrid algorithm HICATS effectively combines the advantages of ICA and TS and shows the superiority in feature selection.
The rest of the paper is organized as follows. Section
ICA is a population-based stochastic optimization technique, which was proposed by Atashpaz-Gargari and Lucas [
The flowchart of ICA scheme.
Tabu search (TS) was proposed by Glober in 1986 [
ICA as a global search metaheuristic algorithm reveals the advantage in solving combinatorial optimization problems; however, the diversity of the population would be greatly reduced after some generations and the algorithm might lead to premature convergence. TS as a local search technique can exploit the neighbors of current solutions to get better candidates, but it will take much time to obtain the global optimum or near-global optimum. The incorporation of TS into ICA as a local improvement strategy enables the method to maintain the population diversity and prohibits misleading local optimal. Each binary coded string (country) represents a set of genes, which is evaluated by the fitness function. TS is applied on imperialist in each empire to select the new imperialist and avoid premature convergence. The framework of the proposed algorithm HICATS can be shown in Figure
The framework of the proposed algorithm HICATS.
Set parameters of the algorithm and initialize countries with binary representation 0 and 1. Evaluate each country in the population which utilizes support vector machine classifier (SVM). The fitness is decided by the percentage of classification accuracy of SVM and the number of feature subsets. Then empires are generated depending on their fitness values.
Apply TS on imperialist in each empire. Generate and evaluate the neighbors of imperialist. Select the new solution according to the tabu list and aspiration criterion to replace the old imperialist.
Apply a learning mechanism on colonies which is the same as Baldwinian Learning (BL) mechanism [
Compare the objective values between imperialist and its colonies in the same empire. Exchange the positions of imperialist and its colony when a colony is better than its imperialist.
Calculate the total power of an empire and compare all empires; then eliminate the weakest empire when it loses all of its colonies.
If the termination condition (the predefined max iterations) is not fulfilled, go back to Step
It is clear that HICATS integrates two quite different search strategies for feature selection, that is, the operation on ICA which can explore the new region and provide the ideal solution for TS, while TS can exploit the neighbors of imperialist for better candidate and avoid getting into local optimal according to memory system. The evaluation function, incorporating the accuracy of SVM with the number of selected genes in feature subset, assists HICATS to find the most salient features with less redundancy. A reliable selection method for genes classification should have higher classification accuracy and contain less redundant genes. For more comprehensibility, details about each component of HICATS are described in the following sections.
In this paper, we utilize random approach to generate a binary coded string (country) composed of 0, 1 and its length is equal to the dimensions of gene expression data. A value of 1 in country indicates that this gene should be selected while the value of 0 represents the uselessness of corresponding gene. In order to clearly understand these operations, we take an example for explanation. Assume that the gene data have 10 dimensions (10 features:
An illustrated example with generated subset and individual representation.
After generating the population, we should evaluate the countries and initialize empires composed of imperialists and colonies. The fitness value of a country is estimated by the fitness function
In HICATS, assimilation is an important operation and could likely be a momentous help in the progress of colonies evolution. In this paper, the idea of continuous BL is introduced into the HICATS for colonies assimilation by their imperialist. This strategy can utilize some specific differential information from the imperialist, that is, the differential information between imperialist and colony
A colony is assimilated by an imperialist.
The feature selection of gene expression data needs to consider the classification accuracy and the number of selected informative genes. Hence, the fitness function is defined as follows:
In HICATS, each colony can be assimilated by its imperialist and then improve itself. Thus, the whole algorithm has a speed convergence. However, the classical ICA is easy to fall into local optimum. Therefore, the exploitation is performed by TS to search the better solution nearby the current imperialist and to escape from local optimal in this paper. How to produce the solution and the tabu list is very important in TS algorithm. In our study, one bit of the solution with nonoperation is utilized to produce the nearby solutions. For example, if the gene expression data with 10 dimensions and the current country is
Producing nearby solutions in TS.
In this paper, except for SRBCT which was gained by continuous image analysis, the rest of the gene microarray datasets were obtained by the oligonucleotide technique. Presently, there is no standard method for processing gene expression data. Therefore, we designed an effective algorithm HICATS to perform feature selection for improving the classification accuracy. The datasets consist of 10 pieces of gene expression data, which can be downloaded from
Cancer-related human gene microarray datasets used in this study.
Dataset name | Description |
---|---|
9_Tumors | Oligonucleotide microarray gene expression profiles for the chemosensitivity profiles of 232 chemical compounds |
11_Tumors | Transcript profiles of 11 common human tumors for carcinomas of the prostate, breast, colorectum, lung, liver, gastroesophagus, pancreas, ovary, kidney, and bladder/ureter |
Brain_Tumor 1 | DNA microarray gene expression profiles derived from 99 patient samples. The medulloblastomas included primitive neuroectodermal tumors, atypical teratoid/rhabdoid tumors, malignant gliomas, and the medulloblastomas activated by the sonic hedgehog pathway |
Brain_Tumor 2 | Transcript profiles of four malignant gliomas, including classic glioblastoma, nonclassic glioblastoma, classic oligodendroglioma, and nonclassic oligodendroglioma |
Leukemia 1 | DNA microarray gene expression profiles of acute myelogenous leukemia (AML) and acute lymphoblastic leukemia (ALL) of B-cell and T-cell |
Leukemia 2 | Gene expression profiles of a chromosomal translocation to distinguish mixed-lineage leukemia, ALL, and AML |
Lung_Cancer | Oligonucleotide microarray transcript profiles of 203 specimens, including lung adenocarcinomas, squamous cell lung carcinomas, pulmonary carcinomas, small-cell lung carcinomas, and normal lung tissue |
SRBCT | cDNA microarray gene expression profiles of small, round blue cell tumors, which include neuroblastoma, rhabdomyosarcoma, non-Hodgkin’s lymphoma, and the Ewing family of tumors |
Prostate_Tumor | cDNA microarray gene expression profiles of prostate tumors. Based on MUC1 and AZGP1 gene expression, the prostate cancer can be distinguished as a subtype associated with an elevated risk of recurrence or with a decreased risk of recurrence |
DLBCL | DNA microarray gene expression profiles of DLBCL, in which the DLBCL can be identified as cured versus fatal or refractory disease |
Description of gene expression datasets.
Dataset number | Dataset name | Number of | ||
---|---|---|---|---|
Samples | Genes | Classes | ||
1 | 9_Tumors | 60 | 5726 | 9 |
2 | 11_Tumors | 174 | 12533 | 11 |
3 | Brain_Tumors 1 | 90 | 5920 | 5 |
4 | Brain_Tumors 2 | 50 | 10367 | 4 |
5 | Leukemia 1 | 72 | 5327 | 3 |
6 | Leukemia 2 | 72 | 11225 | 3 |
7 | Lung_Cancer | 203 | 12600 | 5 |
8 | SRBCT | 83 | 2308 | 4 |
9 | Prostate_Tumor | 102 | 10509 | 2 |
10 | DLBCL | 77 | 5469 | 2 |
The parameter values for HICATS are shown in Table
Parameter settings for HICATS.
Parameters | Values |
---|---|
The number of countries | 15 |
The number of imperialists | 4 |
The number of colonies | 11 |
The number of iterations (generations) | 50 |
|
0.8 |
|
0.2 |
In this paper, a hybrid algorithm HICATS incorporating ICA and TS is used to perform feature selection for the gene expression data. TS was embedded in the ICA to prevent the method from getting trapped into a local optimum, while applying TS on imperialist can improve the performance and speed up the convergence of TS.
The experiment results included classification accuracy and the number of selected feature genes obtained by HICATS over 10 independent runs for 10 datasets included 11_Tumors, 9_Tumors, SRBCT, Leukemia 1, Leukemia 2, DLBCL, Prostate_Tumor, Lung_Cancer, Brain_Tumors 1, and Brain_Tumors 2 which are shown in Tables
The computational results obtained by our proposed algorithm HICATS for 10 independent runs on 11_Tumors, 9_Tumors, and SRBCT datasets.
Runs | 11_Tumors | 9_Tumors | SRBCT | |||
---|---|---|---|---|---|---|
Acc. (%) | Selected genes | Acc. (%) | Selected genes | Acc. (%) | Selected genes | |
1 |
|
|
75.00 | 245 | 100 | 10 |
2 | 96.55 | 302 | 76.67 | 262 | 100 | 14 |
3 | 94.83 | 330 | 75.00 | 233 | 100 | 15 |
4 | 95.40 | 268 | 75.00 | 249 | 100 | 13 |
5 | 96.55 | 290 | 76.67 | 257 |
|
|
6 | 96.55 | 356 | 81.67 | 242 | 100 | 12 |
7 | 94.83 | 323 |
|
|
100 | 16 |
8 | 94.83 | 349 | 76.67 | 238 | 100 | 9 |
9 | 95.98 | 275 | 81.67 | 247 | 100 | 9 |
10 | 95.40 | 295 | 81.67 | 253 | 100 | 10 |
|
||||||
Ave. |
95.86 |
307.5 |
78.33 |
248.5 |
100 |
11.70 |
The computational results obtained by our proposed algorithm HICATS for 10 independent runs on Leukemia 1, Leukemia 2, and DLBCL datasets.
Runs | Leukemia 1 | Leukemia 2 | DLBCL | |||
---|---|---|---|---|---|---|
Acc. (%) | Selected genes | Acc. (%) | Selected genes | Acc. (%) | Selected genes | |
1 |
|
|
100 | 8 | 100 | 4 |
2 | 100 | 3 | 100 | 10 |
|
|
3 | 100 | 3 | 100 | 6 | 100 | 5 |
4 | 100 | 3 | 100 | 6 | 100 | 3 |
5 | 100 | 3 | 100 | 7 | 100 | 4 |
6 | 100 | 3 | 100 | 8 | 100 | 3 |
7 | 100 | 3 |
|
|
100 | 4 |
8 | 100 | 3 | 100 | 7 | 100 | 5 |
9 | 100 | 3 | 100 | 5 | 100 | 6 |
10 | 100 | 3 | 100 | 6 | 100 | 4 |
|
||||||
Ave. |
100 |
3 |
100 |
6.80 |
100 |
4.10 |
The computational results obtained by our proposed algorithm HICATS for 10 independent runs on Prostate_Tumor, Lung_Cancer, Brain_Tumor 1, and Brain_Tumor 2 datasets.
Runs | Prostate_Tumor | Lung_Cancer | Brain_Tumor 1 | Brain_Tumor 2 | ||||
---|---|---|---|---|---|---|---|---|
Acc. (%) | Selected genes | Acc. (%) | Selected genes | Acc. (%) | Selected genes | Acc. (%) | Selected genes | |
1 | 98.04 | 8 | 95.57 | 6 |
|
|
94 | 5 |
2 | 97.06 | 7 | 96.06 | 6 | 93.33 | 12 | 90 | 6 |
3 |
|
|
96.06 | 9 | 94.44 | 9 | 94 | 7 |
4 | 98.04 | 7 | 95.57 | 8 | 91.11 | 10 | 92 | 5 |
5 | 97.06 | 6 | 96.06 | 7 | 93.33 | 8 | 92 | 3 |
6 | 98.04 | 7 | 97.04 | 11 | 92.22 | 14 | 94 | 8 |
7 | 97.06 | 10 | 96.06 | 8 | 91.11 | 7 | 92 | 4 |
8 | 98.04 | 8 | 96.06 | 7 | 93.33 | 9 |
|
|
9 | 98.04 | 9 | 96.06 | 9 | 94.44 | 6 | 90 | 9 |
10 | 98.04 | 5 |
|
|
93.33 | 8 | 94 | 8 |
|
||||||||
Ave. |
97.75 |
7.2 |
96.16 |
7.8 |
93.10 |
8.9 |
92.60 |
5.8 |
In order to verify the effectiveness of the proposed algorithm, firstly, we will compare the performance of HICATS with pure ICA using SVM as a classifier under the same experimental conditions; then, we will compare HICATS with other optimization algorithms on several benchmark classification datasets. The comparison results including the optimal classification accuracy and the number of selected genes obtained by HICATS and ICA are given in Table
Classification accuracies and selected genes obtained by HICATS and ICA for gene expression data.
Datasets | Methods | |||
---|---|---|---|---|
HICATS | ICA | |||
Acc. (%) | Selected genes | Acc. (%) | Selected genes | |
9_Tumors |
|
|
76.67 | 282 |
11_Tumors |
|
|
95.98 | 293 |
Brain_Tumor 1 |
|
|
91.11 | 8 |
Brain_Tumor 2 |
|
|
92 | 5 |
Leukemia 1 |
|
|
97.50 | 7 |
Leukemia 2 |
|
|
97.32 | 8 |
Lung_Cancer |
|
|
95.57 | 12 |
SRBCT |
|
|
100 | 10 |
Prostate_Tumor |
|
|
97.06 | 6 |
DLBCL |
|
|
97.50 | 5 |
Classification accuracies of gene expression data obtained via different classification methods.
Datasets | Methods | HICATS | |||||||
---|---|---|---|---|---|---|---|---|---|
Non-SVM | MC-SVM | SVM | |||||||
|
NN | PNN | OVR | OVO | DAG | WW | CS | OVR | |
9_Tumors | 78.33 | 19.38 | 34.00 | 65.10 | 58.57 | 60.24 | 62.24 | 65.33 |
|
11_Tumors | 93.10 | 54.14 | 77.21 | 94.68 | 90.36 | 90.36 | 94.68 | 95.30 |
|
Brain_Tumor 1 | 94.44 | 84.72 | 79.61 | 91.67 | 90.56 | 90.56 | 90.56 | 90.56 |
|
Brain_Tumor 2 | 94.00 | 60.33 | 62.83 | 77.00 | 77.83 | 77.83 | 73.33 | 72.83 |
|
Leukemia 1 | 100 | 76.61 | 85.00 | 97.50 | 91.32 | 96.07 | 97.50 | 97.50 |
|
Leukemia 2 | 100 | 91.03 | 83.21 | 97.32 | 95.89 | 95.89 | 95.89 | 95.89 |
|
Lung_Cancer | 96.55 | 87.80 | 85.66 | 96.05 | 95.59 | 95.59 | 95.55 | 96.55 |
|
SRBCT | 100 | 91.03 | 79.50 | 100 | 100 | 100 | 100 | 100 |
|
Prostate_Tumor | 92.16 | 79.18 | 79.18 | 92.00 | 92.00 | 92.00 | 92.00 | 92.00 |
|
DLBCL | 100 | 89.64 | 80.89 | 97.50 | 97.50 | 97.50 | 97.50 | 97.50 |
|
(
The number of selected genes from datasets between HICATS and IBPSO.
Datasets | HICATS | IBPSO | ||
---|---|---|---|---|
Genes selected | Percentage of genes selected | Genes selected | Percentage of genes selected | |
9_Tumors |
|
0.045 | 2941 | 0.51 |
11_Tumors |
|
0.022 | 3206 | 0.26 |
Brain_Tumor 1 |
|
0.001 | 754 | 0.13 |
Brain_Tumor 2 |
|
0.0003 | 1197 | 0.12 |
Leukemia 1 |
|
0.0006 | 1034 | 0.19 |
Leukemia 2 |
|
0.0004 | 1292 | 0.12 |
Lung_Cancer |
|
0.0005 | 1897 | 0.15 |
SRBCT |
|
0.004 | 431 | 0.19 |
Prostate_Tumor |
|
0.0005 | 1294 | 0.12 |
DLBCL |
|
0.0005 | 1042 | 0.19 |
|
||||
Average |
|
0.00097 | 1117.6 | 0.15 |
The convergence graphs of the best and average classification accuracy obtained by HICATS for 9_Tumors, 11_Tumors, SRBCT, and DLBCL are shown in Figures
The convergence graphs of the best and average accuracy classification by HICATS algorithm on 9_Tumors and 11_Tumors datasets.
The convergence graphs of the best and average accuracy classification by HICATS algorithm on SRBCT and DLBCL datasets.
In this paper, a hybrid algorithm HICATS incorporated binary imperialist competition algorithm and tabu search is used to perform feature selection and SVM with one-versus-the-rest serves as an evaluator of HICATS for gene expression data classification problems. This work effectively combines the advantages of two kinds of different search mechanism algorithms to obtain the higher classification accuracy for gene expression data problems. In general, the classification performance of HICATS is as good as IBPSO; however, HICATS is superior to IBPSO and other methods in terms of selected genes. In our proposed algorithm, in order to avoid imperialist premature convergence, a local search strategy TS embedded in ICA while TS is applied on imperialist in each empire can exploit the neighbors of imperialist to speed the convergence and assist in the imperialist evolution. Experimental results show that our method effectively classifies the samples with reduced feature genes. In the future work, imperialist competition algorithm combined with other intelligent search strategies will be used to select informative genes.
The authors declare that they have no competing interests.