According to the recent reports of the World Health Organization (WHO), stomach cancer is the fifth most common cancer in the world and more than 70% of the new cases of stomach cancer occurred in developing countries (mainly in China) [
Traditionally, LN metastasis diagnosis is mainly implemented by preoperative imaging such as abdominal ultrasonography (US) and computed tomography (CT), but their diagnostic accuracy is limited. It was reported that the detection rate of lymph nodes around the stomach was 18.7% in CT and 5.0% in US [
Recently, increasing evidences suggest the critical role of DNA methylation in human carcinogenesis [
In this study, we grouped the data of stomach cancer into three categories, normal, LN metastasis negative, and LN metastasis positive, according to the clinical information. A three-step feature selection method was applied to identify the key genes. To evaluate the reliability of the selected biomarkers, we introduced the random forest algorithm to predict the categories with and without the three-step feature selection method. The results showed that the prediction accuracy was largely improved with the selected biomarkers, and it also proved the reliability indirectly.
Feature selection is commonly used to remove the irrelevant and redundant features from the original feature set. The minimum redundancy maximum relevance (mRMR) feature selection method is a feature selection method for finding a set of features that have the highest relevance with the target class and are also maximally dissimilar to each other based on the mutual information theory. However, mRMR is computationally expensive. In our paper, the differential methylation analysis was integrated with mRMR to achieve the preliminary feature selection. To further obtain the most informative feature for classification, an embedded feature selection method with genetic algorithm was introduced to get the final optimal features.
To preliminarily obtain the probes that are closely related to the phenotype, DMR analysis, which aimed to identify significantly methylated probes between different phenotypes, was applied. We compared the methylation status of each probes in the normal samples within the cancer samples and the methylation status of probes in the LN-negative samples within the LN-positive samples. Differentially methylated probes were determined with the Mann–Whitney
The density of the mean difference and BH-adjusted
The classic mRMR method was applied to filter the probes selected previously, and the probes were ranked according to their score. Since there is no explicit threshold, only the top 10% probes were left and these probes were used as input to the next feature selection step. The results of mRMR filtering were shown in Figure
The distribution of mRMR scores with respect to features. The dashed line corresponds to the 10% cutoff used. (a) Normal versus cancer. (b) LN negative versus LN positive.
Performing feature selection with genetic algorithm requires conceptualizing the processing of feature selection as an optimization problem and encoded the solution as binary. In this paper, random forest algorithm was used as the fit function during the genetic algorithm and the receiver operating characteristic (ROC) was used to measure the fitness. The details will be discussed later in the section of Materials and Methods. The normal versus cancer classification and LN negative versus LN positive classification were treated independently.
During the genetic algorithm in respect to the normal versus tumor classification, the ROC value summary in each iteration was shown in Figure
The results of genetic algorithm-based feature selection with respect to the normal versus tumor classification. (a) The fitness improvement in the process of iteration. (b) The distribution of the number of selected probes.
The results of genetic algorithm in respect to the LN negative versus LN positive classification were shown in Figure
The results of genetic algorithm-based feature selection with respect to the LN negative versus LN positive classification. (b) The fitness improvement in the process of iteration. (a) The distribution of the number of selected probes.
To illustrate the necessity and effectiveness of the feature selection procedure, we compared the performance of the random forest using the three-step-selected probes with the random forest using only the differentially methylated probes. We randomly generated 100 training and testing data for evaluation, and the AUROC (area under ROC curve) value was used as measurement. The AUROC value of a classifier described the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. Simply put that a larger value of the AUROC means a higher discriminatory power. The box plots in Figure
The distribution of the AUC value with different methods. (a) AUC value with different methods with respect to the normal versus tumor classification. (b) AUC value with different methods with respect to the LN negative versus LN positive classification.
From the plots, we can see that with the three-step feature selection procedure, the classifier can give a better performance in respect to both the normal versus tumor and LN negative versus LN positive classifications compared to with only the DMR analysis. Moreover, we also can find that the three-step feature selection or DMR only analysis gives good performance (AUC value all greater than 0.99) for the normal versus tumor classification.
The clinical data and the TCGA level 3 DNA methylation data were downloaded from The Cancer Genome Atlas (TCGA) project [
The sample number for each phenotype.
Normal | Cancer | ||
---|---|---|---|
LN negative | LN positive | Unclassified | |
27 | 94 | 189 | 12 |
To identify differentially methylated probes, for each probe, we ranked the samples and compared only the lower methylation quintile sample to the upper methylation quintile sample between two phenotypes using the Mann–Whitney
Genetic algorithms are optimization tools that search the solution through simulating the evolution of random variation and natural selection. For feature selection, the individuals are subsets of candidate features that are encoded as binary and the value indicated that a feature is either included or not in the subset. The parameters used for the genetic algorithm were set as follow [ Population size: 100 Maximum number of generations: 100 Selection method: tournament selection with size = 2 Elitism rate: 10 individuals Crossover: 2-point crossover with probability 0.6 Mutation: random mutation with probability 0.05
The initial population was created by producing chromosomes with a random 30% of the predictors. The fitness function of every individual was defined as the ROC value of the classification method.
Stomach cancer is the fifth most common cancer in the world, and most of the new cases occurred in developing countries, especially in China. Recently, more and more evidence demonstrated that LN metastasis was an independent risk factor for stomach cancer recurrence in patients following curative resection, and the overall survival of LN metastasis-negative stomach cancer patients is significantly longer than that of LN metastasis-positive patients.
Based on the critical role of DNA methylation in human carcinogenesis, in this study, we focused on the prediction of the LN metastasis status using the DNA methylation data. However, considering the inherent disadvantage of DNA methylation data, such as the limited sample number compared to the large number of probes, we applied a three-step feature selection procedure to extract a small subset of representative features. First, we applied the differential methylation analysis to identify the significantly methylated probes between different phenotypes. Then, an mRMR method was introduced to remove the redundant feature obtained in the first filter step. Finally, a wrapper method based on genetic algorithm was used to achieve the final feature selection. We obtained 20 probes related to 39 genes which were inputs of the prediction in respect to normal versus tumor, and 12 probes related to 14 genes were input to the prediction in respect to LN negative versus LN positive (see Table
Identified biomarkers for each prediction.
Normal versus tumor biomarkers | LN negative versus LN positive biomarkers |
---|---|
|
|
To evaluate the effect of three-step feature selection to the prediction performance, we downloaded the DNA methylation data and clinical data from the TCGA project. The AUROC value was used as the performance measurement. The experiment results showed that the three-step feature selection can largely improve the performance of prediction, especially predicting LN negative versus LN positive. The source code used in this paper can be obtained at
The authors declare that they have no conflicts of interest.
This work was partially funded by the State Key Development Program for Basic Research of China (2013CB967402) and the National Natural Science Foundation of China (31671299, 61603161). The authors would like to thank the reviewers in advance for their comments.