Esophageal squamous cell cancer (ESCC) is one of the most common fatal human cancers. The identification of biomarkers for early detection could be a promising strategy to decrease mortality. Previous studies utilized microarray techniques to identify more than one hundred genes; however, it is desirable to identify a small set of biomarkers for clinical use. This study proposes a sequential forward feature selection algorithm to design decision tree models for discriminating ESCC from normal tissues. Two potential biomarkers of RUVBL1 and CNIH were identified and validated based on two public available microarray datasets. To test the discrimination ability of the two biomarkers, 17 pairs of expression profiles of ESCC and normal tissues from Taiwanese male patients were measured by using microarray techniques. The classification accuracies of the two biomarkers in all three datasets were higher than 90%. Interpretable decision tree models were constructed to analyze expression patterns of the two biomarkers. RUVBL1 was consistently overexpressed in all three datasets, although we found inconsistent CNIH expression possibly affected by the diverse major risk factors for ESCC across different areas.
Esophageal cancer is the sixth most common fatal human cancer in the world [
Early detection of ESCC could be a promising strategy to decrease mortality. Microarray techniques are extensively utilized to measure expression levels of a large number of genes simultaneously and provide better understanding of the molecular mechanism of ESCC carcinogenesis. The microarray expression data could be analyzed to identify and give insights into clinical biomarkers of ESCC for detection. Several efforts have been made to study gene expression profiles and differential expressed genes for discovering biomarkers using microarray techniques [
The incorporation of classification and feature selection algorithms has been widely used to identify promising features for various classification problems such as ubiquitylation sites [
For the application of biomarkers for detecting ESCC, the simple decision tree methods capable of generating human interpretable rules were chosen instead of the black-box methods such as support vector machines (SVM). In this study, a sequential forward feature selection algorithm is proposed to identify genes best for decision tree classifications that is capable of selecting a small set of biomarkers with human interpretable rules. Two public available microarray datasets obtained from Gene Expression Omnibus (GEO) database [
In order to identify and validate genomic biomarkers for ESCC, two microarray datasets of GSE23400 [
We selected 17 incident male ESCC patients who regularly consumed tobacco and alcoholic beverage to validate the candidate biomarkers obtained from the above two datasets. All of them underwent total esophagectomy in Kaohsiung Medical University Hospital. One pair of resected tumor and adjacent normal tissue for each patient was immediately put into a portable container with dry ice and then transferred and maintained in a nitrogen tank until analysis. After review by a qualified pathologist (Dr. CC Wu), the tumor parts were found to have cancer cells in >80% of the tissues, whereas the normal parts were microscopically tumor-free. This study was in compliance with the Helsinki Declaration and approved by the internal review board of KMUH. All patients provided their written informed consent.
Total RNA from each pair was isolated by a single-step guanidinium isothiocyanate method using the Trizol Reagent total RNA Purification Kit (Invitrogen Inc., USA) according to the manufacturer’s instructions. The yield and quality of RNA were assessed by spectrophotometry and the Agilent 2100 Bioanalyzer (Agilent Technologies, Palo Alto, CA). All paired samples had an A260/A280 between 1.8 and 2.2 and A260/A230 ratio above 1.0 and were eligible for the subsequent array experiment.
After RNA isolation, cDNA was prepared from each sample by Reverse Transcription System (Cat: A3500, Promega Corporation, USA). For the appropriate efficiency of reverse transcription, 1
Human oligonucleotide DNA microarrays (Human Whole Genome OneArray) from Phalanx Biotech Group (Hsinchu, Taiwan) were used. The Human Whole Genome OneArray (HOAv4.3, Phalanx Biotech Group, Taiwan) contains 32,050 60-mer oligonucleotide probes, including 28,703 probes corresponding to the annotated genes in Unigene v175 and RefSeq database, 2,265 experimentally defined probes, and 1,082 control probes.
The detailed experimental method is described elsewhere [
Spots in each array with foreground median intensity of wavelength 532 nm greater than or equal to that of background median intensity plus 3-fold standard deviation of wavelength 532 nm were considered as the “Present” flag and included for the further analysis. In order to evaluate the quality of each array in the entire array experiment, three evaluation steps were performed: basic, reproducible, and diagram. In the basic step, three parameters, including percentage of “Present” spots among all spots, the average intensity of “Present” spots, and coefficient of variation of intensity for control spots in the entire arrays, were all considered. If any two parameters in one array were located outside the 1.5-folds interquartile range (25th–75th) of the same parameters for all arrays, that array was excluded. The remaining arrays were then evaluated in reproducible steps which the repeated arrays of the same sample would pass, when their Pearson’s correlation coefficient was larger than 0.95 and “2-fold percentage” was less than 15%. The “2-fold percentage” was the percentage of probes among all probes in which the ratio of the same probe between two arrays exceeded 2-fold. In the final diagram step, the density plot of repeated arrays was used to examine the intensity profile of each array. An array would pass if the profile was similar to the rest of arrays in the same phenotype groups. When the arrays passed all three steps, the raw intensity of spots was log-2 transformed for subsequent analysis. To adjust the systematic variation of experiments and dye effects, global Lowess normalizations were performed within repeated arrays of the same sample and between the samples. Spot was included for further analysis when it was “Present” in at least one of the qualified arrays.
Decision tree algorithms are useful methods to generate interpretable rules based on gene expressions for ESCC classification that are widely used in various classification and regression problems such as immunogenic peptides [
There are more than forty thousand probes in a microarray experiments. The selection of informative probes for discriminating between ESCC and normal tissues is a crucial step for biomarker identification. Although the decision tree algorithm J48 has a built-in function for feature selection, the incorporation of various feature selection algorithms could generate decision trees with higher classification accuracy [
In this study, a sequential forward feature selection algorithm (SFFS) is proposed to identify useful biomarkers for discriminating between ESCC and normal tissues. The selection process is based on the accuracy of leave-one-out cross-validation (LOOCV) using the decision tree algorithm J48. Given a dataset of sample size
The SFFS algorithm utilizes the greedy selection strategy under the property monotonic assumption. In contrast to univariate feature selection methods, the SFFS algorithm considering the interaction effects of sequential selected probes on the accuracy is expected to perform better. The SFFS algorithm is only applied to training dataset to identify potential biomarkers.
To evaluate classifiers for their prediction performance, the leave-one-out cross-validation method is applied as it is widely used as an objective evaluation method for error rate estimation [
To identify potential biomarkers for ESCC, a microarray dataset GSE23400 is fetched from GEO database for the following analysis. GSE23400 consists of 53 pairs of ESCC and adjacent normal tissue from 53 patients in China. For each probe,
Because overexpressed genes are more useful for clinical diagnosis than downexpressed genes, only significantly differential expressed probes with adjusted
To identify potential biomarkers for discriminating ESCC from adjacent normal tissues, a sequential forward feature selection (SFFS) algorithm is proposed to determine the best probe set giving the highest leave-one-out cross-validation (LOOCV) accuracy using a decision tree algorithm J48. By applying the proposed SFFS algorithm, two probes giving the highest LOOCV accuracy were selected as potential biomarkers whose gene names are RUVBL1 and CNIH.
The selection process of SFFS is shown in Figure
Selection results of the sequential forward feature selection algorithm.
RUVBL1 alone can be utilized to discriminate ESCC from normal tissues. In contrast, CNIH alone is not suitable for this purpose with an LOOCV accuracy of 75.47%. However, the combined use of RUVBL1 and CNIH provides the best LOOCV accuracy. The decision tree models trained on the whole dataset GSE23400 for RUVBL1, CNIH, and both of RUVBL1 and CNIH are shown in Figure
Decision tree classifiers based on GSE23400 dataset using (a) RUVBL1, (b) CNIH, and (c) both RUVBL1 and CNIH.
To externally validate the two potential biomarkers of RUVBL1 and CNIH, another dataset GSE20347 fetched from GEO database consisting of 17 pairs of ESCC and normal tissues from patients in China. LOOCV is applied to GSE20347 dataset to validate the discrimination ability of RUVBL1 and CNIH for ESCC. The use of RUVBL1 yields a high accuracy of 97.06% in GSE20347 that is consistent with that in GSE23400. The same accuracy obtained from 10-CV demonstrates the usefulness of the biomarkers. However, CNIH alone failed to discriminate ESCC from adjacent normal tissues with an LOOCV accuracy of 44.18%. By using both genes of RUVBL1 and CNIH, the accuracy of the decision tree model remains unchanged.
Figures
Decision tree classifiers based on GSE20347 dataset using (a) RUVBL1 and (b) CNIH.
After the identification and validation of the two biomarkers from the two public available datasets, a total of 34 gene expression profiles from 17 pairs of matched tumor and adjacent normal tissues were measured and collected to test the discriminating ability of the two biomarkers for ESCC. The 34 profiles are generated by using Human Whole Genome OneArray (HOAv4.3, Phalanx Biotech Group, Taiwan) that is different from the two datasets generated by using Affymetrix U133A chips.
The LOOCV accuracy using RUVBL1 was firstly evaluated. The sensitivity, specificity, and accuracy are 94.12%, 76.47%, and 85.29%, respectively. The performances using both genes of RUVBL1 and CNIH are 88.24%, 94.12%, and 91.18% for sensitivity, specificity, and accuracy, respectively. Results show that performances can be improved by incorporating CNIH. The LOOCV and 10-CV accuracies are exactly the same. The improvement is consistent with that in GSE23400. A comparison of classification accuracies on three datasets is shown in Table
Classification accuracies using biomarkers of RUVBL1 and CNIH.
Biomarker | Dataset | ||
---|---|---|---|
GSE23400 | GSE20347 | 17 pairs | |
RUVBL1 | 95.28% | 97.06% | 85.29% |
RUVBL1 + CNIH | 99.06% | 97.06% | 91.18% |
The decision tree models trained on the 34 profiles using RUVBL1 and both genes of RUVBL1 and CNIH are shown in Figures
Decision tree classifiers based on our dataset using (a) RUVBL1 and (b) both RUVBL1 and CNIH.
This study proposed a feature selection-based method to discover a small subset of genes to discriminate ESCC from normal tissues. The method was based on a sequential forward feature selection algorithm to design decision tree models for classifying expression profiles of ESCC and normal tissues. Two genes of RUVBL1 and CNIH were discovered with a high LOOCV accuracy of 99.06% in a published dataset GSE23400 (available at GEO database) consisting of 53 pairs of ESCC and normal tissues. The gene set has been validated in another dataset GSE20347 consisting of 17 pairs of ESCC and normal tissues whose platform is the same as GSE23400. A high LOOCV accuracy of 97.06% for GSE20347 shows the discrimination ability of the two genes.
To further test the two genes, microarray techniques were applied to measure gene expression profiles of Taiwanese patients. The dataset consists of 17 pairs of ESCC and normal tissues. An LOOCV accuracy of 91.18% obtained by using RUVBL1 and CNIH shows their potential as biomarkers for ESCC. Each gene alone performs worse in datasets of GSE23400 and our dataset. It suggests that the two genes should be used simultaneously to obtain the best performance. The 10-CV accuracies demonstrate that the performance remains the same when less training samples are available.
The relationship between the two newly identified biomarkers of RUVBL1 and CNIH genes and ESCC has not been reported. The decision tree models show that RUVBL1 is overexpressed in all three datasets of patients in China and Taiwan. However, CNIH is over- and downexpressed in datasets of patients in China (GSE23400) and Taiwan, respectively. For GSE20347 of patients in China, CNIH is neither over- nor downexpressed.
RUVBL1 plays important roles in chromatin remodeling, transcriptional and developmental regulation, DNA repair, and apoptosis. RUVBL1 is ubiquitously and highly expressed in thymus and testis [
CNIH is involved in the selective transport and maturation of TGF-alpha family proteins. There is no research reporting the relationship between CNIH gene and cancers. Interestingly, the expression of CNIH mRNA is affected by tetrachlorodibenzodioxin [
In conclusion, using feature selection and decision tree models from two public available microarray datasets, we found that two genes (RUVBL1 and CNIH), particularly RUVBL1, could be useful biomarkers in the clinic for discriminating cancer and normal tissues in Taiwanese ESCC patients. The collection of a larger dataset for independent test could further validate the robustness of the two biomarkers. A future work to study the mechanism of these two genes in the carcinogenesis of ESCC is necessary.
The authors declare that they have no conflict of interests.
This work was supported by the National Science Council of Taiwan (NSC 101-2311-B-037-001-MY2 and 101-2314-B-037-043), the National Health Research Institutes (NHRI-EX102-10226PC), Kaohsiung Medical University Hospital (KMUH99-9M12), and Kaohsiung Medical University Research Foundation (KMU-Q110015, KMU-Q102012, and KMU-ER013).