Colorectal cancer (CRC), as a result of a multistep process and under multiple factors, is one of the most common life-threatening cancers worldwide. To identify the “high risk” populations is critical for early diagnosis and improvement of overall survival rate. Of the complicated genetic and environmental factors, which group is mostly concerning colorectal carcinogenesis remains contentious. For this reason, this study collects relatively complete information of genetic variations and environmental exposure for both CRC patients and cancer-free controls; a multimethod ensemble model for CRC-risk prediction is developed by employing such big data to train and test the model. Our results demonstrate that (1) the explored genetic and environmental biomarkers are validated to connect to the CRC by biological function- or population-based evidences, (2) the model can efficiently predict the risk of CRC after parameter optimization by the big CRC-related data, and (3) our innovated heterogeneous ensemble learning model (HELM) and generalized kernel recursive maximum correntropy (GKRMC) algorithm have high prediction power. Finally, we discuss why the HELM and GKRMC can outperform the classical regression algorithms and related subjects for future study.
During past decades, new strategies are developed to decrease the incidence and to improve the prognosis of colorectal cancer (CRC), from popularizing regular screening in individuals older than 50 years for prevention to taking some new technologies like laparoscopic surgery, neoadjuvant chemotherapies, and bio-targeted therapy into consideration for more precise and individualized treatment. However, CRC is still one of the important contributors to cancer worldwide [
There are some mathematical models already developed and used to process different type of data for CRC occurrence prediction. For low dimensional data, Wu et al. [
For this reason, to avoid the shortcomings of the previous research when they are used for such complicated data collected in the KOJACH study as mentioned above, we propose a robust CRC cancer predictive model based on our latest study [
The research results indicate that (
Finally, we analyze the outperformance reasons for both HELM and GKRMC algorithm and discuss the future study for the CRC predictive model.
The data used in this study is from the hospital-based case-control study of colorectal cancer in Chongqing, China, by the Department of Toxicology at the Third Military Medical University [
Food intake is evaluated by our previously developed Semi-Quantitative Food Frequency Questionnaire [
The items in the dataset include general information (such as gender and age), polymorphism distribution of genes related to ethanol metabolism (the distribution of homozygotes and heterozygotes of gene loci), and demographic characteristics, food, and lifestyle habits (smoking and alcohol consumption). To avoid any bias, a standard questionnaire is generated in which each survey item has a specific definition. The examination is carried out as a face-to-face query. Several survey items, such as the amount of alcohol and cigarettes consumed, are quantitatively estimated. Using age 60 as the demarcation point, the surveyed patients are divided into the elderly group and the young/middle-aged group. Alcohol consumption is divided into healthy drinking (including people who do not drink and people who drink no more than 15 g per day) and nonhealthy drinking (including people who drink more than 15 g per day). Based on smoking habits, the participants are divided into nonsmokers and smokers (including those who had quit smoking).
This study employs these data to build the predictive CRC model with biological classification, dimensionality reduction, and regression analysis stages, which will be illustrated in detail in the next section.
The biological classification is carried out from the perspective of medical science to divide the original dataset into four subclasses, which are as follows: (
This study employs three broadly used dimensionality reduction methods, namely, principal component analysis, entropy of information, and relief method to obtain the mutually explored biomarkers for each subclass.
The key idea of relief is to iteratively estimate feature weights according to their ability to discriminate between neighboring patterns. In each of the iterations, a pattern
After biological classification and data dimensional reduction stages, we used the logistic regression (LR), support vector machine (SVM), heterogeneous ensemble learning model (HELM), kernel recursive lease squares (KRLS) [
Minimizing the distance 2/
Workflow of HELM algorithm.
For
initialize the weight distribution for
based on the sample distribution compute the error compute the weight update the weight for each sample end, obtain the ensemble learning classifier calculate the accuracy of end, assign a weight
Generalized Kernel Recursive Maximum Correntropy Initialization: Computation: Iterate for
Using the matrix inversion lemma [
Substituting (
The weight vector can be expressed explicitly as a linear combination of the transformed data; that is,
Then we obtain the GKRMC algorithm, in which the coefficients update follows (
Network topology of GKRMC at
In past decades, a number of candidate factors implicated in CRC risk are proposed by epidemiology studies, which can be divided into two groups in total, genetic factors and nongenetic factors. The genetic factors’ group consists of many SNPs, and the nongenetic factors’ group is comprised of several kinds of environment factors. According to the biological characteristics and the manner that human beings are exposed to environmental factors in whole lifetime, the raw big CRC-related genetic and environmental data can be classified into four biological categories: SNPs, demographic characteristics, lifestyles, and foods as in Table
Results of biological classification.
Categories | Illustration |
---|---|
SNPs | Polymorphism distribution of genes |
|
|
Demographic characteristics | Including factors like age, sex, body weight, income levels, and educations, which represents the individually biological or social-psychological features |
|
|
Lifestyles | Behavioral factors, such as smoking and alcohol drinking |
|
|
Foods | The amount of food intake |
To process the dataset of SNPs, demographic characteristics, lifestyle and food, SPCA, and entropy and relief methods are employed, respectively.
Table
The results by SPCA method.
SNPs | rs10046, rs10505477, rs1152579, rs1229984, rs1255998, rs1256030, rs1256049, rs1271572, rs12953717, rs1329149, rs16941669, rs17033, rs1801132, rs2075633, rs2077647, rs3798758, rs3820033, rs4767939, rs4767944, rs4939827, rs676387, rs6905370, rs6983267, rs7296651, rs7837688, rs827421, rs886205, rs928554, rs9322354, rs9340799 |
|
|
Demographic characteristics | Cholesterol, blood triglyceride, psychological trauma, depression, age, exercise, BMI, physical activity, activity, marriage status, emotion status |
|
|
Lifestyles | Smoking, drinking, coffee consumption, drinking and smoking in the same time point, tea consumption |
|
|
Foods | Grains, melons, bean products, roots, vegetables, fruits, eggs and milk, mushrooms, oil, seasoning, meat, seafood, pickles |
We consider that the features with high weight will result in the colorectal cancer when the relief algorithm is applied to extract key features from the dataset. The result of relief algorithm is shown in Figure
Feature selection by relief algorithm: (a) SNPs feature (note: the feature numerical number in the upper figure is regarding Supplementary S1 from columns B(1) to AU(46)), (b) demographic characteristics feature (note: the Feature numerical number in the upper figure is regarding Supplementary S2 from columns A(1) to U(21)), (c) lifestyle feature (note: the feature numerical number in the upper figure is regarding Supplementary S3 from columns B(1) to I(8)), and (d) food feature (note: the feature numerical number in the upper figure is regarding Supplementary S4 from columns B(1) to O(14)).
Table
The results by entropy method.
SNPs | rs6983267, rs1256030, rs10046, rs928554, rs1152579, rs690537, rs676387 |
|
|
Demographic characteristics | Age, BMI, blood triglyceride, depression, mental stress, psychological trauma |
|
|
Lifestyles | Drinking and smoking in the same time point, drinking |
|
|
Foods | Vegetables, nuts, mushrooms, seasoning, pickles, grains |
Regarding the results of Figure
The results by relief method.
SNPs | rs10505477, rs1256030, rs1801132, rs2071454, rs2075633, rs2228480, rs2249695, rs2486758, rs3798758, rs4767939, rs4767944, rs4939827 |
|
|
Demographic characteristics | Age, BMI, physical activity, activity, family number, emotion status, temperament, mental stress, psychological trauma, depression, cholesterol |
|
|
Lifestyles | Drinking, tea consumption, drinking and smoking in the same time point |
|
|
Foods | Nuts, vegetables, meat, eggs and milk, seafood |
Figure
Venn plots of (a) SNPs, (b) demographic characteristics, (c) lifestyle, and (d) food.
Figure
Figure
Figure
We have 36 features mutually explored by every two of the SPCA, entropy, and relief methods.
By
Biomarkers |
|
---|---|
rs10046 | 0.0172 |
rs1256030 | 0.0004 |
rs6766387 | 0.0015 |
rs6983267 | 0.0000 |
age | 0.0152 |
BMI | 0.0019 |
Physical activity | 0.0030 |
Emotion status | 0.0247 |
Mental stress | 0.0213 |
Cholesterol | 0.0000 |
Drinking and smoking in the same time point | 0.0000 |
Vegetables | 0.0000 |
Seafood | 0.0023 |
Table
Mutually explored biomarkers.
SNPS | rs10046, rs1256030, rs676387, 6983267 |
|
|
Demographic characteristics | Age, BMI, physical activity, emotion status, mental stress, cholesterol |
|
|
Lifestyle | Drinking and smoking in the same time point |
|
|
Foods | Vegetables, seafood |
According to the dimensionality reduction analysis, there are 13 biomarkers selected as the classifier for these four biological datasets. Next, we employ LR, SVM, KRLS, HELM, and GKRMC algorithm to build up the predictive cancer model based on these selected classifiers.
Table
The definition of the classification measurement.
Measure | Formula | Illustration |
---|---|---|
Sensitivity |
|
TP: the number of true positives |
|
||
|
||
Specificity |
|
TN: the number of true negatives |
|
||
|
||
Precision |
|
TP: the number of true positives |
FP: the number of false positives | ||
|
||
Accuracy |
|
TP: the number of true positives |
TN: the number of true negatives | ||
|
||
|
There are 1298 cases-control samples, 369 of which are case and 929 of which are control. Cross validation [
The mutually explored biomarkers.
LR | SVM | KRLS | HELM | GKRMC | |
---|---|---|---|---|---|
Sensitivity | 0.9251 ± 0.0256 | 0.9255 ± 0.0233 | 0.9694 ± 0.0137 | 0.9621 ± 0.0066 | 0.9762 ± 0.0175 |
Specificity | 0.1876 ± 0.0437 | 0.2300 ± 0.0459 | 0.0262 ± 0.0145 | 0.1680 ± 0.0033 | 0.0864 ± 0.0408 |
Precision | 0.7351 ± 0.0288 | 0.7372 ± 0.0315 | 0.7184 ± 0.0170 | 0.7400 ± 0.0066 | 0.7418 ± 0.0197 |
Accuracy | 0.7095 ± 0.0217 | 0.7163 ± 0.0258 | 0.7049 ± 0.0213 | 0.7305 ± 0.0087 | 0.7351 ± 0.0230 |
Predictive performance for the LR, SVM, KRLS, HELM, and GKRMC.
For CRC tumorigenesis, both genetic and environmental factors, as well as their interaction, playing important role in CRC risk is already the common view of most previously studies [
Such big datasets are classified into four different categories in the biological classification stage. And 13 of all explored potential biomarkers consisting of 4 SNPs, 6 demographic characteristics, 1 lifestyle factor, and 2 foods are screened out in data dimensionality reduction stage.
Unlike pure mathematical formulae, the biological rationality of such model depends on whether the selected biomarkers can be biologically explained as validated etiology of colorectal cancer supported by either population-based association study or biological function-based mechanisms experimental study. And then, these explored biomarkers can be used as the classifiers for the predictive model to access the risk of colorectal cancer in the regression analysis stage.
In fact, results from substantial epidemiology studies focusing on CRC risk/protective factors provide evidences for the associations between each category and risk of CRC. For the genetic variations, at least 2 (rs10046, rs6983267) of the 4 currently selected SNPs listed in Table
For demographic factors, almost all the 6 selected factors have been reported to be the unfavorable factors for CRC risk in a bunch of previous studies [
For lifestyles, alcohol drinking and smoking are proved as two significant risk factors of CRC [
For food, extensive epidemiologic and experimental studies confirm their important roles in the development of CRC. For example, higher consumption of vegetables and seafood is always associated with relatively lower CRC risk due to their relatively high content of antioxidant nutrients such as dietary fiber, vitamins, and long-chain unsaturated fatty acids [
In general, it is demonstrated that the 13 currently explored biomarkers can be used as the classifiers in the regression analysis stage, which is supported by these manually reviewed experimental evidences [
Although LR and SVM may perform very well for linear systems, their performance will get worse when applied to the nonlinear and non-Gaussian situations [
To overcome the shortcoming of both linear and conventional nonlinear regression algorithms, this study proposes an ensemble learning model (HELM) and a generalized kernel recursive maximum correntropy (GKRMC) algorithm to increase the predictive power of the model. Next, we analyze the reason why HELM and GKRMC can outperform LR, SVM, and KRLS algorithms.
HELM is an ensemble learning algorithm, which integrates linear and nonlinear classifiers to classify the data points. Based on the previous study [
The cost function of GKRMC (see (
In conclusion, this study proposes a robust CRC-risk predictive model to analyze the big data with information of genetic variations and environmental exposure for the CRC patients and cancer-free controls. The research results indicate that both genetic and environmental related factors explored by our model play the significant roles in the occurrence of CRC and the innovated HELM and GKRMC can increase the predictive power of the model.
However, this novel predictive model is the first step in predicting the risk of CRC tumor growth. Except for the environment factors and SNPs involved in the current model, if other factors such as pathway-pathway and pathway-environment interactions are included, there will be a higher chance to find a set of variations which may be integrative biomarkers, as proved in other researches [
The authors declare that they have no conflicts of interest.
This work was supported by the General Program from National Natural Science Foundation of China (nos. 81273156, 30771841, 61372138, and 61372152), Chongqing Excellent Youth Award and the Chinese Recruitment Program of Global Youth Experts, and the Fundamental Research Funding of the Chinese Central Universities (nos. XDJK2014B012 and XDJK2016A00).