An accurate classification of human cancer, including its primary site, is important for better understanding of cancer and effective therapeutic strategies development. The available big data of somatic mutations provides us a great opportunity to investigate cancer classification using machine learning. Here, we explored the patterns of 1,760,846 somatic mutations identified from 230,255 cancer patients along with gene function information using support vector machine. Specifically, we performed a multiclass classification experiment over the 17 tumor sites using the gene symbol, somatic mutation, chromosome, and gene functional pathway as predictors for 6,751 subjects. The performance of the baseline using only gene features is 0.57 in accuracy. It was improved to 0.62 when adding the information of mutation and chromosome. Among the predictable primary tumor sites, the prediction of five primary sites (large intestine, liver, skin, pancreas, and lung) could achieve the performance with more than 0.70 in
Cancer is a complex disease, which is driven by the combination of genetic, environmental, and lifestyle factors. Among these factors, the combination of multiple genes driving cancer development varies considerably among cancer types and patients [
For cancer classification, the fundamental method is mainly based on the cell of origin or their histological types [
Recently, next-generation sequencing approaches have been applied to cancer studies, including whole genome sequencing, whole exome sequencing, targeted gene sequencing, whole transcriptome sequencing, genome-wide microRNA sequencing, and epigenomics, providing the highest resolution (base-pair resolution) of genetic and genomic information in cancer. These datasets provide us an unprecedented opportunity on systematic and integrated investigation of molecular mechanisms of cancer. For example, Vogelstein et al. systematically analyzed the mutation landscapes in 96 cancer types reported from 127 publications, providing deep insights into the cancer genomic architecture [
In this study, we proposed a novel cancer site classification framework by investigating somatic mutations through machine learning approaches. The somatic mutation information includes (1) patient information, (2) mutation-associated genes, and (3) mutation-associated chromosomes. We extracted these types of information from the database COSMIC (Catalogue of Somatic Mutations In Cancer) [
The main purpose of this study is to test if the somatic mutation features and mutation-related information are useful or have the power to predict the primary cancer site since more than a million somatic mutations in cancer genomes have been reported, collected, and systematically analyzed. To address this important question, we took advantage of the data in COSMIC, which is the most comprehensive, annotation-based database for the somatic mutations from numerous patients with cancer type information. Figure
Study design using somatic mutations to classify primary tumor sites by machine learning model. In order to precisely represent the mutations, we generated a feature
The COSMIC database is established to collect, store, and display somatic mutations and related information extracted from the primary literature on human cancers as well as those identified from cancer genome projects [
To normalize the gene names to the gene official symbols, we took a two-step strategy. First, we utilized the mutation positions from COSMIC data to map the gene regions using the UCSC Genome Browser based on the GRCh37 genome annotation [
To clean the data, we removed the records that do not have the information about gene name, sample ID, primary site, or mutation description. Additionally, we removed the mutations that were involved in fusion genes because they do not have a single-mutation position. Eventually, the filtered dataset contained 230,255 patients, 22,111 unique genes, and 1,760,846 mutations.
KEGG pathway database manually collects and annotates the molecular interactions and regulations among genes and then draws pathway maps [
In this study, we mainly explored the somatic mutations and their relative information for cancer primary site classification. From the filtered data obtained above, we extracted 7,251 patients who had at least ten mutations. Patients with a very small number of mutations would be more likely outliers in the dataset and fail to provide sufficient information for a model to distinguish the final label with other patients. These limitations increase the difficulty in training a good predictive model. On the other hand, patients with a larger number of mutations more likely have common features and thus induce better training to find a more reliable pattern in the model. We chose ten as the threshold because the filtered patients set of over seven thousand is large enough for machine learning experiments and the number of features generated for each patient based on the threshold of ten does not discourage the modeling process.
We further filtered out several minority classes of primary tumor sites. Each of them has less than 60 patients in the dataset, such as “Bone,” “Meninges,” and “Eye.” Thus, the final set of 6,751 patients was chosen to be used in this study. These patients were diagnosed to be one type of cancer among the 17 primary tumor sites. Table
Distribution of primary tumor sites.
Primary tumor site | Number of patients | Percentage (%) |
---|---|---|
Lung | 970 | 14.43 |
Breast | 967 | 14.39 |
Large intestine | 654 | 9.73 |
Haematopoietic and lymphoid tissue | 644 | 9.58 |
Kidney | 491 | 7.31 |
Ovary | 490 | 7.29 |
Liver | 400 | 5.95 |
Central nervous system | 377 | 5.61 |
Prostate | 374 | 5.56 |
Endometrium | 261 | 3.88 |
Pancreas | 252 | 3.75 |
Autonomic ganglia | 222 | 3.30 |
Skin | 184 | 2.74 |
Oesophagus | 174 | 2.59 |
Urinary tract | 110 | 1.64 |
Upper aerodigestive tract | 91 | 1.35 |
Stomach | 60 | 0.89 |
From the COSMIC data, we collected mutations and their corresponding mutated genes and chromosomes to represent the genetic characteristics of each patient. As a result, our process led to twelve unique types into four categories (e.g., substitution, insertion, deletion, and complex) and eight more specific descriptions (e.g., substitution-nonsense, substitution-missense, substitution-coding silent, substitution-intronic, insertion-in frame, insertion-frameshift, deletion-frameshift, and deletion-frameshift) according to the mutation description in the COSMIC and our filtering procedure. Table
Mutation description.
Mutation description | Definition |
---|---|
Substitution | A mutation involving the substitution of a single nucleotide |
Substitution-nonsense | A substitution mutation resulting in a termination codon, foreshortening the translated peptide |
Substitution-missense | A substitution mutation resulting in an alternate codon, altering the amino acid at this position only |
Substitution-coding silent | A synonymous substitution mutation which encodes the same amino acid as the wild type codon |
Substitution-intronic | A substitution mutation outside the coding domains; no interpretation is made as to its effect on splice sites or nearby regulatory regions |
Insertion | An insertion of novel sequence into the gene |
Insertion-in frame | An insertion of nucleotides which does not affect the gene’s translation frame, leaving the downstream peptide sequence intact |
Insertion-frameshift | An insertion of novel sequence which alters the translation frame, changing the downstream peptide sequence (often resulting in premature termination) |
Deletion | A deletion of a portion of the gene’s sequence |
Deletion-in frame | A deletion of nucleotides which does not affect the gene’s translation frame, leaving the downstream peptide sequence intact |
Deletion-frameshift | A deletion of nucleotides which alters the translation frame, changing the downstream peptide sequence (often resulting in premature termination) |
Complex | A compound mutation which may involve multiple insertions, deletions, and substitutions |
Instead of directly using individual mutation description, we bound them with their corresponding gene symbols to precisely represent the mutations. It resulted in 79,865 unique combos of gene symbols and mutation descriptions in the dataset, such as “CHDC2_Insertion-Frameshift,” “SPEN_Complex,” and “SP1_Substitution-Missense.” In this paper, we use “
Since the human somatic mutation landscape is related to chromosome [
Besides the mutation-related information, we further integrated the KEGG dataset to provide the functional knowledge of the genes involved in the patients’ mutations. There are 285 unique pathways for the 21,286 genes.
Therefore, in this study, we defined four features:
In the data we collected, each sample contains an array of features that are present in one patient. We present all the collected features in all patients as a feature vector in the machine learning fashion. All features in the vector were represented by binary values; namely, “1” represents present while “0” represents not present. Then, we constructed a data matrix, in which each row includes all the features for one patient while each column includes one type of feature for all patients.
With respect to the classification method, we implemented a one-versus-all multiclass classification schema to identify the primary tumor site based on patients’ mutation-associated features and the gene pathway feature. For each primary tumor site, we trained a binary classifier that could distinguish the class belonging to the site versus the one that does not. Each classifier was a support vector machine (SVM) with linear kernel implemented by LIBLINEAR [
We performed the multiclass classification experiments on the
We conducted experiments by 10-fold cross validation. All patient samples were split into ten folds with stratification so that the class distribution in each is much similar to the one from the original dataset. We alternately treated one fold as the test set and the other as the training set. Then we did the predictive model training and testing 10 times. Eventually, each patient would have a diagnosis of the primary tumor site by the predictive model. We computed the accuracy as global metric to evaluate different feature combinations. We also evaluated the performance of prediction on each primary tumor site by precision, recall, and
We also used microaverage and macroaverage methods to report the accuracy. In the microaverage accuracy (miAccuracy),
Following the study design in Figure
We have trained seven predictive models using different combinations of feature sets. The specific features for each combination, sizes of features, and the accuracies as their global scores are shown in Table
Micro- and macroaveraged accuracies of seven combinations of gene symbols with three other features.
Feature combination | Number of features | miAccuracy | maAccuracy (mean) | maAccuracy (SD) |
---|---|---|---|---|
|
21,286 | 0.57 | 0.57 | 0.019 |
|
101,151 | 0.58 | 0.58 | 0.019 |
|
21,571 | 0.58 | 0.58 | 0.010 |
|
21,311 | 0.60 | 0.60 | 0.022 |
|
101,436 | 0.60 | 0.60 | 0.013 |
|
101,176 | 0.62 | 0.62 | 0.021 |
|
101,461 | 0.60 | 0.60 | 0.015 |
Note: miAccuracy represents the microaverage accuracy; maAccuracy represents the macroaverage accuracy, which is reported in mean and standard deviation (SD) over 10 accuracies from 10-fold cross validation.
Using the best model set, we predicted the whole dataset using 10-fold cross validation and evaluated the performance on every primary tumor site by precision, recall, and
Precision, recall, and
Primary tumor site | Precision | Recall |
|
---|---|---|---|
Large intestine | 0.88 | 0.85 | 0.87 |
Liver | 0.88 | 0.72 | 0.79 |
Skin | 0.91 | 0.61 | 0.73 |
Pancreas | 0.75 | 0.67 | 0.71 |
Lung | 0.66 | 0.75 | 0.70 |
Endometrium | 0.91 | 0.52 | 0.67 |
Kidney | 0.72 | 0.62 | 0.66 |
Haematopoietic and lymphoid tissue | 0.50 | 0.75 | 0.60 |
Breast | 0.50 | 0.75 | 0.60 |
Central nervous system | 0.63 | 0.51 | 0.56 |
Ovary | 0.40 | 0.49 | 0.44 |
Prostate | 0.46 | 0.35 | 0.40 |
Autonomic ganglia | 0.45 | 0.28 | 0.34 |
Oesophagus | 0.81 | 0.20 | 0.31 |
Urinary tract | 0.83 | 0.09 | 0.16 |
Upper aerodigestive tract | 1.00 | 0.05 | 0.10 |
Stomach | 0.60 | 0.05 | 0.09 |
The average precision and recall were 0.70 and 0.49, respectively. This predictive model could achieve the precision of 0.75 or higher in 8 out of 17 primary tumor sites, recall of 0.60 or higher for 8 out of 17, and
In this study, we performed a systematic exploration of the somatic mutations and their related features for cancer classification using a machine learning approach and the most comprehensive somatic mutation dataset so far. The study filtered the somatic mutation data from COSMIC, identified the best feature combination, and predicted the primary tumor sites using the machine learning methods.
Machine learning approaches have been applied to cancer prognosis and prediction [
For the bottom four primary sites with the smallest sample size, the performance by the model tended to be poorest. Specifically, “Oesophagus,” “Urinary tract,” “Upper aerodigestive tract,” and “Stomach” had smallest numbers of samples, and they were also ranked at the bottom according to
One important output of this study is the best feature combination (
To test if the high-level function-associated features could improve the performance of cancer site classification, we explored the KEGG pathway that mutation-associated genes are involved in. However, in our study, there is no improvement of performance by integrating the
Our prediction model utilized the
Summary of genes and samples used in the primary tumor site prediction.
Primary tumor site | Number of genes | Number of true positives |
---|---|---|
Large intestine | 18,066 | 555 |
Liver | 19,778 | 287 |
Skin | 10,898 | 113 |
Pancreas | 3,364 | 170 |
Lung | 18,423 | 724 |
Endometrium | 18,234 | 137 |
Kidney | 10,601 | 302 |
Haematopoietic and lymphoid tissue | 14,545 | 723 |
Breast | 6,327 | 486 |
Central nervous system | 2,773 | 192 |
Ovary | 8,169 | 238 |
Prostate | 5,875 | 132 |
Autonomic ganglia | 1,425 | 62 |
Oesophagus | 6,200 | 34 |
Urinary tract | 3,288 | 10 |
Upper aerodigestive tract | 1,013 | 5 |
Stomach | 86 | 3 |
Among the 17 primary tumor sites, five primary tumor sites achieved better performance, according to their
Comparison among five sets of the top 50 genes used in the machine learning modeling for five primary tumor sites (large intestine, liver, lung, pancreas, and skin).
In this exploratory study, we demonstrated that the somatic mutation information could be used for cancer classification. As the first attempt for prediction of cancer sites, we have seen many opportunities to improve the performance based on the genetic and genomic information in future work. First, refinement of the features might improve the performance of machine learning experiments in several ways. (1) The first is identification and analysis of the most frequently mutated genes across multiple primary sites. (2) The second is reducing redundancy of feature sets by automatic dimension reduction techniques. We can use two types of methods. One is the algorithms without the label information, such as, principle component analysis [
In conclusion, our application of the machine learning technique to somatic mutations could predict some primary tumor sites, such as the large intestine, liver, skin, pancreas, and lung. Since treatment of cancer does rely on not only the known cancer site, but also the underlying molecular profiles (e.g., cancer driver mutations) and cancer cells migrate to multiple sites at metastasis stage, the prediction of cancer sites based on mutation profiles may be helpful for the enhancement of molecular therapeutics development. This study represents the first large-scale prediction of primary tumor site using comprehensive, publicly available somatic mutations through a machine learning approach.
All the authors declare no conflict of interests.
Yukun Chen and Jingchun Sun contributed equally.
This project is partially supported by National Institutes of Health Grants (R01LM011177, P30CA68485, P50CA095103, and P50CA098131), Vanderbilt-Ingram Cancer Center’s Breast Cancer SPORE pilot grant (to Zhongming Zhao), Ingram Professorship Funds (to Zhongming Zhao), and Cancer Prevention & Research Institute of Texas (CPRIT R1307) Rising Star Award (to Hua Xu).