New data sources for the analysis of cancer data are rapidly supplementing the large number of gene-expression markers used for current methods of analysis. Significant among these new sources are copy number variation (CNV) datasets, which typically enumerate several hundred thousand CNVs distributed throughout the genome. Several useful algorithms allow systems-level analyses of such datasets. However, these rich data sources have not yet been analyzed as deeply as gene-expression data. To address this issue, the extensive toolsets used for analyzing expression data in cancerous and noncancerous tissue (e.g., gene set enrichment analysis and phenotype prediction) could be redirected to extract a great deal of predictive information from CNV data, in particular those derived from cancers. Here we present a software package capable of preprocessing standard Agilent copy number datasets into a form to which essentially all expression analysis tools can be applied. We illustrate the use of this toolset in predicting the survival time of patients with ovarian cancer or glioblastoma multiforme and also provide an analysis of gene- and pathway-level deletions in these two types of cancer.
Copy number variations (CNVs) are promising DNA-level biomarkers of cancer subtype. CNVs can influence the phenotypes of cancer by disrupting (i.e., removing) or duplicating (i.e., adding) copies of a gene [
Previous studies describe several software packages useful for analysis of CNVs [
In the past decade, a number of useful and well-organized toolsets were developed for analyzing cancer phenotypes or genotypes. However, most of these software packages are commonly used to analyze gene-expression datasets. For example, unsupervised clustering methods and supervised classification methods are applied to machine learning algorithms in order to classify cancer phenotype, survival time, cancer metastasis, and so forth. Such algorithms create distinctions based on tissue RNA signatures for their predictive and classification tasks. Certain tissue samples contain macroscopic DNA variations which extend RNA variations based on CNVs, especially in cancer.
Particularly in cancer tissue, distinctions, classifications, and predictions based on such DNA variations may be useful. Indeed, DNA variation in a tumor changes more slowly than RNA variation and thus may be considered less noisy. Therefore, it is worthwhile to consider DNA-based algorithms.
If DNA copy number datasets (e.g., Agilent datasets) could be reprocessed into formats that parallel RNA expression signatures obtained from microarrays, it would be possible to construct DNA-based algorithms that parallel existing algorithms based on RNA. To the extent that a primary contributor to the expression level of a gene in cancerous tissue is the corresponding gene copy number, this type of analysis, using DNA copy number microarrays, could be considered as a proxy for RNA microarrays. Given that RNA signatures are more time-dependent than cancer-cell DNA signatures, the latter provide a more stable set of biomarkers for use in prediction of survival time, chemotherapy response, and other outcomes.
Here we present an algorithm which reprocesses Agilent DNA copy number signatures into a format that parallels the microarray signatures used in standard software packages for prediction and classification based on microarrays. Because copy number datasets are provided with probe IDs, we first need to download gene information to convert Agilent probe positions to gene regions. The following describes the steps for generation of CNARs corresponding to known genes.
We downloaded 216 ovarian cancer (OV) and 215 glioblastoma multiforme (GBM) CNA datasets, produced by Harvard Medical School using the Agilent Human Genome Comparative Genomic Hybridization Microarray 244A platform (HG-CGH-244A), updated in 2012 from the TCGA data portal. The portal also provided probe IDs with probe positions on the chromosome covered by 60 base pairs (bp). We also selected two classes of survival datasets for classification. From the GBM datasets, we extracted 23 samples from patients whose survival time was less than 100 days (short class) and 23 samples from patients with survival time greater than 1500 days (long class). From the OV datasets, we extracted 18 samples from patients whose survival time was less than 1 year (short class) and 18 samples from patients whose survival time was greater than 5 years (long class). The survival times distinguishing the long and short classes differed between GBM and OV because GBM is a more fatal disease.
We downloaded Agilent genome region from BioMart (
We extended each gene region by including 2000 bp for the promoter region and 100,000 bp on each side, the so-called pseudogene. Each chromosome includes approximately 20 million base pairs, and the number of genes in each chromosome is about 1000. Therefore, an expansion of 100,000 bp on each side is reasonable. In addition, the region involved in CNV is commonly much larger than the size of a gene. We note that the range of CNV is commonly 1 kb or larger [
First, we generated a sparse matrix called the base matrix, with dimensions 21,856 (genes) by 227,612 (probe IDs). Each component
We combined all CNA datasets from Step
These comprehensive datasets allow a large class of gene-expression software to be utilized to study DNA rather than RNA signatures. In Figure
Procedure for generation of CNAR datasets. (a) Basic matrix in Step
To test our newly generated CNAR datasets, we implemented well-known machine learning algorithms including unsupervised cluster methods and supervised classification methods: consensus clustering, silhouette clustering, and the support vector machine (SVM). The Fisher criterion method [
We applied the newly generated CNAR datasets to SVM with feature selection for classification of survival in patients with OV (18 short and 18 long samples) and GBM (23 short and 23 long samples). To avoid any prior bias, the training of the classifier, the choice of the number of features, and feature selection were done strictly in test datasets. For evaluation, we used standard leave-one-out cross validation (LOOCV). In OV, the best classification accuracy of long versus short survival was 83.33%, using four features. For GBM, the best classification accuracy of long versus short survival was 82.61%, using 20 selected features. The accuracies obtained using CNAR datasets were higher than the accuracies obtained using gene-expression microarray datasets (63.64% and 72.35% for OV and GBM, resp.).
In OV datasets, since four-feature selection performed best results, we collected all selected four features from all train datasets as performing LOOCV. STOML3, including eight CN probes from chr13:39482884 to chr13:39554832, was selected in all training samples. The boxplots of STOML3 show the comparison of short survival patients to long survival patients (Figure
(a) Plot for STOML3 in short and long classes of OV patients using CNAR datasets. (b) Plot for ZNF488 in short and long classes of GBM patients. The
We applied the CNAR datasets to the unsupervised consensus clustering algorithm [
(a) Consensus clustering using CNAR datasets with 204 genes from 215 GBM samples from
We also subjected OV CNAR datasets to the same procedure. In that case, we downloaded 216 samples and selected 175 genes with variance of 0.4, and the results are shown in Figure
(a) Consensus clustering using CNAR datasets with 175 genes from 216 OV samples from
In addition, we selected the gene with the largest gain in copy number, EGFR, using standard deviation from the GBM CNAR dataset. The distribution of CNAR in chromosome 7 is provided in Figure
(a) Examples of CNAR across chromosome 7 for selected patients. The
We subjected CNAR datasets derived from OV and GBM samples to gene set enrichment analysis (GSEA) [
Pathways enriched in OV and GBM: categories of molecular interaction and reactions, from KEGG, are indicated: (1.1) carbohydrate metabolism, (1.2) energy metabolism, (1.3) lipid metabolism, (1.8) metabolism of cofactors and vitamins, (1.9) metabolism of terpenoids and polyketides, (1.11) xenobiotic biodegradation and metabolism, and (2.2) translation.
Pathways enriched in OV | FDR | Enriched in class |
---|---|---|
Pentose and glucuronate interconversion (1.1) | 0 | Long |
Androgen and estrogen metabolism | 0 | Long |
Porphyrin and chlorophyll metabolism (1.8) | 0 | Long |
Aminoacyl tRNA biosynthesis (2.2) | 0 | Long |
Carbon fixation in photosynthetic organisms (1.2) | 0 | Long |
3-Chloroacrylic acid degradation | 0 | Short |
Caprolactam degradation (1.11) | 0 | Short |
Glycolysis and gluconeogenesis (1.1) | 0 | Short |
Atrazine degradation (1.11) | 0 | Short |
Polycyclic aromatic hydrocarbon degradation (1.11) | 0 | Short |
|
||
Pathways enriched in GBM | FDR | Enriched in class |
|
||
Linoleic acid metabolism (1.3) | 0 | Short |
Arachidonic acid metabolism (1.3) | 0 | Short |
Terpenoid backbone biosynthesis (1.9) | 0 | Short |
Ether lipid metabolism (1.3) | 0 | Short |
Pentose and glucuronate interconversion (1.1) | 0 | Long |
Porphyrin and chlorophyll metabolism (1.8) | 0 | Long |
In this study, we generated new datasets that represented DNA variation in formats that paralleled variation in RNA expression and subjected the data to established machine learning algorithms. Most previous studies identified changes in DNA using CN alteration and then interpreted the relationship with genes by looking up the chromosome region. These new CN alteration-based datasets, which we call CN array (CNAR) datasets, could be used directly for identification of genetic variation at the RNA level. In addition, the new CNAR datasets enable straightforward visualization of gain or loss of pseudogenes along the chromosome. In our analyses, we used several existing methods including clustering, classification, and GSEA. In addition, we applied CNAR datasets to two clustering methods, consensus clustering and silhouette clustering. The clustering performances were superior to those obtained using RNA expression. We also applied CNAR to the SVM classification algorithm for GBM and OV cancer patients stratified for short and long survival times; the accuracies were 83.33% and 82.61%, respectively. When we subjected CNAR datasets to GSEA, we found two enriched metabolic pathways in both OV and GBM with a FDR cutoff of zero. These new datasets enable many applications, including clustering of cancer subtypes, prediction of survival times, and classification of cancer metastasis by analyzing DNA alterations using tools developed for RNA-level analysis. Such analyses may provide novel insights into the biological mechanisms underlying cancer. The major limitation of this analysis is the fixed extension of the gene region. Up to date there is no information for extending gene region for copy number alteration. Therefore, we need further study for extension method for each gene region.
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work was supported by the National Research Foundation Grant funded by the Korean Government (NRF-2013S1A2A2034953).