Identifying Mutually Exclusive Gene Sets with Prognostic Value and Novel Potential Driver Genes in Patients with Glioblastoma

The pathogenesis and prognosis of glioblastoma (GBM) remain poorly understood. Mutual exclusivity analysis can distinguish driver genes and pathways from passenger ones. The purpose of this study was to identify mutually exclusive gene sets (MEGSs) that have prognostic value and to detect novel driver genes in GBM. The genomic alteration profile and clinical information were derived from The Cancer Genome Atlas, and the MEGSA method was used to identify the MEGS. Next, we performed survival analysis and constructed a risk prediction model for prognostic stratification. Leave-one-out cross-validation and permutation test were used to evaluate its performance. Finally, we identified 21 statistically significant MEGSs. We found that the MEGS in the RB pathway was significantly associated with poor prognosis, after adjusting for age and gender (HR = 1.837, 95% CI: 1.192–2.831). Based on the risk prediction model, 208 (80.9%) and 49 (19.1%) patients were assigned to high- and low-risk groups, respectively (log-rank: p < 0.001, adjusted p=0.001). Additionally, we found that SPTA1, a novel gene involved in the MEGS, was mutually exclusive with members of cell cycle, P53, and RB pathways. In conclusion, the MEGS in the RB pathway had considerable clinical value for GBM prognostic stratification. Mutated SPTA1 may be involved in GBM development.


Introduction
Glioblastoma (GBM) is the most common and biologically aggressive primary brain tumor [1,2]. Each year, it affects over 10,000 new patients in the United States [3]. Despite improvements in diagnostic and therapeutic approaches, patients with GBM have poor prognosis. e median overall survival (OS) time is 12-17 months [2,[4][5][6]. To improve the prognosis of GBM, it is important to understand the carcinogenic mechanism of GBM. Tumor development is primarily driven by the accumulation of lifetime somatic alterations [7,8]. erefore, identifying and understanding the genetic and pathway abnormalities that drive the initiation and progression of GBM are critical for the development of effective therapies [2]. e development of the next-generation sequencing has accumulated a large amount of genomic data. e major tasks of analyzing these data are identifying driver alterations that contribute to cancerogenesis and investigating their functional interactions. ese tasks can be approached via mutual exclusivity (ME) analysis [9,10]. Mutual exclusivity of genomic alterations, indicating that genes belonging to the same functional pathway tend not to mutate simultaneously in the same patient, has been observed in various cancer types [11,12]. Over 25% of wellknown cancer genes show an mutual exclusivity (ME) pattern [7]. Detecting an ME pattern is important to understand the tumorigenic mechanisms and identify drug targets. Currently, several methods based on mutual exclusivity have been proposed to uncover novel infrequent cancer drivers and investigate their functional relationship [9,10,13]. Mutually exclusive gene set analysis (MEGSA), proposed by Hua et al., is a new model to discover mutually exclusive gene sets (MEGSs) from de novo or existing biological pathways. Simulation studies have indicated that MEGSA outperformed other methods, such as Dentrix, MDPFinder, Multi-Dentrix, and Mutex, in statistical power and their capability for identifying specific MEGSs, especially for highly imbalanced MEGSs [9]. However, one limitation of this analysis is that only nonsynonymous point mutations were taken into consideration when Hua et al. identified MEGSs in patients with GBM. Consequently, only one mutually exclusive gene pair (PTEN and IDH) was found. Compared with other types of somatic genetic alterations, copy number variation (CNV) accounts for a large fraction of genomic alterations in cancer [14] and plays a critical role in carcinogenesis [14]. erefore, it is necessary to take CNVs into account when performing mutual exclusivity analysis. Other studies have identified ME patterns related to GBM; however, no study has analyzed their prognostic values [9,10,13,[15][16][17][18][19][20]. erefore, one purpose of this study is to identify MEGSs and detect novel infrequently driver genes in GBM by integrating nonsynonymous single-nucleotide variants (SNVs) and copy number variations (CNVs) using MEGSA. A further objective is to assess the prognostic value of specific MEGSs.

Data.
e preprocessed GBM genomic variant dataset was derived from Multi-Dentrix, which contained 398 alterations (nonsynonymous SNVs and CNVs) and 261 patients [18,21]. Data preparation for GBM was described in reference [18]. e clinical data were downloaded from e Cancer Genome Atlas (TCGA). Samples with incomplete survival information were excluded, and 257 patients with GBM were enrolled in survival and leave-one-out crossvalidation (LOOCV) analysis.

Identifying MEGSs.
In this study, the MEGSA was employed to identify MEGSs and novel driver genes. In brief, MEGSA consists of three parts. First, a likelihood ratio (LRT) statistic for testing mutual exclusivity was constructed. Second, global null hypothesis (GNH) analysis was performed to test whether the set of M genes contains an MEGS of any size. ird, the optimal MEGS was identified using model selection [9].
Suppose that A 0 is an MEGS with N rows that correspond to patients and m columns that represent genes. e entity a ik denotes the mutation status which is 1 if the gene k is mutated for the subject i or 0 otherwise. We defined the set of model parameters, Θ � (c, P, Π), using coverage, c, genespecific background mutation rate, Π � (π 1 , . . . , π m ), and gene-relative mutation frequencies in A 0 , P � (p 1 , . . . , p m ).
erefore, under the assumption of p k ∝ π k , the total log likelihood across N subjects is defined as e LRT is calculated as S � 2(log L(c 1 , Π 1 ; A 0 ) − log L(c 0 , Π 0 ; A 0 )) (H 0 : c � 0 versus H 1 : c > 0), with an asymptotically null distribution of 0.5χ 2 0 + 0.5χ 2 1 . e GNH test is completed in three steps: (1) e multiple-path search algorithm is performed to determine the minimum p values for gene sets with different size (denoted as p k (k � 2, . . . , K)). (2) e permutation test is used to adjust the p k values and obtain Q k (k � 2, . . . , K). Intuitively, Q k measures the significance that searches only for MEGSs of size k. (3) Finally, the overall statistic is defined as θ � min(Q 2 , . . . , Q K ).
Considering two significant putative MEGSs (Q k < θ 1− α ), MEGS1 has two genes (G 1 , G 2 ) with a nominal p value of p 1 and MEGS2 has three genes (G 1 , G 2 , G 3 ) with a nominal p value of p 2 based on LRT. e null hypothesis of model selection would be that none of the M-2 genes (G 3 , . . . , G M ) are mutually exclusive of (G 1 , G 2 ). We chose MEGS2 if p 2 < p 0 and p 0 was chosen according to permutations with a false-positive rate <5%. is procedure was repeated until the size of the MEGS reached its preset maximum value, k, or the hypothesis test no longer rejected H 0 .

Selection and Validation of MEGSs Related to Prognosis.
We transformed the gene mutation profile to the MEGS mutation profile by assuming that the MEGS was mutated in a patient if any gene in the gene set was mutated [9]. Univariate and multivariable Cox proportional hazards models were constructed to assess the association between the MEGS, clinical characteristics, and 5-year survival. Next, we developed a risk prediction model based on the prognostic index of the multivariable Cox model for prognostic stratification and evaluated its performance using LOOCV [22,23]. For each leave-one-out step, the risk score was calculated for the patient who was removed for testing. Following this, each patient was classified into the high-or low-risk group based on whether the risk score was above or below the cut-off value [23]. e cutoff among N risk scores was defined using maximally selected log-rank statistics [4,23,24]. Survival curves of the high-and low-risk groups were estimated using the Kaplan-Meier method, and significance was assessed using the log-rank test. To overcome any overfitting bias, the permutation test was used to adjust the log-rank p value. In brief, we randomly permuted the correspondence of survival time and censoring indicators to covariates and repeated the entire LOOCV process. e adjusted p value was calculated as the proportion of permutations whose log-rank statistics were greater than or equal to the value of the statistic for the original data [22,23].
All analyses were considered statistically significant if p < 0.05. All analyses were performed using R 3.3.2 and SAS 9.2. Table S1).

Prognosis Stratification Based on Risk Prediction Model.
We developed a risk prediction model based on the prognostic index of the multivariable Cox model to divide the patients into low-and high-risk groups. Taking practical and statistical significance into consideration, we chose 0.82 as the cut-off value using the maximally selected log-rank statistics (Figure 2(a)).
ere were 49 (19.1%) and 208 (80.9%) patients in the low-and high-risk groups, respectively. Figure 2(b) shows the Kaplan-Meier curves for the low-and high-risk groups (log-rank: p < 0.001). e adjusted log-rank p value calculated via the permutation test (1000 times) was 0.001. e univariate Cox model indicated that the mortality risk within 5 years in the high-risk group was 1.953 times higher than that in the low-risk group (Table 3).

Discussion
In this study, we identified MEGSs in GBM by integrating nonsynonymous SNVs and CNVs. Most genomic alterations that were involved in MEGSs were enriched in core pathways (RB, P53, and RTK/RAS/PI(3)K pathways) required for GBM pathogenesis [1,6,21], providing an important validation for the MEGSA. e most significant MEGSs included 3 genomic alterations: CDKN2A deletion, CDK4 amplification, and RB1 mutations (covered 87.7%). ese genes are core members of the RB pathway, which plays a central role in the regulation of cell proliferation. In quiescent cells, hypophosphorylated RB (active) binds E2F to prevent cell progression through the G1/S cell checkpoint, whereas in the proliferating cell, the D-cyclin/CDK4/6 complex phosphorylates RB (inactive) leading to the release of E2F, which, in turn, induces genes required for DNA synthesis and cell growth. CDKN2A-p16 INK4A is a negative regulator of the RB pathway, and CDKN2A-p16 INK4A competes with D-cyclins to bind CDK4/ 6, which prevents the formation of the D-cyclin/CDK4/6 complex [1,6,29]. Intuitively, any genomic alteration, including CDKN2A deletion, CDK4 amplification, and RB1 mutations, can inactivate RB, resulting in cell proliferation. Moreover, our results showed that the ME pattern in the RB pathway was associated with poor prognosis in GBM. Previous studies have shown that disrupting the RB pathway is associated with prognosis of various human cancers [30][31][32][33][34][35][36][37]. Immunohistochemical analysis has shown that the underexpressed RB protein in gastric adenocarcinoma [36] and low expression of p16 (encoded by CDKN2A) in oral carcinoma [31], vertical growth phase melanoma [32], esophageal squamous cell carcinoma [35], and GBM [38] significantly predict poor patient survival. Bäcklund et al. have reported that any loss of CDKN2A and RB or the amplification of CDK4 in anaplastic astrocytoma (AA) was associated with decreased survival [39]. Furthermore, poor prognosis in patients with an abnormal RB pathway may be due to high resistance caused by RB silencing to etoposide (VP-16) [40]. An interpretation about mutual exclusivity, referred to as synthetic lethality, is that the secondary driver alteration within the same pathway is detrimental to cells and may result in cell death [12,13,41]. erefore, our study results provided a clue to the development of tumor molecular targeted therapies. Additionally, we developed a risk prediction model for prognosis stratification. e leave-oneout cross-validation and permutation test results revealed the effectiveness of the developed model in our study. ME analysis can overcome the limitations linked to the frequency-based method for large sample size and detect less frequent mutated genes [9,10,13]. Given that CDKN2A, TP53, RB1, PTEN, NF1, CDK4, MDM2, EGFR, PDGFRA, IDH1, and MET are well-known genes associated with GBM, the observed mutual exclusivity suggests that SPTA1 and CAPZA2 may be cancer genes. SPTA1, which is one of the most recurrent genes involved in MEGS, encodes α-spectrin. α-Spectrin and ß-spectrin are assembled into spectrin, which is an actin crosslinking and molecular scaffold protein that determines cell shape and membrane protein location [42,43]. Alterations in SPTA1 are associated with colorectal cancer [44,45] and small-cell lung cancer [42]. However, to date, the carcinogenic mechanism of mutated SPTA1 remains unknown. Previous studies have shown that nonerythroid α-spectrin interacts with proteins that are related to several cellular processes, such as DNA synthesis, cell cycle progression, and signal transduction, which are consistent with our findings [46]. We found that SPTA1 mutations were mutually exclusive to the core members of the RB, P53, and cell cycle pathways. Taken together, these data indicate that mutated SPTA1 may be related to abnormal cell proliferation and apoptosis in GBM development.
CAPZA2 encodes the human actin-capping protein α-subunit. e function of the actin-capping protein is to block the growth of actin filaments by capping the barbed   end [47]. e CAZ2 protein is overexpressed in breast cancer, and the F-actin-capping protein is linked to renal cell carcinoma [48,49]. Moreover, Mueller et al. have observed CAPZA2 amplification in glioma, which was in good agreement with our observation [50]. However, the investigations about the role of CAPZA2 in cancer are rare. It is possible that CAPZA2 may play a role in tumor-specified cell motility [48]. Recently, Ohishi et al. found that CAPZA2 negatively regulates cell invasion [51], which indicates that amplified CAPZA2 may be a favorable prognosis marker in cancer. To explore the prognostic value of SPTA1 and CAPZA2, survival analyses were performed. However, our results showed that neither SPTA1 (p � 0.764) nor CAPZA2 (p � 0.213) had significant associations with survival in patients with GBM after adjusting for age, gender, and CDK4(A)/RB1/CDKN2A(D). ese data suggest that alterations in SPTA1 and CAPZA2 may be linked to the formation of GBM alone. Nonetheless, the mutually exclusive gene set CDK4(A)/CDKN2A(D)/RB1 was involved in the formation of GBM and predicted the prognosis of GBM. e main limitation of this study was the lack of external validation, which makes our results less reliable. erefore, a study using a larger patient cohort and an experiment with cell lines are required to validate our findings and allow more reliable conclusions to be reached.

Conclusions
In summary, we derived 21 MEGSs by integrating nonsynonymous single-nucleotide variants and copy number variations. In these MEGSs, only the ME pattern in the RB pathway predicted the prognosis of patients with GBM after adjusting for age and gender. is finding may help researchers develop molecular targeted therapies and identify high-risk GBM for better treatment. Additionally, we obtained several less frequent cancer genes, which may extend our knowledge on the pathogenesis of GBM.

Data Availability
e results published here are based on the data generated by the TCGA database at http://cancergenome.nih.gov/. e preprocessed GBM genomic variant dataset was derived from the study by Leiserson et al. [18].

Disclosure
An earlier version of this manuscript has been presented as a conference abstract at the 4th World Congress on Cancer Research & erapy.

Conflicts of Interest
e authors declare no conflicts of interest.