Uncovering Driver DNA Methylation Events in Nonsmoking Early Stage Lung Adenocarcinoma

As smoking rates decrease, proportionally more cases with lung adenocarcinoma occur in never-smokers, while aberrant DNA methylation has been suggested to contribute to the tumorigenesis of lung adenocarcinoma. It is extremely difficult to distinguish which genes play key roles in tumorigenic processes via DNA methylation-mediated gene silencing from a large number of differentially methylated genes. By integrating gene expression and DNA methylation data, a pipeline combined with the differential network analysis is designed to uncover driver methylation genes and responsive modules, which demonstrate distinctive expressions and network topology in tumors with aberrant DNA methylation. Totally, 135 genes are recognized as candidate driver genes in early stage lung adenocarcinoma and top ranked 30 genes are recognized as driver methylation genes. Functional annotation and the differential network analysis indicate the roles of identified driver genes in tumorigenesis, while literature study reveals significant correlations of the top 30 genes with early stage lung adenocarcinoma in never-smokers. The analysis pipeline can also be employed in identification of driver epigenetic events for other cancers characterized by matched gene expression data and DNA methylation data.


Introduction
As a leading cause of death worldwide, lung cancer is mainly attributed to smoking in both men and women [1,2], of which the most common histological subtype is adenocarcinoma. However, as smoking rates decrease, proportionally more cases occur in never-smokers [3]. Lung adenocarcinoma in never-smokers shows obvious distinctions in clinical and molecular mechanism to those cigarette smoking [4]. Both genetics and epigenetics in cancer genomes have been suggested to account for the development of lung adenocarcinoma.
As one of the vital epigenetic mechanisms, DNA methylation regulates gene expression without alterations in DNA sequence [5,6] and plays key roles in X chromosome inactivation, genome stability, chromatin structure, embryonic development, differentiation, and maintenance of pluripotency in normal somatic cells [7,8]. Genome-scale methylation-profiling techniques have confirmed the existence of widespread aberrations of DNA methylation patterns in human cancer genome [9][10][11][12]. Studies of DNA methylation have suggested that both global DNA hypomethylation and gene-specific hypermethylation may contribute to the initiation and progression of tumorigenesis, as well as gene body methylation [13][14][15]. It is challenging but of great significance to distinguish genes whose methylation changes are crucial in cancer occurrence, progression, or metastasis from genes whose methylation changes merely have effects on the process of tumorigenesis in cancer research and therapy [13]. Unlike somatic mutations in the genome, DNA methylation is inherently reversible and serves as potential drug targets in cancer intervention [16,17].
Numerous studies have focused on discovering genes whose DNA methylation potentially plays key roles in tumorigenesis of lung adenocarcinoma, including integration of genome-scale DNA methylation and gene expression [18][19][20][21]. The main idea of these works is to search genes whose gene expression fluctuations are highly correlated to DNA methylation changes. However, there is a deficiency derived on the complexity of the gene expression regulation. Both genetic and epigenetic alterations can contribute to gene expression as well as other transcriptional factors in sophisticated manners in complex diseases [22,23]. In tumors, a differential gene expression may be induced by an aberrant DNA methylation in the promoter of the gene but also may be a consequence regulated by its upstream genes in regulatory mechanisms. These appeal to a great attention in uncovering driver DNA methylations, which play major roles in methylation-associated gene silencing and drive malignant transformation [5,13]. In this work, we refine the generalized description of driver methylation as two properties. (1) Driver DNA methylation should induce distinctive expressions in tumors with differential DNA methylation (T-DM) when compared to expressions in matched adjacent nontumor (normal) and tumors with nondifferential DNA methylation (T-NDM), and (2) driver methylation should induce a distinct regulation module in the network perspective. The first property guarantees the major role of DNA methylation in the regulation of gene expression, while the second property guarantees the functional effects of driver genes on tumorigenesis.
Focusing on genes differentially expressed among matched adjacent nontumors (normal), tumors with aberrant DNA methylation (T-DM), and tumors without aberrant DNA methylation (T-NDM), we integrate genomewide DNA methylation data and gene expression data to uncover driver methylation events in never-smokers in early stage lung adenocarcinoma. Differential network analyses show significant changes of DNA methylation-responsive modules in network topology across normal, T-DM, and T-NDM, which imply potential mechanisms of identified driver genes underlying the tumorigenesis.

Data Sets.
Both the DNA methylation data and gene expression data are downloaded from NCBI Gene Expression Omnibus (GEO) with accession number GSE32867 [18]. The series contains 59 samples with paired genome-scale DNA methylation profiling and gene expression. Stage I and stage II are merged as early stage and stages III-IV are labeled with late stage [18]. After removing noisy data [18], 22 samples are labeled with "never smoking" and "early stage" simultaneously. Paired DNA methylation data and gene expression data of these 22 samples are collected to further analysis. Probes in gene expression data are firstly mapped to Entrez gene ID and expression values sharing same Entrez gene IDs are averaged among samples.

Schematic
Overview of the Analysis Pipeline. The schematic overview of the analysis pipeline is shown in Figure 1, and detailed procedures are described in the following sections. Figure 1(a) shows a brief schematic overview of this procedure. The difference matrix is firstly created to measure differences of beta values of DNA methylation between tumor and normal. The kernel probability distribution with normal smoothing function is used to estimate the probability density distribution for each probe in the difference matrix (Figure 1(a)). The hypothesis is that the differences of beta values for given probes come from distributions with the mean 0 and unknown variances. The cumulative density function (CDF) is used to estimate the probability of a beta value falling within given interval. Hypermethylation and hypomethylation are determined by the upper bound CDF > 0.95 and the lower bound CDF < 0.05, respectively. For each probe, tumors are partitioned into two groups, tumors with differential methylation group (T-DM) and tumors without differential methylation group (T-NDM).

Candidate Driver Gene Selection.
Then, the two-sample -test is used to evaluate differential expression under conditions [24], and values are adjusted by the procedure introduced by Storey [25]. The mapping from DNA methylation to gene expression is performed by shared Entrez gene ID. Probes remain if the mapped genes are differentially expressed in T-DM when compared to normal and T-NDM (adjusted value < 0.05), which implies that the differential methylation of given probes in T-DM is more likely to induce significant expression changes. Probes mapping to same genes are removed if hypermethylation and hypomethylation coexist in more than 5 samples. Then samples in T-DMs and T-NDMs merge, respectively, by shared Entrez gene ID and serve as T-DM and T-NDM of the gene.
We then search for genes whose expressions are highly discriminative and consistent in T-DM when compared to normal and T-NDM. Many types of statistics, such as Wilcoxon score, Pearson correlation coefficient (PCC), or mutual information (MI), could be used to score the relationship between gene expression and class labels, and a -score method is used in this work [26]. For a given gene, let be the gene expression levels across samples with class and the discriminative score ( , ) is defined as the t-test statistic. To determine whether the discriminative level of the gene among groups is consistent, we permute the class by 1000 times and obtain a background distribution of the discriminative scores ( , ) derived on the gene expression levels and permuted class . Genes with significant values ( value < 0.05) among groups (normal versus T-DM and T-DM versus T-NDM) are considered differentially methylated and served as candidates for further analysis.  Driver responsive module For each candidate gene, a subset of DM responsive genes is collected and DM responsive modules are constructed by the CLR method. Candidate driver genes are ranked by differential scores derived on the differential network analysis.

Detection of DNA Methylation-Responsive
candidate gene , we firstly recognize a set of genes whose expressions are highly discriminative among groups defined by DNA methylation profiles of . These genes are potentially responsive to aberrant DNA methylation of . The Context Likelihood of Relatedness (CLR) method [27] is used to assess regulatory relationships among these genes. CLR estimates MI for each pair of variables and corrects the MI via a background-corrected procedure. In particular, for mutual information ( ; ), CLR scores the relatedness between a pair of variables and by the joint likelihood measurement: where and are the mean and standard deviation derived on the empirical distribution of MI between and arbitrary variables ( = 1, 2, . . . , ) and ( ; ) is the mutual information of and .
CLR employs B-spline smoothing and discretization method [28] to estimate the MI for a pair of variables. However, it is time-consuming in this work under diversiform conditions and permutations. Thus, we use the following estimation method to calculate MI for pair of variables and [29]; that is, where is the PCC of and . An experienced threshold is necessary when CLR is employed. A larger threshold results in a higher precision but a smaller size of responsive modules. The size of more than 70% modules is less than three when = 4.46, while the size of 80% modules is larger than 3 when = 4.46 and approximate ranking lists of top 30 genes are obtained when falls in the interval between 3.96 and 5.46. Thus, we set = 4.46 in this work.

Scoring Candidate Driver
Genes by Differential Network Analysis. Differential network analysis reveals dynamic changes of pathways and potential mechanisms in complex diseases including cancers [30]. For each candidate gene, we calculate CLR scores for edges in responsive modules under normal and T-NDM. Differential scores are calculated to estimate network differences among groups. The differential score (DS) is yielded by the following equation: where is the CLR score of the th edge and is the number of edges in driver methylation-responsive module. Then candidate genes are prioritized by DS scores in descending order.

Results
We focus on the detection of differentially methylated genes which play key roles in tumorigenesis ("driver methylation gene") and modules responsive to aberrant methylation of these genes. Rather than genes with consistent expressions to DNA methylation levels in whole tumors, we detect genes differentially expressed and consistent with DNA methylation in T-DM when compared to normal and T-NDM.

Identification of Candidate Driver Genes in Tumorigenesis.
By integrating DNA methylation and corresponding gene expression data, the samples are partitioned into three groups (normal, T-DM, and T-NDM) for each gene (Figure 1(a)). Firstly, we remove genes that are not differentially expressed in T-DM when compared to normal and T-NDM. Then a permutation test is performed to determine the significance of the consistency of gene expression changes in T-DM when compared to T-NDM. To obtain a significant level of differences, we randomly permute T-DM and T-NDM and calculate differences. After 1000 times permutation, a background distribution of differences is constructed. After removing genes with the absolute mean beta value less than 0.1, 135 genes remain in the candidate list (see Supplementary File in Supplementary Material available online at http://dx.doi.org/10.1155/2016/2090286). We perform a functional enrichment analysis using DAVID [31,32]. Of these 135 genes, 115 are annotated to GO terms including cancerrelated functions such as response to stimulus, development process, cell differentiation, cell adhesion, cell growth and cell death, DNA repair, and apoptosis, which imply potential relationships between cancers and these 135 genes.

Detection Responsive Modules of Candidate Driver Genes.
Biological network reveals cell's functional organization [33].
To characterize the functional implications of candidate driver genes in tumorigenesis, we detect modules responsive to differential methylation of candidate driver genes (Section 2). Totally, 130 of 135 modules have at least one edge when the threshold of CLR is set to 4.46, and the mean size of 130 modules is 15.

Prioritization of Candidate Driver Genes by Differential
Network Analysis. We argue that a driver DNA methylation can induce not only a distinctive gene expression in T-DM, but also a distinctive module responsive to the alteration. We score each candidate driver gene by analysis of the differential level of the responsive module. Candidate driver genes are ranked by differential scores in descending order. We testify the significance of the differential score to a background distribution derived from random permutations. For a given candidate driver gene, genes are randomly selected from its possible responsive genes with module size maintained, and a new module is constructed by CLR with = 4.46 as well as a differential score. A sequence of DS consisting of random differential scores is obtained after 1000 times random permutation. Of 135 candidate driver genes, 130 genes pass the test with value < 0.01.
We also perform a differential network analysis of responsive modules under different CLR thresholds from 1.96 to 6.96 with step 0.5. Almost all modules obtain significant differential scores under CLR cutoffs (Supplementary File). Table 1 lists details of top 30 genes.

Discussion
We build two lists as background to testify the accuracy of the ranked list. The first consists of genes that show absolute mean fold change larger than 0. We test the accuracy of our list to Standard Lit and Standard Sel; Figure 2(a) shows the ROC curves with AUC = 0.686 and AUC = 0.628, respectively, which means that over half of genes in two standard lists are high-ranked in our list. The ranked list is also validated by literature annotation. Of the top 30 genes, 27 genes are previously reported to be cancer-relevant, while 17 of them are lung cancer or nonsmall-cell lung cancer-related (Table 1). We also annotate responsive modules of top 30 ranked genes to KEGG signaling pathways. Among them, responsive modules for 18 genes are enriched with KEGG signaling pathways with significance level value < 0.01, which imply significant relations of these responsive modules to cancer processes (Table 2) and indicate potential mechanism changes induced by aberrant DNA methylation. The KEGG signaling pathways are collected from MsigDB [63,64].
Of 30 top ranked genes, FAM107A, MAMDC2, SOX17, TCF21, PTPRH, and CDO1 have been previously reported with aberrant DNA methylation in lung cancer [18,34,45,46,52,57]. All these genes obtain higher occurrences ( > 19) in lung adenocarcinoma. AGR2, CDH13, CRYAB, MX2, SH100P, and SH3GL2 are reported with aberrant gene expression [38,40,47,54,58,59], while AGR2, CDH13, and MX2 are of high occurrences in aberrant DNA methylation ( ≥ 18). Differential expression of these genes has been reported playing crucial roles in key pathways in tumorigenesis or serving as potential prognostic targets. With higher occurrences, the correlation of differential gene expression and aberrant DNA methylation of AGR2, CDH13, and MX2 have been reported relevant to lung adenocarcinoma [18]. Alpha B-crystallin (CRYAB) is one of the important members of the small heat-shock protein family with aberrant DNA methylation occurring in 12 of 22 samples. The upregulated expression of CRYAB is reported relevant to the poor survival of patients with non-small-cell lung cancer (NSCLC) [38]. Interestingly, we find a contrary expression pattern in early stage lung adenocarcinoma in nonsmoking patients (Figure 3). A decreased expression is observed in both T-DM ( value = 8.20 − 11) and T-NDM ( value = 7.72 − 8) when compared to normal, while a relatively weak difference is also observed between T-DM group and T-NDM group (mean fold change difference = 0.07, value = 0.15), which implies multiple mechanisms in regulation of CRYAB, as well as DNA hypermethylation. The responsive module of CRYAB is highly changed in normal and T-NDM (DS = 14.508, value = 3.84 − 10). The similar case is SH3GL2, deletion of which downregulates tumor growth by modulating EGFR signaling [47].
Another interesting case is S100P, which has been reported as a key gene in tumor progression in both initial stage and advanced stage in lung adenocarcinoma [60]. The gene shows distinctive expressions among normal, T-DM, and T-NDM. There are nearly no changes existent in gene expression between normal and T-NDM, while in T-DM, upregulation is observed, which implies that the upregulation of S100P may be an important step in the early stage of lung adenocarcinomas.
Also some genes are relevant to cancers but lung cancer from literature study (COL5A2 [51], SPARCL1 [35], EFEMP2 [39], MSR1 [49], and DOCK2 [61]). APARCL1 and DOCK2 have shown downregulation in types of cancer [36,61], while both of them show downregulated gene expressions in T-DM with high occurrences of DNA hypermethylation. Similar to CRYAB, EFEMP2 shows contrary expression patterns in our observation compared to which in gliomas [39]. EFEMP2 has high occurrences of DNA hypermethylation and downregulated gene expression in totally 20 samples, while 2 samples in T-NDM show little differences when compared to matched normal. COL5A2 also shows T-DM specific upregulation of gene expression and DNA hypermethylation with high occurrences.
We show the responsive module of MSR1 in Figure 4(a) as a representation of responsive modules of cancer-related genes. All these genes exhibit significant changes in responsive modules in T-DM when compared to normal and T-NDM.
Besides cancer-related genes, three genes ARL14, CELSR3, and WFDC3 are also observed in our list. These three genes show T-DM specific expression changes (Figure 3), and regulatory correlations in responsive modules show significant differences in T-DM when compared to normal and T-NDM (Figures 4(b)-4(d)) which also imply potential roles of the three genes in the tumorigenesis of lung adenocarcinoma.

Conclusions
By integration of gene expression and DNA methylation data, we analyzed 22 matched lung adenocarcinoma/nontumor lung pairs for nonsmokers in early stage lung adenocarcinoma. By focusing on differences in gene expression patterns and responsive modules derived from T-DM compared to those in normal and T-NDM, we proposed a pipeline by employing a differential network analysis strategy. Totally, 135 candidate genes are analyzed, and top 30 genes are well studied in this work. All 135 genes are differentially expressed in T-DM when compared to matched normal and T-NDM, while 130 of them show significant changes in regulatory correlations of responsive modules. Literature mining of top 30 genes indicates a high proportion of lung cancerrelevant genes, which implies potential risks of these genes to disturb functions and pathways via differential methylation mechanisms, and further drives the tumorigenesis of lung adenocarcinoma in early stage. In conclusion, we provide a bioinformatics pipeline to identify driver genes with aberrant DNA methylation by fully considering differential expression and network changes in T-DM, normal, and T-NDM. The analysis pipeline can also be employed in identification of driver genes with aberrant DNA methylation of other cancers characterized by paired gene expression and DNA methylation.