Integrated Analysis of DNA Methylation and mRNA Expression Profiles Data to Identify Key Genes in Lung Adenocarcinoma

Introduction. Lung adenocarcinoma (LAC) is the most frequent type of lung cancer and has a high metastatic rate at an early stage. This study is aimed at identifying LAC-associated genes. Materials and Methods. GSE62950 downloaded from Gene Expression Omnibus included a DNA methylation dataset and an mRNA expression profiles dataset, both of which included 28 LAC tissue samples and 28 adjacent normal tissue samples. The differentially expressed genes (DEGs) were screened by Limma package in R, and their functions were predicted by enrichment analysis using TargetMine online tool. Then, protein-protein interaction (PPI) network was constructed using STRING and Cytoscape. Finally, LAC-associated methylation sites were identified by CpGassoc package in R and mapped to the DEGs to obtain LAC-associated DEGs. Results. Total 913 DEGs were identified in LAC tissues. In the PPI networks, MAD2L1, AURKB, CCNB2, CDC20, and WNT3A had higher degrees, and the first four genes might be involved in LAC through interaction. Total 8856 LAC-associated methylation sites were identified and mapped to the DEGs. And there were 29 LAC-associated methylation sites located in 27 DEGs (e.g., SH3GL2, BAI3, CDH13, JAM2, MT1A, LHX6, and IGFBP3). Conclusions. These key genes might play a role in pathogenesis of LAC.


Introduction
As the most common type of lung cancer and a non-small cell lung carcinoma (NSCLC) [1], lung adenocarcinoma (LAC) is characterized by gland or duct formation and massive mucus production [2]. LAC generally originates in peripheral lung tissue, and this is in contrast with squamous cell lung cancer and small cell lung cancer (SCLC), which are both apt to be located more centrally in lungs [3,4]. In US, LAC accounts for approximately 40% of lung cancers [5]. Smoking is the main cause of LAC, and the disease has a high metastasis rate even at an early stage [4]. Therefore, it is necessary to identify key genes in LAC and develop effective therapeutic schedule.
The WNT/TCF signaling can promote osseous and brain metastasis of LAC cells through targeting HOXB9 and LEF1 which mediates chemotactic invasion and colony outgrowth [6]. Coexpression of Oct4 and Nanog, which are homeobox transcription factors important for self-renewal of stem cells, commands epithelial-mesenchymal transdifferentiation, mediates tumor-initiating ability, and contributes to metastasis of LAC [7]. As a noncoding RNA, MALAT-1 enhances motility of LAC cells via regulating motilityassociated gene expression in transcriptional and posttranscriptional levels [8,9]. BRAF mutation is frequently detected in human LAC, indicating that BRAF may serve as a therapeutic target for a subset of patients with the disease [10,11]. Overexpression of caveolin-1 is essential for mediating filopodia formation, which may promote invasion of LAC cells [12]. In spite of the above researches, the mechanisms of LAC still remain unclear.
Recently, along with the development of chip technology, microarray data have been obtained and uploaded to Gene Expression Omnibus (GEO) for us to study [13]. Using microarray data GSE62950, we screened the differentially expressed genes (DEGs) between LAC tissue samples and adjacent normal tissue samples. And their potential functions were predicted by Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), and Disease Ontology (DO) enrichment analysis. Then, the interrelationships between these DEGs were analyzed by protein-protein interaction 2 BioMed Research International (PPI) network and module analysis. At last, LAC-associated methylation sites were identified and mapped to the DEGs to obtain LAC-associated DEGs.

Microarray Data.
Microarray data GSE62950 was downloaded from the database of GEO (http://www.ncbi.nlm.nih .gov/geo/), which included a DNA methylation dataset and an mRNA expression profiles dataset. The DNA methylation dataset and the mRNA expression profiles dataset separately were based on the platform of GPL8432 Illumina HumanRef-8 WG-DASL v3.0 and GPL8490 Illumina HumanMethyla-tion27 BeadChip (HumanMethylation27 270596 v.1.2), and both of them included 28 LAC tissue samples and 28 adjacent normal tissue samples.

Data Preprocessing.
Normalized series matrix file of mRNA expression profiles was downloaded directly. After beta matrix of DNA methylation data was downloaded, primary methylation signals were preprocessed by Methylation Module V1.9 [14] in BeadStudio V3.1.0.0 to obtain normalized beta matrix. Those methylation sites which had no signal values in one or more samples were removed.

DEGs
Screening. Using Limma package [15] in R, the DEGs between LAC tissue samples and adjacent normal tissue samples were identified. The values of DEGs were adjusted by Benjamini-Hochberg (BH) method [16]. The adjusted value < 0.05 and |log 2 fold-change(FC)| ≥ 1 were taken as the thresholds.

GO, KEGG, and DO Enrichment
Analysis. GO is utilized for predicting potential functions of gene products in three categories (biological process, BP; cellular component, CC; and molecular function, MF) [17]. The KEGG database can be used for systematic analysis of gene functions, which connects genomic information with corresponding functional information [18]. The DO database contains a comprehensive knowledge of human diseases and is applied to annotate diseases [19]. Using TargetMine online tool [20], GO, KEGG, and DO enrichment analyses were performed for the DEGs. The values of enriched terms were corrected by Holm-Bonferroni [21]. The adjusted value < 0.05 was used as the cut-off criterion.

PPI Network and Module
Construction. The STRING [22] database was utilized to search PPI pairs among the DEGs. And combined score > 0.4 was used as the cut-off criterion. Then, the PPI network of the DEGs was visualized by Cytoscape (http://www.cytoscape.org/) [23]. Subsequently, igraph package [24] in R was used to calculate connectivity degrees of nodes (proteins) in the PPI network, and nodes with higher degrees were taken as hub nodes.
Furthermore, MCODE plugin [25] in Cytoscape was used to screen modules from the PPI network. Using BinGO plugin [26], GO functional enrichment analysis was conducted for the genes in each module.

Screening of LAC-Associated Methylation
Sites. Using genefilter package [27] in R, the methylation sites which had higher beta value variations within groups compared with variations among groups were deleted. And the value < 0.05 was used as the cut-off criterion. Then, the methylation sites located in sex chromosomes were removed. Finally, the associations between the screened methylation sites and LAC were analyzed by CpGassoc package [28] in R. The value < 0.05 was taken as the threshold.

Correlation Analysis of Methylation Sites and DEGs.
According to the annotation information of the DNA methylation profiles, the nearest genes to the LAC-associated methylation sites were obtained and then mapped to the DEGs. At last, LAC-associated methylation sites of the DEGs were gained.

Validation of Methylation Sites and DEGs in LAC Tissue
Samples. The RNASeqV2 data of LAC were downloaded from The Cancer Genome Atlas (TCGA, http://cancergenome.nih.gov/) database, which included 515 LAC tissue samples and 59 adjacent normal tissue samples. Using Limma package [15] in R, the DEGs with the adjusted value < 0.05 and |log 2 FC| ≥ 1 were also screened. Meanwhile, the methylation data of LAC were also downloaded from TCGA database, which included 459 LAC tissue samples and 32 adjacent normal tissue samples. Moreover, LAC-associated methylation sites were also identified and mapped to the DEGs using the same methods as the above.

Data Preprocessing and DEGs
Screening. After preprocessing, total 18626 beta values of DNA methylation data and 18626 gene expression values of mRNA expression profiles separately were obtained. Compared with adjacent normal tissue samples, there were a total of 913 DEGs (including 409 upregulated and 504 downregulated genes) in LAC tissue samples. In the heat map of the DEGs, LAC tissues could be definitely separated from adjacent normal tissues by the DEGs (Figure 1).

GO, KEGG, and DO Enrichment Analysis.
Using Target-Mine online tool, functions of the DEGs were predicted by GO, KEGG, and DO enrichment analyses. For the upregulated genes, the enriched GO functions included multicellular organismal catabolic process ( value = 8.59 − 07) in BP category, as well as extracellular region ( value = 2.67 − 04) and extracellular space ( value = 0.002024) in CC category (Table 1(a)). The enriched KEGG pathways for upregulated genes included protein digestion and absorption ( value = 0.001244) and cell cycle ( value = 0.026338, which involved cell division cycle 20, CDC20; cyclin B2, CCNB2; and mitotic arrest deficient 2-like 1, MAD2L1) (  (Figure 2(a)). Particularly, MAD2L1 (degree = 36), AURKB (degree = 38), CCNB2 (degree = 40), and CDC20 (degree = 42) had higher degrees, and they can interact with each other in the PPI network. Several modules were screened from the PPI network for upregulated genes, and the largest module (module 1) is also showed in Figure 2(b). GO functional enrichment analysis showed that the genes in module 1 were involved in mitosisassociated terms. The PPI network for downregulated genes had 233 nodes and 368 interactions ( Figure 3). Connectivity degrees of the nodes in the PPI network were calculated, and cGMPdependent protein kinase II (PRKG2, degree = 11), VEcadherin (CDH5, degree = 12), WNT3A (degree = 15),

Screening of LAC-Associated Methylation Sites.
After screening, total 8856 methylation sites were obtained. Then, the associations between the screened methylation sites and LAC were analyzed under the threshold of value < 0.05. As a result, 230 LAC-associated methylation sites were identified.

Correlation Analysis of Methylation Sites and DEGs.
There were 29 LAC-associated methylation sites located in 27 DEGs (e.g., Src homology 3 domain GRB2-like 2, SH3GL2; brain-specific angiogenesis inhibitor 3, BAI3; Hcadherin, CDH13; junctional adhesion molecule 2, JAM2; metallothionein 1A, MT1A; LIM-homeobox containing 6, LHX6; and insulin-like growth factor binding protein-3, IGFBP3) ( Table 2). And all of the methylation sites were within 2 kb from transcriptional start sites (TSSs) of the DEGs. For the 29 methylation sites, their methylation indexes in LAC tissue samples were compared with that in adjacent normal tissue samples. The methylation indexes of 19 methylation sites were significantly increased, and the downstream genes mediated by those 19 methylation sites were downregulated. Nevertheless, 1 methylation site had significantly decreased methylation index, and its downstream genes were upregulated.  of GSE62950 were also detected in the methylation data downloaded from TCGA database.

Discussion
In this study, 913 DEGs including 409 upregulated and 504 downregulated genes were identified in LAC tissue samples compared with adjacent normal tissue samples. Total 230 LAC-associated methylation sites were identified, among which 29 LAC-associated methylation sites were located in 27 DEGs. Afterwards, the RNASeqV2 data and methylation data of LAC were downloaded from TCGA database to validate the obtained methylation sites and DEGs. There were 691 common DEGs (such as WNT3A, MAD2L1, AURKB, CCNB2, CDC20, SH3GL2, BAI3, CDH13, JAM2, MT1A, and IGFBP3) between the 913 DEGs identified in the microarray data of GSE62950 and the 4893 DEGs identified in the LAC sample data downloaded from TCGA database. In addition, the 29 LAC-associated methylation sites identified in the microarray data of GSE62950 were also detected in the methylation data downloaded from TCGA database. Functional enrichment indicated that WNT3A was involved in single-multicellular organism process and cell periphery. Overexpression of Wnt5a, which belongs to Wnt family that encodes signaling glycoproteins, promotes invasion of NSCLC during tumor progression [29,30]. Via activating JNK pathway, Wnt-7a and Fzd-9 signaling plays role in inducing the receptor tyrosine kinase inhibitor Sprouty-4 and cadherin proteins and is essential for maintaining epithelial differentiation and inhibiting transformed cell growth in some NSCLC patients [31]. In the PPI network for downregulated genes, WNT3A had higher degrees. These suggested that WNT3A might play a role in LAC. Besides, CDC20, CCNB2, and MAD2L1 were enriched in the pathway of cell cycle. Meanwhile, MAD2L1 and AURKB were involved in DO terms of type cancer, germ cell cancer, embryoma, and embryonal cancer. Results of immunohistochemistry suggest that CDC20 can be a negative marker in prognosis of patients with resected NSCLC, especially adenocarcinoma [32]. Overexpressed CDK5RAP3 and CCNB2, as well as suppressed RAGE, may be promising biomarkers in lung adenocarcinoma [33]. The 5-year overall survival rates of LAC patients with low CCNB2 mRNA levels were significantly higher than that with high levels, and overexpressed CCNB2 mRNA can independently predict a poor prognosis in patients with LAC [34,35]. Immunohistochemical analysis indicates AURKB, which mediates chromosome segregation during mitosis, is frequently overexpressed in primary lung carcinomas [36,37]. Immunohistochemistry shows that overexpression of cell division cycle associated 8 (CDCA8) and AURKB can result in bad outcome of lung cancer patients; thus, suppression of the CDCA8-AURKB pathway is a potential therapeutic strategy for lung cancer [38]. Semiquantitative RT-PCR shows that mitotic arrest defective protein 2 (MAD2) is overexpressed in a high percentage of lung cancers, and multivariate analysis suggests that high-level MAD2 can be a prognostic marker independently [39]. In the PPI network for upregulated genes, MAD2L1, AURKB, CCNB2, and CDC20 had higher degrees, and they can interact with each other. Therefore, MAD2L1, AURKB, CCNB2, and CDC20 might be implicated in LAC by interacting with each other.
Additionally, LAC-associated methylation sites were identified and mapped to the DEGs. And there were 29 LAC-associated methylation sites located in 27 DEGs (e.g., SH3GL2, BAI3, CDH13, JAM2, MT1A, LHX6, and IGFBP3). Loss of SH3GL2 is frequently detected in NSCLC and SH3GL2 can mediate cellular growth and invasion through interacting with EGFR [40]. CDX2, VIL1, and BAI3 levels have significant differences in SCLC and large-cell neuroendocrine lung carcinoma (LCNEC); therefore, they can be diagnostic markers of these tumor types [41]. Tumor suppressor gene CDH13, located on chromosome 16q24.2-3, is downregulated in lung cancer and its aberrant methylation may be a potential marker for cancer detection [42][43][44]. Via mediating 1 integrin subunit and ERK activation in human dermal lymphatic endothelial cells (HDLEC), junctional adhesion molecule-C (JAM-C) contributes to lymphangiogenesis and nodal metastasis, suggesting that JAM-C may be a target for treating lymphatic metastases in NSCLC [45]. Overexpression of metallothionein (MT) can be used as an independent predictor of short-term survival in SCLC patients enduring chemotherapy [46,47]. Previous study indicates that LHX6 is a candidate tumor suppressor gene that has epigenetic silencing in patients with lung cancer [48]. In NSCLC, methylation status of IGFBP-3 before cisplatin therapy seems to be a biomarker of prognosis, helping to select appropriate therapeutic method for patients [49,50]. These declared that SH3GL2, BAI3, CDH13, JAM2, MT1A, LHX6, and IGFBP3 might relate to LAC.