A Prognostic 14-Gene Expression Signature for Lung Adenocarcinoma: A Study Based on TCGA Data Mining

Background Lung adenocarcinoma (LUAD), a major and fatal subtype of lung cancer, caused lots of mortalities and showed different outcomes in prognosis. This study was to assess key genes and to develop a prognostic signature for the patient therapy with LUAD. Method RNA expression profile and clinical data from 522 LUAD patients were accessed and downloaded from the Cancer Genome Atlas (TCGA) database. Differentially expressed genes (DEGs) were extracted and analyzed between normal tissues and LUAD samples. Then, a 14-DEG signature was developed and identified for the survival prediction in LUAD patients by means of univariate and multivariate Cox regression analyses. The gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis were performed to predict the potential biological functions and pathways of these DEGs. Results Twenty-two out of 5924 DEGs in the TCGA dataset were screened and associated with the overall survival (OS) of LUAD patients. 14CID="C008" value=" "DEGs were finally selected and included in our development and validation model by risk score analysis. The ROC analysis indicated that the specificity and sensitivity of this profile signature were high. Further functional enrichment analyses indicated that these DEGs might regulate genes that affect the function of release of sequestered calcium ion into cytosol and pathways that associated with vibrio cholerae infection. Conclusion Our study developed a novel 14-DEG signature providing more efficient and persuasive prognostic information beyond conventional clinicopathological factors for survival prediction of LUAD patients.


Introduction
Lung cancer continues to be the leading cause of cancerrelated mortality around the world [1], in which nonsmall-cell lung cancer (NSCLC) is the most often type, being mainly subdivided into adenocarcinoma (LUAD), squamous cell carcinomas (LUSC), and large cell carcinoma (LCC) [2,3]. In the past decades, LUAD represents the major lung cancer population, increasingly accounting for approximately 40% of all lung cancers [4]. LUAD were characterized by distinct epidemiological, clinicopathological, and molecular properties [5]. Despite the improvements in diagnosis and therapy made during the past 30 years, the biomarkers for early detecting, prediction of high rate of relapse and mortality populations and the identification of target or immunological therapies for lung cancer patients are still unsatisfactory. Thus, identification of effective biomarkers for the prognosis of LUAD is critical for the diagnosis and treatment of LUAD patients.
Differentially expressed genes (DEGs) that regulated by gene transcription are implicated in diverse biological processes. Gene-expression profiling analysis made some progresses in predicting overall survival (OS) in NSCLC [1,6,7]. Mascaux et al. showed that immune activation and immune escape in tumor microenvironment (TME) occurred before lung cancer invasion [7]. With the importance of DEGs involved in cancer research, the roles of DEGs as biomarkers and drivers of tumor oncogenesis and suppression have been identified. However, there are no definite and effective biomarkers in predicting the 5-year survival rate of LUAD patients, which bring great difficulty to clinical prognosis. Therefore, investigation in DEGs may be the solution to noninvasive biomarkers for LUAD.
Although several genes or long noncoding RNA expression signatures, including programmed death-ligand 1(PD-L1), have been recently proposed for predicting the OS in NSCLC [6,[8][9][10], the prognostic value of an effective and new biomarker of gene profile is still limited. DEG signatures identification related to patient OS in standard clinical samples may promote the development of molecular drug subtypes and potential therapy targets. LUAD and LUSC exhibit distinction in the epidemiology, molecular characteristics, and prognosis [5]. Although several prognostic DEG signatures have been discovered for NSCLC [11,12], few of these research identify and pinpoint the prognostic value of DEGs biomarkers for LUAD patients in a large cohort. Therefore, we focused on the DEG signature of LUAD not previously published.
In this study, we identified a 14-DEG signature as a predictor of survival risk of LUAD patients using a cohort of 522 cases from The Cancer Genome Atlas (TCGA) database. We employed a survival-associated risk score formula to identify a novel 14-DEG prognostic signature from the TCGA dataset of 522 LUAD patient samples. To show the conscientiousness of this signature, the specificity and sensitivity of our model were examined by the area under ROC curve (AUROC) analysis. A 14-DEG signature which could distinguish patients between good and poor survival was developed by means of Cox regression analysis and risk score model method. A higher area under curve (AUC) of the receiver operating characteristic (ROC) curve confirmed good sensitivity and specificity of the prognostic model, while multivariate Cox regression analysis and stratified analysis indicated the independence of predictive capacity of the 14-DEG prognostic signature. Besides, the functional enrichment analysis demonstrated that the 14-DEG may be probably involved in the progression of LUAD through exerting their roles in LUAD-related function of release of sequestered calcium ion into cytosol and pathways that associated with vibrio cholerae infection. Therefore, our finding may provide insights into the predictive capacity of DEG signature elaborating LUAD.

Materials and Methods
2.1. The LUAD Patient Dataset. The RNA-Seq data set of patients with LUAD was downloaded from the TCGA database (https://cancergenome.nih.gov/), including clinical features. The patients with the following criteria were filtered: patients with complete information of RNA expression profiles and clinical factors (including age, gender, TNM stage, survival status, and survival time).

Differentially Expressed Gene Screening in LUAD.
Raw gene-level counts were utilized in our analysis. All the data processing and normalization were performed and com-pleted by using the Perl and R version 4.0.0. The gene expression profiling data of the 522 LUAD samples and 59 normal samples were downloaded from the TCGA database. The DEGs between normal and LUAD group were identified through the "edgeR" package from Bioconductor in R language [13]. |log 2FC | >2 and adjusted p value < 0.01 were set as the threshold for screening the expression difference of DEGs.

Cox Regression
Analysis. The RNA-seq expression values were transformed in log2 format to normalize the data. Univariate Cox regression analysis using the "Survival" R package was performed to clarify the association between DEG expression and patient survival. The DEGs (p value < 1.0e-06) from the univariate analysis were considered as potential candidate DEGs associated with OS.
To determine the independent predictive capacity of the 14-DEG signature for LUAD patients, a stepwise multivariate Cox regression analysis was executed to identify the predictive model with the best explanatory and informative efficacy.
2.4. Risk Score and ROC Curve. A mathematical formula ðRiskscore = ∑ N i=1 ðExpðiÞ ⋅ coeðiÞÞ Þ was developed to predict the risk score for each patient based on the multivariate Cox regression analysis. In accordance with our risk scoring system, patients were classified into high-risk and low-risk groups according to the median risk score. A Kaplan-Meier overall survival curve of the different stages was plotted, and the hazard ratio was calculated. Subsequently, the logrank test was utilized to determine the survival differences between high-risk and low-risk groups. The sensitivity and specificity of the DEG prognostic model to predict clinical outcome were evaluated by calculating the area under curve (AUC) of the receiver operating characteristic (ROC) curve in the R package of "survival ROC" [14].

Differential Analysis of Scores and DEG Expression with
Clinicopathological Stages. The clinicopathological characteristics data corresponding to the LUAD samples were downloaded from TCGA. The independence of the RiskScore from the clinical parameters, such as age, gender, and tumor stage, was determined, and the statistical analysis was performed by Kruskal-Wallis rank sum test or log-rank test as the significance test. In addition, the differential expression of the DEGs between distinct clinicopathologic stages was analyzed and plotted.
2.6. Function Enrichment Analysis. Gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis was carried out for DEGs with the aid of clusterProfiler R package. Only terms with p value < 0.05 were considered as significantly enriched in functions of prognostic DEGs and KEGG pathway analysis.  Table 1. 47.89 percent of 522 LUAD patients was no more than and 52.11 percent was more than 65 years old. The female accounted for 53.64% and the male 46.36% in these patients. Of the 522 patients, 280 were classified as stage I, 130 as stage II, while 86 were labeled with stage III and 26 with stage IV disease. The survival time of 522 LUAD patients was 902:51 ± 892:15 days.

Differentially Expressed Genes in LUAD Patients.
According to the defined criteria, a total of 5924 DEGs (including 5147 upregulated and 777 downregulated) were extracted between LUAD and normal samples (Figure 2(a)). The results of unsupervised hierarchical cluster analysis in Figure 2(b) showed that the LUAD samples could be clearly distinguished from the normal controls with the expression of DEGs. A total of 5924 DEGs were screened to be differentially expressed between LUAD and normal tissues and were used for survival analysis. To identify the DEGs which were related to patient survival in LUAD, univariate Cox regression analysis for all DEG expression data was assessed. With the significance level threshold of 1.0E-06, a set of 22 DEGs was selected. These DEGs were utilized in stepwise multivariate Cox regression analysis, and finally, Gene signature   3.3. The Development of the 14-Gene Prognostic Model. We divided the patients into high-risk and low-risk groups according to the median risk score (value = 0:89) by calculating the expression levels of the 14 DEGs in each patient. The log-rank test was used to determine the survival differences. As depicted in Figure 3(a), Kaplan Meier curves showed that the high-risk group was correlated with poor prognosis (p = 7e − 16). ROC curves indicated that the AUC of the 14-gene signature was 0.769 ( Figure 3(b)), which proved that the 14-gene signature had a high specificity and sensitivity in predicting the OS of LUAD patients.

The 14-DEG Signature Independence from Conventional
Clinical Factors. According to multivariate Cox regression analysis, we demonstrated that the 14-DEG signature risk score exhibited an independent predictive ability from other clinical factors (p = 7e − 16, shown in Figure 3(a)). Meanwhile, we found that TNM stage was an independent factor for predicting the OS of LUAD patients (p < 0:001) (Figure 4(a)). Therefore, stratification analysis was further performed to examine whether the 14-gene signature could provide predicted value for patients within the same TNM stage. Because the sample numbers in stage IV were too small to draw any reliable conclusions (n = 26), stratification analysis was carried out only in stage I, II, and III patients. Logrank test for patients in stage I demonstrated that the 14-DEG signature could distinguish patients with significantly different survival time (p = 0:00018, Figure 4(b)). Similar predictive outcome of the 14-DEG signature was achieved in stage II (p = 1e − 05) and III (p = 9e − 05) patients (Figures 4(c) and 4(d)). Besides, distinct expression of DEGs between different clinicopathological stage samples in Figure 4(e) showed that the DEG expression was positively related to clinicopathological stage. Altogether, these results manifested that the prognostic capability of the 14-DEG signature was independent from conventional clinical factors for predicting survival of LUAD patients. and GO: 0008191~metalloendopeptidase inhibitor activity (MF) were mainly clustered, respectively ( Figure 5(a)). The top 10 GO terms were shown in Table 3. The DEGs were enriched in three KEGG pathways which mainly focused on tumor metabolism, including hsa05110: vibrio cholerae infection, hsa04141: protein processing in endoplasmic reticulum, and hsa04020: calcium signaling pathway (Table 4, Figure 5(b)).

Discussion
NSCLC is a global health threat with high morbidity and mortality, up to 0.6 and 0.1 percent, respectively [11]. LUAD accounts for more than 40% of the lung cancer patients, showing its predominance among NSCLC. On account of the heterogeneity, conventional prognostic systems such as TNM stage sometimes exhibited predicting deficiency for risk stratification and clinical outcome estimations. Therefore, considerable outcomes are in urgent need in recent Increasing evidences suggest that DEGs play indispensable and important roles in the tumorigenesis, TNM staging, and progression of lung cancer. Although several researches have identified a number of DEGs with prognostic value in NSCLC, especially in LUSC [10,11], few studies have concentrated on and analyzed the DEG expression specifically in LUAD. Moreover, because LUAD and LUSC are vastly distinct diseases at the molecular, pathological classification and clinical level, such as distinct driver genetic changes, response to chemotherapy, or targeted therapy [4,5], single-gene expression models are insufficient for accurate prediction of LUAD outcomes. Therefore, we focused on the molecular prognostic DEG signature patterns in LUAD.
In this study, 14-DEG signature related to overall survival of LUAD patients was identified. By means of univariate Cox regression analysis and stepwise multivariate Cox regression analysis, a novel 14-gene (C1QTNF6, ERO1A, MELTF, ITGB1-DT, RGS20, FETUB, NTSR1, LINC02178,  We calculated the RiskScore of each patient through the formula and the expression of selected DEGs. The patients were divided into high-and low-risk group by the median RiskScore (value = 0:89); then, we obtained the survival curve according to the survival rate of all LUAD patients. To our knowledge, C1QTNF6 has been recently identified as a novel biomarker exacerbating the outcome of lung adenocarcinoma patients [15]. Combined expression of protein disulfide isomerase and endoplasmic reticulum oxidoreductin 1-α (ERO1A) is a poor prognostic marker for non-small-cell lung cancer [16]. Level of melanotransferrin (MELTF) in tissue and sera serves as a prognostic marker of gastric cancer. Patients with high serum MELTF levels had poor prognosis [17]. It was demonstrated that ITGA5 and ITGB1 are prognostic in non-small-cell lung cancer by integrin and gene network analysis [18]. Regulator of G protein signaling 20 (RGS20) was identified as molecular marker for LUAD for its effect in enhancing cancer cell aggregation, migration, invasion, and adhesion [19,20]. Fetuin-B (FETUB) was reported as a plasma biomarker candidate related to the severity of lung function in COPD [21]. Neurotensin (NTS) and its receptor (NTSR1) promote EGFR, HER2, and HER3 overexpression and their autocrine/paracrine activation in LUAD. Their expression is increased in 60% of lung cancer patients.
In a previous clinical study, NTSR1 overexpression was applied to predict a poor prognosis for 5-year OS in a stage I lung adenocarcinomas population treated by surgery alone [22]. Besides, LINC02178, LINC01312, AL353746.1, DRAXINP1, and LINC02310 were identified as the prognostic markers and prediction of the survival of LUAD by genome-scale analysis [23]. Among the identified 14 genes in this study, all were associated with high risk, indicating that the expression of these genes was positively related. Moreover, gene MELTF, AC034223.2, and AC034223.1 were firstly identified related to LUAD in our study.
The carcinogenesis of LUAD is a multistep process hallmarked by a series of genetic alterations. In order to gain a further insight into the functional roles of the 14 DEGs, the correlation between their expression levels and the coexpressed protein-coding genes were analyzed. In the present study, we performed GO and KEGG enrichment analysis to explore the functions of the predictive DEGs. The results indicated that the prognostic 14-DEGs were involved in significant functional process, such as release of sequestered calcium ion into cytosol (BP), endoplasmic reticulum lumen (CC), and metalloendopeptidase inhibitor activity (MF) and enriched in KEGG pathways including vibrio cholerae infection, protein processing in endoplasmic reticulum, and calcium signaling pathway. Therefore, it is convincing to infer that the fourteen prognostic DEGs participate in the progression of LUAD in these LUAD-related biological pathways. However, further

Conclusions
In summary, this study identified a novel 14-DEG prognostic signature which could predict the survival risk of LUAD patients. The signature exhibited independent prognostic capacity of clinicopathological factors and could predict survival outcomes of LUAD patients within the same TNM stage. This signature could be utilized to identify patients with high-risk scores who may be further desperate for more effective and individualized therapy. It could not only serve as a novel potential biomarker for the survival risk stratification of LUAD patient but also provide us a better understanding of molecular mechanisms involved in the development of LUAD. However, further molecular investigations, such as exploring the underlying mechanisms of these DEGs in LUAD development and performing  Protein processing in endoplasmic reticulum 0.041902 1 hsa04020 Calcium signaling pathway 0.049161 1 10 Oxidative Medicine and Cellular Longevity independent cohorts of large sample sizes from institutions across the country or world, are necessary to confirm accuracy and stability for the prediction signature.

Data Availability
The data used to support the findings of this study are included within the article.