Identification of 5 Gene Signatures in Survival Prediction for Patients with Lung Squamous Cell Carcinoma Based on Integrated Multiomics Data Analysis

Background Lung squamous cell carcinoma (LSCC) is a frequently diagnosed cancer worldwide, and it has a poor prognosis. The current study is aimed at developing the prediction of LSCC prognosis by integrating multiomics data including transcriptome, copy number variation data, and mutation data analysis, so as to predict patients' survival and discover new therapeutic targets. Methods RNASeq, SNP, CNV data, and LSCC patients' clinical follow-up information were downloaded from The Cancer Genome Atlas (TCGA), and the samples were randomly divided into two groups, namely, the training set and the validation set. In the training set, the genes related to prognosis and those with different copy numbers or with different SNPs were integrated to extract features using random forests, and finally, robust biomarkers were screened. In addition, a gene-related prognostic model was established and further verified in the test set and GEO validation set. Results We obtained a total of 804 prognostic-related genes and 535 copy amplification genes, 621 copy deletions genes, and 388 significantly mutated genes in genomic variants; noticeably, these genomic variant genes were found closely related to tumor development. A total of 51 candidate genes were obtained by integrating genomic variants and prognostic genes, and 5 characteristic genes (HIST1H2BH, SERPIND1, COL22A1, LCE3C, and ADAMTS17) were screened through random forest feature selection; we found that many of those genes had been reported to be related to LSCC progression. Cox regression analysis was performed to establish 5-gene signature that could serve as an independent prognostic factor for LSCC patients and can stratify risk samples in training set, test set, and external validation set (p < 0.01), and the 5-year survival areas under the curve (AUC) of both training set and validation set were > 0.67. Conclusion In the current study, 5 gene signatures were constructed as novel prognostic markers to predict the survival of LSCC patients. The present findings provide new diagnostic and prognostic biomarkers and therapeutic targets for LSCC treatment.


Introduction
The incidence and mortality of lung cancer have been increasing annually all over the world in the past few decades [1], allowing lung cancer to become a leading cause of male cancer death and the second most frequent cause of female cancer death right behind breast cancer [2]. Lung squamous cell carcinoma (LSCC) is the second most common pathological type of lung cancer, second to lung adenocarcinoma. LSCC accounts for 40%-50% of all lung cancer cases, and its early symptoms are not obvious and atypical; thus, most patients are already at a middle or late stage by the time of diagnosis [3]. In recent years, advances in scientific research and clinical practice and achievements have been made in understanding the mechanism of the occurrence and development of lung cancer; moreover, predictive screening indicators and targeted drug therapy have also been improved. However, scientific research and achievements concerning lung adenocarcinoma and LSCC are limited; therefore, regarding such a research gap, the study of LSCC is highly urgent and necessary.
Multiomics data, such as cancer genome mapping (TCGA) and therapies applied to research (TARGET) projects, have the potential to generate effective treatments, as they can be used effectively to predict disease progression [4]. Liu et al. predicted the prognosis of high-risk diseases by integrating the data of copy number variations (CNVs) and single-nucleotide polymorphisms (SNPs) [5]. SNPs account for 1% of the human genomes and exist in coding or noncoding regions that affect exon splicing or transcription [6]. SNPs have been considered predictive markers of complex diseases [7] and have been found to be associated with many common diseases, including type II diabetes [8], Crohn's disease [9], schizophrenia [10], and breast cancer [11]. However, the functional significance and gene/variant alleles of novel disease-related SNPs studied by genomewide association studies (GWAS) or next-generation sequencing (NGS) data remained a challenge to be investigated. CNVs are defined as sequence variants, which ranged from 50 bps to a few megabits (Mb) in size, and have deletions, duplicates, triplets, insertions, complex genome rearrangements (CGR), and other CNVs [12]. CNVs have more than 10 times differences in heritable sequences compared with single nucleotide variants (SNVs) in the general population [13], and its genome-wide map has been comprehensively studied [14].
Moreover, in some other studies, more than one gene is identified; for example, a prognostic risk model constructed by 4 abnormally methylated genes (VAX1, CH25H, AdCyAP1, and Irx1) has been found to be able to predict the survival rate of LSCC patients [17]. A previous study established 4-gene expression signature clustering models with 14 genes collected from cluster patient samples and indicated that the signature could effectively help predict the prognosis of LSCC patients and improve treatment strategies [18]; however, this also poses difficulties and challenges in clinical applications, as selecting the suitable signature is not an easy work. Therefore, it is essential to define and validate an effective genetic signature model for predicting LSCC prognosis.
Gene expression profile, single-nucleotide mutation, and CNVs of patients with LSCC were obtained from large datasets in TCGA and GEO databases. Prognostic markers were screened by integrating genomics and transcriptome data to establish a 5-gene signature, and its ability to predict survival was further verified by an internal test set and an external validation set. We found that the 5-gene signature is involved in important biological processes and pathways of LSCC, and similar results were also shown by GSEA analysis, suggesting that the 5-gene signature could effectively predict the prognostic risk of patients with LSCC. Thus, the signature established in the current study could provide a basis for better understanding of the molecular mechanism of LSCC prognosis.

Univariate Cox Proportional Hazard Regression Analysis.
Guo et al. [20] previously performed univariate Cox proportional risk regression analysis with the training dataset for each gene to screen genes significantly related to overall survival (OS) of patients; p < 0:05 was also defined as the threshold in the present study.
2.3. Analysis of CNV Data. GISTIC was widely used to detect both broad and focal (potentially overlapping) recurring events; GISTIC 2.0 [21] software was used to identify genes with significant amplification or deletion according to the parameter thresholds of amplified or absent fragments > 0:1 and p < 0:05.

Genetic Mutation
Analysis. In order to identify the genes with significant mutations, Mutsig 2.0 software was used to identify the genes with significant mutations in the maf file of TCGA mutation data, with a threshold value of p < 0:05.
2.5. Construction of Prognostic Gene Signature. Genes significantly associated with patient OS and those with amplification, deletion, and mutation were selected and subjected to random survival forest algorithm to rank genes that showed prognostic values [22]. As previously described by Meng et al. [23], R package random survival forest was used for screening, with the Monte Carlo iteration number set as 100 and the previous progress number set as 5; moreover, a gene with relative importance greater than 0.27 was identified as a characteristic gene. Additionally, multivariate Cox regression analysis was carried out, and the following risk scoring model was constructed: where n is the number of prognostic genes, Exp k is the expression value of the prognostic genes, and e HR k is the estimated regression coefficiency of genes in the multivariate Cox regression analysis.
2.6. Functional Enrichment Analyses. Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis was performed using the R package clusterprofiler [24] to identify overrepresented GO terms in three categories (biological processes, molecular function, and cellular component) and KEGG pathway. For the analysis, a FDR < 0 :05 was considered statistically significant.
GSEA [25] was performed by JAVA program (http:// software.broadinstitute.org/gsea/downloads.jsp) using MSigDB [26] C2 Canonical pathway gene set collection, which contains 1320 gene sets. After performing 1000 permutations, gene sets with a p value lower than 0.05 were considered to be significantly enriched.

Statistical
Analysis. The Kaplan-Meier (KM) curve was plotted by using the median risk score in each dataset as a cutoff to compare the risk of survival between the highrisk group and the low-risk group. Multivariate Cox regression analysis was conducted to examine whether gene markers were independent prognostic factors. Statistical significance was defined as p < 0:05. The AUC analysis was performed using the R package pROC, and the heat map was drawn using the R package pheatmap. All analyses applied default parameters except for special instructions, which are performed in R software version 3.4.3.

Analysis of Multiomics Data to Identify Genes Associated with Overall Survival of Patients with LSCC.
For the samples of the TCGA training set, univariate Cox regression analysis was performed to establish a relationship between OS of patients and gene expression. 804 prognostically significant genes were identified, and information of the 20 genes with the highest significance is shown in Table 2.

Gene Set for the Identification of Genomic Variation.
For CNV data in TCGA, GISTIC 2.0 was used to identify genes with significant amplification or deletion, with parameter thresholds of amplification or deletion > 0:1 and p < 0:05. Figure 1(a) shows significantly amplified fragments of the LSCC, and a total of 535 genes were amplified. Among them, EGFR was significantly amplified at the 7p11.2 segment (q = 1:33E − 16); CD72 was significantly amplified on the 9p13.3 segment (q value = 1.38E-07); CDK3 was significantly amplified on the 17q25.1 segment (q value = 0.0092281). Figure 1(b) shows the segments of the LSCC genome with significant deletion, and a total of 621 genes were deleted. A significant loss of CDKN2A on the 9p21.3 segment (q value = 7.83E-116) and a significant loss of FOXP1 on the 3p13 segment (q value = 6.47E-21) were observed; moreover, RB1 was absent in the 13q14.2 segment (q value = 0.0012441).
For the TCGA mutation annotation data, Mutsig2 used to identify genes with significant mutations screened a total of 388 genes with significant mutation frequencies. The distribution of synonymous mutations, missense mutations, frame-     insertion or deletion, frame-shifting, nonsense mutations, shear sites, and other nonsynonymous mutations in TCGA patients showed the most significant p values (Figure 1(c)), and CDKN2A, PTEN, TP53, RB1, PIK3CA, and some other genes were found closely related to the occurrence and development of LSCC.

Functional Analysis of CNV Genes and Mutated
Genes. In order to analyze the function of genomic mutant genes, a total of 1385 amplified and deleted genes and significantly mutated genes identified were integrated. GO biological process and KEGG functional enrichment analysis were performed on the 2261 genes. The results of KEGG enrichment analysis revealed that the 1385 genes were significantly enriched in the mTOR signaling pathway, cell apoptosis, autophagy, EGFR tyrosine kinase inhibitor resistance, non-small cell lung cancer, B cell signaling pathway, and other KEGG biological pathways related to the development of cancer (Figure 2(a)).
In the category of the biological process, the 1385 genes were mainly enriched in epidermal development, epidermal cell differentiation, keratinocyte differentiation, and other GO terms (Figure 2(b)). Noticeably, these terms are also closely related to the occurrence and development of cancer; in other words, these genomic mutations are closely related to tumors.

Identification of a 5-Gene Signature for LSCC Survival.
First, a total of 804 candidate prognostic genes of gene variants and prognostic genes were integrated, and we finally identify 51 genes with amplification, deletion, and mutation as candidate genes. Furthermore, a random survival forest algorithm is used for feature selection, and the relationship between the error rate and the number of classification trees is shown in Figure 3(a). Genes with relative importance of > 0.27 served as the final signature, and finally, 5 genes were obtained (Table 3). These genes play important roles in the regulation of tumor-related pathways and biological processes; however, their expression levels did not always show a high AUC for the prediction of tumor prognosis ( Figure S1). The importance of out-of-bag of these 5 genes was ranked and is shown in Figure 3(b). A 5-gene signature was established using multivariate Cox regression analysis, and the model is as follows: The risk score of each sample was calculated, and the samples were grouped according to the mid-value of the risk score (cutoff = -0:05950035). A significant difference in prognosis, which is a carcinogenic signature, was identified between the high-risk group and the low-risk group (Figure 3(c)). The 3-year AUC of the 5-gene signature in the training set was 0.76 (Figure 3(d)). The relationship between the expressions of the 5 genes and risk score was observed; specifically, high expressions of SERPIND1, COL22A1, and LCE3C were found correlated with a high risk, and these genes are therefore considered risk factors, while highly expressed HIST1H2BH and ADAMTS17 were correlated with a low risk and could be regarded as protective factors.

Verification of the Robustness of the 5-Gene Signature
Model. In order to determine the robustness of the 5-gene signature model, the risk score of each sample was calculated in the test set, and the samples were divided into two groups according to the threshold of the training set, with significant prognostic differences observed between the two groups (Figure 4(a)). ROC analysis showed that the 5-year AUC reached 0.68 (Figure 4(b)). Furthermore, the analysis of the relationship between the expressions of the 5 genes and risk score revealed that SERPIND1, COL22A1, and LCE3C were associated with a high risk and were seen as risk factors, while HIST1H2BH and ADAMTS17 were indicative of low risk; thus, the two could serve as protective factors. This is also consistent with the training set results (Figure 4(c)). In conclusion, the model showed highly effective prognosis classification results in the TCGA dataset.
In order to verify the classification performance of the 5gene signature model in data from different data platforms, GEO platform data GSE42127 and GSE37745 were taken as the external dataset. The signature model was used to calculate the risk score of each sample, and the cutoff of the training set was used to divide the samples into the high-risk and low risk groups. The results demonstrated that the prognosis of the low-risk group was significantly better than that of the high-risk group (Figure 5(a)). Moreover, ROC analysis showed that the 3-year AUC reached 0.79 ( Figure 5(b)). The relationship between the expressions of 5 genes and the risk score was also consistent with the training set ( Figure 5(c)). In addition, the similar results were observed in the GSE37745 dataset, 5-year AUC was found to be 0.74, and the OS between the two groups showed a significant difference ( Figure 6). In conclusion, our 5-gene signature model    In order to allow those models to be more compara-ble, we calculated the risk score of each sample in the TCGA using the same method based on the corresponding genes in the 4 models. The ROC of each model was examined, and the samples were divided into the high-risk and low-risk groups based on the median risk score, and the OS prognosis difference between the two groups of samples was calculated. The KM curve of OS showed that the prediction performance of the four models was less accurate than our 5-gene signature model (Figures 7(a)-7(d)). To further compare the predictive performance of these models on TCGA samples, the "rms" package in R was used to calculate the restricted mean survival curves of the 4 models and our model, and the results demonstrated that our model has the highest C-index among the total 5 models investigated (Figure 7(e)); noticeably, our 5-gene model also showed more advantages in long-term survival prediction. Furthermore, we compared the 5-gene signature and the prediction results of the 4 models by the DCA curve, and results showed that the performance of the 5-genes signature model established in the current study was higher than those of the other four (Figure 7(f)).

GSEA Analysis on Enriched Pathway Differences between the High-Risk Group and the Low-Risk Group.
In the TCGA training set, GSEA used to identify the significantly enriched pathways in the high-risk group and the low-risk group screened a total of 41 significantly enriched pathways (Table 5). Among them, KEGG cell adhesion molecules (cams), KEGG ECM receptor interaction, KEGG JAK STAT signaling pathway, and KEGG focal adhesions were all significantly related to the occurrence, development, and metastasis of LSCC (Figure 8).

Discussion
Lung cancer is a leading cause of cancer deaths, and the incidence of the cancer is increasing worldwide. In all lung cancer cases, LSCC causes an annual death of at least   14 BioMed Research International 400,000. Similar to many other cancers, lung cancer patients are usually at advanced stages by the time of diagnosis, suggesting that there is nearly no available treatment for the patients. However, early diagnosis and surgical resection can significantly improve the survival rate of lung cancer patients. The development of molecular biomarkers play an important role in personalized medicine and current precision medicine [27]; therefore, there is an urgent need for a classifier for predicting the prognosis of LSCC patients with poor prognosis and designing customized therapies. In our current study, transcriptome, copy number variation, and mutation data were mined from TCGA to search for obtaining novel prognostic markers for LSCC. Interestingly, we found that the gene signature constructed from 5 differentially expressed genes demonstrated great prediction performance for LSCC. TCGA data with large-scale genome analysis allowed it to be possible to examine the molecular characteristics associated with LSCC results [28]. In 2015, Huang et al. analyzed gene and miRNA expressions, DNA methylation, and CNV data of 129 LSCC specimens in TCGA, and they established a genome-wide integration network by using variance expansion factor regression and isolated lung cancer subnetwork by the Bayesian method [29]. LSCC patients with a 4-gene expression signature among 14 dif-ferentially expressed feature genes were at a high risk of developing a poor prognosis. Gao et al. also reported that 12 of the 133 abnormally expressed miRNAs were correlated with OS in the TCGA LSCC cohort [30]. The AUC of our 5-gene signature was close to 0.7 in the training set, test set, and verification set. All these genes had abnormal genome mutations, which allows an easy clinical detection. In a word, our 5-gene signature had high AUC and involved fewer genes; thus, it was conducive to clinical transformation.

BioMed Research International
We were also interested in investigating the prospective molecular mechanisms of these 5 genes. Therefore, GSEA analysis was conducted to explore related gene enrichment pathways. SERPIND1, COL22A1, and LCE3C are risk factors, while ADAMTS17 is a protective factor in gene signature. Hereinto, SERPIND1 acts as a potential oncogene in the development of tumor, including in lung cancer [31,32]. In head and neck cancer, highly expressed COL22A1 mRNA is statistically correlated with reduced disease-free survival and is significantly associated with lymph node metastasis [33]. HIST1H2BH is associated with the prognosis of cervical cancer patients [34]. Knocking down Adamts17 expression induces the apoptosis of breast cancer cells and inhibits cancer cell growth [35]. However, LCE3C has not been shown to be associated with tumor. In this study, for the first time, LCE3C was found to be a new prognostic marker for lung adenocarcinoma. Meanwhile, our GSEA analysis results showed that the pathway enriched by the 5-gene signature was significantly correlated with the pathway and biological process of the occurrence and development of LSCC. These results indicated that our 5-gene model has a potential clinical application value and could provide a potential target for the clinical diagnosis of LSCC patients. It should be noted that though potential candidate genes for tumor prognosis were identified in large samples through bioinformatics techniques, some limitations still exist in the present research. Firstly, the sample lacked certain clinical follow-up information; thus, we did not

Conclusion
In conclusion, in this study, a 5-gene signature prognostic stratification system has been developed, and the model demonstrated great AUC in both the training set and validation set and was independent of clinical features. Compared with clinical features, gene classifier can improve the prediction of

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
The authors stated they have no conflicts of interest.