Construction and Validation of a Novel Prognostic Signature for Intestinal Type of Gastric Cancer

Background Intestinal type of gastric cancer (IGC) is the largest subtype of gastric cancer (GC) by Lauren classification. The purpose of this present study was to construct a prognostic signature for IGC patients, based on the high-grade dysplasia (HGD) and IGC tissues, to improve and enhance the prognostic accuracy. Methods The microarray datasets and associated clinical characteristics of HGD and IGC were obtained from the Gene Expression Omnibus (GEO) database. Based on the differential expression analysis between HGD and IGC, the prognostic-related differential expression genes (DEGs) were identified in a training set by univariate COX regression analysis. The least absolute shrinkage and selection operator (LASSO) regression was used to construct an optimal prognostic signature. The enrichment analysis was performed by using Gene Set Enrichment Analysis (GSEA). The performance of the nomogram was assessed by the calibration curve and concordance index (C-index). The results were validated by using a testing set. Results We identified 35 prognostic-related DGEs in the training set. The nine-gene signature was established by LASSO analysis. The nine-gene signature was an independent risk factor in both the training and testing sets. The areas under the curve (AUC) values of receiver operating characteristic (ROC) analysis were 0.733 and 0.700 for the training and testing sets, respectively. In GSEA analysis, the gene expression in high-risk group was enriched in hedgehog signaling, epithelial mesenchymal transition, and angiogenesis. The nomogram for IGC showed good performance with C-index of 0.81 (95% CI: 0.76-0.86) and 0.70 (95% CI: 0.63-0.77) in the training and testing sets, respectively. Conclusion We identified and verified a nine-gene signature for the prognostic prediction of IGC patients, which might identify subgroups of IGC patients and select more suitable therapeutic options.


Introduction
Gastric cancer (GC) is the fifth most common cancer and the third leading cause of cancer-related deaths worldwide, with 27,510 incidences and 11,410 mortalities since 2019 [1]. It is desirable to explore accurate prognostic models which could identify the subset of patients with a high risk for death and prompt to give those timely treatments. A number of previous studies have established the different types of prognostic signatures for GC. Several studies have demonstrated that associated gene signatures for GC patients to predict the prognosis have been identified, including six-gene signature, five-gene signature, 24-long noncoding RNA (lncRNA) signature, and 14-lncRNA signature [2][3][4][5].
Based on Lauren classification, GC can be divided into intestinal-type, diffuse-type, and mixed-type [6]. The tumorigenesis of IGC primarily results from environmental factors, such as Helicobacter pylori (H. pylori) infection, and is mostly associated with geriatric patients [7,8]. Diffuse-type of GC (DGC) was more commonly observed in younger individuals with worse prognosis [9]. The carcinogenesis of IGC is a complicated multistep process, including chronic atrophic gastritis (CAG), intestinal metaplasia (IM), low-grade dysplasia (LGD), high-grade dysplasia (HGD), and eventually carcinoma [10]. The carcinogenic pathways of DGC are mostly attributed to genomic aberrations and are less associated with environmental factors and chronic inflammatory cascade [11,12]. Furthermore, Jinawath et al. indicated that IGC and DGC had different mechanisms underlying gastric carcinogenesis by screening the gene expression profile [13]. Therefore, there is a great need to construct a novel prognostic signature for IGC patients.
HGD is an obvious premalignant lesion, requiring aggressive treatments such as endoscopic interventions [14]. A meta-analysis illustrated that the progression rate of the patients from HGD to GC was 16 times higher than those from LGD to GC [15]. In our present study, we conducted gene differential expression analysis of multiple gene expression profiles between HGD and IGC to determine the potential mechanisms in this progression. Subsequently, we identified prognostic-related differential expression genes (DEGs) between HGD and IGC. The prognostic model was constructed based on prognostic-related DEGs and was validated for IGC patients in both the training and testing sets. Finally, we integrated the prognostic signature and clinical factors to establish a clinical nomogram and assessed the accuracy in predicting the survival rates of IGC patients.

Microarray Datasets.
All microarray data were downloaded from the Gene Expression Omnibus (GEO) database (http://www.ncbi.nlm.nih.gov/geo/). We identified and downloaded the microarray data (GSE55696, GSE87666, GSE130823), which enrolled HGD and IGC samples [4,16,17]. Moreover, the six independent microarray data of IGC who underwent gastrectomy were included in the current study, including GSE26901, GSE26899, GSE66229, GSE26253, GSE29272, and GSE13861 [18,19]. The detailed information of each dataset was listed in Table 1. The clinical information of IGC patients was collected from corresponding literature. We randomly and equally divided the IGC patients into training and testing sets for the validation.

Data Processing.
The workflow of the current study is shown in Figure 1. The raw CEL format files or gene expression matrices were normalized by using normalize Between Arrays function of limma package in R (https:// bioconductor.org/biocLite.R). To reduce noise and batch effects in the microarray gene expression data, batch normalization was performed by using sva and limma package in R. Differential expression genes (DEGs) between HGD and IGC were selected by using |log2 fold change (FC)|>0.58 and false discovery rate (FDR) <0.05 in R.

Enrichment Analysis of DEGs between HGD and IGC.
Gene Ontology (GO) terms and Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis were performed using stringi and ggplot2 packages in R. GO term contains three domains: biological process (BP), cellular component (CC), and molecular function (MF). FDR <0.05 was considered statically significant. The top 10 terms of each domain for GO analysis and the top 30 terms for KEGG analysis were presented.

Protein-Protein Interaction (PPI) Network Analysis of
DEGs. The detailed description of PPI network construction has been introduced in our previously published article [20]. The PPI network construction was analyzed using the STRING database (http://string-db.org/) [21]. The PPI pairs with an interaction score >0.7 were considered significant. The PPI network was constructed by using Cytoscape 3.6.1. Moreover, subclusters in the PPI network were identified by using the Molecular Complex Detection (MCODE) plug-in of Cytoscape [22]. The selection criteria for the subclusters were as follows: MCODE score ≥ 6, degree cutoff = 2, node score cutoff = 0:2, and k − score = 2. The hub genes in the PPI network were selected by calculating the degree with the cytoHubba plug-in of Cytoscape [23].

Identification of Prognostic-Related Genes.
To evaluate the prognostic values of the DEGs between HGD and IGC, a univariate Cox regression analysis was performed in the training set by using the survival package in R. Subsequently, the least absolute shrinkage and selection operator (LASSO) regression was used to build an optimal prognostic signature for IGC patients by glmnet package in R. The prognostic risk score for overall survival (OS) was calculated based on the gene expression weighted by the regression coefficient in the multivariate Cox regression analysis. Receiver operating characteristic (ROC) curves were used to evaluate the accuracy of the prognostic value in IGC by using survivalROC package in R.

Identification of DEGs between HGD and IGC.
We integrated multiple microarray datasets, containing 43 HGD and 41 IGC tissues. After differential expression analysis, 637 DEGs were identified between HGD and IGC, including 602 upregulated genes and 35 downregulated genes.

Functional and Pathway Enrichment Analysis.
To better understand the potential functions of DEGs between HGD and IGC, GO and KEGG analyses were performed. The results in the BP category were mainly enriched in immune-associated terms, such as T cell activation, regulation of lymphocyte activation, and leukocyte cell-cell adhesion (Figure 2(a)). The significantly enriched CC term included the external side of the plasma membrane, receptor complex, and endocytic vesicle (Figure 2(a)). Furthermore, cytokine receptor binding, cytokine activity, and cytokine binding were primarily enriched in the MF category ( Figure 2(a)). The KEGG enrichment analysis results revealed that the primary pathways were enriched in cytokine-cytokine receptor interaction, chemokine signaling pathway, and cell adhesion molecules (Figure 2(b)).

Construction of PPI Network and Subclusters
. PPI network was visualized by using Cytoscape (Figure 2(c)). According to the MCODE plug-in, three modules were iden-tified in the PPI network (Figures 2(d)-2(f)). The KEGG enrichment analysis showed that genes of the module in Figure 2(d) were mainly enriched in the chemokine signaling pathway, cytokine-cytokine receptor interaction, and Tolllike receptor signaling pathways. Cell adhesion molecules, human T-cell leukemia virus type 1 (HTLV-I) infection, and T cell receptor signaling pathway were significantly enriched in the module, presented in Figure 2(e). In the module ( Figure 2(f)), the analysis also revealed significant enrichment of malaria, transcriptional misregulation in cancer, and hematopoietic cell lineage pathways. After calculating the degree of each gene in the PPI network by cytoHubba, the top 10 hub genes of the PPI network were PTPRC, IL6, LCK, ITGAM, TNF, CCR7, GNG2, CCR5, CXCR4, and CD3G.

Characteristics of IGC Patients.
A total of 503 IGC patients were enrolled in the current study, including 126 (25.05%) females and 377 (74.95%) males. The detailed information of IGC patients is presented in Table 2. The IGC patients with stage I, II, III, and IV accounted for 17.30%, 29.03%, 37.17%, and 16.50%, respectively. The survival data of 7 IGC patients could not be acquired. After removing 7 patients without prognostic information, the training set and the testing set contained 248 IGC patients, equally. The clinical characteristics were not significantly different between the training and testing sets (Table 3).

Assessment of the Prognostic Values of DEGs and
Construction of a Prognostic Signature for IGC Patients. To screen the genes which were related to prognosis, a total of 35 DGEs were identified as prognostic-related genes by using univariate Cox regression analysis. The forest map presented that the hazard ratio and P value of each prognostic-related gene (Figure 3(a)). Subsequently, a total of 9 genes were        After dividing the patients into high-risk and low-risk groups based on median risk score, the Kaplan-Meier survival analysis for the training set showed that the IGC patients with high-risk scores had significantly reduced the OS rate compared to those with low-risk scores (P = 4:85 × 10 −7 , Figure 4(a)). To validate the accuracy of the risk model, ROC analysis for risk score, sex, age and stage indicated that the areas under the ROC curves (AUC) were 0.733, 0.583, 0.642, and 0.707 for the training set, respectively (Figure 4(b)). In addition, the distribution of risk scores, survival status, and expression values of 9 DEGs was presented   Figure 4(f).

Validation of Prognostic Model for IGC Patients.
To validate the prognostic value of the risk scoring model, the survival rate of IGC patients in the testing set was consistent with those in the training set by using Kaplan-Meier survival analysis (P = 1:45 × 10 −3 , Figure 5(a)). ROC curve analysis showed that the AUC of risk score, sex, age, and stage in the testing set were 0.700, 0.573, 0.601, and 0.567, respectively ( Figure 5(b)). In the testing set, we found out that the patients in the low-risk group had a significantly better OS than those in the high-risk group, which was in consistent with the training set ( Figures 5(c)-5(e)).

Exploration of Enriched Pathways between High-Risk and
Low-Risk Cohorts. In order to further elucidate the potential mechanisms, GSEA with hallmark gene sets was conducted between high-risk and low-risk cohorts ( Table 4). The gene expression in the high-risk group was enriched in myogenesis, hedgehog signaling, epithelial-mesenchymal transition (EMT), ultraviolet (UV) response down (DN), angiogenesis, and apical junction ( Figure 6). Furthermore, the low-risk group was enriched in oxidative phosphorylation, interferon gamma response, and interferon alpha response ( Figure 6).

Construction of Nomogram for IGC Patients.
A nomogram for OS was constructed by age, sex, adjuvant chemotherapy, risk score, and stage (Figure 7(a)).

Discussion
It has been reported that IGC constitutes the largest proportion of GC by Lauren classification [26]. The tumorigenesis of IGC is a complex and complicated process, and HGD is a key precancerous lesion with a specific pathologic characteristic. Therefore, the selection of HDG and IGC to do further analysis was more reasonable. In the present study, we identified 637 DEGs after a comparison of 43 HGD tissues and 41 IGC tissues. The pathways and GO term enrichment analyses may suggest the potential mechanisms during the tumorigenesis from HGD to IGC. We found out that immune-related and inflammation-related pathways, like the T cell activation pathway, were significantly enriched. More evidence showed that chronic inflammation may induce progression from CAG to IM that may increase the likelihood of GC [27]. Moreover, the maintenance of IM with chronic inflammation and progression of spasmolytic polypeptide expressing mucosa (SPEM) to IM by promoting proinflammatory signals can predispose an individual to induce dysplasia [28]. Blocking T cell activation during the H. pylori infectious process may inhibit and reverse established preneoplastic lesions [29]. Coincidentally with our study, T cell activation may play an important role in gastric tumor carcinogenesis.
The KEGG analysis showed that some potential significant signal pathways were screened in the progression from HGD to IGC. We found out that Th1 and Th2 cell differentiation pathway was significantly enriched in the progression from HGD to IGC. Similarly, Ren et al. found that the significant differences between gastritis with and without cancer and dysplasia indicated a shift from a Th1 to a Th2 helper cell pattern of cytokine secretion [30]. In addition, we also  Through Cox and LASSO regression analysis, we constructed a predictive risk model for IGC patients based on nine prognostic-related DEGs. After dividing the IGC patients into high-risk and low-risk groups, Kaplan-Meier analysis and ROC curves indicated that the model of training and validation cohorts both had a good performance. To the           Figure 5: Validation of the nine-gene signature for IGC patients in the testing cohort. (a) Kaplan-Meier analysis showed that IGC patients with high-risk scores had a shorter OS than those with low-risk scores (P = 1:445 × 10 −3 ). (b) ROC analysis showed that the AUC of risk score, sex, age, and stage in the testing group were 0.7, 0.573, 0.601, and 0.567, respectively. (c-e) An overview of the survival status, the distributions of the risk score for each patient, and heatmaps for nine-gene signature in the testing group. (f) Multivariate Cox analysis for IGC patients in the testing cohort also identified that the nine-gene signature was an independent risk factor. 13 Disease Markers best of our knowledge, this is the first time to construct a nomogram for predicting the OS of IGC patients who underwent gastrectomy. Moreover, the calibration curves and Cindexes showed good concordance. Zhang et al. reported that the five-gene signature for GC patients achieved a higher Cindex in OS than the six-gene and 24-lncRNA signatures [4]. Most importantly, our nine-gene signature for IGC patients showed a higher C-index than the five-gene signature for GC.
To further explore the potential molecular mechanisms between high-risk and low-risk groups based on the risk score, GSEA in the Hallmark pathway database was performed, whereby the results illustrated that the high-risk cohort was significantly enriched in hedgehog signaling, EMT, angiogenesis, and apical junction pathways. The hedgehog signaling pathway plays a critical role in gastric development, homeostasis, and tumorigenesis [27,32]. Activation of the EMT pathway could induce gastric epithelial

15
Disease Markers cells to turn into mesenchymal cells, causing tumor metastasis by attenuating a cell-cell adhesion and alteration of cell polarity [33]. Angiogenesis is a hallmark of solid tumor development and also an important prerequisite for tumor growth and metastasis [34]. Overall, the IGC patients with high-risk score were characterized by a high invasion and fast growth. Therefore, antiangiogenic therapies, such as Bevacizumab and Ramucirumab, may be more suitable for highrisk IGC patients based on our risk score.
The strength of our study is that, to our knowledge, it represents a novel gene signature evaluating the prognostic value for IGC patients. However, there are still several limitations in the present study. First, we only incorporated the microarray datasets which included the clinical characteristics and survival data for IGC patients. Second, the data regarding H. pylori infection status is unknown.

Conclusion
In the present study, we identified and verified a novel ninegene signature for the prognostic prediction of IGC patients, which might identify subgroups of IGC patients with different risk scores. The nomogram could accurately predict the prognosis for IGC patients. The nine-gene signature may help to select more suitable therapeutic options for different subgroups of IGC patients.

Conflicts of Interest
The authors have declared that no competing interest exists.