Identification of a 5-Gene-Based Scoring System by WGCNA and LASSO to Predict Prognosis for Rectal Cancer Patients

Background Although accumulating evidence suggested that a molecular signature panel may be more effective for the prognosis prediction than routine clinical characteristics, current studies mainly focused on colorectal or colon cancers. No reports specifically focused on the signature panel for rectal cancers (RC). Our present study was aimed at developing a novel prognostic signature panel for RC. Methods Sequencing (or microarray) data and clinicopathological details of patients with RC were retrieved from The Cancer Genome Atlas (TCGA-READ) or the Gene Expression Omnibus (GSE123390, GSE56699) database. A weighted gene coexpression network was used to identify RC-related modules. The least absolute shrinkage and selection operator analysis was performed to screen the prognostic signature panel. The prognostic performance of the risk score was evaluated by survival curve analyses. Functions of prognostic genes were predicted based on the interaction proteins and the correlation with tumor-infiltrating immune cells. The Human Protein Atlas (HPA) tool was utilized to validate the protein expression levels. Results A total of 247 differentially expressed genes (DEGs) were commonly identified using TCGA and GSE123390 datasets. Brown and yellow modules (including 77 DEGs) were identified to be preserved for RC. Five DEGs (ASB2, GPR15, PRPH, RNASE7, and TCL1A) in these two modules constituted the optimal prognosis signature panel. Kaplan-Meier curve analysis showed that patients in the high-risk group had a poorer prognosis than those in the low-risk group. Receiver operating characteristic (ROC) curve analysis demonstrated that this risk score had high predictive accuracy for unfavorable prognosis, with the area under the ROC curve of 0.915 and 0.827 for TCGA and GSE56699 datasets, respectively. This five-mRNA classifier was an independent prognostic factor. Its predictive accuracy was also higher than all clinical factor models. A prognostic nomogram was developed by integrating the risk score and clinical factors, which showed the highest prognostic power. ASB2, PRPH, and GPR15/TCL1A were predicted to function by interacting with CASQ2/PDK4/EPHA67, PTN, and CXCL12, respectively. TCL1A and GPR15 influenced the infiltration levels of B cells and dendritic cells, while the expression of PRPH was positively associated with the abundance of macrophages. HPA analysis supported the downregulation of PRPH, RNASE7, CASQ2, EPHA6, and PDK4 in RC compared with normal controls. Conclusion Our immune-related signature panel may be a promising prognostic indicator for RC.


Introduction
Previously, colon cancer (CC) and rectal cancer (RC) are considered a single tumor entity (called colorectal cancer (CRC)) [1]. However, recent studies indicate that there are significant differences in epidemiology, pathology, molecular mechanisms, and the response to treatments [1]. The risk of developing RC is estimated to be four times higher than that of CC, and RC patients have a lower 5-year survival rate than CC patients (60% versus 72%) due to a poor response to current treatment options [1][2][3]. Thus, it is of particular importance to explore approaches to early separate RC patients with a high death risk and then provide improved specialized care to further reduce the overall mortality rate.
With the developments in sequencing technology and bioinformatics, recent scholars suggest that a molecular signature panel may be more effective for the prognosis prediction than routine clinical characteristics (such as the tumornode-metastasis (TNM) stage) [4,5]. Li et al. identified a four-mRNA signature panel as an independent prognostic factor for CRC. This four-mRNA signature panel can effectively predict the prognosis of CRC patients, with an area under the receiver operating characteristic (ROC) curve (AUC) of 0.730. The stratified analysis indicated that the patients belonging to the same T stage (T3+T4), N stage (N1+N2), or TNM stage (III+IV) can also be stratified into the high-risk and low-risk groups using this 4-gene signature panel [6]. The study of Zuo et al. revealed that a six-mRNA signature panel had a significant prognostic value to discriminate the high-risk patients from the low-risk patients, with an AUC of 0.711 and 0.683 for 3-year and 5-year survival, respectively. This 6-gene signature panel was an independent factor of OS after adjustment for clinical factors and can predict different survival outcomes in patients with the early (or advanced) TNM stage [7]. Sun et al. developed a 12-gene expression signature panel to precisely predict the prognosis for CC patients, which could distinguish poor from good prognosis patients within stage II/III [8]. Chen et al. found that the signature panel consisting of 16 gene pairs formed by 24 genes had a better prognostic ability than the TNM stage (AUC: 0.724 vs. 0.703; concordance index (C-index): 0.869 vs. less than 0.8) at 5 years [9]. Although there were also studies to identify mRNAs associated with the prognosis of RC patients [10][11][12][13], no reports specifically focused on the signature panel and compared its prognostic values with clinical factors.
In the present study, we aimed to (1) screen RC-related genes by weighted gene coexpression network analysis (WGCNA) [14], (2) develop a reliable mRNA signature panel for the prediction of overall survival (OS) in patients with RC using the least absolute shrinkage and selection operator (LASSO) method [15,16], (3) validate its superior prognostic performance to various clinical features by stratified analysis and comparison of the AUC and C-index, and (4) establish a clinicopathologic-mRNA nomogram to improve the prediction accuracy clinically.

Materials and Methods
2.1. Data Access. The RNA-seq expression data (fragments mapped per kilobase of exon per million reads mapped, level 3) were retrieved from The Cancer Genome Atlas (TCGA; https://portal.gdc.cancer.gov) database using "TCGA-rectal adenocarcinoma (READ)" as the keyword. There were 177 READ cases and 10 controls in this dataset. However, only 162 READ cases provided clinical information. Thus, TCGA dataset (including 162 cases and 10 controls) served as the training dataset for our following analyses. Furthermore, GSE123390 (platform: Affymetrix Human Transcriptome Array 2.0) and GSE56699 (platform: Illumina HumanHT-12 WG-DASL V4.0 R2 expression beadchip) microarray datasets were also obtained from the Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/) database using "rectal cancer" as the keyword. GSE123390 was used for the WGCNA model validation because it investigated the mRNA expression profile in human RC tissues (N = 28) and controls (N = 5). GSE56699 was used for the survival model validation because it provided the prognosis information for 61 of 72 RC cases. The study flowchart is illustrated in Figure 1.

Identification of Differentially Expressed Genes (DEGs).
In TCGA-READ and GSE123390 datasets, the DEGs were screened between RC and controls using the limma package of R (version 3.34.7; https://bioconductor.org/packages/ release/bioc/html/limma.html) [17]. False discovery rate ð FDRÞ < 0:05 and |log 2 FC ðfold changeÞ | >0:5 were defined as the statistical threshold. All DEGs in these two datasets were subjected to the hierarchical clustering analysis using the pheatmap package of R (version 1.0.8; https://cran.rproject.org/web/packages/pheatmap), and the heat map generated was used to assess the heterogeneity of gene expression patterns between RC and controls. The Draw Venn Diagram online tool (http://bioinformatics.psb.ugent.be/webtools/ Venn) was utilized to identify the shared DEGs between TCGA-READ and GSE123390 datasets. The functions of common DEGs were analyzed using the gProfiler tool (http://biit.cs.ut.ee/gprofiler/gost). Gene Ontology (GO) terms, Kyoto Encyclopedia of Genes and Genomes (KEGG), and Reactome pathways with an FDR < 0:05 were considered to be statistically significant.

WGCNA.
To screen genes associated with the development of RC, WGCNA was performed using the WGCNA package in R (version 1.61; https://cran.r-project.org/web/ packages/WGCNA/index.html) [14], by which highly correlated mRNAs could be clustered into the same coexpression modules. WGCNA included six steps: (1) calculation of the expression and connectivity correlations of mRNAs between TCGA-READ and GSE123390 datasets; (2) selection of the soft threshold power (β) according to the scale-free topology criterion; (3) calculation of the topological overlap matrix dissimilarity between genes in TCGA-READ to build the dendrogram and identification of modules (cutHeight = 0:995 and minSize ≥ 50) by the Dynamic Tree Cut method [18]; (4) assessment of the preservation (Z-score > 5 and p < 0:05) of modules in two datasets using the modulePreservation statistics [19]; (5) enrichment of DEGs to modules using the hypergeometric algorithm ½ f ðk, N, M, nÞ = Cðk, MÞ * Cðn − k, N − MÞ/Cðn, NÞ [20]; and (6) association of coexpression modules with clinical information.
2.4. Protein-Protein Interaction (PPI) Network. The PPI pairs among DEGs in crucial modules were identified using the Search Tool for the Retrieval of Interacting Genes (STRING; version 11.0; https://string-db.org) database [21]. Only 2 Analytical Cellular Pathology interactions with a combined score > 0:4 were selected to construct the PPI network using Cytoscape software (version 3.4; http://www.cytoscape.org/) [22]. The functions of genes in the PPI network were analyzed using the KEGG, GO, and Reactome implements in the STRING. Significant GO terms, KEGG pathways, and Reactome pathways were chosen using an FDR < 0:05 as the statistical threshold.
2.5. Development of the Prognostic Signature Panel. The univariate Cox regression analysis was used to screen OSassociated genes from the DEGs in the preserved modules. The DEGs with a log-rank p < 0:05 in the univariate analysis were entered into the multivariate Cox regression model to identify independent predictors. An L1-penalized (LASSO) Cox proportional hazard model in the penalized package (version 0.9-5; http://bioconductor.org/packages/penalized/) [15,16] was further applied on these independent prognostic DEGs to select the optimal subset of signature panels. The risk score model was established based on the expression levels of prognostic DEGs (Exp DEGs ) and their LASSO coefficients (β DEGs ): Risk score = β mRNA1 × Exp mRNA1 +⋯β mRNAn × Exp mRNAn : The patients were divided into the high-risk group and the low-risk group by using the median risk score as the cutoff. The OS differences between the high-risk group and the low-risk group were compared according to the Kaplan-Meier survival curve analysis and log-rank test. The predictive accuracy of the risk score was estimated through the AUC calculated from the ROC curve. These analyses were first carried out for TCGA-READ dataset and then validated in the GSE56699 dataset.
Moreover, univariate and multivariate Cox analyses were applied using TCGA-READ cohort to evaluate whether the risk score was independent of other clinical variables for the prognosis prediction. Kaplan-Meier survival curve analysis was used to identify whether the risk score was also an effective tool for stratification of patients with the same clinical characteristics. A nomogram that incorporated the risk score and clinical prognostic factors was developed for the prediction of 3-year and 5-year OS rates. The predictive power of the nomogram was assessed in terms of the AUC and the C-index (which was calculated using the survcomp package, http://www.bioconductor.org/packages/release/ bioc/html/survcomp.html). Kaplan-Meier curve to assess the survival difference between high-risk and low-risk groups; ROC curve, C-index calculation to assess the predictive accuracy; Univariate, multivariate and stratification analyses to confirm the prognostic independence GSE56699 PPI network by STRING Function enrichment by DAVID Immune association by TIMER Protein expression by HPA 3 Analytical Cellular Pathology TCGA-PRAD (B cells, CD4+ T cells, CD8+ T cells, neutrophils, macrophages, and dendritic cells) were estimated from the Tumor Immune Estimation Resource (TIMER; https:// cistrome.shinyapps.io/timer/) based on Spearman's correlation test. p < 0:05 adjusted by the tumor purity was considered significant.

Verification of Protein Expressions of Prognostic Genes.
The immunohistochemical results for the prognosticrelated DEGs and their interaction genes in rectal and normal tissues were downloaded from the Human Protein Atlas (HPA) to confirm their protein expression levels.  Table S1).

Identification of RC-Related Modules by WGCNA.
The correlation analysis showed that there were positive correlations in the expression level (cor = 0:46, p < 1e − 200) and the connectivity (cor = 0:23, p < 8:1e − 100) of RNAs between the training dataset TCGA and the validation dataset GSE123390. Using TCGA dataset, the soft threshold power 10 was chosen to create networks with a scale-free topology (R 2 reached 0.9 for the first time; the mean connectivity was equal to 1). The dendrogram displayed that the genes were clustered into 9 modules using TCGA dataset (Figure 3(a)). These modules were also validated in the analysis of the GSE123390 dataset ( Figure 3(b)). Among them, blue, brown, grey, red, turquoise, and yellow modules were considered to be preserved because of their Z-score > 5 and p value < 0.05 (Table 1). Brown and yellow modules were significantly enriched by DEGs (enrichment fold > 1 and p value < 0.05), suggesting they may be particularly crucial for the development of RC (Table 1). The genes in the brown and yellow modules were also found to be significantly associated with the pathologic M, pathologic N, pathologic T, pathologic stage, survival time, and death of RC patients (Figure 3(c)).

Construction of a PPI Network.
The 77 DEGs in the brown and yellow modules were uploaded to the STRING database to obtain their interaction relationships. As a result, 281 interaction pairs between 69 DEGs were identified (such as peripherin-(PRPH-) PTN, ankyrin repeat and SOCS box containing 2-(ASB2-) CASQ2/PDK4/EPHA6, and G protein-coupled receptor 15 (GPR15)/TCL1A-CXCL12), which were used to construct the PPI network ( Figure S1). Function enrichment analysis obtained 30 GO biological process terms (such as regulation of heart contraction: CASQ2; regulation of ion transport: CASQ2 and CXCL12; regulation of tissue remodeling: PDK4; negative regulation of leukocyte tethering or rolling: CXCL12; and negative regulation of dendritic cell apoptotic process: CXCL12), 4 KEGG pathways (such as metabolism of xenobiotics by cytochrome P450 and chemical carcinogenesis), and 4 Reactome pathways (such as muscle contraction: CASQ2) (Table S3).
Univariate and multivariate Cox regression analyses were also performed using the risk score and other clinical factors to confirm the independence of our five-mRNA signature 4 Analytical Cellular Pathology The results displayed that after multivariable adjustments for clinicopathological factors, the risk score remained significantly associated with patients' OS (Table 3), suggesting this five-mRNA-based classifier was an independent prognostic factor. In addition, the risk score was shown to have the ability to further classify the population with age   (Figure 6(a)). Therefore, the risk score should be integrated with the clinical factors to better predict the prognosis clinically, based on which a prognostic nomogram was developed ( Figure 6(b)). As expected, the AUC (0.976) and the C-index (0.913) of the nomogram were higher than those of any clinical factor model and the risk score model (Figure 6(a)).

Discussion
In the present study, we established a risk score model based on five prognostic DEGs (ASB2, GPR15, PRPH, RNASE7, and TCL1A). ROC curve analysis indicated that this 5-mRNA signature panel can accurately predict the prognosis for patients with RC, with the AUC of 0.915 and 0.827 for the training and validation datasets, respectively. The prognostic performance of our risk score seemed to be better than that of previously reported signature panels developed for CRC (such as 4 genes: AUC = 0:722 for the external dataset and 0.607 for the internal dataset [4]; 6 genes: AUC = 0:683 [7]; and 9 genes: AUC = 0:741 [23]) or CC (16 gene pairs: AUC = 0:724 [9]; 9 genes: AUC = 0:676 [22]). In addition, in the study of Zuo et al. [7], they performed a subgroup analysis to confirm whether the 6-gene signature panel was effective for colon adenocarcinoma (COAD) and READ. As a result, the AUC was, respectively, 0.653 and 0.74 for COAD  7 Analytical Cellular Pathology and READ, which were both lower than that of our signature panel. Moreover, in line with other signature panels identified for CRC or CC patients [6][7][8][9]24], our risk score was demonstrated to be an independent factor for the prognosis prediction and stratify the survival of patients with the same TNM stage. Also, the AUC and the C-index of the risk score were higher than those of age (AUC To better guide prognostication in clinical practice, some authors suggest that molecular prognostic models and clinicopathological models should be combined [24][25][26][27], which showed the highest predictive power compared with anyone. In agreement with these studies, our results also showed that the nomogram that integrated the five-mRNA classifier and four clinical risk factors (age, pathologic M, pathologic N, and pathologic stage) had the highest AUC (0.976) and Cindex (0.913).
Although all of the 5 signature genes were not included in the previous signature panels for CRC [6-9, 24], some of  Analytical Cellular Pathology them were found to be associated with the progression and prognosis of CRC (including ASB2, GPR15, TCL1A, and PRPH) [28][29][30][31]. High ASB2 expression was shown to predict a short relapse-free survival for patients with CRC [28]. Deletion of ASB2 in hematopoietic cells inhibited the shortening of the colon and the tumor load in mice. Function analysis indicated that ASB2 may exert tumor-promoting roles by decreasing Th1, Th17, and cytotoxic CD8+ T cell response which are beneficial for protection against tumor progression [28]. Consistent with the study of Spinner et al. [28], our results also demonstrated that patients with a high level of ASB2 may have a 12.505-fold higher risk of possessing a worse OS rate than those with low expression. Also, ASB2 was predicted to interact with the chemotaxis-associated EPHA6 gene which could be hypermethylated to result in downregulated EPHA6 expression by anti-inflammatory interleukin-6 [32,33], a Th17 cell biomarker [34]. The knockdown of EPHA6 decreased prostate cancer cell invasion in vitro and reduced lung and lymph node metastasis in vivo [35]. High EPHA6 expression was associated with a lower OS rate in patients with breast cancer [36] and our RC (HR = 1:11). In addition to EPHA6, we also predicted that ASB2 may be involved in RC by influencing the expressions of CASQ2 and PDK4. Highly expressed CASQ2 [37] and PDK4 [38] were reported to be significantly correlated with poor OS and disease-free survival in cancer patients.
Overexpression of PDK4 promoted cell proliferation, invasion, and tumor growth in vivo [38]. These prognosis conclusions of CASQ2 (HR = 1:12) and PDK4 (HR = 1:45) were also confirmed in our study on RC. TCL1A is crucial for cancer development by expressing in a subpopulation of immune B cells (CD3-/CD19+/CD10+/CD34-). A high TCL1A/CD20 (B cell) ratio or TCL1A expression was shown to correlate with improved survival [39,40]. In agreement with other cancers [39,40], our study also showed that TCL1A was a protective risk for the survival of RC patients (HR = 0:530) and positively associated with the abundance of B cells. Furthermore, we predicted TCL1A may interact with CXCL12, a chemokine gene that was speculated to be involved in the regulation of the dendritic cell apoptotic process in our function enrichment analysis. High CXCL12 expression was reported to confer a survival advantage for breast cancer patients [41] and stage III CC [42]. Silencing of CXCL12 by transforming growth factor-β in mesenchymal stromal cells of the primary tumor site promoted the tumor metastasis by increasing the expression of CXCR7, a CXCL12 receptor [43]. A meta-analysis showed that immunotherapy with DCs significantly improved OS at 6 months, 1 year, 3 years, and 5 years of patients with hepatocellular carcinoma [44]. Coculture of DCs significantly inhibited liver cancer stem cell growth in vitro and in vivo [45]. Consistent with these findings, we also found that the expression of TCL1A was significantly positively correlated with the infiltration of DCs. Using TCGA and the genotype-tissue expression data, Wang and Wang found that GPR15 was significantly lowly expressed in COAD and READ compared with normal tissues [30], which was validated in our study. GPR15 expression was significantly positively correlated with the prognosis of patients with COAD (that is, the high expression had a longer OS) [30], which was also observed in our study of READ. Wang and Wang believed that GPR15 may be a tumor suppressor by regulating a serial of genes enriched in immune systems and increasing the infiltration of B cells (in neck squamous carcinoma, lung adenocarcinoma, and stomach adenocarcinoma), CD4+ T cells, and DCs (in neck squamous carcinoma and stomach adenocarcinoma) [30]. Similarly, we found that the expression of GPR15 was positively associated with the levels of tumorinfiltrating immune B cells and DCs and predicted that GPR15 could interact with DC-related CXCL12 to participate in RC progression. CD133+ human umbilical hematopoietic progenitor cells were revealed to promote the proliferation and invasion of CRC cells in vitro and enhance tumor growth and metastasis in vivo by upregulating PRPH  11 Analytical Cellular Pathology [31]. However, the mechanisms of PRPH in CRC remained unclear. In this study, we speculated that PRPH may function by interacting with downstream PTN. Tumor-associated macrophages increased the proportion of cancer stem cells in lymphoma by secreting PTN [46]. Upregulated PTN promoted tumor cell proliferation and inhibited apoptosis and
Meta-analysis showed that high expression of PTN was significantly associated with an advanced TNM stage and a poor OS in tumor patients [48]. Similar to these studies, we also reported that PRPH was positively associated with tumorassociated macrophages. Although RNASE7, encoding an antimicrobial peptide, was not demonstrated to be associated with CRC, the roles of itself and its family members in other cancers may indirectly verify our conclusions. Scola et al. reported that the expression of RNASE7 was gradually reduced during the malignant transformation process, showing the highest expression in healthy skin and the lowest expression in oral squamous cell carcinoma [49]. The low expression of RNase family members contributed to the loss of immune defense against bacterial infections [50][51][52][53], which is an important 13 Analytical Cellular Pathology cause for the initiation of cancer. The knockdown of RNase L increased prostate cancer cell migration [54] and enhanced tumor growth and metastasis following implantation in the mouse prostate [55], the mechanism of which was related with an increased cell surface expression of integrin β1 and activation of the focal adhesion kinase-sarcoma pathway and the Ras-related C3 botulinum toxin substrate 1guanosine triphosphatase activity [54]. Colorectal tumors with lower levels of RNase H2 exhibited a significantly shorter survival time [56]. In line with these studies, we also  14 Analytical Cellular Pathology demonstrated that RNASE7 was downregulated in RC and patients with a higher level of RNASE7 had a longer OS compared with controls. Some limitations should be acknowledged in this study. First, the prognostic signature panel was developed and validated based on the survival information retrospectively collected from the public datasets (TCGA and GSE56699). Prospective trials needed to be performed in our hospital to further verify the prognostic value of this signature panel. Second, the expressions of these signature DEGs were also identified using public TCGA and GSE123390 datasets. In these datasets, the number of samples in the normal group was quite smaller than that in the cancer group. This imbalance may cause a statistical problem. Consistent sample size in the RC and control groups should be designed to further confirm their expressions. Third, clinical (PCR, immunohistochemistry, and Pearson's correlation), in vitro (coimmunoprecipitation, knockdown, overexpression, or coincubation of immune cells), and in vivo (tumor transplantation, mimics, siRNA transfection, and immunotherapy) experiments should be conducted to explore the PPI relationships between our signature genes (PRPH-PTN, ASB2-CASQ2/PDK4/EPHA67, and GPR15/TCL1A-CXCL12) and assess the functions of our signature genes in the progression of RC (especially RNASE7, which was not reported in CRC previously). Fourth, other grouped variable selection methods (such as Elastic net and CoxBoost) [57] for identification of prognostic signature panels should be used individually or jointly with LASSO to identify more effective prognostic indicators for RC. Fifth, the expression, prognostic power, and functions of signature genes should be compared between the CC and RC samples.

Conclusion
Our study developed a five-mRNA signature panel (ASB2, GPR15, PRPH, RNASE7, and TCL1A) as an immunerelated prognostic biomarker for RC. This signature panel exhibited excellent accuracy to stratify the patients with a higher death risk. The nomogram that combined the risk score and clinical features (age, pathologic M, pathologic N, and pathologic stage) may be more effective in guiding the clinical decision-making of personalized treatment.