Screening Prognosis-Related lncRNAs Based on WGCNA to Establish a New Risk Score for Predicting Prognosis in Patients with Hepatocellular Carcinoma

Background Hepatocellular carcinoma (HCC) remains an important cause of cancer death. The molecular mechanism of hepatocarcinogenesis and prognostic factors of HCC have not been completely uncovered. Methods In this study, we screened out differentially expressed lncRNAs (DE lncRNAs), miRNAs (DE miRNAs), and mRNAs (DE mRNAs) by comparing the gene expression of HCC and normal tissue in The Cancer Genome Atlas (TCGA) database. DE mRNAs were used to perform Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis. Then, the miRNA and lncRNA/mRNA modules that were most closely related to the survival time of patients with HCC were screened to construct a competitive endogenous RNA (ceRNA) network by weighted gene coexpression network analysis (WGCNA). Moreover, univariable Cox regression and Kaplan-Meier curve analyses of DE lncRNAs and DE mRNAs were conducted. Finally, the lasso-penalized Cox regression analysis and nomogram model were used to establish a new risk scoring system and predict the prognosis of patients with liver cancer. The expression of survival-related DE lncRNAs was verified by qRT-PCR. Results A total of 1896 DEmRNAs, 330 DElncRNAs, and 76 DEmiRNAs were identified in HCC and normal tissue samples. Then, the turquoise miRNA and turquoise lncRNA/mRNA modules that were most closely related to the survival time of patients with HCC were screened to construct a ceRNA network by WGCNA. In this ceRNA network, there were 566 lncRNA-miRNA-mRNA regulatory pairs, including 30 upregulated lncRNAs, 16 downregulated miRNAs, and 75 upregulated mRNAs. Moreover, we screened out 19 lncRNAs and 14 hub mRNAs related to prognosis from this ceRNA network by univariable Cox regression and Kaplan-Meier curve analyses. Finally, a new risk scoring system was established by selecting the optimal risk lncRNAs from the 19 prognosis-related lncRNAs through lasso-penalized Cox regression analysis. In addition, we established a nomogram model consisting of independent prognostic factors to predict the survival rate of HCC patients. Finally, the correlation between the risk score and immune cell infiltration and gene set enrichment analysis were determined. Conclusions In conclusion, the results may provide potential biomarkers or therapeutic targets for HCC and the establishment of the new risk scoring system and nomogram model provides the new perspective for predicting the prognosis of HCC.


Background
In addition to being the sixth most common cancer in the world, liver cancer is also the fourth leading cause of cancer death, with 841080 new cases and 781631 deaths in 2018 [1]. In the United States, there were 42030 new cases of liver cancer and 31780 deaths in 2019 [2]. Due to the lack of special clinical manifestations in patients with early hepatocellular carcinoma (HCC), 70%-80% of patients are in advanced stages when they experience symptoms and have missed the opportunity for radical resection [3]. Although the current treatment of HCC includes surgical resection, transplantation, chemotherapy, radiotherapy, radiofrequency ablation, targeted therapy, transcatheter arterial chemoembolization (TACE), and immunotherapy, the overall survival rate has not changed significantly and the 5-year recurrence rate after surgery is still up to 70% [4]. Therefore, it is of great significance to clarify the molecular mechanisms of the occurrence and development of HCC and to identify new molecular markers to improve its clinical efficacy.
As a kind of noncoding RNA (ncRNA) without protein coding ability, long noncoding RNAs (lncRNAs) are transcripts more than 200 nucleotides in length [5]. Thousands of lncRNAs have been found due to the development of high-throughput RNA sequencing (RNA-seq). Originally, lncRNAs were thought to be the noise of genomic transcription, a byproduct of RNA polymerase II transcription, and have no biological function [6]. In fact, lncRNAs play important regulatory roles in different cellular processes, especially in different types of tumors [7][8][9]. Increasing evidence has shown that lncRNAs play a key role in the occurrence, invasion, and distant metastasis of HCC through cell differentiation regulation, epigenetic regulation, and cell cycle regulation [7,10].
MicroRNAs (miRNAs) are a kind of noncoding RNA with a length of approximately 20-22 nucleotides. They can bind to a target protein coding gene in its 3 ′ -untranslated region (3 ′ -UTR) based on sequence complementarity, thus affecting the stability of the mRNA or interfering with protein translation [11,12]. It was found that miRNAs are abnormally expressed and dysfunctional in a variety of tumors and play an important role in the occurrence and development of tumors, including gastric cancer [13], colorectal cancer [14], ovarian cancer [15], breast cancer [16], and HCC [17].
Weighted gene coexpression network analysis (WGCNA) is a systematic biological approach that describes intergene correlation patterns through RNA sequencing or microarray data [18]. It usually constructs a scale-free gene coexpression network to explore the correlation between the gene set and clinical characteristics, and it can recognize highly correlated genes and aggregate them into the same module [19]. However, many studies have only focused on the differential expression between genes, ignoring the high correlation between genes, and WGCNA can make up for these defects. Therefore, WGCNA plays an important role in identifying potential biomarkers or new therapeutic targets [20].
In recent years, studies have shown that the mutual regulation between lncRNAs and miRNAs plays an important role in the development of tumors [21]. The competitive endogenous RNA (ceRNA) hypothesis, first proposed in 2011, suggests that lncRNAs can act as ceRNAs to bind to miRNAs, affecting the regulation of miRNAs on target mRNAs and thus regulating the expression of related target genes [22]. As an open-source sequencing database, The Cancer Genome Atlas (TCGA) platform contains clinicopathological information and corresponding bioinformatics data for more than 30 types of human cancer, which is helpful for the comprehensive analysis of the regulatory function of the lncRNA-miRNA-mRNA ceRNA network in the path-ogenesis of cancer [23]. These networks play an important role in understanding gene interactions and identifying potential biomarkers. However, the prognostic value of the lncRNA-related ceRNA regulatory network in HCC is still unclear.
In this study, we selected differentially expressed lncRNAs (DElncRNAs), miRNAs (DEmiRNAs), and mRNAs (DEmRNAs) between cancer and normal tissues by using data from the TCGA database. In addition, we also performed cluster analysis, biological function enrichment analysis, and pathway enrichment analysis on these DElncR-NAs, DEmiRNAs, and DEmRNAs. Then, we constructed a coexpression network based on these differentially expressed genes (DEGs) to determine the modules related to clinical features by using the WGCNA-based systems biology method. According to the module data, the most relevant modules for survival and prognosis were selected to construct the ceRNA network. In this ceRNA network, there were 566 lncRNA-miRNA-mRNA regulatory pairs, including 30 upregulated lncRNAs, 16 downregulated miRNAs and 75 upregulated mRNAs. Moreover, we screened out 19 lncRNAs, and 14 hub mRNAs related to prognosis from this ceRNA network by univariable Cox regression and Kaplan-Meier curve analyses. Finally, we used the 19 prognosisrelated lncRNAs to establish a new risk scoring system by lasso-penalized Cox regression analysis. The flow chart of the whole study was shown in supplement Figure S1. In addition, we established a nomogram model consisting of independent prognostic factors to predict the survival rate of HCC patients.

Data Collection and
Preprocessing. HCC gene sequencing data were downloaded from the TCGA database (https:// cancergenome.nih.gov/), and then, the original data were standardized for further analysis. First, the samples must have the detection data of lncRNAs, miRNAs, and mRNAs. Then, the samples that had other malignant tumors, no stage information (including pathological stage and TNM stage), and no age information were removed. Finally, 224 tumor samples and 27 normal samples were used for analysis. To identify DEGs between HCC samples and normal samples, such as lncRNAs, miRNAs, and mRNAs, we used a random variance model (RVM) to compare data between groups. The screening parameters of lncRNAs and mRNAs were set to p < 0:05, false discovery rate ðFDRÞ < 0:05, and fold change ðFCÞ > 1:5. The screening parameters of miRNAs were set to p < 0:05 and FC > 1:2. The data were obtained from the TCGA database and did not involve ethical issues.
in the sample, and blue represents the low expression value of the DEGs in the sample.

Functional Enrichment
Analysis. The function of the DEmRNAs was analyzed using the Gene Ontology (GO) database (http://www.geneontology.org), and the signaling pathways involved in the DEGs were analyzed using the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (http://www.kegg.jp/). Fisher's exact test and the multiple comparison test were used to calculate the significance level of each function and signaling pathway (p < 0:05).

Construction of the Weighted Gene Coexpression
Network. According to the DElncRNAs, DEmiRNAs, and DEmRNAs and their expression information in HCC samples, cluster analysis was carried out to determine whether there were outlier samples. If they exist, the outlier samples need to be removed before analysis. WGCNA is a freely accessible R package [18] for performing weighted correlation network analysis. First, the optimal soft-thresholding power (β) value was calculated based on the expression data of the DElncRNAs, DEmiRNAs, and DEmRNAs in the HCC samples. Then, the coexpression matrix was constructed according to the soft threshold and the adjacency between genes was calculated. According to the similarity between genes, the coefficient of dissimilarity between genes was deduced and the cluster dendrogram of genes was obtained. Thus, the modules of lncRNAs/mRNAs and miRNAs were identified and the hierarchical cluster dendrogram of the genes in the module was displayed. Finally, according to the setting of the phenotypic information of the grouping traits, the correlation between the gene module and the phenotype was calculated and the trait-related module was identified.

Construction and Analysis of the ceRNA Network.
According to the predicted miRNA-target binding relationship and the expression relationship in HCC samples, the negatively correlated miRNA-mRNA and miRNA-lncRNA pairs were screened. Combining the differentially expressed mRNAs, miRNAs, and lncRNAs, the coexpression network of the ceRNA (lncRNA-miRNA-mRNA) network was constructed. The lncRNA-miRNA-mRNA network was constructed and visualized with Cytoscape V3.7.
2.7. Protein-Protein Interaction (PPI) Network Analysis. The PPI network between the encoded proteins of the DEmRNAs was constructed by using the STRING protein interaction database. Protein interaction data from the STRING database were downloaded and imported into Cytoscape. Then, the protein interaction network was constructed using Cytoscape software.
2.8. Prognostic Gene Screening. HCC gene detection data with survival information were selected from the TCGA database. According to the expression of the DEGs in the sample, a univariate Cox regression model was used to analyze the relationship between the overall survival of HCC patients and the DEGs in the ceRNA network and p < 0:05 was considered to be significant. Then, the selected DEGs were visualized using the survival curve in the survival analysis.
2.9. Construction of the Risk Score System for Prognostic Prediction. After removing samples with missing survival time, 220 patients were randomly divided into two cohorts: a training cohort (n = 132) and a test cohort (n = 88). Then, univariate Cox proportional hazards regression was performed on lncRNAs in the training group and lncRNAs significantly related to OS in HCC patients were included in subsequent analysis. Next, lasso regression was used to select the potential risk genes and to eliminate genes that overfit the model. Finally, we used Cox proportional hazards regression to build a prognostic risk model. The calculation of the risk score used the following formula: riskscore = coefficient ðgene1Þ × expression value of ðgene1Þ + coefficient ðgene2Þ × expression value of ðgene2Þ + ⋯+coefficient ðgeneNÞ × expression value of ðgeneNÞ. Using the median risk score of the training cohort as a cutoff value, all HCC patients were divided into a highrisk group and a low-risk group.

Establishment and Evaluation of the Nomogram Model
for Predicting the Survival Rate of HCC Patients. Univariate and multivariate Cox regression analyses were used to screen independent factors related to prognosis, and a nomogram model was established and visualized for the obtained independent prognostic factors. To evaluate the prediction ability of the nomogram, we used resampling technology to carry out statistical inspection and drew the calibration curve. The closer the calibration curve is to the 45°line, the better the predictive power of the model constructed by the factors. 2.12. Statistical Analysis. All statistical analyses were performed using R software, and the Pearson correlation coefficient test was used to evaluate the rank correlation among the different variables. Kaplan-Meier curves and the log-rank test were used for survival data analysis. Univariate Cox regression analysis was used for survival factor analysis. Multivariate Cox regression analysis was used to determine independent prognostic factors, and a time-dependent receiver operating characteristic (ROC) curve was plotted to evaluate the accuracy of the prognostic prediction model.

GO and Pathway Analysis of DEmRNAs in HCC.
Compared with normal liver tissue, there were 1896 DEmRNAs in HCC tissues, including 953 upregulated DEmRNAs and 943 downregulated DEmRNAs. These upregulated and downregulated DEmRNAs were analyzed by using the GO database. These genes were enriched by terms in the GO database to determine their functions. The upregulated DEmRNAs were mainly enriched in rRNA processing, SRP-dependent cotranslational protein targeting to the membrane, translational initiation, nuclear-transcribed mRNA catabolic process, nonsense-mediated decay, viral transcription, and translation ( Figure 1(d)), while the downregulated DEmRNAs were mainly enriched in the oxidation-reduction process, regulation of complement activation, complement activation, classical pathway, receptor-mediated endocytosis, and proteolysis ( Figure 1(e)). Then, we performed KEGG pathway enrichment analysis and found that the upregulated DEmRNAs were mainly correlated with the ribosome, metabolic pathways, protein processing in the endoplasmic reticulum, spliceosome, and cell cycle (Figure 1(f)), while the downregulated DEmRNAs were mainly correlated with metabolic pathways, complement and coagulation cascades, carbon metabolism, valine, leucine and isoleucine degradation, and fatty acid degradation (Figure 1(g)).

Construction and Analysis of the Weighted Coexpression
Network. According to the expression information of the DElncRNAs, DEmiRNAs, and DEmRNAs in the samples, cluster analysis was carried out and the results showed that there were no outlier samples (Figures 2(a) and 2(b)). To determine the relative balance between scale independence and mean connectivity, the network topology with a soft threshold power of 1 to 20 was analyzed. Finally, we determined that the optimal β value of mRNAs/lncRNAs was 7 ( Figure 2(c)) and the optimal β value of miRNAs was 4 ( Figure 2(d)) in the coexpression network analysis. According to the consistent topological overlap and the corresponding module colors represented by the color row, a gene dendrogram was obtained by clustering the dissimilarity. Each colored row represents a color-coded module that contains a set of highly connected genes. Finally, nine modules were generated in the lncRNA/mRNA coexpression network ( Figure 2(e)) and five modules were generated in the miRNA coexpression network (Figure 2(f)). Then, we calculated and mapped the relationship between each module and the corresponding clinical features. From the miRNA correlation module (Figure 2(h)), we found that the turquoise module had the largest negative correlation (module-trait weighted correlation = −0:67; the number of DEmiRNA = 22) with the tumors related to the miRNA coexpression network and had the largest positive correlation (module-trait weighted correlation = 0:17) with survival time. Therefore, for the miRNA module, we selected the turquoise module to construct the ceRNA network. In addition, from the lncRNA/mRNA correlation module (Figure 2(g)), we found that the turquoise module had the largest positive correlation (module-trait weighted correlation = 0:54; the number of DEmRNAs and the number of DElncRNAs are 481 and 48, respectively) with the tumor related to the lncRNA/mRNA coexpression network and the largest negative correlation with survival time (module-trait weighted correlation = −0:27). Therefore, in the lncRNA/mRNA module, the turquoise module was selected to construct the ceRNA network. Finally, the turquoise lncRNA/mRNA module was combined with the turquoise miRNA module to construct the ceRNA network.

Construction and Verification of the Prognostic Risk
Model. To explore the significance of risk genes in predicting the prognosis of HCC patients, the risk score of each patient was calculated by the estimated regression coefficient and expression level of the risk lncRNAs. The calculation formula is as follows: training cohort risk score = ð0:1895 × expression of CTD − 2510F5:4Þ + ð0:1516 × expression of DSTNP2Þ. According to the median risk score, patients in the training cohort were divided into a high-risk group  9 Journal of Immunology Research and a low-risk group. We created a Kaplan-Meier curve based on the log-rank test, which showed that the prognosis of the high-risk group was worse than that of the low-risk group (p < 0:05) (Figure 6(a)). Then, we ranked the risk scores of the patients in the training cohort and generated a distribution map according to the survival status of each patient. A heat map was used to describe the expression of risk genes in the two prognosis groups (Figure 6(b)). Finally, we used a time-dependent ROC curve to test the accuracy of the model in predicting 1-, 3-, and 5-year overall survival. The area under curve (AUC) values of the prediction model were 0.854 at 1 year, 0.756 at 3 years, and 0.756 at 5 years ( Figure 6(c)). To verify the accuracy of the prognostic risk model, we used it to analyze the test cohort and the whole TCGA cohort. Kaplan-Meier survival curve analysis in the test cohort (Figure 6(d)) and the whole TCGA cohort (Figure 6(g)) showed that the prognosis of the high-risk group was worse than that of the low-risk group (p < 0:05). The distribution of risk scores, survival status, and risk gene expression in the test cohort and the whole TCGA cohort are shown in Figures 6(e) and 6(h) and were similar to the results of the training cohort. The ROC analysis results showed that the AUCs at 1 year, 3 years, and 5 years in the test cohort were 0.736, 0.668, and 0.73 and the AUCs at 1 year, 3 years, and 5 years in the whole TCGA cohort were 0.798, 0.723, and 0.751, respectively (Figures 6(f) and 6(i)). These results indicate that our prognostic risk model can accurately predict the prognosis of patients with HCC.

Analysis of Independent Prognostic Factors. Univariate
Cox regression analysis showed that vascular tumor invasion, TNM stage, and risk score were related to the prognosis of patients with HCC (Figure 7(a)). To further confirm independent prognostic factors, we performed multivariate Cox regression analysis (Figure 7(b)). The results also showed that vascular tumor invasion, TNM stage, and risk score were significantly related to prognosis and might be considered independent prognostic factors.

Discussion
As the most common pathological type of liver cancer, the early symptoms of HCC are not obvious. Most patients are not diagnosed until they are in advanced stages; their prognosis is poor, and the treatment process can be a very painful experience. Therefore, it is very important to identify effective prognostic biomarkers and explore potential regulatory networks. The ceRNA hypothesis has been considered to be a new method of gene regulation through the competitive binding of miRNAs in HCC [28].
To further explore the regulatory network of prognosisrelated molecules of HCC, we first used the TCGA database to identify the DElncRNAs, DEmiRNAs, and DEmRNAs in HCC and normal liver tissues. Then, GO and KEGG pathway enrichment analyses were carried out to further explore the main biological processes and regulatory pathways involved in these DEmRNAs. Next, we used WGCNA to identify the modules and selected the turquoise lncRNA/mRNA module and turquoise miRNA module, which were most closely related to the occurrence of HCC and the survival time of patients with HCC. Finally, based on the miRNA prediction website and the DElncRNAs, DEmiRNAs, and DEmRNAs in the above two modules, negatively correlated miRNA-lncRNA and miRNA-mRNA relationship pairs were constructed and a ceRNA network was generated by Cytoscape software. GO enrichment analysis revealed that the DEmR-NAs in the ceRNA network were mainly related to mRNA splicing via the spliceosome, cell division, the viral process, and the mitotic cell cycle, and KEGG enrichment analysis showed that the DEmRNAs were mainly related to metabolic pathways, the cell cycle, the spliceosome, and DNA replication. Abnormal regulation of cell division and the mitotic cell cycle are key to the occurrence of cancer [29]. Moreover, many studies have shown that metabolic pathways play an   important role in HCC [30,31]. In this ceRNA network, we found that 30 lncRNAs may regulate the expression of 75 mRNAs by competitively binding 16 miRNAs. Then, the interaction among proteins was presented through the PPI network. Finally, the survival analysis of these 15 hub mRNAs showed that the high expression of 14 hub mRNAs (H2AFZ, HNRNPA1, RAN, SNRPD1, H2AFX, NASP, PPIA, CSNK1D, NAP1L1, DARS2, FARSB, SMARCC1, TPM3, and ZCCHC17) was related to the poor prognosis of patients with HCC. Before this, Bai et al. [32] found that PPIA may be a potential marker of gastric cancer, while Sun et al. [33] confirmed that the increased expression of DSN1 is related to the poor survival of patients with HCC. However, their mechanism of action has not been elucidated. Our study found that the regulatory axis of lncRNA SNHG3/mir-139-5p/DSN1 and lncRNA SNHG3/let-7c-5p/DSN1 may provide direction for the study of their mechanism of action. Moreover, some studies have shown that miRNA-2 promoted carcinogenic activity by upregulating the expression of RAN in HCC cells [34]; the high expression of HNRNPA1 promoted the invasion of gastric cancer cells [35]; lncRNA CDKN2B-AS1 promoted the growth of HCC by regulating the let-7c-5p/NAP1L1 axis [36]. Therefore, these results may also be the reason why the high expression of these hub genes is associated with poor prognosis in patients with HCC. Univariable Cox regression analysis of the DElncRNAs in the ceRNA network revealed that 19 lncRNAs (AC016747. In recent years, the role of lncRNAs in a variety of cancers has been widely reported. Many experimental studies have shown that lncRNAs play an important role in many biological processes, such as cell cycle regulation, DNA damage, signal transduction, and epigenetic regulation [37]. Moreover, lncRNAs play a role through the competitive binding of miR-NAs to regulate their target mRNAs [38]. For example, lncRNA SNHG3 can promote the progression of breast cancer [39], gastric cancer [40], lung cancer [41], laryngeal carcinoma [42], renal cell carcinoma [43], and liver cancer [44]. Zhang et al. [45] found that lncRNA SNHG3 induced EMT in HCC cells via miR128/CD151 cascade activation and that high expression of SNHG3 was associated with poor survival outcomes in HCC patients. Our research results also predicted that the high expression of lncRNA SNHG3 can be used as a biomarker of poor prognosis in patients with HCC and that SNHG3 may competitively bind multiple miRNAs to affect mRNA expression. Wang and Qin [46] found that lncRNA CTD-2510F5.4 may be involved in the pathogenesis of gastric cancer and has potential as a biomarker for the diagnosis and prognosis of gastric cancer. Similarly, we also found that the high expression of CTD-2510F5.4 was related to the poor prognosis of patients with HCC and has the potential to be a biomarker of HCC.
To identify the impact of these lncRNAs on survival prognosis, we constructed a risk prediction model. First,