Construction and Analysis of Hepatocellular Carcinoma Prognostic Model Based on Random Forest

Methods Transcriptome data and clinical data of HCC were downloaded from the TCGA database. Screen important genes based on the random forest method, combined with differential expression genes (DEGs) to screen out important DEGs. The Kaplan‒Meier curve was used to evaluate its prognostic significance. Cox regression analysis was used to construct a survival prognosis prediction model, and the ROC curve was used to verify it. Finally, the mechanism of action was explored through GO and KEGG pathway enrichment and GeneMANIA coexpression analyses. Results Seven important DEGs were identified, three were highly expressed and four were lowly expressed. Among them, GPRIN1, MYBL2, and GSTM5 were closely related to prognosis (P < 0.05). After the survival prognosis prediction model was established, the survival analysis showed that the survival time of the high-risk group was significantly shortened (P < 0.001), but the ROC analysis indicated that the model was not superior to staging. Twenty coexpressed genes were screened, and enrichment analysis indicated that glutathione metabolism was an important mechanism for these genes to regulate HCC progression. Conclusion This study revealed the important DEGs affecting HCC progression and provided references for clinical assessment of patient prognosis and exploration of HCC progression mechanisms through the construction of predictive models and gene enrichment analysis.


Introduction
Liver cancer is the ffth most common cancer and the fourth leading cause of cancer-related death worldwide [1]. HCC is the most common type of liver cancer, accounting for about 75%-85%, and has the characteristics of a high mortality and high metastasis and recurrence rate [2]. Studies have shown that genetic mutations, chromosomal aberrations, molecular signaling pathways, and epigenetic dysregulation are all associated with the development of HCC [3]. At present, in addition to traditional surgical resection, radiofrequency ablation, transarterial chemotherapy, and other methods have been developed for the treatment of liver cancer [4].
Undeniably, surgical resection is still the most efective treatment for HCC. However, due to the insidious onset of HCC, many patients have already lost the opportunity for surgery when they come to the clinic. Even with surgical resection, the 5-year recurrence rate is as high as 70% [5], and the 5-year overall survival (OS) rate is only 15%-19% [6].
It is well known that the tumor microenvironment plays a crucial role in the occurrence and development of tumors. Te interaction between various signaling molecules in the microenvironment is also a hot topic in tumor-related research. Te occurrence of HCC is closely related to the infammatory response of the environment, and 90% of HCC is associated with infammation [7]. In the state of liver infammation, the dysregulation of the interaction between cytokines, chemokines, and growth factors is an important cause of liver cancer [8,9]. Te original research on the IL-6, IL-1, and TGF-beta infammatory molecules based on recent years for immune checkpoint research to further explore the development mechanism of HCC provided new insights. Studies have found that immune checkpoint molecules, such as programmed death-1(PD-1), cytotoxic T-lymphocyte antigen 4 (CTLA4), lymphocyte activating gene 3 protein (LAG-3), and mucin domain molecule 3 (TIM-3), are upregulated on liver cancer cells and tumor-specifc T cells, which can lead to CD8 + T cell apoptosis and poor prognosis in patients [10]. At the same time, we cannot ignore that there are a large number of proangiogenic factors produced by cancer cells or tumor-infltrating lymphocytes or macrophages in the tumor microenvironment, such as vascular endothelial growth factor (VEGF), which can promote tumor angiogenesis [11]. Angiogenesis is indispensable for tumor development, invasion, and metastasis [12]. On this basis, targeted drugs such as sorafenib, lenvatinib, VEGF inhibitors, and immune checkpoint inhibitors (ICIs) have signifcantly improved the prognosis of patients in recent years, but the overall treatment efect is still poor due to the changes in HCC heterogeneity and the continuous emergence of phenotypic drug resistance [13][14][15][16][17][18][19]. Terefore, it is particularly urgent to fnd a way to evaluate the disease early and take personalized treatment measures to improve the prognosis of patients.
Te development of liver cancer is a multistep process caused by changes in signaling pathways triggered by multiple genes, and it shows high heterogeneity within tumors, between tumors, and between patients [20][21][22][23][24]. DEGs play an important role in this process. Terefore, considering the high heterogeneity of HCC, the limited treatment methods, and the poor prognosis of patients, it is more urgent to further explore the development mechanism of HCC and new survival and prognostic models. Nault et al. frst identifed a genetic marker associated with the development of HCC in 2013 [25]. Subsequent studies have also shown that gene mutations occurring in HCC can be used as biological markers for targeted therapies [26]. Although it has been demonstrated that programmed death ligand-1 (PD-L1) inhibitors combined with antiangiogenesis therapy can signifcantly improve the prognosis of patients with HCC [27], intervention-related toxicity and difculty in determining the optimal dosing phase have hindered further beneft for patients [28,29]. Terefore, the search for HCC-related dysfunctional genes is particularly important to elucidate the mechanisms underlying the development of the disease and to improve the prognosis of patients. Tanks to the rapid development of sequencing technology, many disease-related marker genes have been identifed one after another, which has laid a solid foundation for the screening of HCC-related genes and the establishment of prognostic models. Public databases such as Te Cancer Genome Atlas (TCGA) are useful tools to screen microarray data for DEGs associated with the development of HCC [30,31].
In this study, we used a random forest algorithm to identify key genes expressed in HCC in the TCGA database and screened DEGs between HCC and normal samples. On this basis, 7 important DEGs were fnally screened. Subsequently, we performed enrichment analysis on these 7 important DEGs and analyzed the expression levels of these genes in diferent clinical states. Furthermore, we performed survival analysis and COX regression analysis, constructed a prognostic risk score model, and plotted the receiver operating characteristic (ROC) curve. Finally, 20 coexpressed genes were screened by Gene-MANIA, and GO and KEGG enrichment analyses were performed to further explore their biological functions and molecular pathways. Te DEGs of HCC discovered in this study, as well as the constructed survival prognosis prediction model, are expected to provide new insights into the clinical treatment and biological mechanisms of HCC.

Data Source.
Te data for our study were extracted from Te Cancer Genome Atlas (TCGA; https://portal.gdc.cancer. gov, up to July 31, 2022) database, which contains transcriptomic data of 374 HCC tumor samples and 50 normal samples, as well as 370 clinical samples and related data. We used data obtained from the TCGA database using Illumina HiSeq Systems, and the sequencing data format was a Counts fle.

Random Forest Screening for Important Genes.
Build a random forest model using the random forest package [32]. First, calculate the average model false positive rate based on out-of-band data for all genes and select 400 as the optimal number of trees to include in the random forest. Next, build a random forest model and use the Gini coefcient method to obtain dimension importance values for the random forest model. Te genes with the top 30 importance values were selected for subsequent analysis.

Identifcation of DEGs.
Expression data downloaded from TCGA were analyzed using the Limma package of R version 4.2.0 [33], and fold diferential expression was calculated after removing or averaging probe sets without corresponding gene symbols or genes with multiple probe sets, respectively. Te criteria for setting the DEG were as follows: genes with adjusted P value <0.05 and |logFC (fold change)| ≥ 1. Plot volcano plots using the ggplot2 package. Next, DEGs and genes screened by randomForest were combined to extract important diferentially expressed genes.

Expression of DEGs in Diferent Clinical States.
We further investigated the expression of DEGs and their association with diferent clinical states of HCC: event, age, gender, and stage. Violin plots were drawn using the ggplot2 package of the R software. Diferences in gene expression among diferent groups were analyzed using SPSS 27.0. Defnitions: * P < 0.05, * * P < 0.01, * * * P < 0.001.

Construction and Validation of an HCC Prognostic Risk
Scoring Model. Kaplan-Meier survival analysis was performed on the important DEGs using the R software survival and survminer packages, and the related survival curves were drawn.
Based on univariate and multivariate Cox regression analyses, a prognostic risk score model was constructed. According to the risk score grouping, prognostic analysis and Cox regression analysis were performed to verify whether the risk score could be used as an independent risk factor for evaluating the survival and prognosis of HCC patients. Te specifcity and sensitivity of the risk scoring model were verifed using the R software pROC package, and the ROC was drawn using the ggplot2 package.
2.6. Enrichment Analyses of Important DEGs. Analysis of Gene Ontology (GO) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment plays a very important role in the annotation of gene products and the study of molecular mechanisms [34,35]. We used enrichGO and enrichKEGG packages in the R language for enrichment, and adjusted P values <0.05 were considered signifcant; enrichment point plots were drawn using the ggplot2 package.

Analysis of the Relationship between DEGs Genes.
We constructed a coexpression network of these genes using GeneMANIA (https://www.genemania.org/) [36] and identifed associations within them. Subsequently, we further carried out enrichment analysis on the coexpressed genes of important DEGs, intending to explore their biological functions and molecular mechanisms.

Identifcation of Important DEGs.
Te research fowchart of this study is shown in Figure 1. First, we downloaded the expression profling data of LIHC from the TCGA database along with clinical data. To fnd the genes with the greatest infuence on the phenotype, we used the random forest method to screen. Te relationship between the error of the reference model and the number of decision trees is shown in Figure 2(a). We selected 400 trees as the parameters of the fnal model, and the model error was stable at this time. We evaluated the fnal results using the Gini coefcient method and selected the top 30 genes as candidate genes (Table 1 and Figure 2(b)).
Next, we screened 1,564 DEGs using the Limma method and plotted the volcano (Figure 2(c)). After interacting with 30 candidate genes, 7 important DEGs were fnally screened ( Table 2). Analysis of the expression profles of these seven DEGs revealed that MAFG-DT, GPRIN1, and MYBL2 were highly expressed in the tumor group, and LINC00907, GSTZ1, CCN1, and GSTM5 were lowly expressed in the tumor group (Figure 2(d)).

Expression and Clinical Parameters of Important DEGs in Patients.
To further analyze the relationship between these seven important DEGs and clinical status, we used SPSS to compare the expression diferences of each gene under diferent groups, and used the ggplot2 package of R software to draw a violin plot (Figures 3-6).
According to the outcome of patience, we divided the patients into the normal group, live group, and dead group. Te results showed that compared with the normal group, the live and death groups showed signifcant diferences in each gene (P < 0.001). In addition, GSTZ1 expression was signifcantly higher in the live group than that in the death group (P � 0.042489), while MAFG (P � 0.047899), GPRIN1 (P � 0.003478), and MYBL2 (P � 0.000164) were signifcantly lower ( Figure 3).
Stratifed analysis by age revealed that GSTZ1 expression was signifcantly lower in the 41-60 years group than that in the 61-80 years group (P � 0.037252), while the opposite was true for MYBL2 (P � 0.017321). Te expression of  LINC00907 in the ≤20 years group was signifcantly higher than that in the 21-40 years group (P � 0.035828). In addition, the expression of CCN1 was signifcantly lower in the group of 41-60 years (P � 0.03869) and 61-80 years (P � 0.026584) than in that of the over 80-year-old group.
Tere was no signifcant diference in the expression of the remaining genes among the age groups (P > 0.05) ( Figure 4). By analyzing the expression levels of each gene in patients of diferent genders, we found that the expression level of GSTZ1 (P � 0.010058) in female patients was signifcantly lower than that in male patients, while LINC00907 (P � 0.000144) was just the opposite ( Figure 5).
Te expression level of MAFG-DT was signifcantly lower in stage I than in stages II (P � 0.006818) and III (P � 0.000635) when grouped according to the HCC stage. However, the expression level of GSTZ1 in stage I was signifcantly higher than that in stage III (P � 0.041631). In addition, GPRIN1 expression was signifcantly lower in stage I compared with stage II (P � 0.001831) and stage III (P � 0.001096), and the same diference was observed with MYBL2 (P � 0.009085; P � 0.000282). In contrast, the    Figure 2: Identifcation of important DEGs: (a) the efect of the number of decision trees on the error rate. Te X-axis represents the number of decision trees, and the Y-axis represents the error rate. When the number of decision trees is about 400, the error rate is relatively stable; (b) results of the Gini coefcient method in random forests. Te X-axis represents genetic variables and the Y-axis represents the importance index; (c) the volcano diagram; (d) the heatmap diagram. expression level of GSTM5 in stage III was signifcantly lower than that in stages I (P � 0.032924) and II (P � 0.030159) ( Figure 6).

Impact of Important DEGs on Patients' OS.
We performed survival analysis using the survival and survminer packages of R software and used survival curves combined with log-rank tests to assess the impact of important DEGs on patient OS. As shown in Figure 7, high expression of GPRIN1 and MYBL2 indicated better prognosis of patients (P � 0.002; P � 0.00047), while patients with low expression of GSTM5 had better prognosis (P � 0.01). However, MAFG-DT, GSTZ1, LINC00907, and CCN1 were not associated with patient prognosis (all P > 0.05).

Construction and Validation of a Prognostic Risk Model for HCC Patients.
First, the selected seven important DEGs were combined with survival time and survival status and then included in a multivariate Cox regression analysis (Table 3 and Figure 8(a)), and a prognostic risk score model was constructed based on the results. Te risk score calculation method is 7 important as the product of DEGs expression level and risk coefcient. Te specifc risk score model is risk score � "MAFG-DT" * 0.069645 − "GSTZ1" * 0.070909 + "GPRIN1" * 0.009885 + "MYBL2" * 0.4614 18 + "LINC00907" * 0.010721 + "CCN1" * 0.227363 − "GSTM5" * 0.116514.
We divided 346 HCC samples into a high-risk group and a low-risk group, with 173 cases in each group, using the median risk value (2.4211) as the cutof value. Survival     analysis showed that there was a signifcant diference between the two groups (P � 0.00017), and the survival time of the low-risk group was signifcantly longer than that of the high-risk group (Figure 8(b)). Te results of univariate Cox regression analysis showed that the risk score model was signifcantly correlated with survival time and survival status (hazard ratio � 0.5066, 95% Confdence interval � 0.353-0.727, P � 0.000224); further multivariate Cox regression analysis results also showed that the risk score model was closely related to the prognosis status (hazard ratio � 0.5549, 95% confdence interval � 0.3767-0.8173, P � 0.00287) ( Table 4 and Figure 8(c)).
Te subsequent proportional hazard analysis also confrmed that patients with a lower risk score had a better prognosis (Figure 8(d)).
Finally, using ROC to evaluate the relationship between clinical characteristics and the prognosis of HCC patients, the results showed that the risk scoring model (AUC = 0.582) was the second most important factor for prognosis after stage (AUC = 0.659) (Figure 9).

Enrichment Analysis and Interaction Analysis.
To further explore the mechanism of action and signaling pathway of important DEGs, we performed GO and KEGG enrichment  analyses on them. GO enrichment analysis found that important DEGs were mainly enriched in the glutathione metabolic process (P � 0.000117), glutathione transferase activity (P � 1.91E − 05), and transferase activity, transferring alkyl or aryl (other than methyl) groups (P � 0.0001) ( Table 5 and Figure 10(a)). KEGG enrichment analysis showed that important DEGs were mainly enriched in the following seven pathways: tyrosine metabolism (P � 0.01318), glutathione metabolism (P � 0.020815), chemical carcinogenesis-DNA adducts (P � 0.02516), drug metabolism-cytochrome P450 (P � 0.026244), platinum drug resistance (P � 0.026605), metabolism of xenobiotics   by cytochrome P450 (P � 0.02841), and drug metabolism-other enzymes (P � 0.029131) ( Table 6 and Figure 10(b)).
To understand the interaction network of these genes, we used the GeneMANIA database for analysis. Te results showed that these genes were in a complex PPI network, with physical Interactions of 77.64%, coexpression of 8.01%, predicted of 5.37%, colocalization of 3.63%, genetic Interactions of 2.87%, pathway of 1.88% and shared protein domains of 0.60% (Figure 10(c)). GO enrichment analysis of these 25 coexpressed genes found that, for biological processes, they were mainly enriched in the glutathione metabolic process (P � 2.27E − 10), benzene-containing compound metabolic process (P � 4.18E − 08), and cellular modifed amino acid metabolic process (P � 1.40E − 07); in terms of cell composition, it is mainly located in the intercellular bridge (P � 2.36E − 06), transcription regulator complex (P � 2.59E − 05), and transcription repressor complex (P � 0.000121); and its molecular functions are mainly involved in glutathione transferase activity (P � 1.56E − 10), glutathione binding (P � 4.64E − 10), and oligopeptide binding (P � 7.28E − 10) (Table 7 and Figure 10(d)). KEGG enrichment analysis showed that these genes were mainly enriched in the following pathways: cellular senescence (P � 3.38E − 08), glutathione metabolism (P � 1.50E − 07), chemical carcinogenesis-DNA

Discussion
At present, many studies have found many genes that afect HCC, but the mechanism of HCC occurrence and development is still not very clear, and there is an urgent need to further explore the factors afecting its development and prognosis. Although several previous studies have analyzed gene signatures related to HCC prognosis [30,[37][38][39], these studies did not further screen genes that are more closely related to patient survival after screening DEGs. Terefore, in this study, we used the random forest and limma algorithms to screen out 30 important genes and 1,564 DEGs, respectively, and then took the intersection of the two to further screen out 7 important DEGs: MAFG-DT, GSTZ1, GPRIN1, MYBL2, LINC00907, CCN1, and GSTM5. Subsequent enrichment analysis, expression profling analysis, survival analysis, and the constructed prognostic prediction model indicated that they are closely related to the occurrence and prognosis of HCC.
Among the seven important DEGs, MAFG-DT (logFC � 2.295817), GPRIN1 (logFC � 2.444281), and MYBL2 (logFC � 3.861042) were all signifcantly elevated in liver cancer samples (Figure 2(d)). MAFG-DT is an oncogenic long noncoding RNA (lncRNA), and many previous studies have shown that overexpression of MAFG-DT can promote the proliferation and metastasis of cancer cells [40][41][42][43]. High expression of MAFG-DT is signifcantly associated with poor prognosis in bladder and breast cancer patients [44,45]. In this study, MAFG-DT was highly expressed in liver cancer patients, and the normal group was signifcantly diferent from the liver cancer group after grouping by age, gender, and stage (Figures 3-6). In addition, after grouping according to the high and low expression of MAFG-DT, the survival time of the low expression group was higher than that of the high expression group, although there was no signifcant diference (P � 0.19) (Figure 7). As a member of the GPRIN family, GPRIN1 acts on the classical G protein-coupled receptor pathway [46]. GPRIN1 is closely related to cancer, and it can promote the proliferation and metastasis of lung cancer by promoting the epithelial-mesenchymal transition of lung cancer cells [47]. But interestingly, Zhou et al. found that GPRIN1 is downregulated in gastric cancer cells and tissues, and it can inhibit the invasion and metastasis of gastric cancer [48]. Our results showed that GPRIN1 was highly        (Figure 7). Terefore, we speculate that due to the diferent molecular biological mechanisms of diferent cancers, GPRIN1 promotes the malignant behavior of tumors in lung cancer and liver cancer but plays an inhibitory role in gastric cancer. MYBL2, a transcription factor in the myeloblastosis family, plays a key role in cell proliferation, diferentiation, and cell cycle. Previous studies have shown that MYBL2 is highly expressed in cancers such as ovarian cancer and breast cancer and afects the prognosis of patients [49,50].
Our study on liver cancer also proved that MYBL2 is highly expressed in cancer tissues, and the high-expression group has a poor prognosis (P � 0.00047) (Figure 7). In addition, LINC00907 (logFC � −2.6057), CCN1 (logFC � −2.05925), GSTZ1 (logFC � −2.34197), and GSTM5 (logFC � −2.8954) were lowly expressed in liver cancer tissues. Trough survival analysis, only GSTM5 was found to be associated with patient prognosis (P � 0.01) (Figure 7). Te glutathione S-transferase (GST) gene family plays an important role in detoxifcation in the body, protecting cells from damage by poisons and charged particles. Te GST family is closely related to the occurrence and development of many tumors, and the same GSTM5 as a member is abnormally expressed in ovarian cancer and nonsmall cell lung cancer [51,52]. In this study, the expression of GSTM5 was decreased in HCC tissues and suggested a poor prognosis, which was also consistent with the previous fndings, further indicating that GSTM5 plays a key role in tumorigenesis.
Next, we constructed a risk-scoring model based on the multivariate Cox regression analysis of 7 important DEGs. Kaplan-Meier survival analysis showed that the high-risk group had a signifcantly lower survival time (P � 0.00017) (Figure 8(b)). However, subsequent ROC analysis showed that the risk-scoring model was not dominant compared to the stage (AUC = 0.659 vs 0.582) ( Figure 9). However, in the ROC analysis, the AUC of the risk model is second only to the stage. Combined with the results of the Kaplan-Meier survival analysis, we believe that the risk model still has a certain signifcance. Especially when the patient is in the early stage of the disease, and the stage cannot indicate the disease progression, early individualized treatment according to the risk score has extremely important value for improving the prognosis of patients.
To further study the molecular signaling pathways of important DEGs, we performed enrichment analysis and coexpression analysis. As we discussed before, both GSTZ1 and GSTM5 belong to the GST family, so the screened important DEGs were mainly enriched in the glutathione metabolic process and glutathione transferase activity (P � 0.00011692; P � 1.91E − 05) ( Table 5), and further enrichment analysis of their interacting proteins showed the same results (P � 2.27E − 10; P � 1.56E − 10) ( Table 7). Te KEGG enrichment analysis showed that glutathione metabolism (P � 1.50E − 07) was still a major enrichment pathway, and interestingly, we found that more genes were enriched in cellular senescence (P � 3.38E − 08) ( Table 8). In addition to the central gene MYBL2, the cellular senescence pathway has its associated genes: LIN9, LIN37, LIN54, E2F4, RBL1, and FOXM1, which together constitute the DREAM complex and play an important role in cell cycle regulation [53]. Tis proves that MYBL2 can also promote the progression of HCC by interfering with cell cycle regulation.
However, this study still has some limitations. Firstly, the data for model construction and validation were obtained from the TCGA database. Tis public database contains incomplete information, and the data are all retrospective. Terefore, prospective real-world studies are necessary to verify the reliability of the model. It should be noted that in the process of research, it is necessary to comprehensively collect data at all stages of disease progression, such as blood samples and imaging data, etc., to eliminate information distortion caused by incomplete data collection as far as possible. To improve the representativeness of the results, multicenter studies are also necessary. Secondly, the data used in this study were only from the TCGA database, which may make the results lack representative. Terefore, data from other databases, such as GEO and Oncomine databases, can be selected for subsequent analysis and crossvalidation. Tirdly, the diagnostic efcacy of the risk score model constructed in this study was not superior to staging, although the results of its KM analysis were benefcial. One of the reasons for this may be due to the bias caused by the data analysis using only the TCGA database. However, compared with the stage, the risk scoring model has more important signifcance in the evaluation of patients in the early stage of the disease to improve the prognosis. Te performance of this model will be tested in clinical cohorts in the future. Finally, the seven important DEGs screened in this study are currently only data demonstrations. We will carry out in vitro and in vivo experiments to further explore the specifc molecular pathways of these genes in HCC.

Conclusion
In conclusion, we innovatively used a combination of random forest and Limma to screen out the important DEGs for HCC development. Expression analysis and survival analysis were performed, indicating that these genes are closely related to the survival of HCC patients. Te subsequently constructed prognostic risk scoring model has good predictive value for the prognosis of HCC patients, and combining it with other clinical indicators can provide an efective reference for clinical treatment. Subsequent enrichment analysis and coexpression analysis showed that seven important DEGs were closely related to cellular senescence and glutathione metabolism, which also provided new ideas for further research on the molecular mechanism of HCC occurrence and development. In brief, the early risk score model provided in this study can be used as a reference for subsequent personalized treatment of patients and ultimately help to improve prognosis.

Data Availability
Te data used to support the fndings of this study are included within the article.

Conflicts of Interest
Te authors declare that there are no conficts of interest.

Authors' Contributions
Yikai Wang and Xiongtao Liu conceived and designed the study. Yikai Wang analyzed data, created graphs, and wrote the manuscript. Le Ma, Pengjun Xue, and Bianni Qin checked the analysis results. Ting Wang, Bo Li, and Lina Wu helped collect data and references. Liyan Zhao checked the article's format and language. Xiongtao Liu is responsible for the overall content as a guarantor.