The Use of Machine Learning to Create a Risk Score to Predict Survival in Patients with Hepatocellular Carcinoma: A TCGA Cohort Analysis

Introduction Hepatocellular carcinoma (HCC) accounts for approximately 90% of primary liver malignancies and is currently the fourth most common cause of cancer-related death worldwide. Due to varying underlying etiologies, the prognosis of HCC differs greatly among patients. It is important to develop ways to help stratify patients upon initial diagnosis to provide optimal treatment modalities and follow-up plans. The current study uses Artificial Neural Network (ANN) and Classification Tree Analysis (CTA) to create a gene signature score that can help predict survival in patients with HCC. Methods The Cancer Genome Atlas (TCGA-LIHC) was analyzed for differentially expressed genes. Clinicopathological data were obtained from cBioPortal. ANN analysis of the 75 most significant genes predicting disease-free survival (DFS) was performed. Next, CTA results were used for creation of the scoring system. Cox regression was performed to identify the prognostic value of the scoring system. Results 363 patients diagnosed with HCC were analyzed in this study. ANN provided 15 genes with normalized importance >50%. CTA resulted in a set of three genes (NRM, STAG3, and SNHG20). Patients were then divided in to 4 groups based on the CTA tree cutoff values. The Kaplan–Meier analysis showed significantly reduced DFS in groups 1, 2, and 3 (median DFS: 29.7 months, 16.1 months, and 11.7 months, p < 0.01) compared to group 0 (median not reached). Similar results were observed when overall survival (OS) was analyzed. On multivariate Cox regression, higher scores were associated with significantly shorter DFS (1 point: HR 2.57 (1.38–4.80), 2 points: 3.91 (2.11–7.24), and 3 points: 5.09 (2.70–9.58), p < 0.01). Conclusion Long-term outcomes of patients with HCC can be predicted using a simplified scoring system based on tumor mRNA gene expression levels. This tool could assist clinicians and researchers in identifying patients at increased risks for recurrence to tailor specific treatment and follow-up strategies for individual patients.


Introduction
Hepatocellular carcinoma (HCC) is the most common primary tumor of the liver and a leading cause of cancer death worldwide [1]. Within the USA, nearly 42,230 new cases and 30,230 estimated deaths of HCC will occur in 2021 [2]. Despite recent advances in therapeutic intervention, such as liver transplantation, surgical resection, locoregional therapies, and chemotherapy, the recurrence and overall survival rates remain poor [3]. Patients with localized HCC usually have 5-year OS rates of 30%, and they are less than 5% for patients with distant metastasis [4]. Etiologic factors including underlying liver disease as well as stage of presentation greatly vary between patients. In addition, intratumoral heterogeneity influences the ability to predict outcomes as well as develop individualized therapeutic strategies for patients [5].
Over the last decade, numerous attempts have been made to find biomarkers that can detect HCC in early stages, help predict disease-free survival (DFS) and overall survival (OS), and establish guidelines for long-term prognosis of HCC [6]. Traditional serum markers, particularly alpha-fetoprotein (AFP) and AFP mRNA, have been found to be prognostic [7]. However, they rely on significant tumor burden and often have poor sensitivity and specificity in relation to the cutoff value used; taking this into consideration, their usefulness is often questionable [8].
Recent years have shown a rapid development of predictive biomarkers with advances in the understanding of tumor biology and the use of data mining through bioinformatics. A large series of studies has described the role of tissue and serum markers, oncogenes, tumor suppressor genes, and microRNAs in HCC prognosis [9][10][11]. However, the majority of scoring systems that have been developed are often complicated and impractical in the real-world setting [11,12].
e aim of the current study was to create an easy-tocalculate gene-based risk score using machine learning to predict outcomes in patients with hepatocellular carcinoma using the Cancer Genome Atlas (TCGA) public database. Here, we found that the developed risk score was able to stratify patients into different risk groups for shorter DFS and OS.
In general, such models should not be viewed as replacements for good clinical judgment but as additional instruments to assist clinicians in counseling and choosing individualized treatment strategies for every patient.

Methods
RNA-Seq and corresponding clinical data for liver hepatocellular carcinoma (LIHC) were obtained from TCGA database [13]. A list of 363 samples was obtained. We used GEPIA: a web server for cancer and normal gene expression profiling, to determine the top 75 genes with highest impact on DFS [14]. CBio Portal was used to extract mRNA gene expression and sociodemogrpahic data as well as clinical characteristics [15]. Clinicopathological data of the study population were limited to age, sex, ethnicity, tumor stage, and histologic grade.
We next performed ANN analysis to determine the relative weight of the chosen genes and their impact on DFS. For this purpose, a 10-fold cross validation methodology was used, in which the whole dataset was randomly divided and 90% of the patients were selected for the training step and 10% were selected for the final testing. e final model was the one that maximized the correct classification of patients by DFS outcomes. e importance of independent predictors represented a measure of how much the predicted values changed with variations of the independent variables. Genes with a normalized importance >50% were used for subsequent CTA. CTA did not require assumptions on the distribution of variables or linearity of the data and could handle highly skewed or multimodal continuous variables [16]. e output of CTA provided cutoff values for the top three genes predicting DFS. We then used a simple scoring system (0 or 1 point) to give points for each gene based on the individual gene cutoff levels that were derived through CTA. Patients were then grouped based on their total scores (0-3). DFS and OS were obtained by Kaplan-Meier survival analyses (log-rank test). We furthermore examined the association for RFS and OS of the new scoring system and multiple other variables using Cox proportional hazard regression analysis.

Neural Network Analysis and Classification Tree Analysis.
A total of 363 patients with biopsy-proven HCC were derived from TCGA-LIHC database. e 75 most significant genes that predict DFS along with normalized mRNA expression levels were derived from GEPIA and cBioPortal. ANN identified 15 genes with normalized importance > 50% ( Figure 1). We next used these 15 genes to perform CTA. Here, we identified Nuclear Envelope Membrane Protein (NRM (Ensembl: ENSG00000137404)), Stromal Antigen 3 (STAG3 (Ensembl: ENSG0000066923)), and Small Nucleolar RNA Host Gene 20 (SNHG20 (Ensembl: ENSG00000234912)) as the strongest independent predictors of DFS. Detailed CTA along with node cutoff values for each gene can be obtained from Figure 2.

Score Development.
Following the initial steps of NNA and CTA, we performed survival analysis for each gene. CTA provided node cutoff values, and based on these numbers, we divided the population in below and above the cutoff ( Figure 2). Survival analysis showed that patients with mRNA expression levels for NRM and SNHG20 above the CTA cutoff had significantly worse DFS and OS (p < 0.01), whereas patients with STAG3 mRNA levels above the cutoff had significantly better DFS and OS (p < 0.01) (A and C in Figures 3(a) and 3(b)).
Based on the prediction of survival for each gene, we developed a simple risk score. Patients with NRM and SNHG20 mRNA levels above the cutoff (prediction of worse survival) received 1 point. Patients with STAG3 mRNA levels below the cutoff (prediction of worse survival) received 1 point. A simplified table with scores for individual gene levels can be obtained from Figure 2. Patients were then grouped based on their overall score into 0-3 points.

Patient Demographics and Kaplan-Meier Survival
Analysis. Patient demographics for the entire cohort and each scoring group can be obtained from Table 1. Among the 363 patients, the majority was male (n � 244, 67.2%) with a mean age of 60 ± 13 years. e majority of patients were White (n � 177, 48.80%) with Stage 1 disease (n � 167, 48.7%) and had alcohol as underlying risk factor (n � 108, 29.8%) and histologic grade 2 (n � 169, 47.2%). e distribution for each score was 20.1% (0 points), 30.9% (1 point), 28.4% (2 points), and 20.7% (3 points). ere were no significant differences in regard to baseline demographic parameters among the groups. e Kaplan-Meier survival analysis showed that the developed scoring system could stratify among patients for DFS (0 points, median DFS: not reached; 1 point, median DFS: 29.7 months; 2 points, median DFS: 16.1 months; 3 points, median DFS: 11.7 months, p < 0.01). Similarly, the same score was able to stratify patients for decreased OS. is information can be derived from Figures 3(a) and 3(b) (overall score comparison (A), NRM (B), STAG3 (C), and SNHG20 (D)).

Predictors of DFS and OS.
We next performed Cox regression analysis and found that on univariate analysis, patients with higher scores had significantly worse DFS (

Discussion
e significant increase in mortality rates from primary hepatobiliary cancers, particularly over the past decade, has coincided with a rapidly growing interest to seek effective biomarker-driven approaches to determine prognosis and risk of death in patients undergoing treatment [17].
Estimating the individual patients risk of recurrence or death following tissue diagnosis is helpful for physicians and patients. With a certain estimation on long-term prognosis, physicians can better tailor follow-up and patients have the opportunity to make decisions in regard to treatment options and future care. It is therefore of high importance to develop diagnostic tools that are readily available to predict DFS and OS in patients that were diagnosed with hepatic malignancies.
In this current study, we used Neural Network and Decision Tree Analysis to create a genetic signature score to aid in prediction of DFS and OS in patients with HCC. Using the above techniques, we found that the tumor expression levels of STAG3, SNHG20, and NRM significantly differed among patients. With the help of CTA, we transformed the gene expression levels into a scoring system which provided the ability to adequately stratify between patients with different risk for shorter DFS and OS (0-3 points). e calculated scoring system remained a significant predictor for shorter DFS and OS following multivariable Cox regression adjustment.
Given the vast differences among patients and the inherent molecular heterogeneity of the disease and cancer genetics, personalized medicine in cancer can be particularly effective [18]. Recent studies have shown the use of cancer genomic analysis to discover biomarkers for drug sensitivity, drug resistance, and predictors of outcomes along with establishing personalized oncology by targeting HER2-positive patients in breast cancer [19,20]. It is worth noting that several prior studies have evaluated the importance of the genes that we used in developing this score [21][22][23][24][25]. STAG3 is a subunit of the cohesin complex that regulates the cohesion of sister chromatids during cell division. It has been found to be important in DNA repair, meiosis, and its work as a tumor suppressor gene. e loss of STAG3 has been associated with increased metastasis and drug resistance in melanoma [25,26]. Similarly, NRM has been shown to play critical roles in chromatin organization, gene regulation, and signal transduction. NRM serves as a scaffold for numerous transcription factors and regulator of transcription and cell division. Its presence and prognosis in cancers have been less frequently evaluated; however, some studies suggest that the upregulation of NRM leads to decreased apoptosis along with enhanced cell migration and advanced cancer stage [23,24]. Lastly, SNHG20 has been shown to directly predict poor prognosis in HCC patients. High SNHG20 expression can be detected within the tumor but not the healthy background. e SNHG20/ EZH2/E-cadherin pathway was also identified as the potential mechanism in promoting tumor progression and epithelial-mesenchymal transition [22].
As with all retrospective studies, there are several limitations associated with this analysis. First and foremost, TCGA cohort analysis provides data from untreated tumors. As a result, any genomic change that happens due to intervention is unaccounted for from a genetic standpoint. In addition, only limited clinicopathologic data are available with a lack of information on tumor size, lymphovascular invasion, resection margin, etc. Within the US, studies have shown that Asians have the highest incidence for HCC followed by Blacks, Hispanics, and non-Hispanic Whites. TCGA-LIHC database underrepresents Black patients. is makes the finding of the study less generalizable and will therefore need to be confirmed in a cohort that is more representative of the current HCC population within the USA [27]. Furthermore, the calculated risk score was created using Neural Network analysis using an intrinsic training and testing cohort. An extended retrospective study to validate the score is currently underway at our institution.

Conclusion
e current study used individual patient tumor genomic data to develop a three-gene predictive score to stratify patients and their risk for shorter DFS and OS. is study serves to deepen our understanding of how a patient's individual genetic profile can be utilized to better understand their prognosis and consequently improve and individualize their treatment.

Data Availability
e results published or shown here are in whole or part based upon data generated by TCGA Research Network: https://www.cancer.gov/tcga. Disclosure e content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Conflicts of Interest
e authors declare that they have no conflicts of interest.