Based on Integrated Bioinformatics Analysis Identification of Biomarkers in Hepatocellular Carcinoma Patients from Different Regions

Accumulating statistics have shown that liver cancer causes the second highest mortality rate of cancer-related deaths worldwide, of which 80% is hepatocellular carcinoma (HCC). Given the underlying molecular mechanism of HCC pathology is not fully understood yet, identification of reliable predictive biomarkers is more applicable to improve patients' outcomes. The results of principal component analysis (PCA) showed that the grouped data from 1557 samples in Gene Expression Omnibus (GEO) came from different populations, and the mean tumor purity of tumor tissues was 0.765 through the estimate package in R software. After integrating the differentially expressed genes (DEGs), we finally got 266 genes. Then, the protein-protein interaction (PPI) network was established based on these DEGs, which contained 240 nodes and 1747 edges. FOXM1 was the core gene in module 1 and highly associated with FOXM1 transcription factor network pathway, while FTCD was the core gene in module 2 and was enriched in the metabolism of amino acids and derivatives. The expression levels of hub genes were in line with The Cancer Genome Atlas (TCGA) database. Meanwhile, there were certain correlations among the top ten genes in the up- and downregulated DEGs. Finally, Kaplan–Meier curves and receiver operating characteristic (ROC) curves were plotted for the top five genes in PPI. Apart from CDKN3, the others were closely concerned with overall survival. In this study, we detected the potential biomarkers and their involved biological processes, which would provide a new train of thought for clinical diagnosis and treatment.


Introduction
Liver cancer is highly fatal, which causes the second highest death rate of cancer-related mortality worldwide [1,2]. Globally, it is estimated that approximately 80% of liver cancers were HCC [3]. Nobody disputes that this is a public health challenge that needs widespread attention. HCC is a multigene disease caused by the interaction of multiple cancer-promoting and suppressing genes with the microenvironment, and its molecular mechanism is still unclear. us, the identification of new potential therapeutic targets is urgently needed.
In recent years, despite the advances in our knowledge of the genetic factors, it is a pity that the death rates were increasing rapidly [4,5]. If the HCC patients can be diagnosed early, the survival rate may be greatly improved by means of liver resection [6,7]. However, due to the late diagnosis of most patients with HCC, the physical condition is not good enough to withstand the risk of surgery [8,9]. What is worse, the survival rate of patients with advanced HCC is further decreased due to the widespread resistance to chemotherapy. Sorafenib, for example, a multikinase inhibitor, is widely used for the treatment of patients with advanced HCC with a long application time [10,11], but patients invariably develop sorafenib resistance and it only provides limited survival benefit [12]. As a result, we are badly in need of finding new diagnostic and prognostic markers for HCC, which might facilitate early diagnosis and guide treatment decisions to improve patients' survival and quality of life.
In this study, we downloaded the expression matrix of six datasets from the Gene Expression Omnibus (GEO) database, including 630 adjacent normal and 927 tumor tissues. PCA, tumor purity evaluation, and differential expression gene (DEG) analysis were performed by using R software. 266 DEGs were finally obtained, consisting of 81 upregulated and 185 downregulated genes. FunRich undertook the entire enrichment analyses in our experiment, while Cytoscape was employed to build a network diagram. We had found that upregulated genes were closely related to mitotic cell cycle. Different from upregulated genes, downregulated genes were enriched in lipid and lipoprotein derivative pathway. To further explore the role of these DEGs, we divided the PPI network into several independent modules. FOXM1 and FTCD were core genes in two separate models with the highest score, respectively. e former is enriched in FOXM1 transcription factor network pathway, while the latter is mainly enriched in the metabolism of amino acid and derivative pathway. Finally, e Cancer Genome Atlas (TCGA) data were used to test our results and predict overall survival related to five hub genes in PPI. We found certain correlations among hub genes, which might reveal potential signaling pathways in HCC. And 4 of 5 hub genes were connected with low overall survival of HCC patients. Undoubtedly, recognition of biomarkers in HCC that plays a key role in disease progression can provide new insights into the development, prognosis, and treatment of HCC.

Tumor Purity Estimation and Differential Expression Gene
Analysis. All the gene expression profile data originated from the GEO database (https://www.ncbi.nlm.nih.gov/geo). ere were huge liver cancer mRNA microarray datasets in the GEO database, and the included datasets need to meet the following conditions: (1) the microarray data were available; (2) they contained at least 100 samples; and (3) they employed tumor and adjacent normal tissues. erefore, we selected the following datasets: GSE25097, GSE36376, GSE45436, GSE54236, GSE64041, and GSE112790. GEOquery package in R/Bioconductor software (version 3.6.1, https://www.r-project.org) was used to get datasets, which was applied to download gene expression and probe annotation information for the selected datasets. en, the estimate package was used to estimate tumor purity, while the limma package for data normalization and gene differential expression matrix acquisition. In the differential expression gene analysis, FDR < 0.05 and | log 2 FC| ≥ 1 were considered to be significant DEGs, which were visualized by ggplot2 package.

Enrichment Analysis.
We divided the DEGs into two categories and ranked them in descending order of absolute values. Since the six datasets were not from the same platform, we used RRA package to integrate the DEGs. FunRich (version 3.1.3, http://www.funrich.org) is such powerful stand-alone software that we primarily used to perform functional enrichment analysis [13]. Biological process, biological pathway, cellular component, and molecular function can be achieved by FunRich in the present study.

PPI Network and Module Analysis.
e Search Tool for the Retrieval of Interacting Genes database (STRING, https:// string-db.org) can provide information on protein interactions, whose data mainly came from structural predictions and literature reports [14]. Combined score ≥ 0.4 was considered as the cutoff value, and the filtered node information was saved locally for subsequent visualization. en, we used Cytoscape software (version 3.6.0, https://cytoscape.org) to build the protein-protein interaction (PPI) network and one of the plugin in Cytoscape named Molecular Complex Detection (MCODE) was applied to detect notable modules in this PPI network [15]. As is known to all, network modules, as one of the characteristics of protein networks, may have specific biological significance. e default advanced option parameters (degree cutoff � 2, node score cutoff � 0.2, and k-core � 2) in MCODE already met our requirements, so we did not modify it. Moreover, models with score ≥ 5 were used for further path enrichment analysis, which can help to explore the potential biological functions of DEGs.

Analysis for Expression Level and Correlation of the Hub
Genes.
e Gene Expression Profiling Interactive Analysis (GEPIA, http://gepia.cancer-pku.cn) is an online website tool that can perform analysis including gene expression analysis and correlation analysis [16]. Data from TCGA and the Genotype-Tissue Expression (GTEx, http:// commonfund.nih.gov/GTEx/) were used to apply a standard processing pipeline before being used by GEPIA. Based on the huge amount of data from GEPIA, we used it to demonstrate the expression of hub genes in LIHC tissues and normal ones and then made a boxplot to visualize the results.
ere are three correlation coefficients (Pearson, Spearman, and Kendall) for users to choose in GEPIA, and any sets given by TCGA and/or GTEx expression data were used to check the relative ratios between two genes.

Overall Survival Analysis and ROC Curve Analysis of Hub
Genes. Kaplan-Meier plotter (KM plotter, http://kmplot. com/analysis/) is a database that can be accessed openly, which is the largest dataset including breast, ovarian, lung, and gastric cancer [17].
is database is rich in gene expression data and overall survival information from TCGA, which we can use to draw survival curves with 95% confidence interval hazard ratio and log-rank P value. Receiver operating characteristic (ROC) curve analysis was employed to verify the diagnostic performance of hub genes, and 3 years was set as the predicted time. Multivariate Cox proportional hazards regression analysis was performed based on hub genes. e risk score for predicting overall survival was calculated as follows: risk score � n i�1 (coef i * Expr i ), where coef is the regression coefficient and Expr is the expression level of the gene. en, according to the mean risk score, samples were divided into low-and high-risk groups. Finally, survival analysis and ROC curve analysis of the risk score were performed by using the same method as described above.

Tumor Purity Estimation of Tumor Tissue in Datasets and
Identification of DEGs. 630 normal and 927 tumor samples were selected in this study (Table 1). e results of PCA showed that the normal control group and the tumor group in the six datasets could be discriminated very well (Figures 1(a)-1(f)).
en, we calculated the tumor purity of 927 liver tumor tissues through the estimate algorithm. As shown in Figures 1(g) and 1(h), the purity of tumors ranged from 0.179 to 0.979 and 55.8% of the tumor samples had a greater value than the mean value of 0.765. After performing differential expression gene analysis on each dataset, 81 upregulated and 185 downregulated genes were finally detected by RRA (Supplementary Table S1). Compared with the adjacent normal ones, the expression of these genes in tumor tissues was all upregulated or downregulated ≥ 2-folds (Figures 2(a)-2(f)). We sorted the upregulated and downregulated genes in ascending order according to the FDR values and created the heatmap with the top 20 genes (Figure 2(g)).

Enrichment Analysis of DEGs.
DEGs from six independent datasets were integrated and introduced into FunRich for enrichment analysis. e biological processes for upregulated genes were mainly associated with spindle assembly and cell cycle (Figure 3(a) and Supplementary  Table S2), while the molecular functions were about protein binding and kinase binding (Figure 3(b) and Supplementary  Table S2). In addition, the vital gene of upregulated genes was called FOXM1, which was highly correlated with transcription factor activity and FOXM1 transcription factor network (Supplementary Table S2). e functional enrichment of downregulated genes was associated with metabolism, catalytic activity, and energy pathways (Figures 3(c) and 3(d)). And the pivotal gene FTCD in this group was enriched in methyltransferase activity, energy pathways, and histidine catabolism.
rough the biological pathway enrichment analysis, we found that upregulated genes were closely related to mitotic cell cycle, DNA replication, mitotic G1-G1/S phases, and ATM pathway (Figure 4(a) and Supplementary  Table S2). Different from upregulated genes, downregulated genes were enriched in lipid and lipoprotein derivative pathway (Figure 4(b) and Supplementary Table S2).

PPI Network Establishment and Pathway Analysis of Network Module.
After introducing the gene list into STRING website, we finally got the information of 240 nodes and 1747 edges (combined score ≥ 0.4). en, the network diagram was presented by Cytoscape based on the STRING database ( Figure 5(a) and Supplementary Table S3). Interestingly, most of the nodes with higher connectivity were upregulated genes, which signified that they would be closer to the center of the circle. Four modules with score ≥ 5 were detected via MCODE (Figures 5(b)-5(e)). It can be seen in Figure 5(b) that the hub nodes were FOXM1, CCNA2, AURKA, CDKN3, and CDC20 in module 2. Besides, as shown in Figure 5(c), FTCD, HRG, AGXT, C8A, and TAT were nodes with highest connectivity in module 2. Among the four models, only model 3 contained both of the protein nodes expressed by up-and downregulated genes, including AFP, PLG, CRP, FABP1, and SPP1. And the results of the pathway analysis for the two modules with the highest combined score are shown in Figures 5(f ) and 5(g) and Supplementary Table S4. It is worth mentioning the most significant pathways in module 1 and module 2 were mitotic cell cycle and phase 1-functionalization of compounds, respectively.

Expression Level and Correlation of Hub Genes.
A total of 419 samples were selected for gene expression level analysis, including 369 tumor tissues and 50 normal liver tissues. As shown in Figures 6(a)-6(e), the expression of the five hub genes in cancer tissues was significantly higher than that in normal ones. Moreover, it turned out by the correlation analysis that the increased expression of these genes was strongly correlated with the decreased expression of FTCD (Figures 6(f )-6(j)), and heatmap of correlation coefficients between hub genes was shown in Figure 6(k).

Discussion
In the present study, we had detected totally 266 DEGs. FOXM1 was the most connected gene in upregulated genes in the PPI network, which had 44 edges. Increasing evidence has suggested that FOXM1 is elevated in many tumors, such as intrahepatic cholangiocarcinoma, oesophageal adenocarcinoma, gastric cancer, cervical cancer, and HCC [18][19][20][21][22][23][24]. Since FOXM1 can promote the proliferation and invasion of cancer cells, it may give rise to the poor prognosis and low survival rate of patients with high FOXM1 expression [22,[24][25][26][27][28][29][30][31]. Not only that, it was also found that FOXM1 contributes to tumor angiogenesis in the study of colorectal and gastric cancer [23,32]. In previous studies, there was a large amount of evidence that FOXM1 directly or indirectly affects the occurrence and development of HCC, which is in line with our results [27][28][29][33][34][35][36][37][38][39]. In addition, in an in vivo study of HCC, the growth of tumors in mice with FOXM1 deficiency was completely stagnated, suggesting that FOXM1 has the potential to become an independent biomarker of HCC [31]. It is worth mentioning that KIF4A has been confirmed to be a downstream target of FOMX1 and the expression level of KIF4A is positively correlated with FOXM1. Overexpression of both genes will lead to excessive cell proliferation and promote tumor development [22]. Meanwhile, we observed a significant increase in the expression of KIF4A in our experimental results (P � 3.18e − 07), which coincided with previous studies [22]. MircoRNA plays an active role in HCC as well. For instance, the expression of microRNA-135a transcribed by FOXM1 can affect the prognosis and survival rate of patients with HCC [40]. Unfortunately, we did not build a competing endogenous RNAs (ceRNA) network in this project to find potential downstream noncoding RNAs for FOXM1, which should be investigated in future study. FTCD was the core gene in model 2 and had a certain correlation with FOXM1. Additionally, FTCD was found useful to distinguish early HCC from benign tumors, suggesting that it might be a potential marker for HCC early diagnosis [41]. e results of enrichment analysis for FTCD were consistent with prior reports that the decrease in FTCD expression impeded the degradation pathway of histidine, which leads to the poor performance of methotrexate [42]. erefore, we infer that patients with HCC who were not responding to methotrexate may be associated with abnormal expression of FTCD. Besides, in the available evidence, we found that autoimmune hepatitis has a 0.6% to 0.7% probability of inducing HCC [43,44]. Interestingly, by reducing the number of circulating autoreactive T cells, the increased expression of FTCD can prevent the progression of autoimmune hepatitis [45].
us, low level of FTCD might contribute to high incidence of HCC and serves as a useful biomarker for primary HCC. Further research will be needed to clarify the role of FTCD in tumorigenesis.
In addition to FOXM1, CCNA2, AURKA, CDKN3, and CDC20 can be seen in the forefront of PPI. It is common knowledge that CCNA2, a core cell cycle regulator, plays a critical role with high expression from S  phase to early mitosis [46,47]. It was reported that high expression of CCNA2 might induce hepatocyte nodular proliferation [48]. And the exosome circRNAs, secreted from liver adipocytes, promoted tumor growth by controlling miR-34a level and activating the USP7/CCNA2 signaling pathway [49]. All of the above findings indicate that CCNA2 directly or indirectly influences the development of HCC, which is considered as a vital part in HCC development. High expression of AURKA has previously been detected in different cancer types as well [50][51][52], which is implicated with the regulation of cell cycle and division [53]. With no exception to HCC, it is also described as an oncoprotein and therapeutic target. Microarray analysis pointed out that AURKA phosphorylated and stabilized hepatoma upregulated protein [54]. Moreover, a research revealed that AURKA can, in turn, give rise to malignant phenotypes of HCC by regulating HIF-1α through activation of AKT and p38-MAPK signaling pathways [55]. As a result, we conjecture that it may function as a cancer-promoting gene. Excessive replication of the centrosome is considered to be a common feature of almost all human cancers [            CDKN3 happens to have the ability to prevent this abnormality [57]. Further research indicated CDKN3 seemed to play a role in tumor suppression by CDC2 signaling pathway [58]. However, bioinformatics analysis for identification of molecular target genes in HCC revealed that the relative expression levels of CDKN3 were significantly upregulated in tumor tissues, which proved that our results were not accidental [59]. erefore, we conjecture that the result may be due to the positive feedback regulation in the tumor microenvironment, which surely requires subsequent experiments to verify. Regrettably, since the expression level of CDKN3 is not associated with the prognosis of patients with HCC in our study, CDKN3 cannot be counted as a candidate biomarker accordingly. CDC20 played a pivotal role in the regulation of chromosome segregation and the timely end of mitosis [60]. Abnormal CDC20 expression has been detected in most human cancers [61][62][63], and CDC20 knockdown caused mitotic arrest to efficiently kill slippage-prone and apoptosis-resistant cancer cells [64], supporting an oncogenic role of CDC20. In conclusion, combined with literature reports and our findings, FOXM1, FTCD, CCNA2, AURKA, and CDC20 are very competitive biomarkers of HCC, while whether CDKN3 can be regarded as a biomarker for HCC remains further studies. Prior to our research, there have been some reports on bioinformatics analysis of critical genes in HCC [65][66][67][68]. Nevertheless, our research still has several obvious advantages: firstly, the datasets we selected contain as many as 1557 samples and cover multiple different regions; secondly, we prioritized PCA of these datasets to ensure that tumor and normal tissues come from two distinct populations; thirdly, the method of tumor purity estimation allowed us to show readers the quality of the tumor samples in this study; fourthly, we have established a multivariate Cox proportional hazards regression model based on hub genes to improve the accuracy of single prognostic factor prediction; finally, we performed a correlation analysis between the upregulated and downregulated genes, which may reveal potential signal transduction pathways in HCC. We have to admit that our research still has the following shortcomings: first of all, our results were only based on GEO and TCGA data analysis and have not been verified; next in importance, there may be distinctions in gene expression among different types of tumors, which will be perfected in future experiments.

Conclusion
In conclusion, the integrated bioinformatics analysis was derived from 1557 tumor tissues and adjacent normal tissues in the GEO database. Tumor and normal samples came from different populations and half of the tumor samples have a purity of more than 0.765. 266 genes were eventually identified as candidate HCC biomarkers, which were enriched in signaling pathways closely related to cell proliferation and metabolic function. FOXM1, CCNA2, AURKA, CDKN3, and CDC20 were at the core of these genes, which opened up new horizons for diagnosis, prognosis, and treatment of HCC patients.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
No conflicts of interest exist in the submission of this manuscript.