Microarray Data Mining and Preliminary Bioinformatics Analysis of Hepatitis D Virus-Associated Hepatocellular Carcinoma

Several studies have demonstrated that chronic hepatitis delta virus (HDV) infection is associated with a worsening of hepatitis B virus (HBV) infection and increased risk of hepatocellular carcinoma (HCC). However, there is limited data on the role of HDV in the oncogenesis of HCC. This study is aimed at assessing the potential mechanisms of HDV-associated hepatocarcinogenesis, especially to screen and identify key genes and pathways possibly involved in the pathogenesis of HCC. We selected three microarray datasets: GSE55092 contains 39 cancer specimens and 81 paracancer specimens from 11 HBV-associated HCC patients, GSE98383 contains 11 cancer specimens and 24 paracancer specimens from 5 HDV-associated HCC patients, and 371 HCC patients with the RNA-sequencing data combined with their clinical data from the Cancer Genome Atlas (TCGA). Afterwards, 948 differentially expressed genes (DEGs) closely related to HDV-associated HCC were obtained using the R package and filtering with a Venn diagram. We then performed gene ontology (GO) annotation and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis to determine the biological processes (BP), cellular component (CC), molecular function (MF), and KEGG signaling pathways most enriched for DEGs. Additionally, we performed Weighted Gene Coexpression Network Analysis (WGCNA) and protein-to-protein interaction (PPI) network construction with 948 DEGs, from which one module was identified by WGCNA and three modules were identified by the PPI network. Subsequently, we validated the expression of 52 hub genes from the PPI network with an independent set of HCC dataset stored in the Gene Expression Profiling Interactive Analysis (GEPIA) database. Finally, seven potential key genes were identified by intersecting with key modules from WGCNA, including 3 reported genes, namely, CDCA5, CENPH, and MCM7, and 4 novel genes, namely, CDC6, CDC45, CDCA8, and MCM4, which are associated with nucleoplasm, cell cycle, DNA replication, and mitotic cell cycle. The CDCA8 and stage of HCC were the independent factors associated with overall survival of HDV-associated HCC. All the related findings of these genes can help gain a better understanding of the role of HDV in the underlying mechanism of HCC carcinogenesis.


Introduction
Hepatocellular carcinoma (HCC) is the sixth most commonly diagnosed cancer and the fourth leading cause of cancerrelated mortality globally [1,2] and the second in China [3]. More than 80% of all HCC causes are associated with infection with hepatitis B virus (HBV), hepatitis C virus (HCV), and hepatitis delta virus (HDV) [4]. Approximately 292 million people worldwide are chronically infected with HBV, which causes liver injury that can progress to cirrhosis, resulting in HCC, liver failure, and eventually death [5,6]. The HDV is known as the satellite of HBV and affects 15-20 million people in the world [7]. Concurrent HBV and HDV infections significantly increase both the incidence and mortality of HCC among patients with chronic hepatitis B (CHB) [8]. HDV is a kind of defective RNA virus which uses HBV envelope protein for successful spread in hepatocytes [9]. Although the risk of HCC is thought to be higher when a HBV-infected patient is superinfected with HDV, the molecular mechanisms of carcinogenesis remain unclear [10]. Chronic hepatitis D (CHD) is more severe than any other type of hepatitis, but its carcinogenesis mechanism remains poorly understood. Additionally, it has been found that the intrahepatic HBV DNA levels in patients with HDV-associated HCC and non-HCC cirrhosis are significantly reduced [11]. This phenomenon of HDVmediated inhibition of HBV replication suggests that the effects of HDV are mediated through a unique molecular mechanism. While HBV and HCV are both included in International Agency for Research on Cancer (IARC) group 1 (high evidence of carcinogenicity to humans), HDV was assigned several years ago to group 3 (not sufficient evidence of carcinogenicity) [12], due to inadequacy to support the contribution of HDV to HBV-induced HCC. Due to the dependency of HDV on HBV, there are still controversies regarding the increased risk of HCC development in chronically HDVinfected patients [4], and the available data on the particular mechanism by which HDV contributes to HCC are sparse. With the development of genomics and other "-omics" disciplines, substantial omics data from HCC specimens have been accumulated [13]. Therefore, researchers have taken advantage of gene ontology (GO) and signal pathway analysis tools to identify and characterize many differentially expressed genes (DEGs).
In the present study, the GSE55092 and GSE98383 mRNA expression profile datasets were retrieved from Gene Expression Omnibus (GEO) online [14,15]. And the RNAsequencing data of 371 HCC patients and their clinical data were from the Cancer Genome Atlas (TCGA). We performed DEG analysis between cancerous specimens of HBVassociated HCC and HDV-associated HCC patients and their respective paracancerous specimens using Linear Models for Microarray Data (LIMMA) [16] and other packages implemented in R/Bioconductor [17]. We aimed to investigate potential HDV carcinogenesis mechanisms by PPI network, Gene Expression Profiling Interactive Analysis (GEPIA), and Weighted Gene Coexpression Network Analysis (WGCNA), particularly to screen and identify key genes and pathways to determine their possible role in HCC pathogenesis, and to help determine their mechanism of inhibition of HBV replication and their effects in diagnosis, treatment and prognosis.

Methods and Materials
2.1. Acquisition of Data and Preprocessing. The RNAsequencing data and clinical data of 371 HCC patients were downloaded from TCGA (http://cancergenome.nih.gov/. http://cancergenome.nih.gov/). The expression of genes was represented by fragments per kilobase of exon per million fragments mapped (FPKM). Microarray data were available at the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO, https://www.ncbi .nlm.nih.gov/geo/) database. The inclusion criteria for selection GEO datasets in this study were as follows: (1) hepatocellular carcinoma containing cancer and paracancer tissue; (2) HBsAg positive at least 6 months with serum HBV DNA positive; (3) anti-HDAg positive with serum HDV RNA positive (applies only to filter HDV-associated HCC dataset); and (4) sample size more than 10 with data unbiased. The exclusion criteria were as follows: (1) the second liver cancer, (2) HCV-related HCC, (3) dataset biased, and (4) no paracancer tissue and carcinoma tissue present at the same patients. We searched the GEO database using "Hepatitis D Virus," "Hepatocellular Carcinoma," and "Homo sapiens" as keywords, and there were only 2 results, between which only GSE98383 met our criteria for HDV-associated HCC. Similarly, we searched the keywords "Hepatitis B Virus," "Hepatocellular Carcinoma," and "Homo sapiens" and obtained 581 search results. The flow chart of screening HBV-associated HCC could be seen in Figure 1, and only GSE55092 was suited for an in-depth study.
Microarray data GSE55092 [18] and GSE98383 [11] were generated using the GPL570 Affymetrix HG-U133 Plus 2.0 Array platform, and we performed the analysis on the data from whole liver tissue. GSE55092 contains 39 cancer specimens and 81 paracancer specimens from 11 HBV-associated HCC patients (average age = 57:7 ± 7:7 years; 10 male patients and 1 female patient). GSE98383 contains 11 cancer specimens and 24 paracancer specimens from 5 HDVassociated HCC patients (average age = 57 ± 3 years, 5 male patients). The comparison of baseline characteristics between patients with HBV-associated HCC and HDV-associated HCC is shown in Table 1.
We downloaded the GSE55092 and GSE98383 datasets, normalized them by the Affy package of the R Bioconductor, and then converted the gene expression profile at the probe level into gene symbol level and removed the duplicated symbols. When numerous probes were mapped to one gene, the average value was defined as the expression level of that gene. According to the description of the uploader, an unsupervised multidimensional scaling (MDS) of all specimens obtained from HBV-associated HCC patients showed a clear separation between two distinct clusters that corresponded to cancer areas and paracancerous areas. A similar separation between two clusters that corresponded to cancerous areas and paracancerous areas was observed for the specimens from HDV-associated HCC patients. Therefore, all the data are available for the identification of DEGs.
2.2. Identification of DEGs. DEG analysis refers to the identification of genes with significantly different expression levels between two groups through multiple analysis modes [19]. We performed differential expression analysis using Bayes t -statistics from the LIMMA implemented in the R Bioconductor and corrected p values for multiple testing using the Benjamini-Hochberg method [20]. We identified the DEGs in primary cancerous specimens of HCC patients by comparing them with paracancerous normal specimens of the same HCC patients. The absolute value of log2-fold change (FC) was set to ≥1.0, and a p value of <0.01 was used as the    3 BioMed Research International significance criteria, and genes that met these criteria were used for further analysis. The final step is to use Venn diagrams to identify DEGs closely related to HDV-associated HCC [21].

GO Enrichment and Pathway
Analysis. DAVID program (https://david.ncifcrf.gov/) [22] is a bioinformatics resource comprising a biological database and a set of annotation and analytical tool that intuitively integrates functional genomic annotations with graphics. In this study, the DEGs were submitted to DAVID for GO [23] and KEGG [24] enrichment analyses, which included biological process (BP), cellular component (CC), molecular function (MF), and related biological metabolic pathways. A p value < 0.05 was considered statistically significant.

Weighted Gene Coexpression Network Construction and
Module-Clinical Characteristic Associations. We compared DEGs with the genes of 371 HCC patients downloaded from TCGA website. Expression data of the matched genes from TCGA were applied to find gene modules significantly associated with clinical trait (stage of HCC) by WGCNA [25]. In this analysis mode, the soft thresholding acts as the lowest power based on the criterion of approximate scale-free topology [26], and analogous modules would be merged together due to the similarity. The heatmap of module-clinical characteristic relationship could reveal modules significantly associated with clinical characteristics.

Construction of PPI Networks and Module
Analysis. The visual protein-to-protein interaction (PPI) networks of DEGs were predicted using the web resource Search Tool for the Retrieval of Interacting Genes (STRING) [27] to search the STRING database (https://string-db.org/), which contains over 5,000 organisms as well as their over 24.6 million proteins and over 2 billion interactions. We correlated the target DEGs with the STRING database and set the significant threshold to the highest confidence level (interaction score ≥ 0:900). Subsequently, we used Cytoscape [28], a software to construct PPI networks and analyze highly interconnected modules using the built-in Molecular Complex Detection (MCODE) clustering algorithm. The parameters were set by default except for the K-core value which was equal to 8.
2.6. Validation of Module Gene. First, we uploaded the potential genes identified by PPI-network analysis to GEPIA [29] (http://gepia.cancer-pku.cn/, an online server containing TCGA/GTEx datasets) to validate the gene expression consistency between the microarray datasets (GSE55092 and GSE98383) and TCGA/GTEx HCC dataset, setting the threshold parameters as follows: |log2FC| cutoff ≥ 1:0 and p value cutoff < 0.01. Afterwards, we performed the overall survival analysis as follows: we divided the patients in TCGA/G-TEx dataset into high and low expression groups with the TPM (transcripts per kilobase million) midvalue as a breakpoint; a log-rank test was used to determine significance at p < 0:05. Finally, we took the intersection of related genes to the OS of HCC by GEPIA and genes contained in the hub module obtained by WGCNA and got the key genes.

Univariate and Multivariate Cox Proportional Hazards
Model Analysis. We performed univariate and multivariate Cox proportional hazards model analysis in patients from TCGA, to find independent factors associated with the overall survival of HCC.

Result
3.1. Comparison of Baseline Characteristics. GSE98383 was the only dataset of HDV-associated HCC that met our criteria, as to the screen of HBV-associated HCC dataset, overall 581 datasets were enrolled and screened for eligibility and 3 datasets met inclusion criteria. Of these 3 datasets, the data processing platform of GSE55092 was GPL570, which was the same with GSE98383, while GSE22058 and GSE94660 were different, so we took GSE55092 to stand for HBVassociated HCC for study. The number of datasets and reasons for exclusion are shown in Figure 1. The HDVassociated HCC patients in GSE98383 had higher levels of alanine aminotransferase (ALT), aspartate aminotransferase (AST), total bilirubin (TB), prothrombin time (PT), platelet counts (PLT), and inflammatory activity grade than the HBV-associated HCC patients in GSE55092, which is consistent with the characteristics of CHD of the most severe hepatitis. On the other hand, there was no significant difference in sex, age, tumor grade, and tumor size between the two groups.

Identification of DEGs.
We identified DEGs from the microarray GSE55092 and GSE98383 datasets using the LIMMA package, setting |log2-FC| to ≥1.0 and adjusted p value to <0.01 as the criteria. By comparing the cancerous and paracancerous specimens in GSE55092 up to 1,375, DEGs were identified, comprising 518 upregulated and 857 downregulated genes (Table S1). A similar comparison in GSE98383 contains 1,605 DEGs, including 592 upregulated and 1,013 downregulated genes (Table S1). Volcano plots of the GSE55092 and GSE98383 microarrays are shown in Figures 2(a) and 2(b), respectively. We used Venn diagrams to determine the DEGs closely related to HDV-associated HCC ( Figure 2(c)), and 948 DEGs (373 upregulated and 582 downregulated) were identified and selected for further analysis (Table S2).

GO Enrichment and Pathway Analysis.
In order to further screen HDV-associated HCC potential target genes among these DEGs, GO and pathway analysis were performed on HDV-associated HCC DEGs using p value < 0.05 as the threshold (Figure 2(d)). The results are presented in Figure 2(d) and show the TOP-7 GO terms (BP, CC, and MF) and KEGG pathway terms significantly enriched in the DEGs. Additionally, the TOP-5 annotations of the DEGs are shown in Table 2. In the BP series, the DEGs were mostly enriched for genes related to the cellular response to chemical stimulus and organic substance, defense response, and cell adhesion. In the CC series, the DEGs were primarily enriched for genes involved in cell surface, side of membrane, membrane-bounded vesicle, external side of plasma membrane, and proteinaceous extracellular matrix. In the MF 4 BioMed Research International  BioMed Research International series, the DEGs were predominantly enriched for genes associated with glycoprotein binding, molecular function regulator, cytokine binding, heparin binding, and glycosaminoglycan binding. In the KEGG pathway series, the enrichment for DEGs was mainly in the chemokine signaling pathway, Staphylococcus aureus infection, transcriptional misregulation in cancer, focal adhesion, and leukocyte transendothelial migration.

Construction of PPI Networks and Gene Module Analysis.
We uploaded the DEGs onto the STRING online tool and analyzed them with the Cytoscape software. We then selected 353 nodes and 939 edges with the highest confidence (scores > 0:900) to construct the PPI networks (Figure 4(a)). Then, the MCODE plugin filtered out three important gene modules. Genes within module 1 and module  Table 5).
3.6. Validation of Module Genes. We compared the gene expression changes between the three module genes of HDV-associated HCC (a total of 52 genes) and the validation HCC (TCGA/GTEx) datasets in the GEPIA website to verify whether their expression in both datasets is consistent. We noticed that CCL21 and FPR1 in module 1 as well as XAF1 in module 2 were downregulated in tumors compared to normal specimens in the HCC datasets, which is in accordance with the HDV-associated HCC specimens. However, IFI6, IFI27, and ISG15 in module 2, which were downregulated in cancerous specimens compared to paracancerous normal specimens, were conversely expressed in HCC datasets. The genes in module 3, including CDC6, CDC45, CDCA5, CDCA8, CENPH, MCM4, MCM7, and TCEB1, were upregulated in tumor compared to normal specimens in the HCC datasets, which is consistent with the HDV-associated HCC patients. All the box plots comparing gene expression are shown in Figure 5(a). For further verification, 11 genes whose gene expression trends are consistent with HCC datasets were selected and used to conduct overall survival analysis. In Figure 5(b), there were the upregulated genes (CDC6, CDC45, CDCA5, CDCA8, CENPH, MCM4, MCM7, and TCEB1) which are associated with a lower survival rate in the high expression group than in the low expression group. The reason for excluding CCL21, FPR1, IFI6, IFI27, ISG15, and XAF1 is that their p value did not comply with the standards or the opposite gene expression. All the 8 retained potential genes are related to the nucleoplasm, and most of them are related to the mitotic cell cycle process, cell cycle, and DNA replication ( Figure 5(c)). Taken together, the intersection of the above validated 8 genes and 320 genes in MEturquoise module by WGCNA, the potential 7 key genes (CDC6, CDC45, CDCA5, CDCA8, CENPH, MCM4, and MCM7) were found ( Figure 5(d)).

Identification of Independent Factors of Overall Survival of HCC.
We performed the univariate analysis in 371 patients from TCGA and found that CDCA8, stage, CDC45, CDC6, CDCA5, MCM4, CENPH, MCM7, sex, and age were significantly associated with OS of HCC. The multivariate Cox    BioMed Research International proportional hazards model showed that CDCA8 and stage of HCC were independent factors of OS of HCC (Table 6).

Discussion
As the virus causing the most severe type of hepatitis, HDV affects 15-20 million people worldwide, but its specific pathogenic mechanism remains unclear. Accordingly, we undertook to find potential genes and pathways involved in the pathogenesis of this disease through text mining to help explain the underlying carcinogenic mechanism of HDV as well as the HBV inhibitory mechanism.
In this study, we compared cancerous and paracancerous specimens of patients suffering from HBV or HDVassociated HCC with the aim of identifying potential genes closely related to HDV-associated HCC. The study identified 373 upregulated DEGs and 582 downregulated DEGs. These DEGs were subjected to GO and KEGG annotation and enrichment analyses. In addition, we constructed PPI networks and sorted out 353 nodes with 939 edges, from which the three most significant modules were selected and 52 central nodes/genes were selected for validation using the GEPIA database. In the module confirmed by the WGCNA that was significantly associated with clinical features including event, OS, stage, grade, and age, only CDC6, CDC45, CDCA5, CDCA8, CENPH, MCM4, and MCM7 were consistent with genes identified by the PPI network, which were found to be significantly correlated with nucleoplasm, cell cycle, DNA replication, and mitotic cell cycle. Univariate and multivariate Cox proportional hazards model analysis showed the stage of HCC and CDCA8 are the independent factors associated with the OS of HCC.
Cell division cycle 6 (CDC6) is thought to be significantly associated with pancreatic cancer and colorectal cancer (CRC) [30]. It has a pivotal role in regulating the process of DNA replication as well as tumorigenesis; its overexpression could interfere with the expression of tumor suppressor genes (INK4/ARF) through the mechanism of epigenetic modification [31]. During the S phase of DNA replication in eukaryotic cells, cell division cycle 45 (CDC45) is an essential component of CMG (CDC45-MCM-GINS) helicase. CDC45 acts as a hubprotein, significantly upregulated in cancerous tissues from CRC and non-small-cell lung cancer (NSCLC) patients, and promotes tumor progression [32]. It    Type I interferon signaling pathway  12 2:85E − 27  BST2, IFI27, IFI6, IFIT1, IRF7, IRF9, ISG15, MX1, MX2 [33]. In addition, MCM4 is also a member of significant predictors of poor prognosis in CRC patients [34]. Moreover, MCM7 is also a promising  Figure 5: Validation of the expression data and survival curve of hub genes from the 3 modules using the GEPIA database and functional and pathway enrichment analysis. (a) The box plots that verify whether the expression of these DEGs is consistent with that in the LIHC datasets. Among the downregulated genes, CCL21 and FPR1 (module 1) as well as XAF1 (module 2) are consistent with the HCC datasets, while IFI6, IFI27, and ISG15 (module 2) are not. All 8 upregulated genes (module 3) are consistent with the LIHC datasets. (b) The genes are associated with overall survival whose expression is consistent with that in the LIHC datasets. biomarker for early diagnosis of gastric cancer and even a predictor of meningioma recurrence after surgery [35,36]. Another study found that high expression of MCM7 may be involved in the progression of HCC through the MCM7cyclin D1 pathway, and MCM7 may serve as a prognostic marker for patients with HCC [37]. The cell division cycleassociated protein 5 (CDCA5) is a member of the CDCA family that comprises CDCA1-8. It plays a crucial role as a regulator of sister-chromatid cohesion and separation during cell division, and its upregulation has been shown to be associated with various cancers, including breast cancer, esophageal squamous cell carcinoma, CRC, and HCC [38,39]. Also, a study found that the activation of the ERK and AKT pathways may be involved in the regulation of HCC cell proliferation by CDCA45 [39]. CDCA8 is an essential regulator of mitosis, and its overexpression is significantly associated with bladder cancer, cutaneous melanoma, and the progression and prognosis of breast cancer [40,41]. Centromere protein H (CENPH) is considered to be an essential part of the active centromere complex, and its overexpression is highly related to poor prognosis in renal cell carcinoma, nasopharyngeal carcinoma, CRC, and HCC [42,43]. Another study found that CENPH may promote the proliferation of HCC through the mitochondrial apoptosis pathway [43].
Most of the abovementioned genes are significantly associated with the cell cycle and DNA replication, and their overexpression may affect the replication of HBV DNA, thereby promoting the unique phenomenon of HDV inhibits HBV replication. All the findings related to these genes may also help us understand the mechanisms of HDV-induced liver injury and HCC. In the future, we will further verify those genes' function by performing animals, cells, and clinical trials.

Conclusions
In summary, 7 potential candidate genes closely related to HDV-associated HCC were identified in this study. Through comparative analysis with previous studies, these genes were found to be involved in many pathways related to tumorigenesis which provided clues to elucidate the mechanism of hepatitis D virus-induced HCC or its unique molecular mechanism in the inhibition of HBV replication. However, additional in-depth molecular biological research on these candidate genes closely related to HDV-associated HCC is necessary to confirm their functions.