Identification of a Two-lncRNA Signature with Prognostic and Diagnostic Value for Hepatocellular Carcinoma

Background Accumulating evidence has revealed the important role of long noncoding RNAs (lncRNA) in tumorigenesis and progression of hepatocellular carcinoma (HCC). This study aimed to identify potential lncRNAs that can serve as diagnostic and prognostic signatures for HCC. Methods Expression profiling analysis was performed to identify differentially expressed lncRNAs (DElncRNA) between HCC and matched normal samples by integrating two independent microarray datasets. Functional Gene Ontology (GO) terms and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways were explored by Gene Set Variation Analysis. The prognostic and diagnostic models were developed based on two DElncRNAs. Real-time PCR was used to quantify the relative expressions of candidate lncRNAs. Results Two robust DElncRNAs were identified and verified by quantitative PCR between HCC and matched normal samples. Function enrichment analysis revealed that they were associated with the wound healing process. The two lncRNAs were subsequently used to construct a prognostic risk model for HCC. Patients with high-risk scores estimated by the model showed a shorter survival time than low-risk patients (P < 0.001). Besides, the two lncRNA-based HCC diagnostic models exhibited good performance in discriminating HCC from normal samples on both training and test sets. The values of area under the curve (AUC) for early (I–II) and late (III–IV) HCC detection were 0.88 and 0.93, respectively. Conclusions The two wound healing-related DElncRNAs showed robust performance for HCC prognostic prediction and detection, implying their potential role as diagnostic and prognostic markers for HCC.


Introduction
Hepatocellular carcinoma (HCC) is one of the major human malignancies and ranks as the second leading cause of cancer-related deaths worldwide [1]. Over the past few decades, although progress in surgical techniques and systemic treatments have improved the overall prognosis of patients with liver cancer, the incidence rate continues to increase and clinical outcome is extremely dismal [2]. It is estimated that half of the worldwide new liver cancer cases and deaths occurred in China every year [3,4]. Multiple factors including hepatitis C (HCV) and B virus (HBV) infection, toxins (aflatoxin B1), chronic alcohol abuse, and nonalcoholic fatty liver disease have been identified as risk causes of HCC [5,6]. e classic prognostic model, tumornode-metastasis (TNM) staging, as well as molecular biomarkers such as serum alpha-fetoprotein (AFP) levels, have been utilized for the diagnosis of HCC and for predicting its prognostic outcome to therapy [7]. However, due to its high heterogeneity, the prognosis and response to chemotherapy differ largely among patients with a similar stage [8].
erefore, the search for effective biomarkers for early diagnosis and prognosis is indispensable.
lncRNAs are a class of noncoding RNAs with a length longer than 200 nucleotides (nt) that show little or no protein-coding capacity [9,10]. In recent years, emerging studies have indicated the crucial roles of lncRNAs in pathological processes or tumorigenesis and metastasis of various cancers [11][12][13]. Some dysregulated lncRNAs can activate key oncogenic networks, such as the activation of epithelial-to-mesenchymal transition, Wnt, and TGF-β signaling pathways, promoting cancer metastasis [14]. For example, silencing of MALAT-1 by siRNA decreases cell proliferation and inhibits HCC migration and invasion [15]. Increased expression of GAS5 downregulates the vimentin and upregulates the E-cadherin level in HCC cells [16]. Given their important function in tumor development, lncRNAs as prognostic signatures in multiple cancers including renal cancer, glioblastoma, colorectal cancer, lymphoma, and others have been explored in previous studies [17][18][19][20][21]. To date, a variety of lncRNAs or lncRNA groups for HCC prognosis prediction have been uncovered by multiple studies [21][22][23][24][25][26]. While to the best of our knowledge, many lncRNA signatures were developed only from a single source of the dataset and few of them were both for HCC prediction of diagnosis and prognosis. So it is necessary to identify new lncRNA signatures for the early diagnosis and prognosis of HCC. e public databases of e Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO) containing abundant cancer data allow researchers to exploit the potential tumorigenesis mechanisms, novel molecular subtypes, and prognostic biomarkers for various cancers [27][28][29][30][31]. In the current study, two novel lncRNAs were identified from two independent datasets based on our reannotation method. en prognostic risk models, as well as prediction diagnostic models were constructed with the two lncRNAs expression profile. We further assessed the relationship of 2-lncRNAs signature with clinicopathological parameters of the stage, grade, age, and molecular subtypes. In addition, functional enrichment analysis revealed the potential roles of 2 lncRANs in biological processes and pathways. ese results indicated better performance of our 2-lncRNA signature for both the diagnostic and prognostic prediction of HCC.

Sample Collection and Data Preparation.
We collected 14 HCC and adjacent normal FFPE samples from the Department of Oncology I of Seventh People's Hospital of Shanghai University of Traditional Chinese Medicine. e study was approved by the hospital ethics committee (No. KY2020106) and patients have been informed of the purpose of collected samples. All information regarding patient privacy has been anonymized. e dataset consisted of 11 male and 3 female patients, with a median age of 57 (Table 1).
e expression values of lncRNA based on the Reads Per Kilobases per Million (RPKM) mapped reads as well as the phenotypes and prognostic data were downloaded from the TCGA Liver Hepatocellular Carcinoma (LIHC) database (https://portal.gdc.cancer.gHCC/). Two independent expression data, GSE70880 [32] and GSE101728 [33], were downloaded from the GEO database (https://www.ncbi.nlm. nih.gov/geo/), which included 7 and 16 normal and tumor paired samples, respectively. We selected these two datasets in our study because both of them were lncRNA + mRNA microarray platforms. Additionally, the GSE144269 dataset [34] was retrieved as an independent testing set, which consisted of 140 RNA-seq samples from 70 HCC tumors and matched normal tissues. e information on the platforms and numbers of samples of each dataset were provided in Supplemental Table 1.

Re-Annotation of lncRNAs.
e pipeline for re-annotation of lncRNA microarray probes relies on a custom Perl script [35] and the sequence alignment program of BLAT [36]. Probe sequences provided by Agilent (https://www. agilent.com/) were BLATed against the latest noncoding RNA sequences from Ensembl (https://asia.ensembl.org/ Homo_sapiens/Info/Index) database with the parameter "-t � DNA -q � DNA -maxGap � 0 -out � blast8 -fastMap." e noncoding gene alignment reports were then parsed and only the best probe-lncRNA alignment entries were kept. In order to check the consistency and detect misalignments of probes, the re-annotated probes were then compared with the probe annotation details obtained from GPL21047.

Differential Expression Analysis.
Two independent datasets of GSE70880 and GSE101728 downloaded from the GEO database were used for the identification of differentially expressed lncRNAs (DElncRNAs) between the tumor and paired normal samples. To define the DElncRNAs, the | log2 fold change| > 1 and P < 0.05 were set as the threshold by using the limma R package [37]. e similarity between tumor and normal samples was evaluated by using an affinity propagation (AP) clustering algorithm [38].

Prognostic Risk Model Construction.
Univariate Cox analysis using R survival package (https://CRAN.R-project. org/package�survival) was conducted for the identification of prognosis-associated DElncRNAs with overall survival (OS) in the TCGA HCC training set (Table 2). DElncRNAs with log-rank test P < 0.05 was considered as seed lncRNAs for Cox LASSO [39] regression with 10-fold cross-validation (CV). By 1000 iterations of Cox LASSO regression with 10fold CV using the R package glmnet (with the default parameter), the shrunken lncRNAs with nonzero coefficients were selected as potential prognostic lncRNAs. Multivariate Cox regression with 1000 times bootstrapping was further performed to calculate the contribution of the 2 DElncRNAs in survival predictions. e risk score of each patient was then evaluated based on the 2 DElncRNAs expression profile which is described as follows: where Expr and β are the expression value and multivariate regression coefficient of 2 DElncRNAs, respectively. e DElncRNAs with β > 0 were defined as high-risk signatures while those with β < 0 were defined as protective genes. e patients were divided into high (Risk-H) and low (Risk-L) risk groups according to median risk score.  Table 3. e PCR procedure consisted of pre-denaturing 95°C for 2 min, followed by 45 cycles of 95°C for 30 s, 58°C for 30 s, and 72°C for 30 s. e qPCR assay was repeated three times for each gene on each sample, and the corresponding Cq values were obtained separately. e relative expression values (2 −ΔΔCq ) of DYNLL1-AS1 and RP11-116D2.1 for each repeat were calculated according to the Cq values of GAPDH. e average of the three 2 −ΔΔCq values was used as the relative expression values of these two genes.   Primer   Journal of Oncology

Functional Analysis of DElncRNAs.
e co-expressed mRNA with DElncRNAs was identified by performing Pearson's correlation analyses between the expression of lncRNAs and protein-coding genes based on the RNA-seq data from the TCGA HCC cohort. Protein-coding genes with a correlation coefficient >0.5 and a false positive rate <0.05 were considered as the lncRNA-related genes. Functional annotation of the lncRNA-related genes including Gene Ontology (GO) analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment were conducted by the GeneCodis3 website tool [40] (https://genecodis.cnb.csic.es/). Significantly enriched categories were identified with the threshold of Hyp_c < 0.05 and limited to GO terms in the "Biological Process" (GOTERM-BP-DIRECT) and KEGG pathway categories.

Construction of HCC Diagnostic Model.
Binomial logistic regression was used to develop a diagnostic model for the detection of HCC patients based on the expression profile of 2 DElncRNAs. First, all samples from the TCGA HCC cohort were randomly separated into training and validation sets, and no significant differences were found in baseline characteristics between the two groups. e diagnostic accuracy was quantified by the area under the receiver operating characteristic (ROC) curve (AUC) [41]. Finally, the Younden index identified the optimal sensitivity and specificity. To avoid the bias of randomly grouping in a single time, the bootstrapping method with 100 times resampling was used to further evaluate the accuracy of the training set and validation set.

Statistical Analysis.
All statistical analyses were conducted using R software (version 3.6.1). Paired Student's ttest was used for the comparison of lncRNAs between tumor and paired normal samples. Comparisons for two groups and multiple groups were performed for continuous variables using the Wilcox rank test and Kruskal-Wallis rank test, respectively. Categorical variable independence was established using a χ 2 test. e false discovery rate was calculated using BH method.

Identification of Differentially Expressed lncRNAs.
Our re-annotation method for the noncoding gene expression probes in GPL21047 generated 23111 best-matched lncRNAs of which 79.8% were confirmed by comparing to Agilent official data (https://www.agilent.com/) indicating high reliable expression of lncRNAs ( Figure 1(a)). APClustering analysis showed systematic variations in the expression of lncRNAs and mRNAs between paired HCC and normal samples from GSE70880 and GSE101728 datasets. Samples of the two datasets were self-segregated into the HCC and normal clusters (Figures 1(b) and 1(c)). ese normalized microarray expression data were used for the identification of differentially expressed lncRNAs (DElncRNAs). A total of 2713 (up/down: 1468/1245) and 4389 (up/down: 1622/2767) DElncRNAs were identified in GSE70880 and GSE101728 datasets, respectively. As for both datasets, 412 DElncRNAs (Supplementary Table 2) were shared accounting for 15.3% and 9.38% of their total DElncRNAs (Figure 1(d)). Further analysis of 412 overlapped DElncRNAs revealed that 186 DElncRNAs were significantly upregulated in HCC tissues and 226 DElncRNAs were significantly downregulated in HCC tissues (Figures 1(e) and 1(f)), indicating that the activation as well as suppression of certain biological processes regulated by these DElncRNAs potentially involved in the developing of HCC.

Development of HCC Prognostic Risk Model.
e 363 TCGA HCC patients were randomly separated into training set (n � 181) and testing set (n � 182) ( Table 2). Among 412 DElncRNAs identified in both in GSE70880 and GSE101728 datasets, 45 of them were detected to be also expressed in the TCGA HCC dataset. en LASSO Cox regression method with 10-fold cross-validation was used to construct a multivariate prognosis risk model among the 45 DElncRNAs (Figures 2(a) and 2(b)). As a result, two lncRNAs (DYNLL1-AS1 and RP11-116D2.1) remained for constructing a prognostic signature based on their expression levels and their multivariate Cox coefficients (Table 4). e performance of the 2-lncRNA signature was evaluated by a timedependent ROC curve with the area under the curve (AUC) of 0.72, 0.71, and 0.58 for 1-, 3-, and 5-year overall survival prediction (Figure 2(c)). In the training set, HCC patients were divided into a high-risk group (Risk-H, n � 91) and a low-risk group (Risk-L, n � 90) according to the median risk score. Kaplan-Meier analysis revealed that the Risk-H group had a significantly poorer prognosis than the Risk-L group (P < 0.001, HR � 3.46) (Figure 2(d)). For the testing set, patients were also divided into a Risk-H (n � 92) and Risk-L group (n � 92) by using the same risk score model and cutoff value in the training set. e AUC for predicting 3-year overall survival reached 0.715 and Kaplan-Meier analysis also revealed a differential prognosis between the two groups (P � 0.0012 and HR � 3.03) (Figures 2(e) and 2(f )), which suggested that the 2-lncRNA signature had a good performance in the prediction of 3-year overall survival. Gene set variation analysis (GSVA) for Risk-H and Risk-L groups showed that cancer-related pathways such as cell cycle, P53 signaling pathway, and pathways in cancer had a significantly positive correlation with risk score (Figure 2(g)). While in contrast, metabolic pathways such as histidine metabolism, beta-alanine metabolism, and primary bile acid biosynthesis had significant negative correlations with risk scores (Figure 2(g)).

Functional Analysis of the Two lncRNAs.
e expression profile of the two lncRNAs was then evaluated in tumor and paired normal tissues from TCGA HCC and our inner HCC datasets. e results indicated that DYNLL1-AS1 was significantly upregulated and RP11-116D2.1 was significantly downregulated in the TCGA dataset (Figure 3(a)). We observed similar expression trends for these two genes between 14 HCC and adjacent normal samples, although no significant difference was found in DYNLL1-AS1 (Figure 3(b), Supplementary Table 3). Co-expression analysis identified 881 DYNLL1-AS1-related genes and 458 RP11-116D2.1-related genes, of which 174 protein-coding genes were overlapped (Figure 3(c)). KEGG pathway enrichment analysis suggested that DYNLL1-AS1-related genes are significantly enriched in pathways in cancer, focal adhesion, and the Wnt signaling pathway (Figure 3(d)). For RP11-116D2.1-related genes, they were also involved in pathways in cancer, as well as spliceosome, and RNA transport (Figure 3(e)). e overlapped significantly enriched pathways between them were spliceosome, cell cycle, oocyte meiosis, and lysine degradation (Figure 3(f )) which indicated potential interactions among the two lncRNAs.

Correlation of 2-lncRNA Prognostic Risk Model with
Clinical Characteristics. Further analyses were performed to determine the association of the 2-lncRNA-based risk score with clinical-pathological factors in the TCGA dataset, including pathologic stage, histological grade, and age. Kaplan-Meier analysis showed a significantly different prognosis between Risk-H and Risk-L groups classified by AJCC stage I and Grade 1 and 3-4 (P < 0.05), indicating that the 2-lncRNA-based risk model was an independent prognostic factor of an early stage and grade (Figures 4(a)  and 4(b)). Patient age also showed no significant correlation with risk scores either in Risk-H or Risk-L group (Figure 4(c)). Multivariate Cox regression analysis confirmed that the 2-lncRNA prognostic risk score was the most important hazard factor for HCC (HR � 2.25, 95% CI � 1.23-4.12, P < 0.001) (Figure 4(d)). Considering the infection of the virus has a great impact on HCC, we further compared the patient risk scores in HBV-infected, HCVinfected, and non-infected groups and found that the risk scores of HBV-infected samples were significantly much higher than HCV-infected and non-infected groups (Figure 4(e)). Based on the molecular level survival analysis, the LIHC cohort of the TCGA project was clustered into three molecular subtypes [42], we found that the patients in iCluster (iC) 2 with the best prognosis showed significantly lower risk scores than the patients in other two subtypes, meanwhile, the patients in iCluster 1 with the worst prognosis had the highest risk scores (Figure 4(f )). Similar results were observed among the patients from different immune subtypes. Patients in immune subtype C1 with the worst prognosis had the highest risk scores than other immune subtype patients (Figure 4(g)). e immune subtype C1 was characterized by the immune module of wound healing [43]. Given that high-risk score was mostly distributed in C1, we examined the relation between C1 signatures (wound healing-related signatures including 1 cells, 2 cells, proliferation, and wound healing) and risk score. Besides 1 cells, the other three signatures were all positively correlated with risk scores (P < 0.01, Figure 5).

Development of HCC Diagnostic Model Based on the 2-lncRNA Signature.
e logistic regression method was performed to establish a diagnostic risk score (dRS) model with the 2 lncRNAs. e expression profile of 2 lncRNAs showed that the expression of DYNLL1-AS1 was much lower than RP11-116D2.1 and their expression represented a significantly negative correlation (Figure 6(a)) suggesting a complemental expression pattern of the 2 lncRNAs. e violin plot (Figure 6(b)) showed that the dRS value was significantly upregulated in HCC tissues in the entire patient cohort. We further evaluated the performance of dRS model in discriminating HCC patients from normal controls. e ROC curve for training set and validation set was plotted (Figures 6(c) and 6(d)) with AUC of 0.889 (95%CI:      (Table 5).

Validation of the Diagnostic Model in Independent Testing
Set. e GSE144269 dataset containing 70 HCC tumor tissues and paired normal samples was used as a testing set to evaluate the performance of dRS model. e expression of DYNLL1-AS1 was upregulated in cancer samples, while RP11-116D2.1 showed an opposite trend (Figure 7(a)). Correlation analysis indicated that they were negatively correlated with each other (Figure 7(b)), which was consistent with the results of the TCGA dataset. e risk scores of cancer samples estimated by the model were significantly higher than that of normal samples (Figure 7(c)). ROC curve analysis showed that the model achieved an AUC of 0.87 (95% CI: 0.81-0.93) with the optimal sensitivity and specificity of 77.1% and 84.3%, respectively, as determined by the maximized Youden index (Figure 7(d)).

Discussion
Over the past decades, great efforts have been made in developing signatures for the prognostic prediction of HCC. However, no biomarkers have been shown to effectively predict the survival of HCC patients to date, partly due to its high heterogeneity causes such as virus infection, alcohol consumption, as well as immune disorders [44,45]. Given the high morbidity and mortality of HCC, it is crucial and urgently needed to develop effective biomarkers for the prognosis prediction of HCC. Several studies have shown important implications of molecular biomarkers such as aberrantly expressed genes and abnormal methylation events for outcome prediction and therapy decisions     [17,20,46,47]. Recently, lncRNAs were reported as critical regulators in various diseases including cancers [12,13]. Moreover, their potential utility as prognostic biomarkers for HCC was also demonstrated in a few studies [18,21]. e current study identified a series of differentially expressed lncRNAs by integrating two independent gene expression microarray datasets, among which 412 DElncRNAs were shared by the two datasets. Two lncRNAs were subsequently selected by performing LASSO regression with 10-fold cross-validation (CV). Results of multiple datasets indicated that expressions of DYNLL1-AS1 were significantly upregulated in cancer cells, while it was the opposite for RP11-116D2.1. We also observed a similar tendency in our custom dataset. Notably, no significant difference in RP11-116D2.1 expression was found in our dataset (P � 0.094), which we supposed might be related to several reasons. Firstly, the small sample size could lead to errors when using a two-sample t-test, as only 14 samples were quantified in this study, and the power was only 0.93 for paired t-test, which is slightly lower than 0.95 (n � 14). According to our clinical data ( Table 1), most of the samples were stage II, which leads to a more unbalanced distribution of clinical stages compared with the TCGA dataset. Secondly, the degradation, fragmentation, and chemical modification of nucleic acid frequently occur in FFPE tissues, which will affect the quantification of gene expression    [48]. Nevertheless, our results showed a similar trend of upregulated DYNLL1-AS1 in HCC samples, and the effects caused by FFPE samples can be eliminated by the relative quantification base on the internal reference gene, making this result reliable to some extent. Survival analysis indicated that patients with higher expressions of DYNLL1-AS1 were found to correlate with shorter survival time, while the elevated expressions of RP11-116D2.1 were associated with longer survival time.
e two lncRNAs were then used to construct a prognostic prediction model for HCC. ROC curve analysis suggested that the model was robust for HCC survival prediction. e 2-lncRNA signature showed a higher AUC value than the 5-lncRNA signature [49] and nearly the same AUC as 7-lncRNA signature [50]. Obviously, it is more feasible for researchers to carry out further investigations of less lncRNAs. ese results revealed that the two lncRNAs could serve as promising prognostic factors for HCC.
Previous studies have revealed that HBV infection was a risk factor for HCC prognosis [51]. A significantly higher risk score estimated by this model was also observed in the HBV-infected group than in the non-infected group. Besides, the prognostic model based on lncRNAs developed by Xiwen Liao et al. showed a good performance in HBV-related HCC [52], which was comparable with our model in terms of AUC values. Our findings implied their potential association with HBV infection.  16 Journal of Oncology Functional enrichment analysis indicated that genes correlated with DYNLL1-AS1 and RP11-116D2.1 significantly enriched in the Wnt signaling pathway, chemokine signaling pathway, and VEGF signaling pathway. Interestingly, we observed a significant correlation between the 2-lncRNA signature and wound healing subtype (C1) of HCC. e wound healing response involves several phases, including the formation of a fibrin clot at the wound site and the infiltration of neutrophils in the early stage, where these immune cells can release a plethora of cytokines and chemokines [53]. Another critical phase is epithelial regeneration, which relates to the migration and proliferation of fibroblasts [54]. Additionally, ECM remodeling is another important event in wound healing [55]. In this study, the risk score estimated by the 2-lncRNA signature showed a strong correlation with the enriched score of the ECM receptor pathway. ese findings suggested critical roles of the two lncRNAs in inflammatory pathways, however, more robust experiments are needed to investigate their potential relationship with the inflammatory response. e testing and abdominal ultrasound of serum biomarker alpha-fetoprotein (AFP) were widely recommended for routine surveillance of HCC in high-risk patients (US) according to many HCC guidelines [56], however, it has been excluded from the surveillance and diagnosis criteria in the guidelines published in 2014 [57]. Other serum biomarkers such as AFP-L3, DCP, interleukin-6, interleukin-10, and squamous cell carcinoma antigen were also investigated, while these serum-based tests lack adequate sensitivity and specificity for effective surveillance [58][59][60]. In this study, we further developed a model for HCC detection based on the two lncRNA expression profiles. e performance of the 2-lncRNA diagnostic model achieved AUC values of 0.889 and 0.913 in the training and validation sets, respectively, which was better than the efficiency of serum biomarkers. Moreover, we also obtained an AUC of 0.87 (95%CI: 0.81-0.93) in the independent testing set of GSE144269. For early stage (I-II) HCC, the diagnostic model achieved an AUC of 0.88. Previous studies have demonstrated the favorable efficiency of lncRNAs for HCC detection. Our results showed a comparable performance relative to other promising biomarkers such as DANCR [61], HULC [62], and Linc00152 [63], suggesting that two lncRNAs could be promising candidates in diagnosing ontogenesis of HCC.

Conclusions
In this study, we identified two differentially expressed lncRNAs from multiple datasets verified their expressions in our custom HCC samples. e 2-lncRNA signature showed robust performance for HCC detection and prognosis prediction. However, some limitations of the study need to be considered. For example, further experiments should be carried out to demonstrate their potential roles in HBV infection as well as their relationship with wound healing. eir performance for HCC detection and prognosis prediction needs to be evaluated in more clinical samples. Besides, the exact mechanisms of the two lncRNAs in HCC tumorigenesis and progression are still not well-studied although this study has comprehensively revealed their aberrant expressions between normal and cancer samples.

Data Availability
Data from cancer patients used in this study were collected from the publicly available de-identified hepatocellular carcinoma data set from Gene Expression Omnibus (GEO) and e Cancer Genome Atlas (TCGA). All the intermediate data are supplied in this manuscript and the supplementary materials.

Conflicts of Interest
All authors declare no conflicts of interest.