Identification of Long Noncoding RNA Biomarkers for Hepatocellular Carcinoma Using Single-Sample Networks

Objective Many studies have found that long noncoding RNAs (lncRNAs) are differentially expressed in hepatocellular carcinoma (HCC) and closely associated with the occurrence and prognosis of HCC. Since patients with HCC are usually diagnosed in late stages, more effective biomarkers for early diagnosis and prognostic prediction are in urgent need. Methods The RNA-seq data of liver hepatocellular carcinoma (LIHC) were downloaded from The Cancer Genome Atlas (TCGA). Differentially expressed lncRNAs and mRNAs were obtained using the edgeR package. The single-sample networks of the 371 tumor samples were constructed to identify the candidate lncRNA biomarkers. Univariate Cox regression analysis was performed to further select the potential lncRNA biomarkers. By multivariate Cox regression analysis, a 3-lncRNA-based risk score model was established on the training set. Then, the survival prediction ability of the 3-lncRNA-based risk score model was evaluated on the testing set and the entire set. Function enrichment analyses were performed using Metascape. Results Three lncRNAs (RP11-150O12.3, RP11-187E13.1, and RP13-143G15.4) were identified as the potential lncRNA biomarkers for LIHC. The 3-lncRNA-based risk model had a good survival prediction ability for the patients with LIHC. Multivariate Cox regression analysis proved that the 3-lncRNA-based risk score was an independent predictor for the survival prediction of patients with LIHC. Function enrichment analysis indicated that the three lncRNAs may be associated with LIHC via their involvement in many known cancer-associated biological functions. Conclusion This study could provide novel insights to identify lncRNA biomarkers for LIHC at a molecular network level.


Introduction
Hepatocellular carcinoma, one of the most common cancers worldwide, is the third leading cause of worldwide mortality for various cancers, and its incidence rate per year remains increasing rapidly [1][2][3]. The risk factors for HCC include infection with hepatitis B virus (HBV) or hepatitis C virus (HCV), aflatoxin B1 intake, alcohol consumption, nonalcoholic fatty liver disease, and some hereditary diseases [4][5][6]. Since patients with HCC are usually diagnosed at late stages, when medication is no longer effective, understanding the molecular mechanisms of HCC and identifying biomarkers for early diagnosis and treatment seem to be essential [7,8].
Long noncoding RNAs (lncRNAs) are non-proteincoding transcripts longer than 200 nucleotides. According to the well-known central dogma of molecular biology, genetic information is stored in protein-coding genes [9,10], for which noncoding RNA (ncRNAs) have been considered as "junk genes" or "transcriptional noise" for a long time [11]. However, with the development of both experimental technology and computational methods, an increasing number of lncRNAs have been discovered in human transcriptome. Over the past decades, several researches have shown that lncRNAs are involved in almost the whole life cycle of cells through different mechanisms, and they have played diverse and important roles in many fundamental and critical biological processes, including transcriptional regulation, epigenetic regulation, organ or tissue development, cell differentiation and apoptosis, cell cycle control, metabolic processes, and chromosome dynamics [12][13][14][15]. Recently, several lncRNAs have been demonstrated to be associated with the development and survival in patients with different kinds of cancers, including HCC [16][17][18]. Moreover, many studies have highlighted the molecular mechanism and biological characters of lncRNAs in HCC occurrence and progression, and the result revealed that some lncRNAs can also serve as valuable prognostic predictors for HCC patients [19][20][21]. Despite precision medicine, which uses molecularly targeted therapy against malignant tumors and speeds up progress toward the discovery of novel molecular targets with the diagnostic and prognostic value [22], the management of patients with HCC remains problematic [23,24]. Therefore, there is an urgent requirement to identify many more lncRNA biomarkers for HCC.
One key to achieving personalized medicine is to elucidate the molecular mechanisms of individual specific diseases, which generally result from the dysfunction of individual specific molecular network rather than the malfunction of single molecules. With rapid advances in highthroughput technologies, applying molecular networks to analyze human complex disease is attracting increasing wide attention [25,26]. A molecular network, e.g., a gene regulatory network or a coexpression network, can be generally estimated by correlation coefficients of molecule pairs from expression or sequence data of multiple samples [27,28]. In recent years, based on biological and clinical data, a number of network-based methods were proposed not only to identify disease modules and pathways but also to elucidate molecular mechanisms of disease development at the network level [29][30][31]. Many studies have shown that network-based biomarkers are superior to traditional single-molecule biomarkers for accurately characterizing disease states due to their additional information on interactions and networks [28,30,32,33]. In particular, a single-sample network is considered to be reliable for accurately characterizing the specific disease state of an individual. It can be directly used to identify the biomarkers and further elucidate the molecular mechanisms of a disease for individual patients [29].
In this study, we aimed to identify the lncRNA biomarkers for the patients with LIHC based on the RNA-seq data from TCGA. By constructing the single-sample networks for the 371 tumor patients, we obtained three lncRNA biomarkers associated with overall survival (OS) of LIHC patients. Then, a 3-lncRNA-based risk score model was established on the training set, which could effectively predict the OS of LIHC patients. The independence and the predictive ability for survival prediction of the 3-lncRNA-based risk score were validated on the testing set and the entire set.

Differential Expression Analysis.
EdgeR is a Bioconductor package for differential expression analysis of replicate count data, which was widely used in the differential expression analysis of high-throughput sequencing data. In our study, we extracted the expression profile of mRNAs and lncRNAs from RNA-seq count data, which was normalized using the edgeR package (version 3.22.5). Those mRNAs and lncRNAs with zero expression value in more than 10% samples were discarded. The differentially expressed lncRNAs and mRNAs were calculated by edgeR at a threshold of FDR < 0:05 and | log 2 ðfold changeÞ | >1.
2.3. Construction of the Single-Sample Networks. In our study, the differentially expressed mRNAs and lncRNAs which had the same gene names were removed. As a result, 3329 differentially expressed mRNAs and 956 differentially expressed lncRNAs for each sample were left for further investigation. A single-sample network was constructed based on statistical perturbation analysis of a single test sample against a group of given reference samples, which can accurately characterize the disease state of an individual or a sample. The more detailed description about the single-sample network can be found in Ref. [34]. In our study, we took 50 normal samples as the reference samples, while 371 tumor samples were the test samples. Firstly, based on the gene expression data of the reference samples, a reference correlation network can be constructed by computing the Pearson correlation coefficient (PCC) between lncRNA-lncRNA and lncRNA-mRNA pairs, which was denoted as N r . Then, adding a test sample s to the reference samples, another perturbed correlation network was obtained in the same way, which was denoted as N p . By comparing the difference of the two correlation networks, we can get a single-sample network N ssn for this test sample sðN ssn = jN r − N p jÞ. Finally, 371 single-sample networks were obtained in our study.
For convenience, we transformed each single-sample network to an adjacency matrix Δ D, and the element ΔD i,j represents the ΔPCC of the edge for a pair of molecules in the single-sample network. As the theoretical foundation for this method, if there were obvious differences between the reference samples and the single sample s in terms of the gene expression pattern, adding the tumor sample s to the reference samples would cause significant changes of the PCC on some edges in the perturbed network. We assumed that if a lncRNA might be a key biomarker, the sum of the Δ PCC of the edges linked by the lncRNA would be higher than others. Then, a vector SD was used to represent the sum of ΔPCC of the edges linked by each lncRNA in a singlesample network, which was denoted as where SD i is the sum of ΔPCC of all the edges linked by the ith lncRNA in a single-sample network and can be calculated by Consequently, all the 371 single-sample networks can be represented by a matrix M 956×371 : , where i denotes the ith lncRNA and j denotes the jth tumor sample. Different from the method in Ref. [34], we designed a ranking system to identify the candidate lncRNA biomarker according to the matrix M. Firstly, each column in matrix M was sorted by the value size of SD ij , and the items in the column returned to the corresponding lncRNA in matrix ML. Then, we calculated the frequency of each lncRNA in the top K rows of matrix ML and the top 5% lncRNAs were retained. Finally, in order to improve the effectiveness of our method, we took the intersection of the candidate lncRNAs under different K (K = 5,10,20,30) as the final candidate lncRNA biomarkers by SSN. The flowchart of the ranking system is shown in Figure 1.

Survival Analysis.
The differentially expressed lncRNAs, of which the expression level was zero that exceeded 10% of all subjects, were removed from the prognostic analysis. In the meantime, clinicopathological features, including survival information, were also achieved from TCGA. Samples without sufficient clinical data were omitted. Finally, the prognostic analysis included a total of 307 tumor samples with the expression data from 956 lncRNAs. The univariate and multivariate Cox regression analyses were conducted to evaluate the association between the variables and the OS of LIHC patients on the training set, the testing set, and the entire set, and statistical significance was assessed using p < 0:05. The Kaplan-Meier plots were employed to observe the prognosis effect. The ROC curve analysis was used to evaluate the prognosis performance for 5-year survival rate. All analyses were performed on the R3.4.3 framework.
2.5. Construction of the Risk Score Model. The 307 patients were randomly divided into two datasets. 154 patients were used as the training set to build a risk score model based on the candidate lncRNA biomarkers, while the other 153 patients were used as the testing set to evaluate the predictive ability of the risk score model. Firstly, the multivariate Cox regression analysis was used to evaluate the association between the expression of the lncRNA biomarkers and the patient's OS in the training set. Then, a risk score model was built by linear combination of the expression levels of the lncRNA biomarkers weighted by their Cox regression coefficients. The calculation formula of the risk score model was as follows: where i represents the total number of the lncRNA biomarkers, Exp indicates the expression profiles of lncRNA, and Coe is the estimated regression coefficient of the ith lncRNA derived from the multivariate Cox analysis. Based on this formula, each patient with LIHC had an RS and the median RS was treated as a cut-off point to divide the patients into the low-risk group and the high-risk group.
2.6. Pathway and Functional Enrichment Analysis. Metascape (http://metascape.org/gp/index.html) is a free gene annotation and analysis resource that helps biologists make sense of one or multiple gene lists [35]. To understand the function of the identified lncRNA biomarkers in our study, the Gene Ontology (Go) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis was performed by Metascape. A value of p < 0:01 was set as the cut-off for significance.

Differentially Expressed lncRNAs and mRNAs in LIHC.
We downloaded the expression data of 60483 RNA, from which 14822 lncRNAs and 19814 mRNAs were obtained by the gene type data reported by the genome GRCh38.p13. By calculating with edgeR, a total of 956 lncRNAs and 3329 mRNAs were considered to be differentially expressed between tumor samples and normal samples. The volcano plot of the differentially expressed mRNAs and lncRNAs is shown in Figure 2.

The Candidate lncRNAs Identified by Single-Sample
Networks. In our study, the top 48 lncRNAs (5% of 956 lncRNAs) that had the highest frequency of occurrence in all the 371 tumor samples remained as the candidate lncRNA biomarkers by the ranking system. Furthermore, by taking the intersection of the candidate lncRNAs identified under different KðK = 5, 10, 20, 30Þ, 27 lncRNAs were obtained as the final candidate lncRNA biomarkers by the ranking system according to 371 single-sample networks. The detailed differential expression information of these lncRNAs is listed in Table S1.  Figure 3. The detailed description of the three lncRNAs is shown in Table 1.   3.5. Evaluation of the Risk Score Model in the Testing Set and the Entire Set. To assess the robustness of the 3-lncRNAbased risk score model in OS prediction for LIHC patients, we further examined it in the testing set and the entire set. The same 3-lncRNA-based risk score model and cut-off point derived from the training set were used to divide the testing set and the entire set into the high-risk group and the low-risk group (n = 77 and 77, n = 154 and 153, respectively). In the testing set, the Kaplan-Meier curve showed that the survival of LIHC patients in the low-risk group exhibited a longer OS as compared to those in the high-risk group ( Figure 5(b), p = 0:0037). The median survival of patients in the high-risk group and the low-risk group was 2.29 years and 6.69 years, respectively. In the entire set, a similar result was shown that patients in the high-risk group exhibited a shorter OS as compared to those in the low-risk group ( Figure 5(c), p = 0:0011). The median survival of patients in the high-risk group and the low-risk group was 3.39 years and 5.52 years, respectively. The AUC of 5-year ROC curve in the testing set and the entire set was 0.704 and 0.664, implying a good prognostic capacity of the 3-lncRNA-based risk score (Figures 5(e) and 5(f)). , which revealed that the 3-lncRNA-based risk score was an independent predictor of OS for LIHC patients. More detailed results are provided in Table 2.

Pathway and Functional Enrichment Analysis.
We were also interested in the molecular mechanisms of the three lncRNAs. Unfortunately, little publication was found on the functional mechanism of these lncRNAs. Finally, we performed Pearson correlation analyses between the three lncRNAs and protein-coding genes based on their expression levels in TCGA LIHC cohort. The proteincoding genes that correlated with at least 1 of the three lncRNAs (Pearson coefficient > 0:45, p < 0:01) were considered to be significantly correlated coexpressed genes of the three lncRNAs, which are listed in Table 3. Then, pathway and process enrichment analyses were carried out with the following ontology sources in Metascape: KEGG pathways  and GO biological processes. All genes in the genome were used as the enrichment background. Terms with p < 0:01, minimum count = 3, and enrichment factor > 1:5 were collected and grouped into clusters based on membership similarities. We found that the coexpressed genes of three lncRNAs were mainly associated with terms related to endothelial cell migration, ncRNA metabolic process, cell cycle arrest, etc. (Figure 6). The retrieved data is also provided in Table S2.

Discussion
HCC is the most common type of liver cancer, accounting for about 80% of the tumours in this organ. Although there are advances in the diagnosis and treatment of HCC, the OS time of patients with HCC remains poor. Recently, molecular biomarkers have been identified throughout the various clinical stages and pathological tissue types, and multiple studies have revealed that differential expression of lncRNAs played critical roles in cancer development, indicating their potential as novel biomarkers for cancer diagnosis and prognosis [36,37]. However, until now, only a few lncRNAs had been experimentally verified for HCC. Thus, it is necessary to identify more potential lncRNA biomarkers for HCC.
In the present study, we analyzed the expression profiles and clinical information of LIHC patients from TCGA database. The differential expressed lncRNAs and mRNAs were screened to construct the single-sample networks for 371 LIHC patients. According to the 371 single-sample networks, 27 candidate lncRNAs were selected by the ranking system designed for identifying the lncRNA biomarkers from the single-sample networks. Furthermore, by using the univariate Cox regression, three lncRNAs (RP11-150O12.3, RP11-187E13.1, and RP13-143G15.4) were selected as the lncRNA biomarkers associated with patients' OS. Next, multivariate Cox regression was performed in the training set to establish a 3-lncRNA-based risk score model based on the expressions and relative contributions in the multivariate Cox regression model. The 3-lncRNA-based risk score had a good prognosis prediction ability for OS of patients with LIHC, which was tested in the testing set and the entire set. Moreover, we performed multivariate Cox regression analyses using the 3-lncRNA-based risk score and other clinical data in the training set, the testing set, and the entire set. The result showed that the 3-lncRNA-based risk score was an independent predictor for the OS of LIHC patients.
The results of function enrichment analysis showed that these KEGG pathways and functional categories of the three lncRNA biomarkers were all closely associated with tumorigenesis. For instance, the transcription factor p53 is a tumor suppressor, activating downstream targets to trigger cell cycle arrest and apoptosis [38], and several lncRNAs participate in the p53 regulatory network and serve as p53 regulators or effectors [39]. Moreover, a recent global transcriptome study identified that sixteen p53 target lncRNAs forming a pathway web constitute tumor suppressor signature with high diagnostic power [40]. Importantly, Jelena et al. revealed that p53 might play multifunctional roles in different stages in HCC [41]. Moreover, the metastatic characteristics of liver cancer are the key factors affecting the survival and prognosis of tumor patients, and the process of cell migration is involved in tumor metastasis [42]. The COP9 signalosome is a highly conserved protein complex implicated in diverse biological functions that involve ubiquitin-mediated proteolysis [43]. Also, research demonstrates that COP9 signalosome is an important regulator of cell cycle and cell survival mediating the proliferation of HCC cells and highlight that COP9 signalosome might be a promising strategy for anti-HCC therapy [44]. The unfolded protein response is a signal transduction cascade, which acts as a quality control mechanism for protein synthesis and consequently can act to protect cells from an adverse external microenvironment. The study declared that the unfolded protein response is activated in the majority of HCC and may play a role in chemoresistance to the most widely used chemotherapy agent, doxorubicin, and have effects on newer antiestrogen and multikinase inhibitor therapies [45]. The present results indicated that the three lncRNAs might have an important role in LIHC via their involvement in these known cancerassociated biological functions.
Besides, previous studies have demonstrated that RP11-150O12.3 was significantly and independently associated with survival of HCC patients [46]. Moreover, RP11-150O12.3 also has been recognized as an independent   C7orf61, LSM8, AP1S1, ELOB, TSR3, PSMG3, COPS9, COPS6, ZNF593, LOC101927420,  CLDND2, ST7-AS1, C6orf52, LOC106660606, B9D1, FIS1, PRSS3, S100A13, S100A16, NSUN5, LMNA,  LAGE3, ACD, SNHG19, HSPB1, TMEM54, LINC00896, TMSB10, SFN, BRI3, EIF3B, HRAS, MRPL28,  MIR6132, NT5C, PTCD1, RASSF1-AS1, MINCR, LOC105371849, METTL26, DPM3, BCL7C, NUDT1,  AP1M2, MCM7, BOLA2B, JOSD2, LOC400684, PUSL1, ARF4-AS1, UQCC3, POLR2J, NAA10,  LOC101928659, PGP, S100P, POP7, LAMTOR4 predictor of gastric cancer prognosis [47], which also related to survival time of patients with colorectal adenocarcinoma [48]. RP13-143G15.3 is abnormally expressed in liver cancer and may act as an important role of tumor suppressor in human HCC [49]. The results of these studies provided some grounding to validate our findings. In our study, the single-sample networks were constructed for identifying the lncRNA biomarkers associated with LIHC patients. Actually, a single-sample network in this study is not a real molecular network for each sample but a perturbation network for a single sample against the reference network. It mainly reflects the variation between normal and disease samples in terms of interactions or a network. Similarly, a differential expression of a gene is not the real gene expression level for each sample but the variation of the gene expression between normal and disease samples. The advantage of the single-sample network is that it cannot only characterize the personalized features for all the single samples but also can be directly applied to the data analysis of single samples at the molecular network level, which is superior to traditional single molecular biomarker.
There are several limitations to the present study. First, the primary purpose of the present study was to identify lncRNA biomarker for LIHC at the molecular network level. The lncRNA biomarkers selected in our study should be further investigated incorporating more specific clinical characteristics to fully understand the associations involved. Second, the three lncRNA biomarkers were only validated in the datasets from TCGA database, which required additional confirmation in other large cohorts of LIHC patients in the future. Third, the lncRNA biomarkers identified in our study have no validation in fresh samples and experiment studies.

Conclusion
In this study, we constructed the single-sample networks for LIHC patients and identified three lncRNA biomarkers associated with the OS of LIHC patients. A 3-lncRNA-based risk score model with the ability to effectively predict the survival of LIHC patients was established, which was validated to be an independent predictor for the OS of LIHC patients. Functional enrichment analysis revealed that three lncRNAs may be associated with LIHC via their involvement in many known cancer-associated biological functions. This study provides a new perspective for identifying the lncRNA biomarkers at the molecular network level.