Genome-Wide Identification of Long Noncoding RNAs in Human Intervertebral Disc Degeneration by RNA Sequencing

Long noncoding RNAs (lncRNAs) are emerging as crucial players in a myriad of biological processes. However, the precise mechanism and functions of most lncRNAs are poorly characterized. In this study, we presented genome-wide identification of lncRNAs in the patients with intervertebral disc degeneration (IDD) and spinal cord injury (control) using RNA sequencing (RNA-seq). A total of 124.6 million raw reads were yielded using Hiseq 2500 platform and approximately 88% clean reads could be aligned to human reference genome in both IDD and control groups. RNA-seq profiling indicated that 1,854 lncRNAs were differentially expressed (log2 fold change ≥ 1 or ≤−1, p < 0.05), in which 1,530 could potentially target 6,386 genes via cis-regulatory effects. Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis for these target genes suggested that lncRNAs were involved in diverse pathways, such as lysosome, focal adhesion, and MAPK signaling. In addition, a competing endogenous RNA (ceRNA) network was constructed for analyzing the function of lncRNAs. Further, quantitative real time PCR (qRT-PCR) was used to confirm the differentially expressed lncRNAs and ceRNA network. In conclusion, our results present the first global identification of lncRNAs in IDD and may provide candidate diagnostic biomarkers for IDD treatment.


Introduction
Intervertebral disc degeneration (IDD) is implicated as the major contributor of low back pain, which inflicts global burden with severe health care [1,2]. Aetiology of IDD is complex, with both environmental and genetic factors playing roles in the disease process [3,4]. For example, genetic polymorphisms in a number of genes, such as collagen I, collagen IX, collagen XI, aggrecan, extracellular matrix-degrading enzymes, inflammatory cytokines (IL-1, IL-6, and TNF ), Fas/FasL, and vitamin D receptors, have been associated with an increased risk of IDD [3]. Moreover, aberrant expression of noncoding RNAs (ncRNAs) has also been reported to be involved in the occurrence of IDD [5,6].
NcRNAs are transcripts without protein-coding capacity, which are largely grouped into two major classes based on transcript size, small ncRNAs with length less than 200 nucleotides (nt), and long ncRNAs (lncRNAs, >200 nt) [7]. Small ncRNAs include many types, such as microRNAs (miRNAs), small interfering RNAs (siRNAs), and PIWIinteracting RNAs (piRNA), which have regulatory roles during diverse biological events [8]. Among these small ncRNAs, miRNAs are studied extensively in IDD, and we compared the expression of miRNAs between IDD and spinal cord injury specimens in previous study [9]. LncRNAs are initially considered as nonfunctional byproducts of RNA polymerase II transcripts and have been paid much attention recently because they may act as important cis-or trans-regulators in a wide variety of biological functions, such as gene expression, genome imprinting, chromatin modification, and epigenetic regulation as well [10][11][12]. Moreover, lncRNAs have been applied in the treatment of disease, such as cancer, neurodegenerative, and psychiatric diseases [13,14].
Recently, Wan et al. have applied microarray method to identify 1,806 significantly differentially expressed lncR-NAs in IDD compared to spinal cord injury group [15], whereas, unlike microarray method detecting known lncR-NAs from an RNA pool [16], RNA sequencing (RNA-seq) Table 1: Basic specimen's information.

Sample
Age Gender  MRI grade  IDD 1  55  F  V  IDD 2  60  F  IV  IDD 3  63  F  V  IDD 4  29  M  IV  IDD 5  34  M  IV  IDD 6  is independent of currently available genome annotation or sequence and can be used to discover previously unknown lncRNAs in an unbiased manner [17].
In current study, we performed a comprehensive transcriptome analysis in IDD and control groups using RNA-seq and identified lncRNAs with differential expression between them. To explore the function of lncRNAs, we predicted their potential targets with cis-regulatory effects, which were then put into gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) for further analysis. In addition, it has been reported that lncRNA may function as competing endogenous RNA (ceRNA) in regulating gene expression. These lncRNAs could share miRNA response elements (MREs) with the transcripts of specific mRNAs and thus affect expression of these mRNAs [18,19]. In combination with our previous miRNA profile [9], we constructed a ceRNA network to analyze the potential functions of lncRNAs. Furthermore, 10 differentially expressed lncRNAs and ceRNA network were confirmed by quantitative real time PCR (qRT-PCR). Taken together, our results may provide more candidate diagnostic biomarkers and treatment targets for IDD patients.
This study was approved by the Human Ethics Committees Review Board at Xi'an Jiaotong University, Xi'an, China, and written informed consent was obtained from all participants.

RNA Extraction and Sequencing.
Total RNA from six IDD and six spinal cord injury samples were extracted separately using TRIzol reagent (Invitrogen, Carlsbad, CA, USA) in accordance with the manufacturer's instructions. Then, DNase I was added to remove contaminating genomic DNA. RNA quality and quantity were measured using a NanoDrop spectrophotometer (Thermo Scientific, USA) and Agilent 2100 Bioanalyzer (Agilent Technologies, USA). RNA integrity was determined by 1% gel electrophoresis. Equal amounts of total RNA from the IDD and spinal cord injury samples were pooled into experimental and control groups, respectively.
As for high throughout sequencing, ribosomal RNA (rRNA) was depleted from total RNA using the Ribo-Zero6 rRNA Removal Kit (Human/Mouse/Rat; Epicentre, USA) according to manufacturer's protocol. The cDNA libraries were prepared using an ScriptSeq6 v2 RNA-Seq Library Preparation Kit (Epicentre, USA) and were sequenced on Illumina HiSeq 2500 with 101 bp paired-end reads at the YingBio Tech, Shanghai, China.

RNA-Seq Reads Mapping and Transcriptome Assembly.
The raw reads were first filtered to remove the adapter sequences and low-quality sequences using Trim Galore [21]. Next, clean reads were mapped to the human GRCh37 reference genome using TopHat [22]. To construct transcriptome, the mapped reads were assembled using Cufflinks [23]. The cuffcompare program was used to merge the RefSeq, Ensembl, Gencode, UCSC, Noncode, and Lncipedia humanknown genes into one set of gene annotation for comparison with the assembled transcripts [23].

Pipeline for the Identification of lncRNA.
To identify novel reliable lncRNAs from IDD, we employed a highly stringent pipeline to remove transcripts with evidence for proteincoding potential ( Figure 1). Firstly, single-exon transcripts and the transcripts with length less than 200 nt were filtered. Next, three independent algorithms, coding potential calculator (CPC), coding potential assessment tool (CPAT), and phylogenetic codon substitution frequency (PhyloCSF), were applied to extract high reliable potential noncoding transcripts. A positive CPC score indicated a protein-coding potential transcript, whereas CPC value < 0 was considered as noncoding transcripts [24]. The CPAT coding probability score for protein determination varied from 0.364 to 0.44 for human [25], and negative PhyloCSF score indicated noncoding transcripts [26]. Here, we selected a quite stringent threshold for PhyloCSF score < −20 as ncRNA. Further, transcripts with CPC < 0, CPAT < 0.364, and PhyloCSF < −20 that encoded any protein domains cataloged in the Pfam database were filtered out utilizing HMMER software [27].

LncRNA Classification.
Depending on their relationships with the neighboring protein-coding genes, the identified lncRNAs can be classified to six categories: (1) sense or (2) antisense, the lncRNA transcript overlaps one or more exons of another transcript in the same or opposite DNA strand, respectively; (3) bidirectional, expression of lncRNAs is in the same direction as a neighboring coding transcript in the same chain; (4) intronic, lncRNAs derive wholly from within an intron of a second transcript; (5)   lncRNAs lie within the genomic interval between two genes; (6) small RNA (sRNA) host lncRNA [7,28]. The differentially expressed lncRNAs were annotated with the following priority: sRNA host lncRNA > intronic lncRNA > sense lncRNA > antisense lncRNA > bidirectional lncRNA > intergenic lncRNA.
2.6. Differential Expression Analysis. The lncRNA and mRNA sequence reads of the IDD and control groups were normalized to fragments per kilobase of transcript per million mapped reads (FPKM) values [29]. Cuffdiff 2.0 was used to detect differentially expressed lncRNAs between the IDD and control groups (log 2 fold change (FC) ≥ 1 or ≤ −1, < 0.05) [23]. Those genes differentially expressed in IDD might be IDD specific genes.

GO and Pathway
Analysis. For each differential expressed lncRNA, the nearest upstream and downstream (within 10 kb) protein-coding neighbors were identified as their cisregulatory potential targets. To explore the roles of these target genes, we performed GO (http://www.geneontology.org) and pathway analysis. GO terms are comprised of biological process (BP), cellular component (CC), and molecular function (MF). Pathway analysis is a functional analysis that maps genes to the KEGG (http://www.genome.jp/kegg/) pathways. KEGG allowed us to determine the biological pathways that there is a significant enrichment of differential expressed mRNAs.

Construction of the ceRNA Network.
In our previous study, 51 miRNAs were found to be differentially expressed in the IDD group compared with the spinal cord injury group [9]. To construct the ceRNA network, the interactions between differentially expressed lncRNAs and miRNAs were predicted by miRcode (http://www.mircode.org/) at first, and then, RNAs that were targeted by miRNAs with luciferase reporter assay support were from TarBase (http://www .microrna.gr/tarbase).

RNA Sequencing and Transcriptome Reconstruction.
Two cDNA preparations were sequenced from IDD and spinal cord injury groups. In total, 63,592,624 and 61,015,766 raw reads were generated from control and IDD RNA libraries, respectively. After discarding low-quality reads and adaptor sequences, we obtained approximately 59 million clean reads in both libraries. To identify putative lncRNAs from deep sequencing data, we developed a stringent filtering pipeline (Figure 1). At the beginning, clean reads were aligned to the human GRCh37 reference genome using TopHat [22]. Approximately 88% of the reads were aligned onto the human genome and 42.3% were uniquely mapped in both groups (Table 2). Then, transcripts were reconstructed using Cufflinks [29]. The reconstructed transcripts were annotated using the cuffcompare program from the Cufflinks package. Further, putative lncRNAs were screened from unknown transcripts according to the analysis work flow shown in Figure 1. Finally, we obtained 177,975 lncRNAs, of which 63 were novel. The average length of putative novel lncRNAs was 1,067 nt, varying from 205 (XLOC 001763) to 6217 nt (XLOC 037168) (Supplementary Table 2).

Differentially Expressed lncRNAs and mRNA Analysis.
The transcript expression levels were normalized to FPKM values by the Cufflinks software. The volcano plots show the differential expression of transcripts in IDD and the control groups (Supplementary Figure 1). 1,854 lncRNAs (916 lncRNAs were upregulated and 938 lncRNAs downregulated)   Table 3). Here, we presented the top 10 downregulated and upregulated mRNA and lncRNAs in Table 3. Among these, NONHSAT031859 (log2FC = 6.04) was the most significantly upregulated and NONHSAT006310 (log2FC = −7.58) was the most significantly downregulated. To classify the differentially expressed lncRNAs according to their position with protein-coding genes, we identified that vast majority were sense lncRNA (1302), followed by intergenic lncRNA (252), intronic lncRNA (134), sRNA host lncRNA (64), antisense lncRNA (59), and bidirectional lncRNA (35), as well as eight lncRNAs that could not be classified well (Supplementary Table 3).
In total, 2,804 (1,444 upregulated and 1,360 mRNAs downregulated) mRNAs were significantly differentially expressed in IDD compared with the control group (Supplementary Table 4). The top 10 downregulated and upregulated mRNAs were presented in Table 4.

Prediction of the Potential Target Genes of lncRNAs.
It was considered that many lncRNAs function as cis-regulators, given that the expression of lncRNA is significantly correlated with their neighboring protein-coding genes as potential targets [31]. Here, we found that 1,530 differentially expressed lncRNAs were transcribed near (<10 kb) their protein-coding neighbor, and a total of 6,386 putative targets were collected (Supplementary Table 5). The number of putative targets for each lncRNA varied greatly. For example, ENST00000565580 had 30 target genes, maximum among these lncRNAs, followed by ENST00000523461 and ENST00000544639 with 27 and 26 target genes, respectively, which might indicate that these lncRNAs conferred to a broad-spectrum regulation. Simultaneously, 207 lncRNAs targeted only one gene, which might suggest a unique regulatory function played by these lncRNAs. To further understand the functions and associated pathways of these target genes, we performed GO and pathway analyses. The enriched GO terms of these potential targets involved many basic biological events, such as intracellular, protein binding, and cellular protein metabolic process (Figure 2(a)) (Supplementary Table 6). Based on our KEGG pathway analysis, the most enriched pathways corresponding to the differentially expressed lncRNAs were associated with lysosome, RNA transport, and aminoacyl-tRNA biosynthesis (Figure 2(b)) (Supplementary Table 7).

ceRNA Network Construction.
A number of studies have demonstrated that some lncRNAs can serve as miRNA sponges to interact competitively with miRNAs to inhibit miRNA availability to mRNAs. Our previous results demonstrated that 25 miRNAs were upregulated and 26 were downregulated in the IDD group compared with the spinal cord injury group [9]. To examine the posttranscriptional regulatory function of differentially expressed lncRNAs in IDD, we constructed a lncRNA-miRNA-mRNA ceRNA network in  combination with previous miRNA expression profile (Figure 3). In this ceRNA network, two downregulated miRNAs, has-miR-34a and has-miR-148a, were negatively correlated with their corresponding target genes E2F3 and VEGFA and ACVR1, respectively. Moreover, upregulated lncRNA PART1 harbor potential MRE for both has-miR-34a and has-miR-148a.

Verification of RNA-Seq Data by qRT-PCR.
To confirm the expression patterns of lncRNAs identified by RNA-seq, 10 differentially expressed lncRNAs including five upregulated lncRNAs and five downregulated ones were selected for validation by qRT-PCR in 12 lumbar disc samples. Of these 10 lncRNAs, eight showed the same trends of up-and downregulation as the sequencing data (Figure 4(a)). In addition, expression level of mRNAs and lncRNA in ceRNAs was also verified by qRT-PCR (Figure 4(b)). Moreover, in the 12 lncRNAs and mRNAs, expression level of 9 genes showed significantly statistical differences between IDD group and control ( Figure 5) ( < 0.05). Taken together, our these results showed good correlation between qRT-PCR and RNA-seq data and different expression level of genes.

Discussion
In this study, we present the comprehensive identification and analysis of IDD specific lncRNAs using RNA-seq. Transcript predicted to have coding potential filtered by CPAT, CPC, and PhyloCSF, respectively. Thereby more reliable putative novel lncRNAs were attained. We identified 1,854 lncRNAs and 2,804 protein-coding genes which are differentially expressed in IDD group compared with the spinal cord injury group. In addition, the expression levels of 10 lncRNAs with significant differential expression in the IDD were verified by qRT-PCR.
Comparing our results with previous microarray study [20], the differentially expressed lncRNAs were totally different. However, we found that differentially expressed mRNAs showed partial overlap (Supplementary Table 8). For example, CHN1 and MAD1L1 exhibited quite similar upregulation and downregulation in two studies, respectively, while DGKZ and PTP4A3 were much more induced in RNA-seq data. Moreover, KEGG pathway analysis of the differentially expressed mRNAs was also partially overlaid, including lysosome, focal adhesion, and ubiquitin mediated proteolysis (Supplementary Figure 2) [15]. Samples taken from patients of different ages and high throughput technology may explain this big difference. Microarray is greatly dependent on designed probes and hence cannot comprehensively characterize dynamic and relatively low expression of lncRNAs. Also, lncRNAs have strong tissue specificity, and many lncRNAs are not identified at present [16,32].
The functions of lncRNAs could be indirectly reflected by the functions of neighbor protein-coding genes. To better understand the regulatory roles of the putative lncRNA candidates, we also employed GO and KEGG to analyze the function of potential target genes of lncRNAs. Among the top 20 enriched KEGG terms, lysosome, endocytosis, and phagosome were relevant to autophagy which is a catabolic process involving the degradation of dysfunctional cellular components by lysosomal systems [33]. Ma et al. have reported that autophagy may play an important role in IDD, and silent information regulation 2 homolog-1 (SIRT1) protected degenerative human nucleus pulposus (NP) cells against apoptosis via promotion of autophagy [34]. An array of cardiovascular risk factors such as age, smoking, hypertension, high cholesterol, and diabetes has been reported to be related to IDD and back pain [35]. Cardiovascular risk related pathways were enriched in our results, including dilated cardiomyopathy, arrhythmogenic right ventricular cardiomyopathy (ARVC), and hypertrophic cardiomyopathy (HCM). Intervertebral disc is the largest avascular organ in the human body and has been identified as immuneprivileged organ [36]. Exposure of NP to the immune system is able to evoke autoimmune reaction, which play an essential role in pathophysiology of IDD [37,38]. Epithelial cell signaling in Helicobacter pylori infection, Staphylococcus aureus infection, and antigen processing and presentation might be associated with the breakdown of immune privilege resulting in IDD at last. Further, it has been reported that p38 MAPK gene expression is upregulated in senescent human annulus fibrosus (AF) cells compared to nonsenescent cells [39]. Both miRNA and lncRNA have been shown to be associated with IDD. In combination with our previous miRNA profile, we constructed a ceRNA network, of which upregulated lncRNA PART1 and downregulated has-miR-34a and has-miR-148a were the core elements. The relative expression level * * * * * * * * * Figure 5: The relative expression level of 12 lncRNAs and mRNAs. The first four differentially expressed lncRNAs were upregulated in IDD, the second four differentially expressed lncRNAs were downregulated in IDD, and the third four differentially expressed genes were one lncRNA (PART1) and three mRNAs (ACVR1, E2F3, and VEGFA) in ceRNA network. The asterisk ( * ) means significantly different expression between IDD group and control.
In conclusion, the identified lncRNAs that are differentially expressed in IDD and control samples could have functional roles in IDD development. This study provides novel insights into the discovery and annotation of IDD development-related lncRNAs in mammalian genome.