Identification of 170 New Long Noncoding RNAs in Schistosoma mansoni

Long noncoding RNAs (lncRNAs) are transcripts generally longer than 200 nucleotides with no or poor protein coding potential, and most of their functions are also poorly characterized. Recently, an increasing number of studies have shown that lncRNAs can be involved in various critical biological processes such as organism development or cancer progression. Little, however, is known about their effects in helminths parasites, such as Schistosoma mansoni. Here, we present a computational pipeline to identify and characterize lncRNAs from RNA-seq data with high confidence from S. mansoni adult worms. Through the utilization of different criteria such as genome localization, exon number, gene length, and stability, we identified 170 new putative lncRNAs. All novel S. mansoni lncRNAs have no conserved synteny including human and mouse. These closest protein coding genes were enriched in 10 significant Gene Ontology terms related to metabolism, transport, and biosynthesis. Fifteen putative lncRNAs showed differential expression, and three displayed sex-specific differential expressions in praziquantel sensitive and resistant adult worm couples. Together, our method can predict a set of novel lncRNAs from the RNA-seq data. Some lncRNAs are shown to be differentially expressed suggesting that those novel lncRNAs can be given high priority in further functional studies focused on praziquantel resistance.


Introduction
The trematode Schistosoma mansoni is the primary parasite species responsible for schistosomiasis, a chronic debilitating disease. It is considered one of the most devastating tropical diseases in the world with at least 258 million people infected. Furthermore, 800 million people were living in endemic areas at risk of infection with more than 200,000 deaths each year [1,2]. Its transmission has been reported in more than 78 countries, especially in tropical and subtropical areas such as Central and South America, Africa, and Southeast Asia [3,4].
Although, in the last few decades, several drugs have been used for the treatment of schistosomiasis, with praziquantel (PZQ) representing the only widely effective agent used [5,6]. It is effective against all species of schistosomes that infect humans and is relatively cheap and easy to use, but PZQ does not provide a cure, since young schistosomula are mostly resistant to its anthelmintic effects [7]. This drug provides some relief to treated patients. However, young parasites, due to their intrinsic resistance to PZQ, escape elimination during treatment, grow to maturation, and begin to release eggs [5]. This mechanism of resistance is worrying, because, under this 2 BioMed Research International ineffective pressure, drug resistance may also arise in humans, as in murine models [8,9].
The S. mansoni genome is structured in 7 pairs of autosomes and one pair of sex chromosomes (female = ZW, male = ZZ). Chromosomes range in size from 18 to 73 MB and can be distinguished by size, shape, and the C-banding technique [10]. According to the latest annotation, the S. mansoni genome is still considered as a draft with 380 Mb and 885 scaffolds. Despite this, about 81% of the bases are organized in these chromosomes. More than 45% of the predicted genes were modified, and the total number was reduced from 11,807 to 10,852 [11].
This parasite has a complex life cycle that involves many larval stages, an intermediate snail, and a final mammalian host. It is believed that the difference and developmental complexity observed between the different evolutionary stages and environments depend on the regulation of gene expression [12]. Several molecules are responsible for gene expression regulation, especially long noncoding RNAs (lncRNAs). They are defined as transcripts longer than 200 nucleotides and do not encode proteins presenting several regulatory functions. They can interact with DNA, RNA molecules, and transcription factors, participating in various biological processes, mainly gene regulation [13].
While our knowledge of the mechanisms and scope of lncRNA-mediated regulation is growing, our understanding of how lncRNAs themselves are regulated is still quite limited [14]. Regulating lncRNA expression would be expected to be an important cellular consideration given that lncRNAs have been implicated in regulating a variety of processes in eukaryotes including imprinting, dosage compensation, cell cycle regulation, pluripotency, retrotransposon silencing, meiotic entry, and telomere length [15][16][17][18]. They can also play important roles in numerous disease and physiological metabolism processes, such as X-chromosome inactivation, embryonic development, and pluripotency maintenance [16,19].
These findings deeply changed disease pathobiology comprehension and led to the emergence of new biological concepts about human diseases, including the parasitic disease. Several methodologies were created to characterize and identify this RNA subtype. Noteworthy, S. mansoni researchers used these techniques and described a great picture of these lncRNAs and their participation in the disease processes. To date two studies have been published predicting lncRNAs molecules in schistosomes [20,21] and only one [21] describes possible lncRNAs' functions of 181 sequences from 7431 total predicted lncRNAs in 5 life cycle stages in this parasite, including canonically spliced putative lincRNA and spliced lncRNAs that are antisense to protein coding genes. These functions were predicted considering that lncRNAs may act by regulating their flanking protein coding gene neighbors in many processes to the rapid adaptation of the parasite to several environments.
In this study, we aimed to predict novel lncRNAs in S. mansoni via a computational pipeline and investigate features including possible functions of these sequences. Here we describe a complete new set of 170 lncRNAs in adult worms, from which we selected 15 for expression analysis in male, female, and also, for the first time, praziquantel-resistant worms which are known to have a differential expression profile in several important genes in relation to susceptible worms [8,22,23]. Our results show a differential expression profile of lncRNAs and reinforce the importance of this RNA subtype in schistosome biology.

Ethics Statement.
All experiments that involve animals were authorized by the Ethical Committee for Animal Care of the Federal University of Ouro Preto (CEUA-UFOP protocol 2011/55). These procedures were conducted in accordance with the accepted national and international regulations for laboratory animal use and care.

2.2.
Parasites. The S. mansoni LE strain was maintained by routine passage through Biomphalaria glabrata snails and BALB/c mice. The infected snails were induced to shed cercariae under light exposure for 2 hours. Adult worm parasites were obtained by liver perfusion of mice after 50 days of infection. The S. mansoni LE praziquantel-resistant (LE-PZQ) strains were obtained following a described method for inducing resistance to PZQ using infected B. glabrata snails [24]. Infected snails were treated 3 times with 100 mg/kg PZQ for five consecutive days with a one-week interval between them. Then, after this treatment the cercariae both from treated snails (LE-PZQ strains) and from nontreated snails (LE strains susceptible) were used to infect two groups of mice. These mice were treated 45 days after infection with 200, 400, or 800 mg/kg PZQ with three PZQ treatments, each treatment administered on 5 consecutive days, with 1week interval, for selection of less susceptible parasites to PZQ following the method developed by Couto et al. [24]. Then, the LE-PZQ adult worms were obtained by mice liver perfusion, washed in RPMI 1640 (Sigma Chemical Co.), quick-frozen in liquid nitrogen, and stored at -80 ∘ C until use.

Data Set Download.
The latest annotation data for S. mansoni (Genome Assembly release v5.2) were downloaded from GeneDB database (http://www.genedb.org) [25]. In order to perform prediction analysis in this pipeline, a RNAseq library from 7-week-old mixed sex adult worms was selected and downloaded from ArrayExpress database [11] in the FASTQ format (http://www.ebi.ac.uk/arrayexpress/) under accession number E-MTAB-451.

Initial Transcriptome
Assembly for S. mansoni. S. mansoni genome FASTA file and the annotation data in the GTF format were downloaded from GeneDB. One paired RNAseq library was downloaded in the FASTQ format provided by sequencing using the Illumina Genome Analyzer IIx platform [11]. The quality was then analyzed using FastQC 0.11.4 [26]. The main parameters analyzed were the quality of the scores on the bases (quality of the Phred set as greater than 20 where it considers 1 error per 100 bases), per sequence quality scores, per sequence GC content, and removal of possible adapters used in the sequencing. After that, these reads were trimmed using Trimmomatic [27], and the following parameters were used: HEADCROP:15, LEADING:20, TRAILING:20, SLIDINGWINDOW:5:20, and MINLEN:50. Next filtering the quality of the files in the FastQC and removing the adapters, all the reads were mapped using the S. mansoni reference genome with the STAR aligner [28]. In order to use this program, it was necessary to index the genome in the format compatible with STAR. Once the indexing was performed, the next step was to perform the mapping of the reads with the S. mansoni genome as reference using STAR. The output SAM files of STAR were then converted to its compressed BAM format, and then indexed and sorted with SAMtools for further analysis [29]. Subsequently, the BAM files were assembled by Cufflinks 2.2.1 [30] using de novo mode to assemble transcripts for 3 S. mansoni stage samples. Finally, assembly of transcripts was performed via Cufflinks, and the final file was submitted to the following pipeline filters for prediction of lncRNAs in S. mansoni. These steps were written in and performed with the use of custom Perl and Python scripts.

Identification of Novel lncRNAs Candidates.
Cuffcompare was the first step used to compare the results generated by the ab initio assembly with the known annotations present in GeneDB in GTF format. The consensuses of novel transcripts were then used for further analysis. As a result of Cuffcompare, all the assemblies that were detected as new transcripts were categorized into 12 different categories according to their location compared with the S. mansoni reference genes [30]. We kept the three following classes: unknown intergenic transcript (u), a transfrag falling entirely within a reference intron (i), and exonic overlap with reference on the opposite strand (x). This final file was submitted to the following filters written in the programming languages Python and Perl.
In the second step a filter was used to extract the sequences equal or greater than 200 nucleotides. The third step was done using CPAT (Coding Potential Assessment Tool), an alignment-free method to predict RNA coding potential using four sequence features [31]. Only the transcripts classified by CPAT as noncoding transcripts were added to a new FASTA file. For our proposed filter, we used a prebuilt logit fly model as the classifier with optimum cutoff (CP) 0.39 (CP >=0.39 indicates coding sequence; CP < 0.39 indicates noncoding sequence) [31]. The fourth step consisted of the extraction of transcripts that presented putative ORF (Open Reading Frame) smaller than 300 nt using OrfPredictor server [32]. Transcripts with protein coding potential generally have ORF greater than 300 nt in size, generating proteins greater than or equal to 100 aa (amino acids) [33,34]. In the fifth step we used an FPKM cutoff based on the distribution of the lncRNA expression level.
Only FPKM values ≥2 were included in the final analysis. The sixth step was performed manually using the NCBI database as a reference and lncRNAs sequences from other works [21]. At this step, we manually removed other RNA types, transposons, and predicted protein coding genes that have homology in other Schistosoma species. The seventh and final step was performed to remove all the known lncRNAs in S. mansoni using BLAST version 2.7.1 [35,36].

lncRNAs Features.
The novel adult lncRNAs predicted were characterized in terms of genomic localization, exon number, transcript length, and log 2 FPKM using R and R Studio with ggplot2 packaged [37].

Gene Ontology Enrichment
Analyses. The possible lncR-NAs' functions were hypothesized with Gene Ontology (GO) enrichment analysis approach (http://geneontology.org/). In this analysis, all neighboring genes up to 100,000 bases upstream and downstream of the lncRNAs were selected and then their functions were evaluated [38]. In this way, lncRNAs may act by binding to these protein coding genes by regulating their translation. The list obtained with all GO terms was created and plotted with R version 3.4.2, and R Studio version 1.0.136. The analysis method was based on Fisher's exact test and the -log10 (P value) was used to denote the significance of the GO term enrichment.

LncRNAs Expression Analysis.
Approximately 100 mg adult worms were used for total RNA extraction performed according to the manufacturer's specifications (SV Total RNA Isolation System). Expression levels of 15 lncRNAs were quantified by RT-qPCR with Applied Biosystems ABI 7300 by using SYBR-Green PCR Master Mix (Roche5). We designed specific primers for each lncRNA, endogenous control, and positive control using Gene Runner version 6.5.46 (Supplementary Table S1). For the investigated transcripts, three biological replicates were performed and normalized to the endogenous control with specific primers for S. mansoni EIF4E [39]. A long terminal repeat (LTR) retrotransposon with 58% cover and 95% identity with S. mansoni Saci-4 LTR retrotransposon was used as a comparison parameter because it is a transcript found expressed in all chromosomes. Expression levels were calculated according to the 2 −ΔCt method [40] using the Applied Biosystems 7300 software. All these experiments were performed following MIQE guidelines [41].

Statistical Analysis.
Statistical analysis for RT-qPCR was performed using GraphPad Prism, version 7.0 (San Diego, CA, USA). One-way and two-way analysis of variance (ANOVA), following by Tukey multiple comparisons, were performed to investigate significant differential expression of transcripts throughout the sexes and treatments. In all cases, the differences were considered significant when P values were <0.05.

Initially Assembled Transcripts.
A stranded RNA-seq data set of the whole transcriptome was used to assemble transcripts (Figure 1(a)). We first trimmed more than 10 million raw reads (FASTQ reads) obtaining a total of 6 million clean reads. More than 5 million (82.18%) of them were mapped with STAR to the S. mansoni genome (v5.2) and calculation of summary statistics (Supplementary Table S2). De novo assembly from the aligned fragments was performed using Cufflinks with all the ab initio default parameters to generate  15,776 transcripts, which were then processed through the described pipeline.

Computational Pipeline to Predict Novel lncRNAs.
Our computational pipeline applied multiple filters on these transcripts (Figure 1(b) Table S3).

lncRNAs Features.
To determine S. mansoni lncRNAs features, the genomic localization, exon number, transcript length, and log 2 FPKM (Figure 2) were analyzed. In this data set we found that the majority of lncRNAs were found on chromosome 1, sexual ZW, and the scaffolds. The majority of the intronic lncRNAs were found on the sexual ZW and the unfinished scaffolds (Figure 2(a)). Moreover, lncRNAs had few exons per transcript (2-3) (Figure 2(b)) with most of them having transcript lengths between 200bp and 2000bp ( Figure 2(c)). Finally, many of these lncRNAs could be confidently detected, with FPKM expression level between 2 and 9 ( Figure 2(d)).
We also selected the first three lncRNAs and the LTR retrotransposon to verify the expression between sexes into four groups: Control-Male, Control-Female, PZQ-Male, and PZQ-Female (PZQ-Male and PZQ-Female are related to S. mansoni LE praziquantel-resistant strains) (Figure 3(b)). All the 4 Control-Female expressions were higher than Control-Male and PZQ-Female.

Target Gene Prediction.
GO enrichment analyses for the neighborhood target genes were made on three different aspects, namely, biological process (BP), molecular function (MF), and cellular component (CC). The 10 significantly overrepresented GO terms included 389 genes involved in BP, 555 genes involved in MF, and 768 genes involved in CC (Figure 4). The largely enriched and meaningful BP terms were related to metabolism, transport, and biosynthesis. The enriched MF terms were predominantly related to binding including catalytic activity and nucleotide binding. As for CC, the most enriched terms were cell, intracellular, and cytoplasm. A detailed table of all 15 expressed lncRNAs was made including the neighboring coding genes and their respective GO entries (Supplementary Table S4).

Discussion
Several molecular and biochemical experiments revealed that lncRNAs may play diverse roles and functions in cellular biology. However, given the high abundance of lncRNAs and the poor-genetic conservation between species, the study of these molecules is extremely intriguing because they can be key regulators of species-specific biological processes. Notably, there are great prospects for a better understanding of S. mansoni biology, but much work will be needed to elucidate the specific role of lncRNAs in this parasite lifestyle. Thus, pipeline development for the S. mansoni genome is of extreme importance due to the lack of adapted and specific methodologies for the genus Schistosoma.
Current methodologies for predicting lncRNA genes are species-specific following the genome intrinsic characteristics of each species which increase data generation sensitivity and specificity. Each methodology used follows a flow chart that best adapts the steps and programs used [42][43][44][45]. This is a very important point because S. mansoni has particular characteristics since some of the repetitive fractions of DNA consist of tandemly repeated ribosomal genes of which there are 500-1000 copies per genome that represent 1.8-3.6% of the total DNA and nonribosomal repetitive sequences that comprise at least a further 2.0% of the total DNA [46].
In this pipeline, we introduced a manual step to remove other RNAs, transposons, and predicted protein coding genes in other Schistosoma species. In Figure 1, 39% of the initially putative lncRNAs showed significant homology with retrotransposons. This finding indicates an important difference of our pipeline. In addition, 33.6% of the putative lncRNAs identified were identical to those recently described for S. mansoni [21], reinforcing the hypothesis that the steps which were described in both studies are ideal for the identification of lncRNA in the genus Schistosoma.
In this work, we selected some lncRNAs to initiate expression studies, using lncRNA assumptions as criteria that are located in the proximity of genes involved with important pathways to S. mansoni survival, like the ubiquitin proteasome system, ribosomal protein (Supplementary Table S4), and others. In total, 100% of novel lncRNAs analyzed were expressed in adult worm couples (Figure 3). This is consistent with data from Vasconcelos et al. [21] who found similar percentages of expressed lncRNAs in S. mansoni adult worms.
The act of transcribing lncRNAs can have profound consequences on the ability of nearby genes to be expressed. For example, transcription of a lncRNA across the promoter region of a downstream protein coding gene directly interferes with transcription factor binding and thus prevents the protein coding gene from being expressed [47,48]. We detected high levels of Sm-lncRNA 5 expression related to proteasome B1 subunit, responsible for peptidylglutamyl activity. Curiously, it is the lowest activity of the proteasome of S mansoni [49]. The proteasomes form a pivotal component for the ubiquitin proteasome system (UPS). It is implicated in protein ubiquitination, proteolysis, and degradation and was set as essential to S. mansoni biology [49]. Moreover, the Sm-lncRNA 12 is differentially expressed in S. mansoni and occupies the neighborhood of E2 gene [50], leading us to hypothesize this participation on ubiquitin proteasome regulation. These findings reinforce themselves because differentially expressed lncRNAs, identified from human placentas, may regulate their associated mRNAs through several mechanisms and connect the UPS with infectioninflammation pathways [51].
Another set of putative lncRNAs, neighbor genes related to protein synthesis described in this work (Supplementary  Table S4), are related to protein synthesis. Recently, lncRNAs have emerged as key players in the stress responses in plants [52] and eukaryotic cellular senescence [53]. The stress response is a feature of the S. mansoni lifestyle in both vertebrate and invertebrates hosts [54].
Furthermore, lncRNA levels dynamically change in response to various drugs. These alterations affect gene expression involved in several cells function such as cycle arrest, inhibition of apoptosis, and DNA damage repair [55]. S. mansoni exposure to a drug, particularly PZQ, which is used in a single dose or repeatedly in reinfection, may induce drug resistance or reduced susceptibility over time [56]. The difference in gene expression between male and female in PZQ resistance worms suggest that lncRNAs can also be involved in drug resistance mechanisms in S. mansoni.
Several studies have provided insights into how genomic neighborhoods could influence gene expression levels, with important consequences for evolution, development, and disease [57].

Conclusions
In conclusion, we have identified a novel set of 170 lncRNAs in S. mansoni expressed in male, female, and PZQ resistant adult worms. These observations suggest that lncRNAs may be significant in parasite biology and be useful therapeutic targets. Further studies are required to dissect the function and mechanism of action of these RNA subtypes in normal biology and life cycle progression.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.