De Novo Assembly and Characterization of Sophora japonica Transcriptome Using RNA-seq

Sophora japonica Linn (Chinese Scholar Tree) is a shrub species belonging to the subfamily Faboideae of the pea family Fabaceae. In this study, RNA sequencing of S. japonica transcriptome was performed to produce large expression datasets for functional genomic analysis. Approximate 86.1 million high-quality clean reads were generated and assembled de novo into 143010 unique transcripts and 57614 unigenes. The average length of unigenes was 901 bps with an N50 of 545 bps. Four public databases, including the NCBI nonredundant protein (NR), Swiss-Prot, Kyoto Encyclopedia of Genes and Genomes (KEGG), and the Cluster of Orthologous Groups (COG), were used to annotate unigenes through NCBI BLAST procedure. A total of 27541 of 57614 unigenes (47.8%) were annotated for gene descriptions, conserved protein domains, or gene ontology. Moreover, an interaction network of unigenes in S. japonica was predicted based on known protein-protein interactions of putative orthologs of well-studied plant genomes. The transcriptome data of S. japonica reported here represents first genome-scale investigation of gene expressions in Faboideae plants. We expect that our study will provide a useful resource for further studies on gene expression, genomics, functional genomics, and protein-protein interaction in S. japonica.


Introduction
Sophora japonica Linn (Chinese Scholar Tree) is a shrub of the pea family Fabaceae. It grows into a lofty tree 10-20 m tall that produces a fine, dark brown timber. It is not only a kind of popular ornamental tree, but also a valuable nectar tree, offering delicious and healthy food. Moreover, dried flowers and buds of Sophora japonica, containing many kinds of components such as flavones, tetraglycosides, isoflavones, and isoflavone tetraglycosides [1], are used as useful herb to treat hemorrhoids and hematemesis in China, Japan, and Korea [2]. In spite of its medicinal and economic value, not much genomic or transcriptomic information is available for S. japonica. As of September 2013, only 74 nucleotide sequences and 35 proteins from S. japonica were available in GenBank. Hence, generation of genomic and transcriptome data is necessary to help further studies on S. japonica.
In the latest decade, the emergence of the next generation sequencing (NGS) technology offers a fast and effective way for generation of transcriptomic datasets in nonmodel species using various platforms such as Roche 454, Illumina HiSeq, and Applied Biosystems SOLiD [3][4][5]. Compared to the whole-genome sequencing, RNA-seq, which is considered as a cost-effective and ultra-high-throughput DNA sequencing technology, is a revolutionary advance in the functional genomic research [6]. In this approach, sequences of the expressed parts of the genome are produced [7] to identify genes [8] and explore the low abundance transcripts [9]. Due to the many advantages, RNA-seq is specifically attractive for nonmodel organisms without genomic sequences [10][11][12][13].
In this study, we used RNA-seq technology to investigate the transcriptome of S. japonica from three tissues. Using Illumina sequencing platform, a total of 86139654 reads of S. japonica transcriptome were produced. Those were assembled into 57614 unigenes and annotated for functionality. Furthermore, the protein-protein interaction network of expressed genes in S. japonica was constructed. This is the first S. japonica and Styphnolobium genus transcriptome data generated by RNA-seq technology. The information provides a good resource for further gene expression, genomics, and functional studies in S. japonica.

Method
2.1. RNA Preparation and Sequencing. S. japonica (provided by the Yangzhou eight strange Memorial) was grown in an open-air place in Jiangsu Province, Eastern China. Total RNA was extracted using TRIzol method (Invitrogen) from three different tissues: tender shoots, young leaves, and flower buds. RNA was isolated from every tissue and mixed together in equal proportion for cDNA preparation.
The poly-A mRNA was isolated from the total RNA using poly-T oligo-attached magnetic beads (Illumina). After purification, fragmentation buffer (Ambion, Austin, TX) was added to digest the mRNA to produce small fragments. These small fragments were used as templates to synthesize the first-strand cDNA with superscript II (Invitrogen) and random hexamer primers. The synthesis of the second strand was performed in a solution containing the reaction buffer, dNTP, RNaseH, and DNA polymerase I using Truseq RNA sample preparation Kit. Next, these cDNA fragments were handled with end repair using T4 DNA polymerase, Klenow DNA polymerase, and T4 polynucleotide kinase (Invitrogen). Illumina's paired-end adapters were then ligated to the two ends of cDNA fragments. The adapter sequences were as follows: read1 adapter: AGATCG-GAAGAGCACACGTC and read2 adapter: AGATCGGAA-GAGCGTCGTGT. The products from this ligation reaction were electrophoresed on a 2% (w/v) agarose gel (certified low range ultragrade agarose from Bio-Rad) and purified according to appropriate size of DNA fragments suitable for Illumina sequencing. Then the sequencing library was constructed according to the protocol of the Paired-End Sample Preparation kit (Illumina). Sequencing was done with an Illumina Hiseq 2000. Raw read sequences are available in the Short Read Archive database from National Center for Biotechnology Information (NCBI) with the accession number SRR964825.

De Novo Assembly.
After removal of adaptor sequences along with low quality reads and reads of larger than 5% unknown sequences, the resting were assembled into unitranscripts and unigenes by Trinity [14].
We used RSEM [15] to quantify expression levels of each unique transcript (see additional file 1 in Supplementary Material available online at http://dx.doi.org/10.1155/2014/ 750961). Results were reported in units of TPM (transcripts per million). After counting fraction of each isoform, we used length × isoform percent as a standard to choose unigenes (see additional file 2).

Functional Annotation and Classification.
All assembled unigenes, longer than 300 bps, were further analyzed to predict putative gene descriptions, conserved domains, gene ontology (GO) terms, and association with metabolic pathways. First of all, all the unigenes were searched in the protein databases including NCBI NR, Swiss-Prot, and clusters of orthologous groups (COG) [16] through BLASTALL procedure (ftp://ftp.ncbi.nih.gov/blast/executables/release/2.2.18/) with an E-value < 1.0E − 6. After obtaining the features of the best BLASTX hits from the alignments, putative gene names and "CDS" (coding DNA sequences) were determined. Subsequently, according to the NR annotation, we took advantage of Blast2GO [17] software to predict GO terms of molecular function, cellular component, and biological process. After obtaining GO annotation for all unigenes, GO functional classification of the unigenes performed using WEGO software [18] and exhibited the distribution of gene functions at the second level. Unigene sequences were also compared to the COG database to predict and classify possible gene functions based on orthologies. Association of unigenes with the KEGG pathways was determined using BLASTX against the Kyoto Encyclopedia of Genes and Genomes database [19]. The KEGG pathways annotation was performed in the KEGG Automatic Annotation Server (KAAS) (http:/www.genome.jp/tools/kaas/) [20].
To obtain the potential protein coding sequences from all unigenes, we first predicted all the open reading frames (ORFs). According to the BLASTP results against NR database, we chose the correct ORFs as potential protein coding sequences. And the longest ORFs from the unigenes without BLASTP results were considered as referential protein coding sequences (additional file 3).

Construction and Topological Analysis of Protein
Interaction Network. The interaction network of unigenes in S. japonica was constructed in the form of nodes and edges where nodes represent genes and edges represent interactions between genes. First, we downloaded protein-protein interactions (PPI) and sequences of six species Arabidopsis thaliana, Arabidopsis lyrata, Oryza sativa subsp. Japonica, Brachypodium distachyon, Populus trichocarpa, and Sorghum bicolor from STRING database that is a precomputed database for the detection of protein-protein interactions [21]. Then, the protein sequences of genes from PPIs were searched against the unigenes datasets in our study to find homologies by TBLASTN ( -value < 1.0 − 6). The TBLASTN hits with identity >50% and covering query gene >80% were identified as the candidate interacting genes of the network. According to the known PPI network of the above six species, the interaction network of S. japonica was constructed using the homologous unigenes from the TBLASTN searches.
The topological features such as the degree distribution of nodes, degree correlation, clustering coefficient ( ), and shortest path length ( ) were determined for the resultant networks. To each node of the network, we assigned a degree , which is the number of its neighbors. We calculated the degree distribution of the giant component (i.e., the probability ( ) that a protein has edges) [22] using the equation where is the number of nodes and ( ) is the number of nodes with degree . The degree correlation, which is characterized by analyzing the average degree of nearest neighbors , [23], is defined by The clustering coefficient ( ) was defined as the average probability with which two neighbors of a node were also neighbors to each other. For instance, if a node had links, and among its nearest neighbors there were links, then the clustering coefficient of [23] was calculated using the equation .
The shortest path length ( ) between two nodes was defined as the minimum number of intermediate nodes that must be traversed to go from one node to another [23]. The average shortest path length was the shortest path length averaged over all the possible pairs of nodes in the network.

De Novo Sequence Assembly of S. japonica Transcriptome.
Total RNA from three different tissues (tender shoots, young leaves, and flower buds) was extracted and blended in equal proportions for Illumina sequencing. A total of 86.1 million high-quality clean reads with total of 8700105054 nucleotides (nt) sequences were produced with an average length of 101 bps for each short read ( Table 1). As a result of the absence of the genomic sequences of S. japonica, the transcripts were assembled de novo from all high-quality reads by Trinity [14]. A total of 143010 unique transcripts (UTs) were predicted from the clean sequence reads, with an average length of 1482 bps and an N50 of 1155 bps. The majority of UTs (33045) were between 100 and 500 bps, which accounted for 23.1% of total UTs shown in Figure 1(a). Then after removing redundancy, 57614 unigenes were generated with an average length of 901 bps. As shown in Figure 1(b), the length of the unigenes ranged from 300 bps to more than 3000 bps.
The quality score distribution across all bases and over all sequences was shown in additional files 4 and 5, revealing that most of the sequences have quality score larger than 30. To further evaluate the quality of the dataset, we compared the unigenes from S. japonica with other species using BLASTX (additional file 6). The result showed that more than half of unigenes that are having significant BLAST hits were mapped to soybean, which was consistent with our expectation.

Functional Annotation and Classification of S. japonica
Transcriptome. In order to annotate the transcriptome of S. japonica, a total of 57614 unigenes were first examined against the NR database in NCBI using BLASTX with an -value cut-off of 1 −6 , which showed 27507 (47.7%) having significant BLAST hits ( Table 2). The -value distribution of significant hits revealed that 67.8% of matched sequences had strong homology (smaller than 1.0 − 50), while the other homologous sequences (32.2%) had -values in the range of 1.0 − 50-1.0 − 6 ( Figure 2(a)). The distribution of sequence similarities represented that most of the BLASTX hits (95.3%) were in the range between 40% and 100%. Only 4.7% of hits had sequence similarity values less than 40% (Figure 2(b)).
The protein coding sequences of unigenes were also compared with the protein database at Swiss-Prot by BLASTX. A total of 20463 of 57614 unigenes (35.5%) showed hits at an -value threshold of ≤1.0 − 6 ( Table 2). More than half of the matched sequences (53.7%) had strong homologies with -values of ≤ 1.0 − 50, and the remaining unigenes had -values between 1.0 − 50 and 1.0 − 6 ( Figure 2(c)). The distribution of sequence similarities against Swiss-Prot was different than that obtained against the NR database. While 75.0% of query sequences against Swiss-Prot had similarities between 40% and 100%, only 25.0% of sequences had strong homologies with <40% identity (Figure 2(d)). Thus by combining the results of sequence similarity searches from NR and Swiss-Prot database, we identified a final set of 27541 unigenes.

Gene Ontology (GO)
Classification. GO terms were predicted for each assembled unigene to characterize functionality of gene products on the basis of their sequence similarities to known proteins in the Nr database. Of the 57614 unigenes of S. japonica, a total of 15063 unigenes were assigned to at least one of the three main GO categories: cellular component (11860, 20.6%), biological process (11643, 20.2%), and molecular function (11160, 19.4%). These GO terms were further subdivided into 51 sub-categories (Table 2, Figure 3, and additional file 7). Among these categories, the "cell, " "cell part, " "cellular process, " "organelle, " "metabolic process, " "catalytic activity, " and "binding" terms were found to have association with relatively more number of unigenes than other GO terms. The relative abundance of unigenes associated with cellular processes (9396) and metabolic processes (9010) in the biological processes category implied that the S. japonica tissues used in the study processed extensive metabolic activities.       Figure 4, and additional file 8). The top five categories based on number of orthologies were (1) "general function prediction only" (13.8%); (2) "translation, ribosomal structure, and biogenesis" (11.9%); (3) "replication, recombination, and repair" (11.1%); (4) "posttranslational modification, protein turnover, and chaperones" (10.5%); and (5) "amino acid transport and metabolism" (6.7%). The two categories comprising "RNA processing and modification" and "chromatin structure and dynamics" consisted of 19 and 12 unigenes (0.5%), respectively, representing the two small COG classifications.
In conclusion, 27541 unigenes were annotated using NR, Swiss-Prot, COG, and KEGG databases. These unigenes had BLASTX scores with -values ≤ 1.0 − 6. Among these, 1561 unigenes showed hits in all the four public databases (NR,  Swiss-Prot, COG, and KEGG) providing the best functional annotations of those unigenes ( Table 2). These annotations provide a valuable resource to investigate further processes, structures, functions, and pathways of S. japonica in future studies.

Construction and Topological Analysis of Protein-Protein
Interaction Network in S. japonica. The interaction network was constructed using the annotated unigenes of S. japonica in comparison with genes with at least one known  Figure 5(a) showed the degree distribution ( ) = 0.23 K −0.94 (the least square fit of associations) which implies the sale-free characteristics of the component. The degree-correlation of the giant component is shown in Figure 5(b). The decay behavior of with suggests the disassortative mixing of nodes. We plotted the clustering coefficient of a node with links in Figure 5(c) yet. The guideline is ( ) ∼ −1 , the scaling law which reflects the hierarchy of the giant component. Besides, we compared the clustering coefficient and the shortest path length of the giant component with Erdős-Rényi, Watts-Strogatz, and Barabási-Albert models of the same nodes and links . The data in Table 3 shows high clustering coefficient and small path length. Those suggest the small-world properties of the network.

Discussion
Sophora japonica Linn is an economically important species for several reasons. It is commonly used to afforest cites and highways for their adaptability to ecology and environment. It also provides useful products such as honey and lumber for human use [2]. Apart from such ecological and economical values of pagoda tree, it has a unique mythological importance to Chinese people. The pagoda tree mentioned in the famous Chinese idiom story "A Fond Dream of Nanke" is believed to still present in the yard of the Yangzhou eight strange Memorial at present, in Jiangsu Province. This story tells that more than one thousand years ago, a person named Nanke drunk and rested against the pagoda tree having a dream. In his dream, he became the prime minister of the kingdom of pagoda tree. After waking up, he found that the kingdom of pagoda tree was the nest of ants under the pagoda tree. Nowadays, people believe that the tree is over 2000 years old. However, very little research has been done with this important species to understand its genome. Recently, high-throughput RNA sequencing has offered a new avenue to generate abundant sequence information from any organism [24,25]. The data obtained from RNA-seq projects are also helpful in inferring the basic biological, molecular, and cellular processes [19,20]. Genomes of many plant species have been studied by de novo transcriptome analysis, such as willow [26], Cocos nucifera [27], tea plant [10], and pineapple [28]. In this study, we used Illumina RNA-seq technology to sequence the Sophora japonica plant transcriptome and predicted a large number of expressed genes in S. japonica. We obtained 8.7 G bps coverage with 86.1 million high-quality clean reads. Using de novo software Trinity, we generated 57614 unigenes. Our results revealed that 27541 unigenes (47.8% of all assembled unigenes) were functionally annotated and involved in different biological processes.
Using the sequences of the predicted unigenes, we constructed a protein-protein interaction network to understand gene interactions in S. japonica. We identified a giant and 88 small components of the network. The best fit of the degree distribution of the giant component demonstrated that it was a scale-free network in which a few proteins interacted with high connectivity [29]. Like other biological networks, the giant component displays disassortative mixing that ensures connection of high-degree nodes with low-degree nodes. It is likely that the disassortativity may reduce the proportion of the important edges among hubs and increase the stability of biological networks when compared to the assortative network.
In addition, the giant component of the network also exhibited small-world properties including the high clustering coefficient as well as the smaller and shortest path length ( Table 3), suggesting that the neighbors of one node have close associations among each other in the network. The smaller shortest path length is an indicative of minimal distance between a node and its target to minimize energy involved in interactions between proteins. At the same time, the scaling law of the average clustering coefficient as a function of the degree (Figure 5(c)) indicates the hierarchical structure which reflects the evolutionary patterns associated with various organizational levels of the network. The combination of many local variations, which affect the small but highly interacted nodes, slowly affects the properties of the larger but less interacted nodes [30]. Such a process during evolution ensures both stability and low energy consumption in an efficient protein-protein interaction.

Conclusion
In this study, we applied Illumina RNA sequencing and de novo assembly approach to study the S. japonica transcriptome for the first time. Totally, about 86.1 million reads assembled into 57614 unigenes were generated with an average length of 1321 bps. Among these unigenes, 27541 unigenes obtained annotation with gene descriptions from NR, Swiss-Prot, COG, and KEGG databases. This study demonstrated that the RNA-seq technology could be used as a rapid and efficient method for de novo transcriptome analysis of nonmodel plant organisms that provides a good resource of gene expression data for further analysis. A protein-protein interaction network of expressed genes was constructed in S. japonica. The topological analysis revealed that degree correlation of the giant component was disassortative and had small-world properties. This result implied that the proteinprotein interaction network in S. japonica might have resulted from a long-term evolution to ensure both stability and low energy consumption protein-protein interactions.