De Novo Assembly and Transcriptome Characterization of Canine Retina Using High-Throughput Sequencing

We performed transcriptome sequencing of canine retinal tissue by 454 GS-FLX and Ion Torrent PGM platforms. RNA-Seq analysis by CLC Genomics Workbench mapped expression of 10,360 genes. Gene ontology analysis of retinal transcriptome revealed abundance of transcripts known to be involved in vision associated processes. The de novo assembly of the sequences using CAP3 generated 29,683 contigs with mean length of 560.9 and N50 of 619 bases. Further analysis of contigs predicted 3,827 full-length cDNAs and 29,481 (99%) open reading frames (ORFs). In addition, 3,782 contigs were assigned to 316 KEGG pathways which included melanogenesis, phototransduction, and retinol metabolism with 33, 15, and 11 contigs, respectively. Among the identified microsatellites, dinucleotide repeats were 68.84%, followed by trinucleotides, tetranucleotides, pentanucleotides, and hexanucleotides in proportions of 25.76, 9.40, 2.52, and 0.96%, respectively. This study will serve as a valuable resource for understanding the biology and function of canine retina.


Introduction
The retina is composed of a neural cell layer and a retinal pigment epithelial cell layer. Two types of photoreceptor cells, rods and cones, in the neural cell layer convert light signals to changes in membrane potential, organized through complex layers of the neural cells and transmitted to the brain through the fibers of the optic nerve [1,2]. The retinal pigment epithelium (RPE) plays an important role in supporting the function of the photoreceptor cells and serves as a bloodretina barrier [3]. Photoreceptor cells and the RPE are of interest physiologically as well as pathologically in relation to retinal degeneration [4][5][6][7]. Furthermore, photoreceptor cells serve as a model for the investigation of the development and differentiation of neural cells [8] and its normal function is dependent upon each cell type working properly in a coordinated fashion. Multiple disorders, that is, diabetes, agerelated macular degeneration, inherited retinal degeneration (IRD), cancer, and so forth, affect the retina and cause vision loss at all ages [9].
Massively parallel, high-throughput sequencing platforms have provided possibility for genome-wide observations of the transcriptional makeup of retina genes. The transcriptome is the complete set of transcripts in a cell at a specific stage or under given physiological condition [10]. Understanding cell development, physiology, and disease may be improved by genome-wide characterization of the retinal transcriptome [11,12]. High-throughput mRNA 2 Genetics Research International sequencing allows simultaneous transcript discovery and abundance estimation [13]. High-throughput sequencing data have a wide dynamic range of transcript expression for quantification and identification of rare transcripts within the constraints of the depth of coverage. With microarray technologies, transcripts can only be detected based on prior knowledge required for probe placement [14]. The size and structure of transcripts can be accurately measured by RNA-Seq, as compared to array hybridization, which does not provide any information on transcript size and splice variation. The microarray based gene expression of canine has been reported for lungs, brain, heart, kidney, liver, lymph node, pancreas, skeletal muscle, and spleen tissues [15]. The characterization of causative mutations for retinal blindness disorders has been of limited success due to poor availability of information on gene expression and underlying molecular mechanisms that trigger degenerative processes. To the best of our knowledge, there are no reports on transcriptome profiling of retinal tissue of dog. Hence, the present study was undertaken with the objective of developing a catalog of genes expressed in canine retina and their functional annotation.

Tissue Collection.
The retinal tissues of both eyes were taken from a female nondescript dog, approximately 4-5 years old, that had an automobile accident and succumbed during the treatment at the Department of Veterinary Surgery & Radiology, AAU, Anand, Gujarat, India. Tissues were washed with sterile phosphate buffer saline solution, transferred immediately in the "RNA later," and stored in liquid nitrogen for downstream processing.

mRNA Extraction, Library Preparation, and Sequencing.
Total RNA was extracted from 100 mg of both tissues using TRIzol (Invitrogen Life Technologies, CA) reagent as per the manufacturer's instructions. DNase treatment was given to remove the DNA contamination. The quantity and quality of RNA were evaluated using NanoDrop1000 spectrophotometer (Thermo Fisher Scientific) as well as Bioanalyzer 2100 (Agilent Technologies, CA). Total mRNA was isolated from total RNA sample using mRNA isolation kit (Roche Diagnostics, Switzerland) as described in the manufacturer's protocols. Total isolated mRNA was again quality checked on Bioanalyzer 2100 using RNA 6000 nano Chip kit (Agilent Technologies, CA). cDNA and library preparations were carried out using kits of 454 GS-FLX sequencing and Ion Torrent mRNA library preparation. Both platforms based cDNA libraries were sequenced on 454 GS-FLX and Ion Torrent PGM sequencers as per the manufacturer's instructions. The brief steps for sequencing are mRNA fragmentation, adapter ligation, cDNA preparation, emulsion PCR based library amplification, and library enrichment.

Read Mapping and Gene Expression
Analysis. The generated reads of both datasets were pooled and subjected to quality screening using PRINSEQ. Reads with less than 60 bases of read length, with mean read quality of less than 20, and with duplicate reads were removed [16]. The quality screened data was processed for the mapping and gene expression analysis. Reads were mapped to canine annotated genome assemblies using CLC Genomics Workbench 4.9 software. To quantify gene expression, the RNA-Seq analysis tool was used as previously described [17] allowing for no more than 2 mismatches per read. The annotated genome was downloaded from NCBI for canine genome build CanFam3.1.

Functional Annotation of Transcripts.
The genes that were expressed with RPKM (reads per kilobase of exon model per million mapped reads) value of ≥0.5 were taken for functional annotation. Genes which were expressed with RPKM < 0.5 were excluded from the functional annotation. The genes (gene ontology) were annotated using Database for Annotation, Visualization and Integrated Discovery (DAVID) Version 6.7 [18] and analyzed for gene enrichment using Functional Annotation Tool. Genes associated with retinal function were retrieved from RetNet retinal information network database (https://sph.uth.edu/retnet/home.htm).

Sequence Data Processing and De Novo
Assembly. The downstream analysis was carried out from total data obtained from 454 GS-FLX and Ion Torrent. The duplicate reads, chimeric reads, minimum length (<30 bp), quality mean read length (<20), and end trimming were performed using PRINSEQ tool [16]. Quality screened data was then processed for assembly. Reads were assembled using contig assembly program-3 (CAP3) [19] with default parameters (overlap similarity score cutoff 90, overlap percent identity cutoff 90, mismatch score (−5), and base quality cutoff 20).

Comparative Annotation of Assembled Contigs.
The annotations of contigs were carried out by BLASTx searches of all contigs with reference to the NCBI nonredundant (nr) database using Blast2GO [13] with -value cutoff of 1 * 10 −6 . The assembled sequences were subjected to KEGG pathways assignment using the online KEGG [20] Automatic Annotation Server (KAAS) (http://www.genome.jp/kegg/kaas/) Ver. 1.67x with default parameters. KEGG pathway analyses of contigs were performed on KASS server using Bidirectional Best Hit.

Full-Length cDNA Prediction.
All assembled transcripts (using CAP3) were submitted to in-house local BLASTx against protein sequence database of Canis lupus familiaris with -value cutoff of 10 −5 for identification of full-length cDNA. Prediction of full-length cDNAs was identified using online tool Target Identifier [21] with -value cutoff of 10 −5 . The cDNA sequence was recognized as a full-length cDNA only if it has the start codon (ATG) and poly(A) tail codon.

ORF Identification.
The assembled contigs were uploaded to online tool ORFPredictor [22] to identify open reading frames in the assembled contigs with an -value cutoff of 10 −5 .

Identification of SSR/Microsatellites and Repeat Elements.
The SSRs were identified from assembled sequences using SSR Locator [23] with threshold of 6 for di-and 5 for tri-, tetra-, penta-, and hexanucleotide repeats. . Among the reads mapping uniquely to protein coding genes, 34.05% located within exon reads and 14.63% on exonexon reads. Nearly 48.00% located within the introns and 2.75% in the exon-intron reads ( Table 2). The relatively high proportion of reads assigned to introns is not uncommon when the sequencing library preparation includes random priming of the mRNA [24].

Functional Annotation of Retina Expressed Genes (GO Analysis).
The functional annotations of genes expressed in retinal tissue were performed using DAVID 6.7 web based annotation tool [18] which provides dynamic, controlled vocabulary and hierarchical relationships for the gene products in three categories: biological process, molecular function, and cellular component. Gene enrichment of GO terms was significant ( value < 0.01) in biological pathway, molecular function, and cellular component. The biological process was enriched in a total of 28 GO terms, cellular component in 34 GO terms, and molecular function in 27 GO terms.

GO: Biological
Process. The enrichment of expressed genes in biological process was observed for 316 genes in 28 GO terms which ranged between 25 and 6 genes. The assignment of GO terms ( value < 0.01) included intracellular signaling cascade (35 genes), phosphate metabolic process and phosphorus metabolic process (29 genes), protein localization (28 genes), protein transport and establishment of protein localization (26 genes), phosphorylation (25 genes), protein amino acid phosphorylation (23 genes), sensory perception of light stimulus and visual perception (17 genes), small GTPase mediated signal transduction (15 genes), macromolecule catabolic process (14 genes), response to radiation and enzyme linked receptor protein signaling pathway (11 genes), and response to light stimulus (9 genes). However, cell adhesion and biological adhesion (18 genes), macromolecule catabolic process (14 genes), cellular macromolecule catabolic process (13 genes), intracellular transport (13 genes), and membrane organization (10 genes) were enriched with value ≤ 0.05 ( Figure 1, Additional File S1, Sheet 1, in Supplementary Material available online at http://dx.doi.org/10.1155/2015/638679). The enrichment of genes in GO terms, namely, sensory perception of light stimulus, visual perception, response to light stimulus, response to radiation, cell adhesion, and biological adhesion, was consistent with retinal transcriptome of aged human and rat [25,26].

Retina-Specific Gene Expression Profiles.
In an attempt to determine the retina associated expression of candidate genes, expression of 160 genes out of 221 genes (based on RetNet (https://sph.uth.edu/retnet/home.htm) gene list, January 2014) was detected. Among that, candidate genes like RHO, PDC, RPGR, CNGB1, SLC1A2, SLC1A3, SLC24A1, PRCD, PDE6G, PDE6A, PDE6B, PDE6H, and PDE6C were detected (expression values were presented in Additional File S1, Sheet 4). The GO analysis revealed that all of these genes were enriched in the categories of sensory perception of light stimulus, visual perception, response to radiation, and response to light stimulus (Table 3).

Assembly of Transcriptome and Comparative Analysis.
The reads were assembled using CAP3 which generated a total of 29,683 contigs with N50 of 619 and mean length of 560.9 bases ( Table 4). Most of the contigs were in the range between 400 bp and 500 bp (Figure 4). The assembled sequences were compared against the NCBI nr database (ftp://ftp.ncbi.nlm.nih.gov/blast/db) using BLASTx ( -value 1 * 10 −6 ). Of the 29,683 assembled sequences, 12,498 (42.10%) contigs had significant hits corresponding to a single or more than one unique accession number to the nr database. The 6 Genetics Research International   sequences hits with nr database using BLASTx were 42.10% with known functions.

Full-Length cDNA Prediction.
Full-length cDNAs are important resources for many genetic and genomic researchers and useful to predict protein sequences [27]. All contigs were analysed by online tool Target Identifier to identify potential full-length cDNAs with complete open reading frame (ORF) in assembled transcriptome of canine retina. A total of 3,827 full-length, 914 short full-length, 331 ambiguous, 2,932 partial (5 -sequenced partial), and 2,913 3 -sequenced partial sequences were identified with a cutoff -value of 10 −5 ( Figure 6).

Open Reading Frame (ORF) Identification.
Open reading frame identification in RNA-Seq is important in gene prediction and identification of candidate protein coding regions [27]. In the present study, out of 29,683 total assembled contigs of canine retina transcriptome, 29,418 (99%) open reading frames (ORFs) were identified with an average length of 285 bp ranging from 50 bp to 5,696 bp ( Figure 7). The remaining 265 contigs contained no ORFs, which indicates that these contigs were noncoding or originated from untranslated regions (UTRs).

KEGG Pathway Annotation.
In order to identify the active biological pathways in canine tissue, the assembled contigs were used to obtain the enzyme commission (EC) against the Kyoto Encyclopaedia of Genes and Genomes (KEGG) database. A total of 3,782 contigs were assigned to 3,570 enzyme commission (EC) numbers ( Table 5). The assignments of contigs with metabolism pathways were predominant. Enzymes involved in retinal tissue metabolism were further classified into 12 subcategories. There were considerably a higher number of enzymes participating in the metabolism of carbohydrate (238 pathways), lipid (167

Discussion
We performed RNA-Seq of canine retina using 454 GS-FLX and Ion Torrent PGM. Mapping identified the expression of 10,360 genes out of 28,455 reference genes of canine genome. The expressed genes were utilized for the gene ontology analysis and functional annotation.
In the retinal transcriptome, 316 genes were found to be enriched into the 28 molecular function GO category ( value < 0.01). The enrichment of genes in GO terms, namely, sensory perception of light stimulus, visual perception, response to light stimulus, response to radiation, cell adhesion, and biological adhesion, was consistent with retinal transcriptome of aged human and rat [25,26]. The enrichment of genes to these functional categories was further established by RetNet (https://sph.uth.edu/retnet/home.htm) database. These visual functions together serve as the series of events   Figure 5: BLASTx hits distribution with -value 10 −6 against NCBI nr database and sequence similarity (a) and species distribution (b).
required for an organism to receive a stimulus, convert it to a molecular signal, and recognize and characterize the signal. Cellular component refers to the place in the cell where a gene product is active. These terms reflect our understanding of eukaryotic cell structure. In the retinal transcriptome, 293 genes were found to be enriched into the 34 cellular component GO category ( value < 0.01). The cellular component GO terms assigned to retinal transcriptome of aged human and mice also showed the GO terms like plasma membrane in human [25] and rat [26]. Molecular function refers to the elemental activity or task performed, or potentially performed, by individual gene products. In the retinal transcriptome, 293 genes were found to be enriched into the 27 molecular function GO category ( value < 0.01).
The BLASTx search of contigs revealed 42% of the hits matching Canis lupus whereas 8.12% matched Homo sapiens. Further -value score and top hits distribution shows that maximum sequences were at the range of 90-100% similarity to reference. However, the poor annotation efficiency could be due to insufficient sequences in public databases for phylogenetically closely related species to date and limited sequence similarity of assembled contigs against NCBI nr database ( Figure 5). Additionally, sequences without annotations may represent poorly conserved regions (e.g., untranslated regions (UTRs)) in Canis lupus familiaris. These values were higher than those in the comparable BLAST results from most other published studies using shotgun generated de novo transcriptomes [28][29][30].
Full-length cDNAs are the valuable source of information for many genetic and genomic researchers and can be useful to predict the protein sequences [27]. A total of 3,827 fulllength cDNA sequences were identified which will serve as base for further cloning and gene expression analysis. A total of 29,418 (99%) ORFs were predicted from 29,683 contigs.  These predicted ORFs indicate that most of the contigs have a protein coding sequence and derived from the exonic region of genes. The average length of predicted ORFs was 285 bp. The assembled transcriptome contigs can serve as a reference for cSNPs (Coding SNP) identification from transcriptome data for multiple canine breeds. ORF analysis would enable us to discriminate synonymous and nonsynonymous SNPs and to identify nonsense mutations in canines. Next-generation sequencing has identified ORF in Anopheles funestus [31] and plant species [32]. However, there are no reports of predicted ORFs identification in canine to date.
The transcriptome sequencing provides an excellent source for mining and development of gene-associated markers [33,34]. Microsatellites or SSRs are molecular markers that are widely distributed in a genome. They consist of repeated core sequences of 2-6 base pairs in length. SSRs have proven to be an efficient tool for performing QTL analysis, constructing genetic linkage, and evaluating the level of genetic variation in a species on account of high diversity, abundance, neutrality, and codominance of microsatellite markers [35,36]. In the identified SSRs, the most dominant SSRs were dinucleotide repeats [(AG/GA) , (CT/TC) , (AC/CA) , (GT/TG) , (AT/TA) , and (CG/GC) ] followed by trinucleotides [GGC, GCC (12.16%), TCC (9.46%), GGA (8.78%), GTC (6.42%), and TTC] (Figure 8). Unlike dog retina, the dinucleotide repeat of AC/GT type is the most abundant in Liaoning cashmere goat [37] compared to other vertebrates [38] but is different from plants [39]. Shotgun sequencing has identified numerous SSRs in plant species [40]. However, there are no reports of SSRs in dog retinal sample in India. We excluded mononucleotide SSRs in our analysis because of the common homopolymer errors that can occur in 454 GS-FLX and Ion Torrent sequencing data. These SSRs markers may offer a valuable resource for genetic variation study and further genetic investigations to be required in large dataset.
Vision is one of the most fascinating mechanisms of the interactions of a biological system and the process of phototransduction where the electromagnetic radiation is converted into biologically recognizable signals by the retinal photoreceptor cell. The phototransduction cascade of vertebrate serves as a benchmark system in signal transduction for a number of light stimuli including the remarkable ability of rod cells to respond reliably to single photon [41,42]. In this study, we detected numerous contigs encoding EC number to phototransduction pathway. Strunnikova et al. [27] noted that several of the highly expressed signature genes encode proteins involved in visual cycle, melanogenesis, and cell adhesion in retinal pigment epithelium. The melanogenesis is essential for removal of toxic substances from the choroid and protects the retina from oxidative and chemical stress [43,44]. The retinol metabolism (biosynthesis of Vitamin A) is essential for the life of all chordates. It has numerous important functions including a role in vision, maintenance of epithelial surfaces, and immune competence [27]. Multiple genes have been characterized to encode the components of this cycle and linked to many human retinal diseases [45].

Conclusion
In the present study, we have analyzed transcriptome data and identified >10000 expressed genes in canine retinal tissue. The enrichment of genes in GO terms, namely, sensory perception of light stimulus, visual perception, response to radiation, and response to light stimulus, suggested the abundance of genes specifically present in retinal tissue and involved in vision related processes. Several highly expressed genes encoding proteins involved in melanogenesis, phototransduction, and retinol metabolism were identified in the retina. Moreover, a large number of cDNA SSRs were predicted which can be used for subsequent marker development, genetic linkage, and QTL analysis. Overall, the canine retina transcriptome represents a valuable resource for future functional and comparative genomic studies for effective way to treat vision related problem of this globally vulnerable species.