Application of Serial Analysis of Gene Expression to the Study of the Gene Expression Profile of Leishmania infantum chagasi Promastigote

This study describes the application of the LongSAGE methodology to study the gene expression profile in promastigotes of Leishmania infantum chagasi. A tag library was created using the LongSAGE method and consisted of 14,208 tags of 17 bases. Of these, 8,427 (59.3%) were distinct. BLAST research of the 1,645 most abundant tags showed that 12.8% of them identified the coding sequences of genes, while 82% (1,349/1,645) identified one or more genomic sequences that did not correspond with open reading frames. Only 5.2% (84/1,645) of the tags were not aligned to any position in the L. infantum genome. The UTR size of Leishmania and the lack of CATG sites in some transcripts were decisive for the generation of tags in these regions. Additional analysis will allow a better understanding of the expression profile and discovering the key genes in this life cycle.


Introduction
Visceral leishmaniasis (VL) is a disease caused by the protozoan Leishmania chagasi (in the New World) and L. infantum or L. donovani (in the Old World) transmitted between humans and other mammals through the bite of sand flies of the genera Lutzomyia and Phlebotomus. It is commonly accepted that L. infantum is genetically identical to L. chagasi [1], which is currently named L. infantum chagasi [2]. VL occurs in 65 countries, with an estimated annual incidence of 500,000 cases. The majority (90%) of these occur in rural areas and suburbs of large urban centers in five countries: Bangladesh, India, Nepal, Sudan, and Brazil [3].
Leishmania parasites have a life cycle characterized by the presence of a promastigote stage, flagellated, inside the gut of the insect vector and an amastigote stage, nonflagellated, and finally within macrophages of the host [4]. Each of these life cycle stages presents with a unique set of biochemical, genetic, and morphological traits that are either unique to trypanosomatids, such as mitochondrial minicircles, glycosome, and RNA editing, or used to a greater extent than in other organisms, such as membrane proteins anchored by glycosyl phosphatidylinositol (GPI), polycistronic transcription, and transsplicing, among others [5].
The publication of the genome sequence of L. major [6] and L. infantum [7] has enabled new approaches of profile studies of comparative gene expression of different stages of the Leishmania life cycle [8][9][10] and during differentiation [11] and has fostered a better understanding of various aspects of its biology and pathogenesis.
The serial analysis of gene expression (SAGE) is a powerful methodology that can be used to obtain the complete gene expression profile of a cell or tissue. It is based on the following principles: first, a short nucleotide sequence of 9 or 10 base pairs (or tag) contains sufficient information to identify an mRNA transcript as unique. Second, the concatenation of dozens of tags enables serial analysis of multiple transcripts, by determining the sequence of multiple tags within a single clone [12].
A major advantage of SAGE is that it can be applied to study the profile of expressed genes in organisms whose genome is not yet sequenced or publicly available. Moreover, in the years since the original description of SAGE, changes have been reported in various stages of the protocol, some of which have enabled a reduction of the quantity of initial RNA [13] or an increase in the size of the tag generated [14][15][16]; these changes have made the method even more specific.
Recently, tag libraries that use SAGE (ShortSAGE) were constructed with the aim of identifying differentially expressed genes between the promastigote and axenic amastigote of L. donovani [9]. The comparison of these libraries showed that about 90% of genes were expressed in both stages. A total of 968 genes revealed statistically different transcript levels between promastigotes and axenic amastigotes. Of these, most (642) were derived from amastigotes [9]. Another application of SAGE in Leishmania was described by Guerfali et al. [10] to investigate changes in the transcriptome of L. major and human macrophages infected with L. major. The comparison between libraries of promastigotes and cultured macrophages that were infected or uninfected by the parasite led to the identification of human genes whose expression profile may be relevant to the pathogenesis of the disease. Regarding the parasite, genes showing differential expression in the intracellular stage were also identified, including amastins, oxidative stress proteins, and several ribosomal proteins [10].
One of the main problems regarding the application of SAGE in Leishmania sp. is the lack of a database that maps tags to a gene. The absence of such data could be a result of difficulties in the mapping process. It is believed that characteristics inherent to the Leishmania sp. genome (such as a percentage of GC bases, size of the coding regions (open reading frame-ORFs) and 5 and 3 untranslated regions-UTRs) can negatively influence the process of mapping the tag to a gene. In addition, one characteristic of the SAGE methodology is that tags can be generated from any transcript region where there is an enzyme-anchoring site NlaIII [12]. Thus, depending on the size, it is possible that many tags are generated from 3 UTR, without corresponding ORFs. Additionally, depending on the size and composition of the bases, it is possible that ORFs do not present the site for the anchoring enzyme, generating tags that are exclusive to UTRs.
This study described the construction of a tag library of L. i. chagasi promastigotes using LongSAGE. It analyzed the role of genomic and methodological characteristics that may influence the process of generation and tag to gene mapping and evaluated the applicability of the methodology for studying gene expression in Leishmania.

Leishmania Cultivation.
Strain 2230 promastigotes of L. i. chagasi were used. The sample was isolated from a patient diagnosed with VL at the Natan Portella Institute for Tropical Diseases (IDTNP), Teresina, Piauí, Brazil. The promastigotes were grown in Schneider medium (Sigma) supplemented with 10% (v/v) fetal bovine serum (FBS), 2% human urine (v/v) at pH 7.2 and maintained at 25 • C.

RNA Extraction and Assessment.
Total RNA was extracted from approximately 8 × 10 7 promastigotes using TRIzol LS Reagent (Invitrogen, Frederick MD, USA) according to the manufacturer's protocol. The integrity of RNA was assessed on 1% agarose gel in a denaturing condition (MOPS/formamide) and quantification was carried out using a spectrophotometer at 260 nm. About 30 µg of total RNA was used in the procedure.
2.3. SAGE Procedure L. i. chagasi Promastigotes. The I-SAGE Long kit (Invitrogen) was used to construct the L. i. chagasi library according to the manufacturer's recommendations. The following were included among the main steps: polyadenylates mRNAs were captured by oligo (dT) linked to magnetic beads and used for cDNA synthesis. The cDNA was digested with 60 U of NlaIII (anchoring enzyme), and the 3 ends of the cDNAs were isolated using the beads. The resulting cDNA 3 was divided into two equal portions and connected to two LongSAGE adapters, A and B. The long tags were released by the enzyme MmeI and linked to form ∼130 bp ditags. Dilutions of 1 : 40 of the product were amplified in 27 PCR cycles (300 reactions in total). The precipitated PCR products were separated on 12% polyacrylamide gel, the 130 bp bands were cut out, and the precipitated DNA was digested with NlaIII. The digestion products were separated on 12% polyacrylamide gel, and ∼34 bp ditags were cut out and linked to form concatemers. The concatemers were separated on 6% polyacrylamide gel, and the fractions of 250-500 and 500-800 pb were isolated and cloned into pZErO-1 vector digested with SphI. Vectors with concatemers were cloned in E. coli DH10b by electroporation, and the isolated recombinant vectors were used as template for sequencing in the ABI Prism 3100 (Applied Biosystems).

Sequence Analysis and Database.
Crude sequences were first analyzed with the EditSeq tool (Lasergene, v4.01, DNASTAR Inc., Madison, WI), to confirm the presence of tags, identifiable due to the occurrence of the CATG motif (Nla III site) at 34 bp intervals. SAGE tags were then directly extracted from the sequencing files using the program SAGE2000 (Version 4.5), which also looks for duplicates and quantifies the individual tags.
To identify the corresponding gene to Leishmania tag, conventional BLAST (http://blast.ncbi.nlm.nih.gov/ Blast.cgi) was carried out with the 1,645 tags that had two or more copies in the library against the L. infantum nucleotides collection. When the tag presented 100% sequence identity with a genomic sequence, the distance in bp was noted, as well as the orientation and the closest ORF (gene ID and name).

Gene Identification by BLAST.
The BLAST research showed that 82% (1,349/1,645) of the tags align to one or more positions of the L. infantum genome, which do not correspond to ORFs. Of these, 42% (690/1,349) were aligned more closely to 5 end of a gene, while the remainder (659) were more closely to 3 end. The mean distance between the tag and the closest gene was 718 bp for both regions. About 80% of tags were less than 1 Kb away from the closest gene (Table 3). About seventy-five percent of these tags aligned up to just one region in the genome, whereas 8.5% (141/1,645) of them identified more than one region that, in most cases, flanked gene copies or genes belonging to the same family. Only is an anchoring-enzyme site or different 5 or 3 sequence generated by alternative splicing or cleavage. Thus, the most expressed genes correspond to the sum of the number of different tags that identifies it. This concept was used to identify the genes most expressed by L. i. chagasi. This shows that the number of times that a tag appears in the library is not the most important factor, but rather the total number of times that the gene that it identifies appears in the library, which justifies the need for mapping all the tags in the library. With this strategy, the most expressed genes for L. i. chagasi were those listed in Table 4, which includes translation elongation factors, several members of the histone family (H1, H3, H4), several members of the ribosomal proteins family (S and L), tubulins, heat shock proteins (hsp 70 and hsp 83), and hypothetical proteins.

Comparison between Libraries.
Compared to L. donovani and L. major promastigotes SAGE libraries [9,10], the results were highly consistent. In all libraries the most expressed genes were those related to cellular metabolism as histone, tubulin, ribosomal protein, and elongation factors. However, in the L. i. chagasi library we found tags matching to genes encoding proteins not found in the other libraries, such as heat shock proteins, an ATP-dependent zinc metallopeptidase, and more genes encoding hypothetical proteins. These proteins may play an important and possibly not yet identified role for the parasite. It is important to verify the sequence, expression, and cellular location to access its cell function.

Discussion
SAGE is a technique for studying global gene expression based on a sequence that enables researchers to obtain the complete gene expression profile of a cell or tissue, even if their genes are not previously known [12]. At least two variants of traditional SAGE (14 bp tag) have been developed to improve the specificity of the technique with respect to the mapping tags to genes: LongSAGE, which generates tags of 21 bp (with CATG) [14], and SuperSAGE, which generates tags of 26 bp (with CATG) [16].
In this study, a tag library of L. i. chagasi promastigote constructed with LongSAGE was presented. A total of 14,208 long tags were generated, of which 8,427 are distinct. This is the first description of SAGE tags library in L. i. chagasi and is the only one for LongSAGE in Leishmania. The advantage of LongSAGE over the conventional SAGE is that the information content of a 21 bp LongSAGE tag is appreciably higher than a 14 bp tag of conventional SAGE [12].
As shown, more than 80% of the tags originated from noncoding sequences. This characteristic seems to be more pronounced in Leishmania sp. than in other organisms. In a SuperSAGE library (tags of 26 bp) from human bone marrow, about 90% of the tags identified the corresponding gene by coding sequence. The high percentage of tags aligning to non-coding sequence in Leishmania raises an important question: why the vast majority of the tags align up perfectly to the genomic sequence and not to the coding sequence? An attempt was made to answer this question by analyzing the following factors: (1) it is possible that not all ORFs have the enzyme NlaIII (anchoring enzyme) site that is used to delimit the tags, while others may have more than one site; (2) characteristics such as base composition and relative size of the UTRs and ORF of each transcript can determine the tag position.
To investigate the impact of ORFs without NlaIII site on the results, 7,993 ORFs of L. infantum deposited in NCBI (www.ncbi.nlm.gov/mapview) were downloaded and analyzed for the presence (and quantity) or absence of the CATG site. The mapping of CATG sites was restricted to coding sequences and according to the results identified 251 ORFs (3.2% of coding sequences in the genome) without the enzyme site. This and the results obtained from other genomes (such as L. major and Plasmodium falciparum) showed that, as expected, the smaller the ORF and GC content, the lower the occurrence of the enzyme site. Certainly, the small percentage of ORFs with no CATG site does not justify the large number of tags (80%) generated from non-coding sequence, but identifies transcripts that can generate tags exclusively on their 3 or 5 UTR sequences.
It is known that UTRs play crucial roles in regulating posttranscriptional gene expression, including regulation of the transportation to outside the nucleus efficiency, subcellular localization, mRNA stability, and translation efficiency [17]. In higher eukaryotes, the mean size of 5 UTR varies from 100 to 200 nucleotides, while for 3 UTR the average size can be as high as 800 nucleotides [17,18]. Compared with higher eukaryotes, the sizes of the 5 and 3 UTR of trypanosomatids are larger. The 5 UTR can vary from 100 to 300 bp, while the 3 UTR may have a size of 1.5 to 2 Kb [6]. The size differences between the 3 UTRs of humans and Leishmania sp. may partly explain the fact that over 80% of L. i. chagasi tags aligned with sequences other non-ORFs.
As shown, the observed mean distance between tags and their nearest ORF was the same (718 bp) for both 5 and 3 regions, indicating that (1) it is possible that many tags mapped closer to the 5 end of the genes are, in fact, located within the 3 UTRs of upstream ORFs or (2) the 5 UTR sequences of Leishmania sp. are greater than initially anticipated [6]. In Trypanosoma brucei, the average 5 UTR length (based in splice-acceptor site) was 184 bp and the average predominant 3 UTR length was 604 bp [19]. Another possibility is that these (especially the largest) can represent unpredicted ORFs.
Since most tags identified 5 or 3 UTR, it is needed to confirm the corresponding genes. The gene annotation or confirmation using the tags that did not match to L. infantum ORFs as initiator in the RAGE [20] or GLGI [21] and the building of a database of accurately annotated tags will be the focus of the next steps in this study.
There are many studies that have looked at transcriptomics in Leishmania using DNA microarray [22,23]. However, unlike SAGE, DNA microarray is a closed platform. Recently, new software for analysis and sequence alignment have been described [24,25], but none permit the comparison of SAGE libraries. In addition, the digital format of SAGE data enables the truncation of 17-base to 10-base tags. This makes it possible to compare libraries built at different times and places-a double benefit of LongSAGE regarding DNA microarrays.
As with the library of L. donovani or L. major [9,10], the most expressed genes in the L. i. chagasi library were those with constitutive expression, such as those related to DNA and protein related to metabolism: histone, translation elongation factors, ribosomal proteins, ubiquitin, and tubulin. However, at least two genes were more expressed in L. i. chagasi when compared to the other libraries: genes encoding heat shock proteins (HSP-83 (LinJ.33.0350) and HSP-70 (LinJ.28.3000)) and an ATP-dependent zinc metallopeptidase (LinJ.18.0620). The significance of high expression of these genes is still not fully understood. It is known that HSP-70 and HSP-83 are constitutively expressed, but the accumulation of mRNA and the increased rate of translation are only found at high temperatures [26], such as the one that the parasite finds in the vertebrate host. Moreover, zinc metallopeptidases are numerous in Leishmania [6]. The main one, GP63, is more expressed in metacyclic promastigotes and plays an important role in the resistance to lysis mediated by the complement system in the early stages of infection [27]. It is possible that the metallopeptidase gene identified in the L. i. chagasi library plays a similar role to that of GP63. In addition, in the L. i. chagasi library we found tags matching to genes responsible for encoding hypothetical proteins with no ortholog found in the other libraries, except one (LinJ.13.0270), whose ortholog (LmjF.13.0450) was abundantly expressed in L. major promastigote. These proteins may play an important and possibly not yet identified role for these parasites. It is important to verify the sequence, expression, and cellular location to access its cell function.
A weakness of our study is that it presents results from a library of L. i. chagasi promastigotes. Although it does not reveal the gene expression profile of the form found in the vertebrate host (amastigotes) and thus makes little contribution to understanding the pathogenesis of VL, it is of great value for understanding the biology of Leishmania in the insect vector. From the methodological point of view, LongSAGE is a powerful tool for studying gene expression in Leishmania.

Conclusions
In conclusion, the description of a tag library using the LongSAGE method in promastigotes of the parasite L. i. chagasi was presented. It was shown that characteristics of the method and of the actual genome of Leishmania sp. (such as UTRs and ORF sizes) may have negative impacts on the process of mapping tag to gene.
LongSAGE revealed the gene expression profile in promastigotes of L. i. chagasi in culture. The most expressed genes were related to cellular metabolism, whereas preferential expression genes in the amastigote stage were also represented in the promastigote form. Additional analysis will allow a better understanding of the expression profile and discovering the key genes in this life cycle.