Cross-Genome Comparisons of Newly Identified Domains in Mycoplasma gallisepticum and Domain Architectures with Other Mycoplasma species

Accurate functional annotation of protein sequences is hampered by important factors such as the failure of sequence search methods to identify relationships and the inherent diversity in function of proteins related at low sequence similarities. Earlier, we had employed intermediate sequence search approach to establish new domain relationships in the unassigned regions of gene products at the whole genome level by taking Mycoplasma gallisepticum as a specific example and established new domain relationships. In this paper, we report a detailed comparison of the conservation status of the domain and domain architectures of the gene products that bear our newly predicted domains amongst 14 other Mycoplasma genomes and reported the probable implications for the organisms. Some of the domain associations, observed in Mycoplasma that afflict humans and other non-human primates, are involved in regulation of solute transport and DNA binding suggesting specific modes of host-pathogen interactions.


Introduction
Progress in DNA sequencing technology has produced the whole genomes of many important organisms including humans. The proper utilization of such sequence information requires understanding of the function of each protein in the database. The ever-increasing gap between the number of sequences deposited in databases and the numbers with accurate functional annotation is a big concern to the scientific community. The goal of functional genomics is to determine the function of proteins predicted from the sequencing projects [1,2]. To reach this goal, computational approaches can assist in the classification of functional genomics targets.
Functional and evolutionary relationships can be inferred from sequence comparisons, especially at high sequence identities. The established computational methods to function detection primarily depend on homology matching to genes with known functions by employing programs such as FASTA [3] and BLAST [4].
Nevertheless, establishing homology is not straightforward and provides limited coverage. Over the past few years, many new methods have emerged to organize the proteins; some of them are highly automated, and others are curated. Position-specific iterative BLAST (PSI-BLAST) can be used to extend the search to distantly related homologues [5]. Some of the other methods rely on the hierarchical classification of proteins into families such as the superfamilies/families in the PIR-PSD [6] protein groups in ProtoMap [7]. Few other methods organize proteins to families of domains such as Pfam [8] and SMART [9]. Others rely on sequence motifs or conserved regions, such as in PROSITE [10] and PRINTS [11]. Databases like CATH [12], SCOP [13], and FSSP [14] employ structural data to organize proteins in to domains. Others are integrations of various family classifications, such as InterPro [15]. However, each of these databases is useful for particular needs, and most of them rely on high sequence similarity for accurate function annotation 2 Comparative and Functional Genomics transfer, and no classification scheme is by itself adequate for addressing all genomic annotation needs [16]. The Gene Ontology (GO) consortium provides a controlled vocabulary to describe the function of a protein [17].
Identification of domains at the sequence level most often relies on the detection of global and local sequence alignments between a given target sequence and domain sequences found in databases such as Pfam [8] and SMART [9]. However, sequence-based methods often fail under low sequence identity conditions. Intermediate sequence approach has been shown to be more effective in enhancing the coverage in homology search and in connecting remotely related proteins of common function [18]. It was shown that about 70% improvement over direct search [18] is possible using this method. Using similar approach in the domain assignment to sequences, earlier, we showed that the domain assignment could be substantially enhanced in the family of genes containing adenylyl cyclases [19]. PURE, this computation-intensive search protocol, was further developed as a web tool [20]. Next, we had implemented our method at the whole genome level by taking smaller genome organism Mycoplasma gallisepticum as a specific example [21]. This paper reports the cross-genome comparisons of 14 Mycoplasma genomes to study the conservation of domains and domain architectures involving new domain associations identified by us in Mycoplasma gallisepticum.
As shown in the earlier paper, PURE approach is effective in establishing remote domain relationships [20,21] and can be useful when the user fails to assign domains to the sequence by using direct search methods like Pfam [8]. We also showed, by comparing different versions of Pfam databases, that the PURE approach can give a good hint at the domains, which are going to be assigned in the updated Pfam database [19].
Mycoplasma constitutes a unique group of bacteria best characterized as lacking peptidoglycan and having one of the smallest genomes of all free-living prokaryotes. Members of this group also represent important pathogens of humans, animals, and plants. Over the last few years, the genomes of many Mycoplasma species were sequenced, reinforcing comparative genome studies which permit a better understanding of their metabolism and the relations with their hosts. Phylogenetic analyses indicate that Mycoplasmas have undergone a degenerative evolution from related, low G+C content, Gram-positive eubacteria [22,23]. Mycoplasmas possess no complete routes for amino acids synthesis and degradation, implying that these monomers must be acquired either from their hosts or from a culture medium, depending upon membrane transporters [24]. Exogenous peptides are an important source of amino acids. Indeed, bacteria have evolved peptide transport systems that also assist in responses to environmental changes, mediating functions such as quorum sensing, sporulation, pheromone transport, and chemotaxis [25].

Materials and Methods
Complete protein sequences of 14 different Mycoplasma genomes were obtained from National Center for Biotechnology Information website [26]. The species we considered for our study were Mycoplasma gallisepticum strain R (total number of proteins in the genome 726), Mycoplasma genitalium strain G37 (477) Table 1).
We assigned domain region to the Mycoplasma gallisepticum protein sequences by scanning the sequences against HMM profiles in the PfamA database (version 21.0) [8] which consists of 8957 families by using standalone version of Hmmpfam of the HMMER suite [28] with E-value cutoff 0.1.
HMMTOP [29] server was used for transmembrane helix prediction, and a standalone version of COILS [30] program was used for coiled-coil region prediction. We used PSI-BLAST [5] (with three iterations and expectation cutoff value of 0.001) for search for similar sequences. During the blast searches, low complexity filter was turned on. Nonredundant database [31] was used for sequence similarity searches. Standalone version of PSIPRED [32] was used for secondary structure prediction. Multiple sequence alignments were performed using CLUSTALW program [33].

Results and Discussion
Earlier analysis revealed 71 new domain relationships in the Mycoplasma gallisepticum genome which corresponds to 62 unassigned regions [21]. 22 domains, which are in the border  [34], wherein some of these domains might have lost their function and diverged beyond recognition by direct search methods. The intermediate sequences through which these domain relationships are established are predominantly of prokaryotic in origin and have relatively fewer hits in the PSI-BLAST search. Our analysis also revealed the presence of extra copy of domains such as RMMBL, Lactamase B, ABC membrane, ABC tran, Lipoprotein X, SBP bac 5, ATP synt ab N, Helicase C, tRNA anti, and GTP EFTU in the Mycoplasma gallisepticum genome. Because of the limited coding capacity of their genome, Mycoplasmas lack many enzymatic pathways characteristic of most bacteria; consequentially, Mycoplasma genes encode many proteins with functions related to catabolism and metabolite transport while encoding few anabolic proteins [35]. Most of these newly predicted domains related to transportation function. Despite low sequence identities, these domains could have critical function in the nutrient transportation. Some of the interesting examples are explained below.
Protein NP 853190.1 was a completely unassigned protein. Our method predicted peptidase M23 (Peptidase family M23) domain relationship in the protein. Members of this family are zinc metallopeptidases and have a characteristic HxH motif [36], and the current gene product also preserved this functional motif in the unassigned region. We found this domain in Mycoplasma gallisepticum only through indirect searches, and the unassigned sequence has less than 20% sequence identity with the typical peptidase M23 members, albeit with few indels in the alignment ( Figure 1). Perhaps, the low sequence identity could explain why this is not associated with domain in the direct searches. Peptidase M23 domain is present in only two other Mycoplasma members (Mycoplasma mobile and Mycoplasma pulmonis). Interestingly, chaperonin (cpn60 or GroEL) domain is absent from these species but is present in Mycoplasma gallisepticum genome. Peptidases and chaperonins are components of protein homeostatic mechanisms. Molecular chaperones promote protein folding and prevent protein misfolding and aggregation, while certain proteases function primarily to degrade improperly folded proteins [37,38]. It has been hypothesized that the protein homeostatic process in Mollicute organisms has shifted through evolution towards favoring protein degradation rather than protein folding [39]. Since peptidase M23 is present only in M. mobile and M. pulmonis ( Figure 2) along with other peptidases where GroEL is completely absent from the genomes, this may explain the need for higher peptidases to degrade improperly folded proteins. Whereas, in M. gallisepticum, the presence of GroEL reduces the pressure on peptidases like peptidase M23 and sequences could have diverged substantially.
The full-length region of the sequence ID NP 852865.1 was unassigned; that is, no sequence domains were observed and recorded. Our method indirectly assigned amino terminal Lactamase B and carboxy terminal RMMBL (RNA-metabolizing metallo-beta-lactamase) domains in the sequence. In the initial PSI-BLAST search against nonredundant database, it has picked up which belongs to more than 100 different species, including Homo sapiens, at very low expectation values. In the Hmmpfam search, all the hits showed identical domain architectures in all the sequences with amino terminal Lactamase B and carboxy terminal    RMMBL domains and with very good E values. The metallobeta-lactamase fold contains five sequence motifs. The first four motifs are found in Lactamase B (PF00753) and are common to all metallo-beta-lactamases. The fifth motif appears to be specific to function. RMMBL represents the fifth motif from metallo-beta-lactamases involved in RNA metabolism.
Multiple sequence alignment of predicted regions with typical Lactamase B and RMMBL (Figures 3 and 4)    conserved. The domains and domain architecture is conserved across Mycoplasmataceae members ( Figure 5). It has been documented that presence of paralogs in Mycoplasma genitalium (MG139 and MG423) and Mycoplasma pneumo-niae (MPN280 and MPN261) along with other bacteria [40,41] could be as inactive forms. These inactive forms could be confined to modularity function helping in regulating enzymatic activity as already suggested by Aravind [41]. Acquisition of new functions beyond the ancestral enzymatic one is also possible [41]. Due to low sequence identities (<20%) with typical Lactamase B members, in Mycoplasma gallisepticum initially there was only one copy of Lactamase B domain in the genome (NP 852802.1). Our analysis revealed that there is a putative paralog of this domain in this genome, like other Mycoplasma genomes.
SBP bac 5 (bacterial extracellular solute-binding proteins, family 5) domain relationship is established in NP 853298.1 (see Table 2), which was initially full-length unassigned sequence. Cross-genome comparisons revealed that this domain is present in all the Mycoplasma species, except Mycoplasma mobile, Mycoplasma pneumoniae, and Mycoplasma synoviae (Figure 6). This domain is involved in peptide and nickel transportation. Mycoplasmas have reduced genome size and are highly dependent on the environment for nutrient abortion [35]. The presence of extra SBP bac 5 domain could help in the peptide uptake by the organisms.
Mycoplasma species were classified into six different groups according to host specificities (as mentioned earlier), and the newly predicted domains were classified based on the host specificities (Table 3; see Supplementary Table S1 in Supplementary Material available online at doi: 10.1155/ 2011/878973). There were few domains, which are group specific, while the majority are found in all the groups. The group specific domains perhaps imply their selectivity in the hosts owing to function which may be directly or indirectly  required for its survival. We found that the two domains namely, HNH (endonuclease) and HTH 5 (helix turn helix motif containing transcription factor), are specific to M. mobile (found in fresh water Tench fish-Tinca tinca). Five domains namely, GMP synt C (GMP synthase CTD), HHH (helix-hairpin-helix motif involved in DNA binding), Methyltransf 3 (O-methyltransferases), SBP bac 1 (Bacterial extracellular solute-binding protein), and Transposase mut (Transposase, Mutator family with DNA-based transposition activity), were found to be primarily in human-specific and primate group-specific pathogens. Most of these speciesspecific domains are involved in DNA binding and have transcription factor functions. One of them, GMP synt C (GMP synthase CTD), is associated with GATase (Glutamine amidotransferase class-I) and Peptidase C26 domains to form a gene product in M. penetrans involved in GMP biosynthesis. Amongst the human-and primate-specific pathogens, M. penetrans has the largest genome (1,358,633 nt) and maximum number of proteins (1037) among all 14 Mycoplasma species analyzed in this study (Table 1), suggesting that this organism may possess additional genetic information involved in its unique infection process. This organism lacks pyrimidine biosynthetic machinery but using orotate-related metabolism (again unique to M. penetrans) circumvents this problem [33]. On the other hand, presence of purine biosynthesis (GMP synthase) related protein assists on the purine part of nucleotide biosynthesis. Also, the larger size of genome and number of proteins present underlines presence of GMP synt C domain specific to M. penetrans. Such an inspection of domain architectures in proteins containing these newly predicted domains was carried out for all host-group specific domains. It revealed that, except for GMP synt C, all other domains are present as single domains in complete protein sequences. Most of the newly predicted domains are transcription factors not only involved in nucleotide biosynthesis but also specifically involved in the regulation of solute transport. This fact emphasizes the importance of solute transfer across the membrane in conditions of minimal genomes. Host-group-wise comparative analysis revealed that the TGS domain is present in two groups, rodents and ovine/caprine. Even within rodent-specific pathogens, it is present in only M. arthritidis 158L3 1, whereas; it is present in all three species of the ovine/caprine host group. TGS domain is named after threonyl-tRNA synthetase (ThrRS), GTPase, and guanosine-3 ,5 -bis(diphosphate) 3 -pyrophosphohydrolase (SpoT). Its presence in proteins like GTPases suggests its role in ligand (nucleotide) binding or some regulatory function, but it has no direct information about function [35]. However, in M. mycoides, it is present in association with other domains in two different proteins. One of them is GTP diphosphokinase involved in guanosine tetraphosphate metabolic process explaining the possible involvement of TGS domain in nucleotide biosynthetic machinery. Here, M. mycoides, which also infects cattle (causing contagious bovine pleuropneumonia (CBPP)), has the second largest genome (1,211,703 nt) and number of proteins (1016) in the 14 Mycoplasma species under consideration (Table 1), explaining the presence of additional genetic information [34].
Some of the domains are specific to motile group, for example, HHH, HNH, HTH 5, Peptidase M23, and SBP bac 1 are specific to motile group, whereas GMP synt C,   Table S2). Inspecting the domain architectures for all the domains specific to the motility group, we found that they were not associated with any other domain in the complete protein sequence, except for the HHH domain in M. pneumoniae. Even in M. pneumonia, HHH (helix-hairpinhelix motif-small DNA-binding motif) was associated with three different ligase domains involved in replication, repair, and recombination events. Therefore, although there is no obvious link between the presence and absence of these domains and motility function, these distant relationships perhaps acquired new function, which may be required for motility of the pathogens.

Conclusions
The investigation in the sequence information among closely related genomes helps in tracing of appearance, disappearance, and reappearance of genes or their close homologues  in closely related bacterial genomes. Generally, functional annotation transfer is accomplished by phylogenomicsbased methods that exploit strong phylogenetic relationship and based on the closest orthologue identified [42]. Apart from different sequence homology-based methods, microarray expression data along with machine learning techniques like Support Vector Machines (SVM) are integrated together for functional annotations [43]. Although use of such meth-ods will be useful, GO annotations could be more comprehensive with regards to the biological process part or the cellular component part than for the exact molecular function [44]. Protein classification methods along with gene ontology terms are very useful tools in protein functional annotation. However, the best hit with respect to sequence identity may not be the correct protein to be used for annotation transfer since paralogous protein sequences from the same organism do share high identity but function may vary.
In this study, newly and indirectly identified domains in Mycoplasma gallisepticum have been compared across 14 Mycoplasma species. This study showed that some of the newly identified domains are specific to Mycoplasma gallisepticum genome. Such genome-specific domains will perhaps provide important clues to the physiological and pathogenic specificities of the genome.