Research Article Identification and Analysis of Novel Amino-Acid Sequence Repeats in Bacillus anthracis str. Ames Proteome Using Computational Tools

We have identified four repeats and ten domains that are novel in proteins encoded by the Bacillus anthracis str. Ames proteome using automated in silico methods. A “repeat” corresponds to a region comprising less than 55-amino-acid residues that occur more than once in the protein sequence and sometimes present in tandem. A “domain” corresponds to a conserved region with greater than 55-amino-acid residues and may be present as single or multiple copies in the protein sequence. These correspond to (1) 57-amino-acid-residue PxV domain, (2) 122-amino-acid-residue FxF domain, (3) 111-amino-acid-residue YEFF domain, (4) 109-amino-acid-residue IMxxH domain, (5) 103-amino-acid-residue VxxT domain, (6) 84-amino-acid-residue ExW domain, (7) 104-amino-acid-residue NTGFIG domain, (8) 36-amino-acid-residue NxGK repeat, (9) 95-amino-acid-residue VYV domain, (10) 75-amino-acid-residue KEWE domain, (11) 59-amino-acid-residue AFL domain, (12) 53-amino-acid-residue RIDVK repeat, (13) (a) 41-amino-acid-residue AGQF repeat and (b) 42-amino-acid-residue GSAL repeat. A repeat or domain type is characterized by specific conserved sequence motifs. We discuss the presence of these repeats and domains in proteins from other genomes and their probable secondary structure.


INTRODUCTION
The anthrax is a disease of herbivores and other mammals including humans, caused by the Bacillus anthracis str. Ames, a Gram-positive, rod-shaped, nonmotile, spore-forming bacterium [1]. It is an endospore-forming bacterium that causes inhalational anthrax. During the course of disease, endospores are taken up by alveolar macrophages where they germinate in the phagolysosomal compartment. Vegetative cells then escape from the macrophage, eventually infecting blood. Expression of the major plasmid-encoded virulence determinants, tripartite toxin, and a poly-D-glutamic acid capsule is essential for full pathogenicity [2]. Key virulence genes found on plasmids are pXO1 and pXO2 [1]. The 60 MDa plasmid pXO2 carries genes required for the synthesis of an antiphagocytic poly-D-glutamic acid capsule [3]. The 110 MDa plasmid pXO1 [4] is required for the synthesis of the anthrax proteins, edema factor, lethal factor, and protective antigen. These proteins act in binary combinations to produce two anthrax toxins: edema toxin (a protec-tive antigen and edema factor) and lethal toxin (a protective antigen and lethal factor) [5]. The chromosome encodes potential virulence factors that include haemolysins, enterotoxins, phospholipases, proteases, metalloproteases, and ironacquisition proteins.
The chromosome of B. anthracis str. Ames contains three homologues of sortase transpeptidase that is responsible for attachment of secreted proteins to peptidoglycan on the cell surface of Gram-positive bacteria [6]. A range of important surface proteins, including enzymes and virulencerelated MSCRAMMs (microbial surface components recognizing adhesive matrix molecules) are anchored to the cell wall in Gram-positive bacteria by sortase, a transpeptidase in Staphylococcus aureus, that cleaves polypeptides at a conserved LPxTG motif near the carboxyl terminus and covalently links them to penta-glycine crossbridges in peptidoglycan [7,8]. Nearly 34 candidate surface proteins which have sortase attachment sites and SLH domains were identified. Two putative B. anthracis str. Ames sortase attached genes have internalin like repeats [9]. The chromosome of B. anthracis str. Ames also contains the csaAB genes for binding of proteins with S-layer homology (SLH) domains to polysaccharide. The SLH domain is a repetitive modular element that is present in several bacterial cell surface proteins and is involved in noncovalent association with peptidoglycan associated polymers [10]. The SLH domain comprises 55-amino-acid residues [11] and the potential role of most proteins with SLH domains on the surface of B. anthracis str. Ames is unknown at present [12]. However, these surface proteins may mediate unknown interactions between B. anthracis str. Ames and its external environment and could be targets for vaccine and drug design. Read et al. [12] reported the complete genome sequence of B. anthracis str. Ames. It comprises 5 227 293 base pairs and 5508 genes with an overall G+C content of 35.4%. Of these, 2762 are functional genes, 1212 are conserved hypothetical genes, 657 genes are of unknown function, and 877 genes are annotated as hypothetical proteins.
As the complete genome sequence of B. anthracis str. Ames is available [12], we intended to systematically identify and analyze all the amino-acid sequence repeats in this proteome. In a general context, a "repeat" corresponds to a region comprising less than 55-amino-acid residues that occur more than once, sometimes in tandem along the primary sequence, examples are the YVTN repeats in various cell surface proteins and the WD repeats present in proteins that perform a variety of functions. On the other hand, a "domain" refers to a region of the protein comprising greater than 55-amino-acid residues and does not contain internal sequence repeats. According to the crystallographer definition, a domain represents a region of the protein capable of folding independently as a stable unit. A domain can also exist in multiple copies and there can be several different domains per protein, examples are the SH2, SH3, and PH domains present in signal transduction proteins. The repeats and domains are characterized by conserved sequence motifs that may be identified according to the conservation of individual amino-acid residues at equivalent positions derived from multiple sequence alignments. In the absence of experimental data, the structural information can be obtained from secondary structure or fold prediction studies in silico. Information about the identified domains and repeats is represented in databases such as SMART, INTERPRO and PFAM. SMART (simple modular architecture research tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures [13]. INTER-PRO is a searchable database that provides information on sequence, function, and annotation. It is an integrated documentation resource for protein families, domains, and sites [14]. PFAM is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families. This can be used to view the domain organization of proteins [15]. We believe that a systematic sequence analysis will provide information on the novel repeats and domains present in B. anthracis str. Ames proteome that are not identified so far.
The B. anthracis str. Ames proteome consists of several known repeats and domains. Some of these domains are as follows. (1) BRCT (breast cancer carboxy terminal) domain was first identified as 100-amino-acid tandem repeat at the C-terminus of the tumor suppressor gene product BRCA1, in which the germline mutations lead to nearly 50% familial breast cancer. Most BRCT domains containing proteins participate in DNA damage checkpoint or DNA repair pathways and transcription regulation [16]. The BRCT is an evolutionarily conserved module that exists in a large number of proteins from prokaryotes to eukaryotes. (2) Excalibur (extracellular calcium binding) domain consists of a conserved DxDxDGxxCE motif, which is strikingly similar to the Ca 2+ binding loop of the calmodulin like EF hand domains, suggesting an evolutionary relationship. (3) Cna B domain forms a stalk in Streptococcus aureus collagen-binding protein that presents the ligand binding domain away from the bacterial cell surface. (4) CBS (cystathionine beta synthase) domain is a small intracellular module with 60-amino-acid residues, mostly found in two or four copies within a protein and occurs in several proteins in all kingdoms of life. Tandem pairs of CBS domains can act as binding domains for adenosine derivatives. In some cases, CBS domains may act as sensors of cellular energy status by being activated by AMP and inhibited by ATP. (5) Par B (par B like nuclease) domain cleaves single stranded DNA, nicks supercoiled plasmid DNA, and exhibits 5 -3 exonuclease activity. (6) KH (K homology) domain comprises 70-amino-acids residues and is involved in RNA binding. (7) PAS and PAC domains comprising 300 and 45-amino-acid residues, respectively, mediate signal transduction. (8) PASTA domain is an extracellular module comprising 70-amino-acids residues that fold into a globular architecture consisting of 3β-strands and an α-helix which aids in penicillin binding. (9) NEAT (near transporter) domain is a 125-amino-acid residue conserved region consisting mainly β-strands. The NEAT domain appears to be associated with iron transport in several Gram-positive species, some of them are pathogenic. (10) SLH domain is present in several bacterial cell surface proteins and is involved in noncovalent association with peptidoglycan associated polymers. It comprises 55-amino-acid residues and the predicted secondary structure comprises two α-helices flanking a short βstrand [11].
The repeats present in B. anthracis str. Ames proteome are as follows. (1) RHS repeats are 21-amino-acids residues long and are involved in carbohydrate binding. (2) TPR (tetratricopeptide) repeats are 34-amino-acids residues long and are involved in protein-protein interactions. (3) EZ HEAT repeats are 37-47-amino-acid residues long and occur in tandem in a number of cytoplasmic proteins that are involved in intracellular transport processes. Arrays of HEAT repeats consist of 3 to 36 units forming a rod-like helical structure and appear to function as protein-protein interaction surfaces. (4) Ankyrin repeats are about 33-amino-acid residues long and occur in at least four consecutive copies; the core of the repeat appears as a helix-loop-helix structure and is involved in protein-protein interactions. (5) LRR (lecuine rich repeats) are 20-amino-acids residues long, each repeat consists of a β-strand and α-helix, that are oriented in an antiparallel manner. The function of LRRs includes signal G. R. Hemalatha et al. 3 transduction, transmembrane receptors, DNA repair, cell adhesion, and extracellualr matrix proteins [17].
Andrade et al. [18] reviewed methods to identify repeats in proteins and the relationship between repeat sequences and their associated functions. Repeats may be identified by manual examination, if the sequence similarity is very high and present in tandem. Repeats are thought to arise due to gene duplication and recombination events. Protein domains may exist either as single or multiple copies and repeats always exist as multiple copies [18,19]. Programs such as BLASTP [20] are also useful in detecting internal and homologous repeats in a protein database. By using the BLAST program, the presence of repeats in a query protein sequence can be identified if (a) the same region of the query is aligned against two or more distinct regions of a second protein; and (b) different regions of the query are being aligned against the same region of a second protein [18].
Several web-based methods are available for ab initio identification of sequence repeats in proteins. For example, RADAR (rapid automatic detection and alignment of repeats) [21] uses an automatic algorithm, for segmenting a query sequence into repeats; it identifies short composition biased as well as gapped approximate repeats and complex repeat architectures involving many different types of repeats in a query sequence. Rep program [22] uses an iterative algorithm based on score distributions from profile analysis. This procedure allows the identification of homologues at alignment scores lower than the highest optimal alignment score for nonhomologous sequences. The PROS-PERO program [23] is ideal for large scale self-comparison of protein sequences. It uses a formula that accurately assesses the significance of protein repeat similarities, allowing for existence of gaps, and also takes into account sequence length and composition. TRUST (tracking repeats using significance and transitivity) program [24] exploits the concept of transitivity of alignments as well as a statistical scheme optimized for the evaluation of repeat significance. Starting from significant local suboptimal alignments, the application of transitivity allows to (1) identify distant repeat homologues for which no alignments were found; (2) gain confidence about consistently well-aligned regions; and (3) recognize and reduce the contribution of nonhomologous repeats. This assessment step will enable to derive a virtually noise-free profile representing a generalized repeat with high fidelity. It has been demonstrated by the authors that TRUST is a useful and reliable tool for mining tandem and nontandem repeats in protein sequence databases, to predict multiple repeat types with varying intervening segments within a single sequence. Once statistically significant repeats are detected, construction of a multiple sequence alignment provides insight into the extent of sequence homology among members of the new protein family and identification of the conserved sequence motifs.
We have implemented TRUST on a personal computer in our laboratory and used it to identify amino-acid sequence repeats in the proteins of B. anthracis str. Ames proteome. We have identified four repeats and ten domains that are novel in the proteome of B. anthracis str. Ames. Further analysis corresponding to searches of the completed and unfinished genome databases identified some of these to be present in other bacterial genomes.

METHODS
We have downloaded the entire proteome of B. anthracis str. Ames from the website http://www.ncbi.nlm.nih.gov in the FASTA format. The TRUST program was downloaded from the website and installed on the local Pentium IV computers on the Linux platform. The TRUST server together with the source code is available at http://ibivu.cs.vu.nl/programs/ trustwww. The TRUST program was run for all the sequences in this proteome. Based on the size of the TRUST output file, the protein sequences with no internal repeats were discarded automatically; that is, only those protein sequences which comprise repeats were retained. The lengths of repeats and domains currently annotated in the INTERPRO database often comprise greater than 25-amino-acid residues; therefore, in this work, we have considered the repeats with greater than 25-amino-acid residues alone for further analysis. Thus selected proteins were submitted to SMART online (http://smart.embl-heidelberg.de/smart/batch.pl) [13] program in batch mode. Manual inspections of the SMART results identified proteins comprising known repeats or domains and were therefore discarded. Only those repeats that are not identified by SMART database are retained for further analysis.
We have downloaded NCBI NR (release date: April 22, 2005) and UNIPROT (release date: April 23, 2005) databases and installed BLAST-2.2.10 on the local Linux computers (OS: Fedora Core-2, Pentium-IV 3.00 GHz, 1 GB RAM, 80 GB hard disk). Using automatic shell scripts, these protein sequences were then blasted using PSI-BLAST program [25] for three iterations against the NCBI NR database and using BLASTALL program against UNIPROT database. The proteins confirmed to comprise repeats by the BLAST program were retained and were tested for presence in the offline versions of INTERPRO (Database: iprscan DATA 10.0, Applications: iprscan V4.1, iprscan binn4.x Linux) and PFAM (release date: April 26, 2005) databases. A final check was made using online versions of INTERPRO and PFAM. These series of steps are given in the flowchart as shown in Figure 1.
The repeats which are not present in any of these databases were considered to be novel repeats or domains, depending upon (1) the number of times they occur in the protein sequences, and (2) length of the amino-acid sequence region. The novel repeats and domains thus identified in B. anthracis str. Ames proteome were subjected to PSI-BLAST analysis in order to identify other proteins from databases that comprise these repeats and domains. Multiple sequence alignment program, ClustalW [26], was used to detect the extent of sequence conservation and the secondary structure prediction was carried out using PHD [27] method. of amino-acid sequence repeats. SMART database analysis identified that 302 entries do not have a SMART description. Based on their absence in the INTERPRO and PFAM databases and the length of repeat sequence (greater than 25amino-acid residues), we have identified about 120 proteins (data not shown) in the B. anthracis str. Ames proteome to comprise novel amino-acid sequence repeats. We have added an additional constraint that the repeats identified by TRUST program should also be identified as a repeat by the BLAST program. Subsequent online INTERPRO and PFAM searches confirmed that these domains and repeats have not been reported before. In this work, we have identified four repeats and ten domains, that are not within or part of previously reported repeats and our findings are therefore novel. Further analysis identified some of these in the proteins of other bacterial genomes. The conserved amino-acid residues observed from multiple sequence alignments using the CLUSTALW program were used to describe sequence motifs characteristic of these novel repeats and domains. Often, more than one sequence motif is associated with repeats or domains and the amino-acid sequence patterns characteristic of these repeats are represented according to the PROSITE description [28]. Ponting et al. [29], have earlier used a similar approach to identify novel domains and repeats in Drosophila melanogaster.

RESULTS AND DISCUSSION
In this work, we identified four repeats and ten domains that have not been reported before in the B. anthracis str.
Ames proteome. The repeats and domains described in 1 to 6 and 9 are also present in some bacterial organisms, 7, 8, 10 and 11 are Bacillus-specific, 12 and 13 are Bacillus anthracis str. Ames specific. Lists of the proteins containing these novel repeats and domains are shown in Tables 1a to 1k. These tables indicate the protein identifiers (Gene or Swall ID), the number of amino-acid residues in the protein, a description of the protein, and other well-characterized repeats and domains present in the protein. Some sequences representing these repeats or domains share lower than 15% pairwise sequence identity. However, these sequences retain the conserved motifs and the positions of secondary structure elements in the multiple sequence alignment. For all the proteins, the amino-acid sequence corresponding to each representative repeat are shown in the multiple sequence alignments (see Figures from 2 to 14). 1 Conservation of the position of secondary structural elements is indicated from the multiple sequence alignment. The schematic figures used to represent these repeats and domains are shown in Figures 15 to 27. These figures (drawn to an approximate scale) reflect the relative proximity and location of individual repeats and domains along the primary sequence. We discuss each of these novel repeats and domains below.

57-amino-acid-residue PxV domain
The 251-amino-acid-residue protein corresponding to the GENE ID BA2292 and described as hypothetical protein comprises of a 57-amino-acid-residue region as two copies. Further BLAST searches using sequence corresponding to the region (65-121) as a query identified 24 proteins that are described as hypothetical (see Table 1      domain. The length of proteins varied between 196 to 488amino-acid residues. The multiple sequence alignment corresponding to this domain is associated with PxV sequence motif where x is any amino-acid residue and is shown in Figure 2. The pairwise identities between sequences corre-sponding to PxV domain varied between 15-96%. The secondary structure corresponding to PxV domain is predicted to comprise four β-strands as shown in Figure 2. The representative domain architecture corresponding to proteins comprising the PxV domain is shown in Figure 15.

122-amino-acid-residue FxF domain
The 293-amino-acid-residue protein corresponding to the GENE ID BA0881 and described as conserved domain protein comprises a 122-amino-acid-residue region as two copies. Further BLAST searches using sequence corresponding to the region (55-176) as a query identified 10 proteins (see Table 1(b)). The proteins comprising this region are described as either conserved or hypothetical proteins. This region occurs as two copies in the proteins of B. anthracis, B. cereus, B. thuringiensis, Geobacillus kaustophilus, Clostridium tetani, Clostridium novyi, and Desulfotomaculum reducens genomes. The length of proteins varied between 262 to 305amino-acid residues. The multiple sequence alignment corresponding to this domain is associated with characteristic sequence motif FxF ( Figure 3) and we refer to this as the FxF domain. The pairwise sequence identities corresponding to this domain varies between 18-97%. The secondary structure corresponding to FxF domain is predicted to comprise one α-helix and five β-strands, and the representative domain architecture of proteins comprising this domain is shown in Figure 16.

111-amino-acid-residue YEFF domain
The 510-amino-acid-residue protein corresponding to the GENE ID BA3695 and described as a S-layer protein comprises a 111-amino-acid-residue region that is present as two copies. Further BLAST searches, using sequence corresponding to the region (247-357) as a query, identified 13 proteins (see Table 1    This domain is characterized by conserved sequence motifs; YEFF, RGD, FTY, GKD, and FVEH. We refer to this 111amino-acid region as the YEFF domain. The pairwise sequence identities corresponding to the YEFF domain varied between 36-96%. The consensus secondary structure predicted for this domain suggests mainly β-strands and the conserved sequence motifs, that is, YEFF and FTY are associated with β-strands; see Figure 4. The representative domain architecture of proteins comprising this domain is shown in Figure 17. It is intriguing that each domain comprises RGD sequence motif which is found in the proteins of extracellular matrix. Many viruses enter their host cells via the RGD motif-integrin interaction and synthetic peptides containing this RGD motif are active modulators of cell adhesion [30]. The RGD motif was originally identified as the sequence within fibronectin that mediates cell attachment. This motif has now been found in numerous other proteins and supports cell adhesion. The integrins, a family of cell surface proteins, act as receptors for cell adhesion molecules. A subset of the integrins recognizes the RGD motif within their ligands, the binding of which mediates both cell substratum and cell-cell interactions [31]. The presence of RGD motif and SLH domain implies that the YEFF domain compris-ing proteins is also present on the cell surface and mediates protein-protein interactions.

109-amino-acid-residue IMxxH domain
The 266-amino-acid-residue protein corresponding to the GENE ID BA1021 and described as hypothetical protein comprises a 109-amino-acid-residue region as two copies. Further BLAST searches using sequence corresponding to the region (4-112) as a query identified 22 proteins (see   pairwise sequence identities corresponding to the IMxxH domain varies between 5-98%. The secondary structure corresponding to IMxxH domain is predicted to comprise four αhelices as shown in Figure 5. The representative domain architecture corresponding to proteins comprising this domain is shown in Figure 18.

84-amino-acid-residue ExW domain
The 246-amino-acid-residue protein corresponding to the GENE ID BA4310 and described as hypothetical protein comprises an 84-amino-acid-residue region as two copies. Further BLAST searches using sequence corresponding to the domain (45-128) as a query identified 25 proteins ( The secondary structure of this domain is predicted to comprise five β-strands and the conserved sequence motif is associated with one of the β-strands as shown in Figure 7.
The representative domain architecture corresponding to proteins comprising this domain is shown in Figure 20.

104-amino-acid-residue NTGFIG domain
The 232-amino-acid-residue protein corresponding to the GENE ID BA2665 and described as a hypothetical protein comprises a 104-amino-acid-residue region as two copies in tandem. Further BLAST searches using sequence corresponding to the region (16-     occurs twice in every protein of the bacillus species as shown in Table 1(g). We refer to this as the NTGFIG domain based on the conserved sequence motif that is present at the Nterminal part. The pairwise sequence identities between sequences corresponding to this domain varied between 31-99%. The secondary structure corresponding to this domain is predicted to comprise three α-helices and two β-strands as shown in Figure 8. The representative domain architecture corresponding to proteins comprising this domain is shown in Figure 21.

36-amino-acid-residue NxGK repeat
The 193-amino-acid-residue protein corresponding to GENE ID BA3686 and described as hypothetical cytosolic protein comprises a 36-amino-acid-residue region as two   copies. Further BLAST searches using sequence corresponding to the region (94-129) as query identified 9 hypothetical proteins comprising this repeat region from the organisms B. anthracis, B. thuringiensis, B. thuringiensis serovar israelensis, Bacillus weihenstephanensis, and B. cereus (see Table 1(h)). The length of proteins varied between 189 to 193-amino-acid residues, and also consists a SAP domain at the N-terminus, in addition to the novel repeat described here. A SAP domain consists of two α-helices and is a DNAbinding motif that is involved in chromosomal organization [32]. Therefore, we believe that these repeats might also participate in a similar function. The multiple sequence alignment corresponding to this repeat identified NxGK sequence motif (Figure 9). The pairwise sequence identities between sequences corresponding to NxGK repeats varied between 36-97%. The secondary structure is predicted to comprise a α-helix and the conserved sequence motif described above is also associated with α-helix. The representative domain architecture corresponding to proteins comprising the NxGK repeats is shown in Figure 22.

95-amino-acid-residue VYV domain
The 225-amino-acid-residue protein corresponding to the GENE ID BA1701 and described as a hypothetical protein comprises a 95-amino-acid-residue region, as two copies in tandem. Further BLAST searches using sequence corresponding to the region    thuringiensis serovar israelensis, and DSY3134 of Desulfitobacterium hafniense Y51 that are described as hypothetical proteins. The length of proteins varied between 227 to 1674amino-acid residues (see Table 1(i)). In RBTH 03882, this region occurs ten times and in tandem. The multiple sequence alignment corresponding to this domain identified characteristic sequence motifs; GDxV, VYV (see Figure 10). For the sake of simplicity, we refer to this 95-amino-acid region as VYV domain. The pairwise sequence identities between sequences corresponding to VYV domains varied between 29-95%. The secondary structure corresponding to VYV domain is predicted to comprise five β-strands. The representative domain architecture corresponding to proteins comprising the VYV domains is shown in Figure 23.

75-amino-acid-residue KEWE domain
The 262-amino-acid-residue protein corresponding to the GENE ID BA3147 and described as a hypothetical protein comprises a 75-amino-acid-residue region as three copies in  tandem. Further BLAST searches using the sequence corresponding to the region (34-108) as query identified this domain in 6 proteins that are described as hypothetical proteins (see Table 1(j)). This domain may exist as 2, 3, or 4 copies in these proteins. The length of proteins identified varied between 178 to 344-amino-acid residues. The pairwise sequence identities between sequences corresponding to these regions varied between 22-69%. These domains are present in tandem and associated with SPY, MIN, LYP, KEWE, and FWT conserved sequence motifs as indicated in the multiple sequence alignment (see Figure 11). We refer to these as the KEWE domain, and this sequence motif occurs at the C-terminus of the domain. The secondary structure corresponding to KEWE domain is predicted to comprise three αhelices as shown in Figure 11. The representative domain architecture corresponding to proteins comprising the KEWE domain is shown in Figure 24.

59-amino-acid-residue AFL domain
The 290-amino-acid-residue protein corresponding to the GENE ID BA3065 and described as hypothetical protein comprises a 59-amino-acid-residue region as two copies. Further BLAST searches using sequence corresponding to the region (13-71) as query identified that this region occurs twice in the proteins with GENE ID's: BAS2851 and Bant 01003715 of B. anthracis strains, the protein with GENE ID: BcerKBAB4DRAFT 1832 of Bacillus weihenstephanensis, and once in the protein with GENE ID: RBTH 02124 of Bacillus thuringiensis serovar israelensis (see Table 1(k)). The lengths of the proteins varied between 145 to 297-amino-acid residues and are described as hypothetical proteins. The multiple sequence alignment corresponding to this domain identified two characteristic sequence motifs: RFxI and AFL (see Figure 12). We refer to this as the AFL domain. The sequence identities shared between AFL domains varied between 38-91%. The secondary structure corresponding to the AFL domain is predicted to comprise of one α-helix and two β-strands and the conserved sequence motif AFL is a part of the α-helix. The representative domain architecture corresponding to protein comprising the AFL domain is shown in Figure 25.

53-amino-acid-residue RIDVK repeat
The 159-amino-acid-residue protein corresponding to the GENE ID BA0482 and described as a conserved domain protein comprises a 53-amino-acid region as two copies. BLAST did not identify this repeat in any other proteins; therefore this repeat is unique to B. anthracis str. Ames. The multiple sequence alignment corresponding to this repeat identified three characteristic sequence motifs: ITV, IGD, and RIDVK ( Figure 13). We refer to this as the RIDVK repeat. The sequence identity shared between this RIDVK repeats in BA0482 is 45%. The secondary structure corresponding to the RIDVK repeat is predicted to comprise three βstrands. The representative domain architecture corresponding to protein comprising the RIDVK repeat is shown in Figure 26.

(a) 41-amino-acid-residue AGQF repeat and (b) 42-amino-acid-residue GSAL repeat
The protein corresponding to the GENE ID BA4081 comprises 462-amino-acid residues and described as conserved domain protein contains two novel repeat types. The sequence length corresponding to repeat types are 41 and 42amino-acid residues and are present as two copies in BA4081. BLAST searches identified these repeats to be specific to this protein alone.
(a) The sequence alignment corresponding to 41-aminoacid-residue repeat identified two characteristic sequence motifs: DLG and AGQF (Figure 14(a)). We refer to this as the AGQF repeat. The motif occurs at the C-terminal part of the repeat region. The sequence homology shared between this AGQF repeats is about 34%. The secondary structure corresponding to the AGQF repeat is predicted to comprise one α-helix. The representative domain architecture corresponding to protein comprising the AGQF repeat is shown in Figure 27.
(b) The sequence alignment corresponding to the 42amino-acid-residue tandem repeat identified three characteristic sequence motifs: GYI, GSAL, and TING ( Figure  14(b)) and is a glycine-rich repeat. We refer to this as the GSAL repeat. The sequence homology shared between this GSAL repeats is 52%. The secondary structure corresponding to the GSAL repeat is predicted to comprise one αhelix and one β-strand. The representative domain architecture corresponding to protein comprising the GSAL repeat is shown in Figure 27. This protein is associated with a 27amino-acid-residue Ribosomal S7 region that is sandwiched between the 41-amino-acid-residue AGQF repeat and the 42-amino-acid-residue GSAL repeat. These two repeats are specific to this protein alone and are therefore B. anthracis str. Ames specific.
From the analysis of the B. anthracis proteome, we observed that the novel repeats and domains are present in all the strains, such as Ames, Ames ancestor, Sterne, and A2012, that have been sequenced so far. This indicates that these strains of B. anthracis have diverged recently. We also observed that the domains PxV, FxF, YEFF, VxxT, ExW, and VYV are present in proteins from several bacterial organisms. The domains NTGFIG, KEWE, AFL, and the repeats NxGK are specific to bacillus. It is interesting to note that the domains VYV and AFL are present in all the B. anthracis species while absent in B. cereus genomes. The repeats RIDVK, AGQF, and GSAL are also specifically present only in all the strains of B. anthracis. This analysis explains some differences in the closely related B. anthracis and B. cereus genomes. The identification of these novel domains and repeats in subsequently sequenced genomes will add value to their annotation.

CONCLUSIONS
A systematic analysis using computational tools identified four novel repeats and ten domains corresponding to the B. anthracis str. Ames proteome. Further database searches identified that some novel repeats and domains are also present in other bacterial genomes. The NxGK repeats are associated with SAP domain. The SAP domain is a DNAbinding motif that is involved in chromosomal organization. Therefore, we believe that these repeats also participate in similar function. The YEFF domain-containing proteins are associated with RGD motif and may be involved in cell adhesion. The identification of novel repeats and domains corresponding to B. anthracis proteome may be useful for annotation. From the presence of VYV and AFL domains in all the B. anthracis species and their absence in B. cereus genomes, we identified some differences in these two genomes that are otherwise closely related.

ACKNOWLEDGMENTS
The first author thanks UGC, New Delhi, for a JRF fellowship. The second author thanks DBT, New Delhi, for project funding. The third author thanks DBT, New Delhi, for research funding.