Comparative Genomic Assessment of Novel Broad-Spectrum Targets for Antibacterial Drugs

Single and multiple resistance to antibacterial drugs currently in use is spreading, since they act against only a very small number of molecular targets; finding novel targets for anti-infectives is therefore of great importance. All protein sequences from three pathogens (Staphylococcus aureus, Mycobacterium tuberculosis and Escherichia coli O157:H7 EDL993) were assessed via comparative genomics methods for their suitability as antibacterial targets according to a number of criteria, including the essentiality of the protein, its level of sequence conservation, and its distribution in pathogens, bacteria and eukaryotes (especially humans). Each protein was scored and ranked based on weighted variants of these criteria in order to prioritize proteins as potential novel broad-spectrum targets for antibacterial drugs. A number of proteins proved to score highly in all three species and were robust to variations in the scoring system used. Sensitivity analysis indicated the quantitative contribution of each metric to the overall score. After further analysis of these targets, tRNA methyltransferase (trmD) and translation initiation factor IF-1 (infA) emerged as potential and novel antimicrobial targets very worthy of further investigation. The scoring strategy used might be of value in other areas of post-genomic drug discovery.


Introduction
Within two decades of the introduction of penicillin, the majority of the existing classes of antibacterial drugs had been discovered by systematic screening of natural product libraries. Remarkably, no new chemical classes of active antibacterial drugs were successfully introduced for a further 30 years (Hancock and Knowles, 1998). Table 1 shows the very restricted set of modes of action of the major antibacterial drugs currently in use.
Microorganisms have also shown themselves to be extremely versatile in overcoming the effects of antibacterial drugs. Bacteria have developed a variety of resistance mechanisms and lateral gene transfer mechanisms allow this resistance to be passed between different bacterial strains and species (Davies, 1994;Heinemann, 1999). Antibacterial resistance has developed steadily as new agents have been introduced, and the past 10-15 years have shown a dramatic increase in the occurrence of resistant populations of microbes in both community and hospital environments (Struelens, 1998).
Measures such as chemical modification of existing antibacterial drugs and the development of inhibitors of resistance genes will have a significant impact on antibacterial therapy in the short term. However, it is obvious that new drug targets need to be found if the use of antibacterial After Chopra et al. (2002).
To this end, genomic approaches are providing a new strategy by revealing new molecular targets that are giving rise to novel antibacterial agents (Allsop and Illingworth, 2002;Dougherty et al., 2002;Haney et al., 2002;Isaacson, 2002;Ji, 2002;McDevitt and Rosenberg, 2001), as these new agents are unlikely to face the current problems of established mechanisms of resistance (McDevitt and Rosenberg, 2001). In anti-infective research, the inevitable selection for resistant strains means that drugs with multiple targets may be preferred (e.g. multiple penicillin-binding proteins or multiple forms of two-component systems; Stephenson and Hoch, 2002;Stephenson and Hoch, 2004). In other pharmaceutical areas it is encouraging that the rational utility of traditional targets is being confirmed by systematic knock-out studies (Zambrowicz and Sands, 2003). With the release of data from numerous sequencing projects, the number of potential drug targets has increased massively. However, not all of these molecules will become drug targets (Hopkins and Groom, 2002), and the big challenge is to select the targets most relevant for a given situation (Terstappen and Reggiani, 2001).
Machine learning methods seek to devise new ideas and hypotheses from more or less unstructured data (Gillies, 1996;Kell and Oliver, 2004;Mitchell, 1997;Mjolsness and DeCoste, 2001).
It has been shown that such data-driven strategies can be used to identify novel drug targets (Spaltmann et al., 1999). A number of metrics are chosen which should be properties of a potential drug target, such as essentiality and specificity. Each potential target in a genome of interest is scored for these properties. These scores can be weighted differently to add more or less emphasis to any particular property. This scoring system can be tuned so that targets which have already been identified score highly, showing that the scoring system is capable of identifying useful targets. Previously unidentified genes may also score highly, and these can be prioritized as potential drug targets for further study. The top-scoring gene in the study carried out by Spaltmann et al. (1999) on antifungal targets was α,α-trehalose-phosphate synthase, a gene which had never before been suggested as a potential drug target. This shows that postgenomic research has much to offer in terms of novel target identification (Allsop and Illingworth, 2002;Buysse, 2001;Dougherty et al., 2002;Glass et al., 2002;Haney et al., 2002;Isaacson, 2002;Ji, 2002;Knowles and King, 1998;McDevitt and Rosenberg, 2001;Payne et al., 2001aPayne et al., , 2001bWillins et al., 2002).
In the present study a number of criteria were chosen on which to characterize proteins as targets. These were suggested by the extensive literature on the subject (see e.g. Alksne, 2002;Allsop and Illingworth, 2002;McDevitt and Rosenberg, 2001; Projan, 2002;Spaltmann et al., 1999;Terstappen and Reggiani, 2001). A full list of the criteria used is given in the Methods section.

Data collection and motives
Data were collected from three pathogenic bacterial species, Staphylococcus aureus, Escherichia coli O157:H7 EDL993 and Mycobacterium tuberculosis. These species were chosen as they represent a broad cross-section of bacterial types. Targets which prove to score well in these three species will probably be good targets across a broad spectrum of pathogens.
The entire set of sequences of proteins encoded by S. aureus, E. coli O157:H7 EDL993 and M. tuberculosis were downloaded from the NCBI website (http://www.ncbi.nlm.nih.gov/PMGifs/ Genomes/micr.html). Each protein was then characterized by a number of criteria which could then be used to prioritize the most suitable proteins as potential antibacterial targets.
A Perl program carried out most of the characterization automatically (see Figure 1 for an overview). Each protein was parsed to find the gene index (gi) number and name of the protein. If the function of the protein was known, or if a function had been assigned to the protein on the basis of sequence homology, then this was noted.
Each protein was then submitted to a BLAST (Altschul et al., 1990(Altschul et al., , 1997 search (BLASTp, using default parameters except for an 'expectation value' of 0.01) against a local copy of the SwissProt database (ftp://ftp.ebi.ac.uk/pub/). The SwissProt database was used because it is well curated, well annotated, non-redundant, and since entries are easily parseable due to its consistent format. There also exist a large number of associated files and websites which use SwissProt-style codes (for species and gene/protein names). Using SwissProt therefore allows these resources to be integrated easily into the program, thus making efficient automation possible.
The results of each BLAST search were parsed to find how many homologues of this protein existed in bacteria, pathogenic bacteria, eukaryotes, mice and Lactobacillus. This was done by comparing the SwissProt species ID code of each hit against a look-up table that listed the classification of the organism (http://ca.expasy.org/cgibin/speclist). A list of bacteria treated as pathogenic in this study is given in Table 2. Bacteria may or may not act as pathogens, depending on the circumstances and the host, and so the list given here covers a broad range of pathogens but is perhaps not completely comprehensive.
The presence of homologues in mice was considered important not only as this will allow targets which are present in higher organisms to be further down-weighted, but also because further down the line the target's absence in mice will make animal trials more effective. Lactobacillus spp. are considered to beneficial or probiotic bacteria, so using this metric might be able to prioritize targets which diminish any unwanted side-effects of a new drug.
The scores of BLAST hits against pathogens were also parsed to find how well conserved a particular gene is amongst pathogens. Obviously a protein that is well-conserved across many pathogens will make a better target for broadspectrum antibacterial drugs. A high degree of conservation may also mean that mutations in the protein are not tolerated, such that resistance is less likely to emerge. The numbers of identical residues in each pathogenic hit compared to the query sequence were summed and then divided by the number of hits against pathogens. This number was normalized by dividing by the length of the query sequence, to give a ratio of conservation for this protein across pathogens.
The query protein was submitted to BLAST separately against the human genome (protein sequences) (ftp://ftp.ncbi.nih.gov/genomes/H sapiens/protein/) and the number of hits was recorded. The closest hit against a human protein was also recorded, with a ratio of similarity given by the number of positive residue matches (matches where amino acids are identical or have similar biochemical properties) divided by the length of the query sequence. The number of positives was chosen so as to err on the side of caution. Any drug designed against a particular bacterial protein may act just as well against a human protein, even if certain key residues are not identical. Similarity of residues may be enough for activity. This metric was included so that potential targets which were not so similar to human proteins would not be so heavily penalized. Even if a human homologue does exist, it may still be possible (e.g. using  The query gene was then again submitted to the BLAST program to find homologues which are known antibacterial targets or whose structures have been deciphered. This time an 'expectation value' of 1 × 10 −10 was used, as to infer suitability as a target or structural similarity it was thought safer to report only very close homologues. After running the BLAST algorithm, the output was parsed to find whether the query gene was homologous to a known antibacterial target. This was done by comparing the SwissProt gene ID against a list of SwissProt IDs (from the ExPASy website: http://ca.expasy.org/enzyme/) of proteins  that are known antibacterial targets (Chittum and Champney, 1995;Egebjerg et al., 1989;Kornder, 2002;Lin et al., 1997;Neu and Gootz, 1996;Schnappinger and Hillen, 1996). Of course, not all current drug targets are perfect examples; indeed, many of the drugs that target them are toxic to humans and resistance has begun to emerge in many cases. Nevertheless, treatments which utilize these targets have been shown to be effective in disease control, and so novel targets possessing similar characteristics to known targets may be useful. The SwissProt species and protein ID codes of each hit in the BLAST results were compared to a look-up table (ftp://beta.rcsb.org/pub/pdb/ uniformity/derived data/) to find out whether any homologues of the query gene had an entry in the PDB database (ftp://ftp.ncbi.nih.gov/genomes/H sapiens/protein/). A protein with a known structure is more attractive from the point of view of further research, as structure-based drug design can be carried out straightaway. A protein with sequence homology to a protein of known structure is likely to have a similar structure (although this is not always true) and so may be favoured as a potential novel drug target.
Each protein was then submitted to several more restricted BLAST searches against selected bacterial genomes. The BLAST searches were restricted by gi number; specifically the gi numbers of genes found to be essential or involved in virulence. These genomes chosen are listed in Table 3.
These genomes were selected as they cover a wide range of bacterial types, and also because they are well characterized and are amongst the few species for which this work has been carried out to any great extent. For those species for which this kind of work has not been done, genomics methods may allow us to predict essentiality or involvement in virulence. Proteins that have significant hits against essential genes or genes involved in virulence are likely to have the same characteristics themselves and so may score highly as potential Assessment of novel antibacterial targets 309 Table 3. List of genomes used for restricted BLAST searches against essential genes or genes involved in virulence Essential genes Genomes Bacillus subtilis (Kobayashi et al., 2003) Escherichia coli K12 (http://www.shigen.nig.ac.jp/ecoli/pec/ About.html) Mycobacterium tuberculosis (Sassetti et al., 2003) Staphylococcus aureus (Forsyth et al., 2002) Virulence genes Genomes Bacillus anthracis (Hoffmaster and Koehler, 1999;Koehler, 2002) Escherichia coli O157:H7 EDL993 (Brunder et al., 2001;Sharma and Dean-Nystrom, 2003;Stuber et al., 2003;Wang et al., 2002) Mycobacterium tuberculosis (Triccas and Gicquel, 2000) Neisseria meningitidis (Sun et al., 2000) Staphylococcus aureus (Dunman et al., 2001) drug targets. The more 'model' genomes in which the gene is found to be essential, the more likely it is that this gene is indeed essential for the query species, and also has greater potential as a target for a broad-spectrum antibacterial drug.
Having assigned each gene in the query genome values for a number of characteristics, these values could then be weighted, summed and ranked to produce a list of high-priority potential targets. This ranking approach was used instead of a machine learning-based approach, as the 'training set' of known antibacterial targets is very small and not necessarily optimal (see Introduction). While the ranking approach is more subjective, it does allow targets to be prioritized which score better according to our metrics than currently known targets.

Assigning weights and the robustness of target prioritization
A number of different weighting schemes were tried so that the weighting scheme could be refined to reflect the relative importance of the various metrics. After a weighting scheme was run on the raw data, the scores for each metric could be summed and the total scores of the targets then ranked. The refinement of the weightings was done by carrying out a sensitivity analysis on the metric scores for the top few ranking targets. Sensitivity analysis is more normally used in biology to find the importance, or so-called control coefficients (http://dbk.ch.umist.ac.uk/mca home.htm; Fell, 1996;Heinrich and Schuster, 1996;Kell and Westerhoff, 1986), by which each enzyme controls the flux through a metabolic pathway, but can in fact be used to find the relative importance of any variable which contributes to a total. The equation giving the sensitivity of overall metric A to individual metric v i is given by (equation 1) Here a more discretized sensitivity analysis was done for each target by taking the score of each metric of the target, finding 1% of this score, dividing this number by the total score and multiplying by 100. When this is done for all metrics, these 'contributions' sum to 1. Thus, sensitivity analysis asks, 'By altering the score of one variable by 1%, what percentage change would this induce in the total score?'. These sensitivity analyses could clearly show when some variables were exerting too much or too little influence on the total score and therefore the weights could be optimized accordingly. This novel approach proved very useful in carefully modifying the scoring systems.
Using different weighting schemes also allowed the analysis of how robust a particular high-ranking target was to the weighting scheme. Clearly, a target which scores highly due to having favourable characteristics in one highly weighted metric is less good than one which ranks highly under a number of different scoring systems.
For each of five different scoring systems (Table 4) used on S. aureus, E. coli O157:H7 EDL993 and M. tuberculosis the top 20 ranking targets were recorded. These top 20 lists could then be checked against each other to see whether robust targets had emerged. The top 20 lists were then cross-checked to see whether any targets were robust in all three species (see Table 5). This 'voting' method approach can be seen as combining the output of several weak learners, which is known to be a very effective approach to data mining (Bauer and Kohavi, 1999;Dietterich, 2000;Hastie et al., 2001). The first scoring system was designed to give most influence to those metrics which were felt to be the most important and least influence to those  Table 4. The five different scoring systems used to test the influence of the scoring system on the ranking of targets Chromosomal replication initiator protein (dnaA) 10 12 716 10 UDP-N-acetylmuramoylalanyl-D-glutamate-2,6-diaminopimelate ligase (murE) 9 12 573 These targets rank highly in all three species used and rank in the top 20s of most of the scoring systems used. Robustness is how many times the gene ranks in the top 20 under five different scoring systems across the three species used, giving a maximum robustness score of 15. Total score is the sum of the scores for this target in all scoring systems across all species used. The maximum possible total score is 24 120. * Indicates that the gene is a known target of an antibacterial drug (murA is targeted by fosfomycin and rpsD is a target of tetracyclines).
felt to be least important. Homology to essential genes in M. tuberculosis and S. aureus, and homology to virulence genes in Bacillus anthracis were weighted lower than homology to essential and virulence genes in other organisms. This was done to reflect the quality of the data for these organisms, as different methods were used and lists of essential and virulence genes are not always complete. Under the second scoring system all metrics were weighted equally, so that a maximum score for one metric would be the same as for another. For the other three scoring systems most of the metrics were weighted as under the first system. However, in the third scoring system homology to virulence genes was given greater influence, in the fourth homology to essential genes was given greater weight, and in the fifth the level of conservation of the target in pathogens was given more importance.

Further investigation of high-scoring targets
Having narrowed down the number of potential drug targets using the methods outlined above, the highest-scoring targets could then be investigated in greater detail. The top genes were again subjected to a BLAST search against the Swis-sProt database to determine in which pathogens they were present. The databases Genbank, EMBL and DDBJ were also searched via ENTREZ (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi) to find whether or not a copy of the query gene existed in a specific pathogen, in case this had been missed by searching only the SwissProt database. PROSITE (http://us.expasy.org/prosite/) was also searched to find any conserved motifs not identified by BLAST, which could be used to find more distantly related homologues of the query gene. This approach was able to identify any human sequences which, although not closely related in terms of sequence homology, could be very similar in terms of structure and biochemical properties to the query gene. Multiple sequence alignments and phylogenetic trees were created using ClustalX (Thompson et al., 1997) and Mega2.1 (Kumar et al., 2001). This was done to determine how distinct the genes in these pathogens were from those homologues in non-pathogens and eukaryotes, and just how well the 'active sites' of these genes were conserved across the different pathogenic species. The available literature was also searched to gain more insights into these suggested targets.

Scores
According to the scoring systems used, the majority of genes in S. aureus, E. coli O157:H7 EDL993 and M. tuberculosis would make very poor antibacterial targets (see Figures 2-4). In all three bacteria there are also only a few high-scoring genes. The highest ranking of these seem to be fairly robust and tend to rank in the top 20, regardless of which scoring system is used (see Table 5).
It is also apparent there are no targets which are perfect in every way. To obtain a perfect score in the present metrics a target should be present in one copy, be present in all pathogens but not in non-pathogens, eukaryotes or humans. It should be perfectly conserved across all pathogens. Its function should be known, it should be homologous to a known target, homologous to essential and virulence genes in all the model genomes used and its structure should be known. Here even the highest-ranking targets achieve only some 50% of the perfect score.
This is perhaps discouraging, as it means that there is little possibility for the development of a 'magic bullet' drug that is highly effective, specifically targets only pathogens, is easy to develop and is immune to the problems of emerging resistance. However, this never was a likely prospect.
The unusual peaks in the distribution graphs are due to genes of unknown function that, when submitted to BLAST with an expectation value of 0.01, did not return any hits. The two peaks in E. coli O157:H7 EDL993 target scores occur for the same reason, except that the peak at the higher score is due to genes that return no hits but have been assigned some sort of function, presumably by other methods.

Scoring systems and sensitivity analysis
The first scoring system used was designed to reflect what is thought to be most important in terms of the properties an antibacterial drug target should possess. Hence, the first scoring system rates as very important the properties 'species distribution', 'conservation in pathogens', 'similarity to human' and 'homology to essential genes'. A target which does not perform well on any one of these criteria will probably not make a good drug target. The emphasis accorded to these properties means that targets which are not present in a wide range of pathogens, are not well conserved, are very similar to targets in humans, or are not essential will not be able to score highly and thus will not be prioritized. The metric 'species distribution' is weighted so that a target will receive the maximum score if it is present in all the bacteria treated as pathogenic by this study and in no nonpathogens. It is unlikely that this maximum would ever be awarded to a target, and so this property is given a very high weighting to compensate for this fact. The other useful properties a target may possess are, in a sense, bonuses and are scored to reflect this. A target does not necessarily need to be (directly) involved in virulence in order for a drug to neutralize an infection. However, involvement in virulence may bring benefits to using a target, in that the target should be absent from most non-pathogens and also absent from humans. The existence of homologues in humans does not matter per se; rather, it is the similarity (or lack) of the target to a human homologue which is important. Again, this is reflected in the scoring system, with the number of human homologues being less important than proximity. Of course the lack of any human homologues will bring other benefits, such as the reduced need for QSAR studies to find lead compounds that will selectively target only the bacterial version of a protein. In a similar way 'known function' and 'entry in PDB' are not crucial properties that a potential target must possess. They simply imply that something is already known about these targets which can be used as a jumpingoff point for further investigation. 'Copy number' could be potentially important as, if a protein exists in several copies and a drug targets only one copy, it is possible that a non-targeted copy could take over the function of the target, thus rendering the drug useless. However, it is likely that drugs which can disrupt the function of one copy will disrupt the function of both. Therefore, copy number is given a low/intermediate weighting in the scoring system. (Multiple paralogues also make the development of resistance much less likely.) The properties 'eukaryotic homologues' and 'mouse homologues' are not crucial, but again may make the process of drug development easier. A target without homologues in animal models will be more useful when it comes to carrying out animal trials. The lack of eukaryotic homologues may also allow the use of the target in developing drugs for livestock. The property 'homology to known target' is weighted fairly low in the first scoring system. Homology to a known target may imply something about the biochemical properties of a target which may be relevant to the drug design process. However, targets currently in use are not necessarily the best available, as problems with toxicity and resistance indicate. This study is also aimed at finding novel targets, which is why this property was not treated as being very important. The other scoring systems used are variants of this first scoring system, with the exception of the second, which simply treats all metrics as being equally important. The function of these scoring systems is to perturb the top-ranking targets, to weed out those targets which only perform well because of the vagaries of one scoring system. We used sensitivity analyses on the top five ranking targets of S. aureus under the five different scoring systems (Figures 5-9) and also show the sensitivity analyses of lowest-ranking ( Figure 10) and middle-ranking ( Figure 11) targets from S. aureus under the first scoring system used. It can be seen that in most cases homology to virulence genes does not make much of a contribution to the total score of the top-ranking targets. This is generally because, unlike essential genes, virulence genes occur in only a limited spectrum of pathogens. As virulence genes tend to be more or less specific to one mode of pathogenicity, they do not occur in such a broad spectrum of organisms and consequently perform poorly in the metric 'distribution in bacteria' and 'distribution of homologues'. Similarly, when proteins do show homology to virulence genes in one pathogen, they often have no hits against virulence genes from other pathogens (e.g. see Figure 5).
The top-ranking targets in any scoring system also seem to have very similar sensitivity profiles, each generally being homologous to essential genes in all of the species used, having no homologues in humans and being well-conserved in a wide range of pathogens. Where target profiles differ is in the extent of the gene's distribution amongst pathogens, the extent of conservation (although not much), whether or not a  Figure 5. The contribution of each metric to the total score of the top five ranking targets in Staphylococcus aureus, based on the first scoring system used. Numbers refer to metrics described in Table 6. trmD is the top ranking (1st) target under this scoring system structure is known, whether or not the gene is homologous to a known target, and whether or not the target is homologous to any virulence genes. From Figure 6 it seems that the in the profiles of the lowest-ranking targets the metric 'similarity to human homologue' becomes relatively more important. This is simply because these targets score so poorly on most of the other metrics. Many of these lower-ranking targets are hypothetical or unknown proteins which do not return many (if any) hits when submitted to BLAST, so are poorly characterized and score accordingly. Figure 7 shows that the profiles of middle-ranking targets vary. These targets score well on several criteria but poorly in others, such as the number of human homologues or homology to essential genes.

High-ranking targets
A number of genes rank not only consistently highly under the five scoring systems used, but also appear in the top 20 targets in all three of the genomes used in this study (see Table 5). Of these,  4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Metric 1st 2nd 3rd 4th 5th Figure 8. The contribution of each metric to the total score of the top five ranking targets in Staphylococcus aureus, based on the fourth scoring system used. Numbers refer to metrics described in Table 6. trmD is the top ranking target under this scoring system, while IF-1 (infA) ranks fifth  Figure 9. The contribution of each metric to the total score of the top five ranking targets in Staphylococcus aureus, based on the fifth scoring system used. Numbers refer to metrics described in Table 6. IF-1 is the top ranking target under this scoring system, while trmD ranks second UDP-N-acetylglucosamine 1-carboxyvinyltransferase (murA) is a known target of fosfomycin, and the 30S ribosomal subunit protein S4 (rpsD) is a known target of tetracyclines. Of the others, dnaE, murC, murD and murE have been previously suggested as potential drug targets (Bouhss et al., 1997;El Zoeiby et al., 2003;Inoue et al., 2001;Marmor et al., 2001;Projan, 2002;Tanner et al., 1996), and work carried out on murC, murD and murE has revealed effective inhibitors of these proteins (El Zoeiby et al., 2003;Marmor et al., 2001;Tanner et al., 1996). The ribosome is of course currently heavily targeted by antibacterial drugs and suggestions for further work on ribosomal proteins have been made previously (Knowles and King, 1998). The existence of several known and previously suggested targets in the overall top 10 ranking in a sense validates this study, as it indicates that this method of target prioritization is indeed able to identify useful targets. There follows a brief discussion of the potential of some of the novel targets suggested.  Figure 10. The contribution of each metric to the total score of the five lowest ranking targets in Staphylococcus aureus based on the first scoring system used. Numbers refer to metrics described in Table 6 0.  Figure 11. The contribution of each metric to the total score of five mid-ranking targets in Staphylococcus aureus based on the first scoring system used. Numbers refer to metrics described in Table 6 tRNA methyltransferase (trmD) tRNA methyltransferase (trmD) catalyses the transfer of a methyl group from S-adenosyl-L-methionine (AdoMet) to G 37 within a subset of bacterial tRNA species, which have a G residue at the 36th position (Ahn et al., 2003). It is essential for the maintenance of the correct reading frame during translation. As an enzyme it is probably a better target than those requiring the inhibition of protein-protein interactions, although we note that progress in finding inhibitors of these is now being made (Oneyama et al., 2002;Paulmurugan et al., 2004).
The structure of the enzyme has been determined and is available from the Protein Data Bank (http://www.rcsb.org/pdb/) (Accession Nos 1UAJ, 1UAK, 1UAL and 1UAM) (Ahn et al., 2003). The active site regions of the enzyme, which binds to AdoMet and tRNA, are known and are illustrated in Figure 12. It can be seen that these active site regions are highly conserved. This is encouraging from the point of view of designing a broad-spectrum drug to target this enzyme and also in terms of the reduced potential for resistant mutants emerging.
TrmD has, to our knowledge, never been recommended as an antibacterial drug target in the scientific literature, although it -along with  Copy number 2 Distribution of homologues (i.e. how many homologues in non-pathogens vs. pathogens?) 3 Distribution in pathogens (i.e. how many distinct pathogens is the gene present in, and in how many discrete non-pathogens?) 4 Conservation in pathogens 5 Number of homologues in eukaryotes 6 Number of human homologues 7 Similarity to human homologue 8 Number of mouse homologues 9 Lactobacillus plantarum homologues 10 Function known? 11 Homology to known target 12 Homology to essential gene in Bacillus subtilis 13 Homology to essential gene in Escherichia coli K12 14 Homology to essential gene in Mycobacterium tuberculosis 15 Homology to essential gene in Staphylococcus aureus 16 Homology to virulence gene in Bacillus anthracis 17 Homology to virulence gene in Escherichia coli O157:H7 EDL993 18 Homology to virulence gene in Mycobacterium tuberculosis 19 Homology to virulence gene in Neisseria meningitidis 20 Homology to virulence gene in Staphylococcus aureus 21 Homology to PDB entry a large batch of other bacterial proteins -has been patented on the basis of experiments in S. aureus in connection with its use as a drug target (United States Patent and Trademark Office http://www.uspto.gov/, Patent No. 6 187 541). Figure 13 shows a neighbour-joining tree of trmD sequences in pathogenic and non-pathogenic bacteria. As can be seen, trmD is present in many pathogens, both Gram-positive and Gram-negative, indicating that it has the potential to be a very good broad-spectrum antibacterial drug target. The enzyme appears to function as a dimer (Ahn et al., 2003), so one would probably seek to target the active site.
It is also important to know in which pathogens trmD is absent. Table 7 shows which of the pathogens (as defined by this study) are not known to possess trmD. It is important to note, however, that only a small number of these strains  and M. tuberculosis were used as BLAST queries and the (non-redundant) hits from these searches were combined. BLAST was run with an expectation value of 0.01. A sequence from Acinetobacter calcoaceticus was removed, as it was considerably shorter than the others. Active site regions are highlighted by black boxes and labels show the function of the active site after Ahn et al. (2003). Sequences were aligned using ClustalX (Thompson et al., 1997) Figure 13. Neighbour-joining tree showing the distribution of trmD in pathogenic and non-pathogenic bacteria. Sequences were aligned in ClustalX (Thompson et al., 1997) (trmD from Acinetobacter calcoaceticus was removed, as it is considerably shorter than the other trmD sequences). The tree was created using Mega 2.1 using default parameters (Kumar et al., 2001). Branches leading to Gram-negative bacteria are coloured blue, those leading to Gram-positive bacteria in red, and those leading to Cyanobacteria in dark green. The tree is rooted using DCA3 from Brassica juncea and DCA1 from Arabidopsis thaliana as the outgroup (branches highlighted in purple). trmD sequences from bacteria treated as pathogenic by this study are marked with a red diamond. Numbers on branches show bootstrap support for groupings based on 100 replicates (values <50 are not shown). Scale bar shows number of substitutions per site have been extensively sequenced (those marked *). Thus, a copy of the gene may exist in these species/strains despite its absence from the databases. As can be seen, there are only a small number of species/strains which have been entirely sequenced and which do not possess a copy of trmD.
While there is a version of trmD in humans, the human version of the protein covers only covers the C-terminal end of the bacterial protein, as illustrated by Figure 12B. This may allow a selective drug to be developed which targets the bacterial but not the human version of the protein.
Another issue of importance in the development of a new drug is ease of assay development (Allsop, 1998). If a copy of a protein exists in yeast, then in vitro assay development is fairly straightforward via haploinsufficient phenotype (hp)-based strategies (Giaever et al., 1999). However, no trmD homologue exists in yeast, so an alternative strategy must be used here. Other assays will of course be necessary for cellular functional assays and analysis in vivo.

Translation initiation factor IF-1 (infA)
Translation initiation factor IF-1 scores well on all criteria except for homology to virulence genes, homology to known targets and the number of homologues in eukaryotes. IF-1 is very well conserved in pathogens, and this contributes significantly to its high score.
The precise function of initiation factor IF-1 is unknown. However, it is known to be one of a number of factors essential for the establishment of the correct reading frame during translation (Dahlquist and Puglisi, 2000). It is therefore one determinant of translation accuracy. IF-1 is essential for cell viability and cells deficient in IF-1 exhibit few polysomes (Cummings and Hershey, 1994).
IF-1 is also well conserved across all the species/strains in which it is present, and thus a drug could be designed to attack this target in a broad spectrum of pathogens. A number of residues are identical in all sequences, perhaps indicating a high selection pressure against mutation at these positions. This is encouraging from the point of view of drug resistance. Any resistant strains arising through mutation of these residues could be severely attenuated compared to the wild-type form.
It has been observed that IF-1 contains a repeated sequence motif (S1-RM) which is also found in ribosomal protein S1 (whose function is to enhance translational initiation in Gram-negative bacteria) (Gribskov, 1992). Thus, a drug designed to target this motif could attack two different essential gene products at the same time, which would be highly advantageous from the point of view of drug resistance. This motif appears to be involved in RNA binding. However, this motif is also found in eukaryotic translation initiation factor αchains (http://us.expasy.org/prosite/) and several copies of this exist in humans. Figure 14 shows the sequences of IF-1 in pathogens aligned against the three eIF-1α sequences found in humans. A number of positions in the alignment are highly conserved in both humans and pathogens, so some skill may be required to develop a drug which targets pathogens but is not toxic to humans. This highlights the problem of using overall sequence similarity to determine the close functional and biochemical relatives of a gene product.
Nevertheless, there do exist cases of successful drugs which target proteins which are also present in humans, such as the antifungal strobilurins which target cytochrome bc 1 (Weber et al., 1990). It can also be seen from Figure 14 that there are a number of positions in the protein where all or most pathogenic sequences have one residue (or biochemically similar residues) but where human sequences possess a biochemically different residue. Therefore, there is still plenty of Figure 14. Alignment of IF1 proteins in pathogens and human eIF-1α sequences (bottom three sequences, marked *). Some identical sequences from close relatives of species/strains shown are omitted for ease of presentation. The alignment was created using ClustalX (Thompson et al., 1997). As can be seen, a number of positions are highly conserved in both human and pathogen sequences. Black arrows mark positions where the majority of pathogens share the same or similar residues but where human sequences possess a different residue. potential for the development of a drug which targets pathogens without interfering with the human form of the protein.
The structure of IF-1 has been determined and is available from the Protein Data Bank (http://www.rcsb.org/pdb/) (Accession Nos 1HRO and 1AH9). IF-1 is known to function as a monomer. Figure 15 shows a phylogenetic tree of IF-1 and its homologues. It can be seen that the eukaryotic (chloroplast) sequences and those from bacteria can be split fairly well. Sequences from humans and pathogenic bacteria can be split easily. The sequences from pathogens and non-pathogens cannot be split, at least at the whole-sequence level, so it may prove impossible to develop a drug which targets IF-1 only in pathogens. This is not necessarily something to be overly concerned with, however, as curing the disease is more important than preserving the commensal flora.
There are a number of pathogens in which IF-1 has not been found (see Table 8). Again, this does not necessarily mean these pathogens do not possess a copy, as in many cases whole-genome sequencing has not been carried out. As can be seen, there are only a small number of species which have been entirely sequenced and in which IF-1 is absent.

Conclusions
This study has used a simple but rational collection of criteria on which to rate bacterial gene products as potential broad-spectrum antibacterial drug targets. We assessed all the proteins from S. aureus, E. coli O157:H7 EDL993 and M. tuberculosis on criteria such as distribution, essentiality and involvement in virulence. All the proteins from each of these organisms were ranked in order of suitability as a potential drug target. It has been shown that although only a small proportion of gene products in any of the genomes would make useful drug targets, those which do rank highly do so fairly independently of the scoring system used. From these rankings it has been found that not only do a number of proteins rank highly under most or all of the different scoring systems used, but they also rank highly in all three genomes used. These targets have been described in some further detail and are left as suggestions for further in-depth analysis.