Analysis of Known Bacterial Protein Vaccine Antigens Reveals Biased Physical Properties and Amino Acid Composition

Many vaccines have been developed from live attenuated forms of bacterial pathogens or from killed bacterial cells. However, an increased awareness of the potential for transient side-effects following vaccination has prompted an increased emphasis on the use of sub-unit vaccines, rather than those based on whole bacterial cells. The identification of vaccine sub-units is often a lengthy process and bioinformatics approaches have recently been used to identify candidate protein vaccine antigens. Such methods ultimately offer the promise of a more rapid advance towards preclinical studies with vaccines. We have compared the properties of known bacterial vaccine antigens against randomly selected proteins and identified differences in the make-up of these two groups. A computer algorithm that exploits these differences allows the identification of potential vaccine antigen candidates from pathogenic bacteria on the basis of their amino acid composition, a property inherently associated with sub-cellular location.


Introduction
During the past 200 years the use of vaccines to control infectious diseases caused by bacterial pathogens has generally proved to be both effective and safe (Poland, 1999;Wilson and Marcuse, 2001). Many of these vaccines were discovered using an empirical approach (Nilsson, 2002) and included live attenuated forms of bacterial pathogens, killed bacterial cells and individual components of the bacterium (sub-units). Although many bacterial vaccines are still widely used, a shift towards reliance on antibiotics for the control of infectious diseases occurred during the latter half of the twentieth century.
The recent appearance of antibiotic resistant strains of many bacterial pathogens (Gould, 2002;Russell, 2002) has prompted a resurgence of interest in the use of vaccines to prevent disease. However, some vaccines are not considered to offer appropriate levels of protection against infection and there are still many infectious diseases for which effective vaccines are not available Poland et al., 2002). In addition, an increased awareness of the potential for transient or longer-term side-effects following vaccination Wilson and Marcuse, 2001) has prompted an emphasis on the use of sub-unit vaccines.
Whilst empirical approaches to the selection of vaccine sub-units are still employed, bioinformatics approaches to select candidate protein sub-units from bacterial genome sequences have been used more recently (De Groot et al., 2001;Gomez et al., 2000;Montgomery, 2000;Nilsson, 2002; Analysis of known bacterial vaccine antigens 469 been termed 'reverse vaccinology' (Gomez et al., 2000;Rappuoli, 2001).
Generally 'in silico' approaches to the identification of vaccine antigens have relied on the assumption that candidate proteins will be located on the outer surface of, or exported from, the bacterium. Amino acid composition has been shown to be useful in the prediction of the sub-cellular location of proteins (Feng, 2002). Some workers have identified open reading frames (ORFs) that encode proteins possessing a signal sequence and screened this dataset to exclude proteins with transmembrane domains (Gomez et al., 2000;Pizza et al., 2000), and to include proteins with lipoprotein attachment sites (Chakravarti et al., 2000;Gomez et al., 2000) or other motifs associated with cell surface anchoring (Pizza et al., 2000;Ross et al., 2001). Other programs have been used to predict epitopes that bind to T cell receptors or major histocompatibility complexes to assist in vaccine design and development (Bond et al., 2001;Grandi, 2001;Mallios, 1999Mallios, , 2001Savoie et al., 1999). Whilst these various approaches have yielded novel sub-unit vaccines, the predictive power of these methods may be limited by our knowledge of protein export and processing pathways in different bacterial species, by the assumption that vaccine antigens will be surface-located and by our limited knowledge of the molecular architecture of outer membrane proteins.
We have set out to investigate whether the biophysical properties of reported protein vaccine antigens are significantly different from a representative control protein dataset.

Construction of vaccine antigen dataset
Bacterial vaccine antigens were identified by patent and literature searches. To qualify for inclusion, the candidate, whole or part of the protein or corresponding DNA must have been shown to induce a protective response in an appropriate animal model after immunization. The amino acid sequences of the vaccine antigens were obtained from publicly available sequence databases, primarily from the National Centre for Biotechnology Information (http://www.ncbi.nlm.nih.gov).

Construction of control dataset
A control dataset was constructed that mirrored the vaccine antigen dataset with respect to the proportion of entries from each genus. For each entry in the vaccine antigen dataset we randomly selected 35 proteins from the proteome of a representative species from the same genus. Where possible, the same species from the vaccine antigen dataset was used for the control dataset. In cases where an appropriate genome sequence was available but had not been annotated, the proteome was predicted using Glimmer, a system for finding genes in microbial DNA (http://www.tigr.org/software.glimmer) (Delcher et al., 1999). Where no completed genome sequence was available for any member of the genus represented in the vaccine antigen dataset, all of the known proteins from a chosen species were downloaded from the publicly available protein sequence database (NCBI) and 35 proteins were then randomly selected.

Removal of signal sequences
Known signal sequences of vaccine antigens were removed from entries in the vaccine antigen dataset. Proteins without a reported signal sequence, and all proteins in the control datasets, were separated into Gram-negative and Gram-positive entries and analysed using SignalP (Nielsen et al., 1997); (http://www.cbs.dtu.dk/services/SignalP). Predicted signal sequences were removed to create two further datasets on which all comparisons of the vaccine antigen and control datasets were done.

Construction of sub-cellular location protein datasets
A search of the SWISSPROT database (Swiss Institute of Bioinformatics; http://www.expasy.ch/ sprot; Bairoch and Apweiler, 2000) identified proteins with defined sub-cellular locations for each of the bacterial species used to construct the control dataset. No entries were available in SWISSPROT for Corynebacterium diptheriae, so Corynebacterium glutamicum proteins were used instead. Any entries where the sub-cellular location of the protein was listed as 'putative', 'by similarity' or 'suggested' were omitted from the datasets. Separate datasets were constructed for each subcellular location, producing cytoplasmic (736 proteins), inner membrane (265 proteins), periplasmic Analysis of physical properties of proteins in the control and vaccine antigen datasets Predicted molecular weights and predicted isoelectric points (pI) of proteins were calculated. Each protein in the control and vaccine antigen datasets was scored for hydrophobicity (Kyte and Doolittle, 1982), flexibility (Bhaskaram and Ponnuswamy, 1988), bulkiness (Zimmermann et al., 1968) and relative mutability (Dayhoff et al., 1978). The statistical significance of any differences was calculated by the Mann-Whitney test (Mann and Whitney, 1947;Wilcoxon, 1945). For all analyses, a p score of <0.05 was considered to be significant.

Calculation of amino acid composition of control and vaccine antigen datasets
The percentage amino acid composition of every protein was calculated. Statistically significant differences in amino acid composition between the control and vaccine antigen datasets were calculated by the Mann-Whitney test (Mann and Whitney, 1947;Wilcoxon, 1945).

Development of scoring algorithms
The amino acid composition of each dataset was calculated as described above and the statistically significant differences noted. A score table was then produced, based on these differences. Each amino acid score was calculated using the mean dataset scores, as follows: Amino acid score = % Composition of % Composition vaccine antigen − of control dataset dataset % Composition of control dataset/10 Amino acids more frequently found in the vaccine antigen dataset compared against the control dataset received a positive score, while those depleted in the vaccine antigen dataset received a negative score. Those that showed no statistically significant difference between the two datasets scored 0.
The scoring scale devised from the above analysis was used to score proteins in the vaccine antigen and control datasets as follows: Protein score = Amino acid scores Number of amino acids in the protein The vaccine antigen scoring scale was applied to proteins from the sub-cellular datasets and the predicted proteome of Streptococcus pneumoniae strain R6 (Hoskins et al., 2001).

Construction of histograms
The distributions of scores from dataset comparisons are represented as histograms. Proteins from each of the two datasets being compared (a query dataset and a control) were scored according to published scales (for hydrophobicity, flexibility, bulkiness and relative mutability) or using the scales generated from amino acid sequences, as described previously. The scores from the query and control datasets were then combined and ranked. The range of scores generated was divided into 25 equal parts (histogram bins) that were used to represent the x axis of the histogram. The upper limit of each bin is used as the axis label. The y axis shows the percentage of proteins from each dataset that lies within each range of scores.

Composition of the vaccine antigen and control dataset
In total, 72 non-homologous vaccine antigens were identified, originating from 32 bacterial species in 23 genera (Table 1) with 26 originating from Gram-positive bacteria and 46 from Gram-negative bacteria (for the purposes of this study, mycobacteria were treated as Gram-positive bacteria). A control dataset of 2520 proteins was constructed by randomly selecting 35 proteins from each representative species for each entry in the vaccine antigen dataset ( Table 2). The size of the control dataset was selected so that it was approximately the number of proteins encoded by a typical bacterial genome. These vaccine antigen and control datasets were used for all subsequent comparisons. Of the proteins in the vaccine antigen dataset, 52(72%) were identified as having signal

Physical properties of proteins in the vaccine antigen and control datasets
The isoelectric points (pI) and molecular weights were predicted for all proteins in the vaccine antigen and control datasets. The results were ranked and the distributions displayed as histograms (Figure 1a, b). The two-peak pattern of pI values seen with both the control and positive datasets was also seen with the predicted proteomes analysed from Escherichia coli, Mycobacterium tuberculosis, Neisseria meningitidis and Streptococcus pneumoniae (data not shown). The median values for each dataset were calculated and the Mann-Whitney test was applied. A comparison of positive and control datasets revealed statistically significant differences for both molecular weight and pI.

Amino acid composition of vaccine antigen and control datasets
We analysed the amino acid compositions of the proteins in the vaccine antigen and control datasets using scales for hydrophobicity, flexibility, bulkiness or relative mutability, according to previously reported scoring methods (Bhaskaram and Ponnuswamy, 1988;Dayhoff et al., 1978;Kyte and Doolittle, 1982;Zimmermann et al., 1968). The output from each of these analyses was displayed

Development of scoring algorithm
Although differences of the vaccine antigen and control datasets using the various published scales were statistically significant, the separation of distribution was poor, with a high percentage of one dataset falling within 1 SD of the mean of the other dataset (Table 3). We have devised a scoring system based on the average amino acid composition of all of the proteins in the positive and control datasets (Table 4). This scoring table was used to score individual proteins in the vaccine antigen and control datasets and the results of this analysis displayed as a histogram (Figure 3). A comparison of the positive and control datasets scored this way was statistically significant and a difference in the distribution of the scores was also seen with only around 18% of one dataset falling within 1 SD of the mean of the other dataset (Table 3).

Vaccine scoring algorithm applied to other datasets
We considered that the differences in amino acid composition of the vaccine antigen and control For each scale listed the percentage of scores in each dataset that falls within 1 SD of the mean of the control database or of the vaccine antigen database is given.
datasets might reflect the differences in the likely cellular locations of the proteins. Therefore we applied the scoring algorithm to groups of proteins with known cellular locations (cytoplasmic, inner membrane, periplasmic, outer membrane or secreted) and compared each sub-cellular dataset against both the vaccine antigen and control datasets. There was no significant difference between the scores of known bacterial vaccine antigens and the scores of outer membrane or secreted   Table 5. The control dataset showed no bias to any one sub-cellular location.

Vaccine scoring algorithm applied to a test proteome
To evaluate the algorithm, we analysed the proteome of S. pneumoniae R6 (2043 proteins) and ranked the proteins by score. The vaccine antigen database contains four entries from S. pneumoniae. When ranked, pneumococcal surface protein A (PspA), was the highest ranked (10th) of these four known protective antigens, with the other three vaccine antigens ranking within the top 10% (within the first 204 proteins when ranked by score; Table 6). Potential vaccine candidates from S. pneumoniae N4 (Wizemann et al., 2001), and known pneumococcal virulence factors that may also have potential as vaccine antigens (Jedrzejas, 2001) were also found within the top 10% of proteins when ranked by our scoring algorithm. Of the five proteins identified by Wizemann et al.,   The vaccine antigen scale was used to score proteins from either the vaccine antigen or control dataset and datasets of proteins from various cellular locations. The p score (the probability that two datasets share the same median) was calculated by the Wilcoxon Rank Sum test.
SP101, a conserved hypothetical protein with a signal peptidase II cleavage site motif, had the lowest ranking of all vaccines and potential vaccine antigens at 376 (Table 6). Predicted signal sequences were removed from the S. pneumoniae R6 proteome and ranked again as described above. Slight changes in rankings were observed; however, all but SP101 were again found to rank within the top 10% (Table 6). Of the top 100 pneumococcal proteins ranked by our algorithm, 31 were predicted to possess a signal sequence.

Discussion
The genome sequences of many bacterial pathogens have now been determined and this has prompted significant work to investigate how these genome sequences can be interpreted to provide improved pre-treatments or therapies for disease. Previous workers have used a range of methods to identify vaccine antigens. Some workers have assumed that vaccine antigens are located on the surface of the bacterium, and used algorithms that predict the cellular location to interrogate the predicted bacterial proteome for novel vaccine candidates (Gomez et al., 2000). Others have used algorithms to locate proteins with sequence homology to known vaccines Moxon et al., 2002). Such techniques would fail to predict new families of vaccine candidates. Other reported methods involve the identification of tandem repeats at the 476 C. Mayers et al. 5 end of a gene, since such repeats have been associated with some virulence genes (Hood et al., 1996). However, many virulence-associated genes lack such repeats and would not have been identified using this method. We have extended these approaches to identify the likely properties of vaccine antigens by comparing the amino acid composition of known protein vaccine antigens with those of randomly selected proteins in a control dataset. It has been a generally held hypothesis that secreted or surface-located proteins are most likely to induce a protective immune response (Grandi, 2001). In silico methods have therefore been employed to identify potential vaccine antigens by predicting secreted proteins by searching for signal sequences (Chakravarti et al., 2000;Gomez et al., 2000;Janulczyk and Rasmussen, 2001). Our analysis has confirmed for the first time that a higher proportion of protein vaccine antigens have signal sequences when compared to the control dataset (72% vs. 14%).
Protein antigens having no classic leader sequence would fail to be identified using methods searching for signal sequences, such as ESAT-6 from M. tuberculosis (Li et al., 1999;Olsen et al., 2001;Sorensen et al., 1995). Using our scoring algorithm, ESAT-6 was ranked 92nd out of the 3918 proteins in the entire predicted proteome of M. tuberculosis (i.e. in the top 3%).
The p scores of both predicted pI and molecular weights of the proteins in the positive dataset showed statistically significant differences from the control dataset. The bimodal pattern of the pI values occurred with all of the datasets analysed and confirms previous observations with bacterial and archaeal proteomes (Van Bogelen et al., 1999). Since proteins are generally less soluble around their isoelectric points, and the cytoplasm has a pH value near to neutrality, it has been suggested that cytoplasmic proteins rarely have a neutral pI.
Our analysis has revealed that the hydrophobicity, bulkiness, flexibility and mutability of vaccine antigens are significantly different from these properties of our control dataset. As most vaccine antigens previously described are surfaceexposed or secreted, they are more likely to be in contact with surrounding media. This might be reflected in their hydrophobicity and may therefore explain the differences seen between the two datasets using hydrophobicity as a scale. The difference in mutability could reflect the ability of pathogens to alter their antigenic presentation and thereby evade the host's immune system. Phenotypic variation in the relevant cell-surface proteins has been seen amongst clinical isolates of some species, suggesting that antigenic proteins can mutate and evolve during the period of infection (Peterson et al., 1995). This could also account for the differences seen in the comparisons of bulkiness, molecular weights and flexibility since the use of small, flexible residues on a protein surface may also reflect the capability to mutate. The difference in molecular weight reflects the size ranges of the two datasets. The control datasets ranges from 1.62 to 252 kDa, whilst the vaccine antigen dataset ranges from 7.69 to 367 kDa. The overlap between the two datasets does not allow this property to be used to predict vaccine antigen proteins. The greatest difference in separation of distribution between the vaccine antigen and control datasets was achieved when amino acid compositions were compared. The algorithm we derived exploits these differences.
Using Streptococcus pneumoniae R6 as a test proteome, our scoring algorithm was able to rank the known antigens included in our vaccine antigen dataset within the top 10% of S. pneumoniae proteins -other bacterial proteomes have also been ranked using our scoring algorithm, and the known vaccine antigens occur most frequently in the top 10% of scores (data not shown). Other virulence factors and potential vaccine candidates from S. pneumoniae were also ranked within the top 10% of scores.
This study demonstrates a fast and efficient scoring system that utilizes amino acid composition as a tool for the prediction of vaccine candidates. Construction of the vaccine antigen dataset has confirmed that a high proportion of known antigens have signal sequences. Since this scoring system is based on amino acid composition, secreted and outer membrane proteins score highly using the algorithm described. However, since this method does not rely on sequence similarity or motifs, it should also identify vaccine candidates lacking such features that other prediction tools, using these criteria, may miss. Ranking proteomes by this method has shown that known protective antigens score highly, independently of cellular location or possession of signal sequences. In contrast to previous methods, our algorithm uses data derived only from bacterial proteins and therefore is specific for use with bacterial genomes. This scoring system therefore provides a fast and efficient method of ranking whole bacterial proteomes for potential vaccine antigen candidates. We aim to use the datasets and algorithms to predict novel vaccine candidates from pathogenic bacteria that will form the basis for clinical trials.