InterPro (The Integrated Resource of Protein Domains and Functional Sites)

The family and motif databases, PROSITE, PRINTS, Pfam and ProDom, have been integrated into a powerful resource for protein secondary annotation. As of June 2000, InterPro had processed 384 572 proteins in SWISS-PROT and TrEMBL. Because the contributing databases have different clustering principles and scoring sensitivities, the combined assignments compliment each other for grouping protein families and delineating domains. The graphic displays of all matches above the scoring thresholds enables judgements to be made on the concordances or differences between the assignments. The website links can be used to analyse novel sequences and for queries across the proteomes of 32 organisms, including the partial human set, by domain and/or protein family. An analysis of selected HtrA/DegQ proteases demonstrates the utility of this website for detailed comparative genomics. Further information on the project can be found at the European Bioinformatics Institute at http://www.ebi.ac.uk/interpro/.


Introduction
Secondary databases describing functional sites and domains are key resources for identifying struc-ture±function relationships in protein sequences. PROSITE, initiated in 1988, is the archetype for these databases. Although many more have been developed since, they have used different approaches to the recognition of conserved sequence. These have been variously described as patterns, clusters, motifs, signatures or ®ngerprints but they all represent structurally conserved features related to protein function. The initial phase of the InterPro project has uni®ed four complementary methods: PROSITE (Hofmann et al., 1999); PRINTS ; Pfam (Bateman et al., 2000); and ProDom (Corpet et al., 2000), each of which can be applied to an individual protein sequence using a formal score-based recognition cut-off. The ®rst InterPro release (June 2000) contains 3052 entries, representing 574 domains, 2418 families, 46 repeats and 14 post-translational modi®cation sites . In addition to comprehensive protein annotation, the objectives of the project include the incorporation of new protein families and domains. It will also be used to enhance the automated functional annotations in TrEMBL, the protein database supplementing SWISS-PROT  and is being applied as a proteome annotation tool for genome projects.

Site outline
The home page ( Figure 1) includes a link to the Documentation page. This includes links to the release notes, user manual, a list of deleted InterPro entries, the scheme of the database, a fully annotated sample entry and references for the member databases. Brief descriptions and links to the member databases are given in the Databases page. The Search pages include a form to directly search the underlying ORACLE database for InterPro, Pfam, PRINTS, Prosite, SWISS-PROT, TrEMBL accession numbers and names, database names, entry types, etc. There is also a Sequence Retreival System (SRS) link for more complex and/ or multiple database queries. A key feature of the search page is ability to query a sequence directly against all three source databases and get the results back in the InterPro graphical format described below. This would be useful either for a novel sequence or one that had entered TrEMBL since the last InterPro run. An extra feature is the option to include a Smith and Waterman search against the SWISS-PROT+TrEMBL (SPTR or SWALL) non-redundant protein database.

The protein and protein family displays
Before going further into what the site has to offer, the best way to grasp the concepts behind InterPro is by inspecting the graphic display for an individual protein entry and following the domain and family links. This can be done via the example given on the home page but it is preferable to use a multidomain protein for which you already have some background knowledge. Those less familiar with the theory and practice of protein sequence analysis will ®nd useful information in the InterPro user manual, as well as in the documentation and publications from the source databases. Figure 2 illustrates the concordance and discrepancies between the source entries for an Escherichia coli HtrA/DegQ gene that contains a serine protease domain and two C-terminal PDZ domains. In this example the PROSITE pro®les and the hidden Markov models (HMMs) in Pfam have recognized both the trypsin and one of the PDZ domains, although these would be classi®ed separately in the two source databases. In this case PRINTS provides a more discriminatory sixelement ®ngerprint that is speci®c to the HtrA/ degQ family of serine proteases, a subfamily (or child' in InterPro terminology) of the larger parent' trypsin family. However, in this case PRINTS recognizes only one of the two PDZ domains because many family members have only one rather than two PDZ domains. It should also be noted that the distal PDZ domain has been recognized by PROSITE but not by Pfam at the cut-off scores used.
From an individual protein entry there can be three types of links. Each of the graphic signatures has a link back to the source database, PS for PROSITE, PF for Pfam and PR for PRINTS. The second type of link is an InterPro domain type, e.g. all PDZ domains, regardless of what protein families they cluster with, will be linked to the domain entry IPR001478 as shown in Figure 3.
The numbers within this domain classi®cation indicate further subtleties and complimentarity of the InterPro approach. While PROSITE and Pfam recognize (by application of their individual scoring cut-offs) 567 and 601 PDZ domains, respectively, the combination has an increased detection sensitivity for 629. The utility of this is shown in Figure 2, where PROSITE scored the second PDZ domain that Pfam and PRINTS`missed'. This is a non-trivial result in that it immediately suggests the non-equivalence of the two domains, i.e. the sequence of the terminal domain is not a simple repeat.
The second type of InterPro entry describes a protein family as de®ned by the extended alignments in Pfam and/or the extended motif groups in PRINTS. The protease in Figure 2 links to the family-speci®c IPR001940 entry for the HtrA/DegQ proteases. This family (Figure 4), numbering 79 members in this ®rst release, is de®ned by conformity to the PRINTS HtrA/DegQ motif group, which includes one proximal PDZ domain, and bỳ membership' of the parent trypsin protease family IPR001254.  As indicated in Figure 5, the parent (superfamily) for the HtrA proteases is the trypsin family, IPR001254. This entry also shows an apparent increase in sensitivity achieved by combining the results of the PROSITE and Pfam pro®le recognition methods for the detection of 1332 trypsin-like sequences. However, the entry also indicates that only 852 proteins have an active-site serine motif that conforms to the 11-element PROSITE regular expression, PS00135. The sensitivity, de®ned as true hits/(true hits+false negatives), for PS00135 is reported as 94% within SWISS-PROT and almost all the false negatives were attributable to loss of a conserved Gly in the ®rst element. These apparent discrepancies in total family numbers illustrate the power of InterPro to highlight signi®cant differences between the recog-nition methods revealed by this scale of analysis. In this case we could conclude that 36% of the pro®le matches are catalytically inactive. However, an alternative interpretation is that the sensitivity of PS00135 may be lower for the expanded phylogenetic range of trypsin proteases now processed by InterPro compared with the 301 sequences matching PS00135 in SWISS-PROT, the majority of which are mammalian proteins. Support for this interpretation is provided by the HtrA/ DegQ protease family. None of the 79 members have an exact match to the PS00135 active-site serine motif but many have be experimentally veri®ed as proteases. In addition to the HtrA/ DegQ example, the InterPro trypsin family has spawned two additional`children', IPR00136, the alpha-lytic endopeptidases and IPR001314, the  chymotrypsins. These are both de®ned by their scores against PRINTS signatures, although the latter is not classi®ed as a separate family on the basis of accepted protease nomenclature (Barrett et al., 1998)

Proteome analysis
InterPro is one of the tools the SWISS-PROT group at the EBI are using to give a perspective on domain structure and function, gene duplication and protein families in different genomes. An analysis has already been produced for the 31 proteomes from completely sequenced organisms and the incomplete human set (http://www.ebi.ac.uk/ proteome/). The page of links for the human proteome so far is shown in Figure 6. A total of 18988 proteins were analysed from a combination of all human SWISS-PROT and TrEMBL entries. The number that had an assignment to at least one domain was 13 275, i.e. 69% of all human proteins. Comparative ®gures for¯y, worm and yeast are 54%, 54% and 52%, respectively. This difference could mean that the domain collections (and therefore the cut-off scores) are historically skewed towards mammalian proteins. Alternatively, the more primitive phyla may have taxon-speci®c domains which have not yet been analysed.
There are a large number of ways to query and compare the data from the Proteomics page, depending on the objectives of the analysis. The example chosen in Figure 7 shows the top 20 human protein domains compared with the three other completely sequenced eukaryotes. The results allow instant comparative genomic observations, so long as caution is exercised. For example, the conclusions that the¯y, worm and yeast contain 3%, 2% and 1.8% protein kinases, respectively, but no histocompatability proteins, seems solid enough. However, a visual inspection of the yeast Ig and Following on from using the HtrA/DegQ example above, we can see that yeast has one trypsin and two PDZ domains. Inspection of the single yeast trypsin assignment shows these have converged in the entry for P53920, i.e. it contains both a trypsin and a PDZ domain but con®ned to the N-terminal section of the 997-residue protein (Figure 8). The intriguing explanation was provided in a review of the HtrA proteases some time ago (Pallen and Wren, 1997). The authors established by conventional sequence comparison that P53920 contains an ancient tandem duplication of an HtrA/DegQ protease, similar to Figure 2, with two PDZ domains but that the second trypsin domain had lost essential catalytic positions. Although the catalytic triad can be discerned for the N-terminal section, the BLAST score is only 5e-7 against the bacterial HtrA/DegQ proteases and the score drops to 0.001 for the C-terminal section. The consequences, seen in Figure 8, are that only the Nterminal trypsin domain and one of the PDZs have scored above the thresholds for the PROSITE and Pfam pro®les. The single PROSITE pro®le match therefore brings P53920 into the IPR001254 trypsin family but the PRINTS match score was below the threshold that would have assigned it to IPR001940, the HtrA/DegQ subfamily. The distal C-terminal section of the ancient duplication has such low residual sequence similarity that no domains were recognized. Although the InterPro family and domain assignments`missed' for the yeast P53920 could be considered false negatives, it is important to note that neither catalytic activity nor PDZ-mediated complexing has yet been documented for P53920.
According to the table in Figure 7, the yeast  contains only one other PDZ domain; in this case the protein is the probable 26S proteasome regulatory subunit P27, SWISS-PROT P40555. Inspection and a BLAST search indicate that this may be a true positive. The PRINTS alignment set that de®nes the HtrA/DegQ protease family is derived from a set of 40 exclusively prokaryotic sequences. The ®nding of one distantly related member in yeast therefore raises the question as to the possible occurrence in other eukaryotes. Inspection of the yeast InterPro top 200 entries for C. elegans (http://www.ebi.ac.uk/ proteome/CAEEL/interpro/top200.html) lists 57 PDZ-containing proteins and 14 trypsin family assignments. A manual inspection of the trypsin assignments reveals two false positives but no overlaps with PDZs, i.e. there are no candidate IPR001940 members in the worm. Interrogation of the¯y top 200 assignments (http://www.ebi.ac.uk/ proteome/DROME/interpro/top200.html) gives 65 PDZs and 193 trypsins. These sets are too large to inspect manually but the intercept for the IPR001940 members can easily be found by using an SRS query. This establishes that the¯y has only one HtrA/DegQ protease, TrEMBL Q9VFJ3. The human proteome set has 142 PDZs and 141 trypsin matches. The SRS query locates the intercept of these two features in three human HtrA/DegQ proteases, two full-length Q92743, O43464, and one partial sequence Q9UNS5. The graphic display for the ®rst of these is shown in Figure 9.
Comparison between Figures 9 and 2 shows interesting similarities and differences between the bacterial and human representatives of this family across such a large phylogenetic distance. The human has only one PDZ but has gained two Nterminal domains, as de®ned by Pfam assignments, one insulin-related growth factor binding domain and one kazal protease inhibitor domain. Using SRS to check for the intercepts of IPR001940, PF00050 and PF00219 ®nds ®ve hits, the three human sequences already refered to above and two additional sequences from rat. This particular domain±family combination is therefore uniquely mammalian in the current protein databases.
These comparative genomic observations raise many intriguing questions about the biology and evolutionary history of these proteases that are beyond the scope of this review. However, the examples above show that this type of information can be rapidly acquired by using the tools available from the InterPro website. They can also be crosschecked using conventional database searching and sequence analysis tools from the parent site (http:// www.ebi.ac.uk/Tools/index.html).

Conclusions
InterPro is destined to become the pre-eminent secondary database for protein annotation. This combination of motif and family pro®le databases into a single package offers a more discriminatory and comprehensive protein analysis resource than the individual databases. The fact that the contributing databases are built on different concepts produces a valuable compensation effect, i.e. their tendencies for false positives and false negatives are skewed in different directions. The key feature of the combined graphic display is that it will often allow a clear judgement between false positives, borderline matches and likely true positives, although it should be borne in mind that the latter should ideally be veri®ed by experiment.
The ability to interrogate and compare the entire proteomes of organisms by domain and/or protein family distributions and combinations is another key feature of InterPro. As illustrated by the protease examples, inspection of the differential cut-off results can reveal subtleties of scienti®c signi®cance that would be dif®cult or tedious to track down by other methods.
Considering the future, the automation capacity of InterPro should be able to keep pace with the genome projects. It is intended that the Ensembl human genome annotation output, already processed through Pfam, will eventually run through InterPro (Apweiler, personal communication). If this output is blended effectively with human TrEMBL by redundancy removal, this will provide an immediate comparative functional overview of the human proteome. The next production release of InterPro will integrate more of the ProDom database that was only partially incorporated in the ®rst release. It is also planned to incorporate the Gene Ontology (GO) classi®cation system (http:// genome-www.stanford.edu/GO/) and eventually other signature databases will be included, such as Blocks and SMART (Henikoff et al., 2000;Schultz et al., 2000).