The Cadherin Superfamily in Anopheles gambiae: a Comparative Study With Drosophila melanogaster

The cadherin superfamily is a diverse and multifunctional group of proteins with extensive representation across genomes of phylogenetically distant species that is involved in cell–cell communication and adhesion. The mosquito Anopheles gambiae is an emerging model organism for the study of innate immunity and host–pathogen interactions, where the malaria parasite induces a profound rearrangement of the actin cytoskeleton at critical stages of infection. We have used bioinformatics tools to retrieve present sequence knowledge about the complete repertoire of cadherins in A. gambiae and compared it to that of the fruit fly, Drosophila melanogaster. In A. gambiae, we have identified 43 genes coding for cadherin extracellular domains that were re-annotated to 38 genes and represent an expansion of this gene family in comparison to other invertebrate organisms. The majority of Drosophila cadherins show a 1 : 1 Anopheles orthologue, but we have observed a remarkable expansion in some groups in A. gambiae, such as N-cadherins, that were recently shown to have a role in the olfactory system of the fruit fly. In vivo dsRNA silencing of overrepresented genes in A. gambiae and other genes showing expression at critical tissues for parasite infection will likely advance our understanding of the problems of host preference and host–pathogen interactions in this mosquito species.

Cadherins have been revised and generally categorized into several groups: classical (type I and II), desmosomal, protocadherins, Flamingo cadherins and a group of unique members that do not fit into any of the previous groupings (Nollet et al., 2000;Tepass et al., 2000). Type I cadherins have five extracellular cadherin domains and a conserved His-Ala-Val (HAV) region in the first ectodomain (Blaschuk et al., 1990). Type II cadherins show a similar domain structure to type I cadherins, but do not have the HAV conserved motif (Tanihara et al., 1994). The desmosomal group is divided into two subgroups, desmocollins and desmogleins, which are present at desmosomal junctions (King et al., 1997). Protocadherins form the largest cadherin subgroup in mammals (Frank and Kemler, 2002). Most of the mammalian protocadherins are expressed in the central nervous system, and are enrolled in tissue morphogenesis and formation of neuronal pathways from early development (Redies, 2000;Shapiro and Colman, 1999;Yagi and Takeichi, 2000). The Flamingo subgroup is defined by the presence of seven-pass transmembrane regions instead of the single transmembranar domain of the other subgroups (Usui et al., 1999). Fat-type and Ret-like cadherins are examples of cadherin members that do not comply with the criteria of any of the subgroups described above.
The diversity of processes in which cadherins are involved has generated much interest in this family. Several members of the family have been extensively studied in model organisms, particularly D. melanogaster and Caenorhabditis elegans, whose complete repertoires were determined and compared (Hill et al., 2001).
Female A. gambiae mosquitoes, a vector of human malaria, require a vertebrate blood meal to complete their life cycle. A. gambiae is strongly anthropophilic but the molecular mechanisms that explain the recognition of the human host, mainly through olfactory cues, are still not well understood (Hallem et al., 2004). During the blood meal, mosquitoes can become infected with the agent of malaria -parasites of the genus Plasmodium.
A. gambiae is an emerging model organism with special relevance for the study of innate immunity and host-pathogen interactions (Christophides et al., 2002). The recent determination of its complete genome sequence (Holt et al., 2002;Mongin et al., 2004) offers an unprecedented opportunity for the identification of new factors involved in host preference and host-pathogen interactions that might determine the ability of this mosquito to transmit malaria. The comparison of a family of proteins known to be fundamental for cell-cell interactions and cell signalling in two insect dipteran species which have very different lifestyles might be an important contribution towards this goal and for a more global and deeper understanding of cadherin functions and evolution.
In this study, we have used bioinformatics methods to retrieve the complete repertoire of A. gambiae cadherins, to compare it to that of D. melanogaster and to use this information to identify members of this superfamily with possible relevance for the mosquito life cycle. This information is also relevant to understanding the cadherin superfamily and its evolution.

Selection of protein sequences for the study
The study was based on a set of protein sequences from D. melanogaster (Table 1) and A. gambiae (Table 2) available in public databases with characteristic cadherin domains (known or predicted). This set was analysed and compared in the two species in study.
The identification of protein sequences presenting cadherin domains was performed by combining  previous work on this family in D. melanogaster (Hill et al., 2001), with assignments for this domain available in SUPERFAMILY (http://supfam.mrclmb.cam.ac.uk/SUPERFAMILY/) (Gough et al., 2001). SUPERFAMILY is a database of identified domains within proteins of known structure using Hidden Markov Models (HMM), relying on the Structural Classification of Proteins (SCOP) database (Andreeva et al., 2004;Murzin et al., 1995). The SCOP database consists of a hierarchical classification of the domains of all proteins of known structure according to their evolutionary and structural relationships. The current SUPERFAM-ILY database (Madera et al., 2004) uses versions 3.1 and 19.2a of the D. melanogaster and A. gambiae genomes, respectively, for the assignments to all predicted proteins in these organisms. The sequences selected were those that matched for cadherin domains by HMM with an expectation value score below 0.001. D. melanogaster sequences fulfilling the above criteria were retrieved from FlyBase (http://flybase.bio.indiana.edu/) and A. gambiae sequences were retrieved from Ensembl (http://www.ensembl.org).
The designations used in this report for the sequences considered are the terms attributed by the genome sequencing projects, or those previously given by researchers.

Identification of other domains, signal peptides and transmembrane helices
The protein sequences identified as having one or more cadherin domains were inspected for other features and domains, using the following servers: (a) InterPro metaserver (Zdobnov and Apweiler, 2001), which includes several databases and scanning methods to check additional protein domains; (b) SignalP 3.0 server (Dyrlov Bendtsen et al., 2004;Nielsen et al., 1999), with default options for eukaryotes to identify signal peptide sequences; (c) TMHMM server (Krogh et al., 2001), with default parameters for transmembrane helices, intracellular and extracellular region prediction; (d) SMART server Schultz et al., 1998), to perform a quick domain inspection and to check InterPro matches.  sets used were the non-redundant public protein database UniProt (Apweiler et al., 2004) and, in the case of Drosophila, a second set was obtained by screening a library of more than 9000 cDNAs (http://www.fruitfly.org/sequence/dlcDNA .shtml) using FASTA (Pearson and Lipman, 1988). The program implementation used for the first set was fasta3, available at the European Bioinformatics Institute (EBI) website with default parameters (http://www.ebi.ac.uk/fasta33/), an upper expectation value of 0.001 and sequence identity higher than 50%. For the cDNA database, a locally installed version of FASTA within the GCG (Wisconsin) package version 10.0 was used, with default parameters and an upper expectation value of 0.05. For this second database, the FASTA algorithm showed undesired sensitivity to intron presence, therefore the Smith-Waterman algorithm (EMBOSS-water implementation) and SIM4 (Florea et al., 1998) were used, both with default parameters. Although no new possible corrections were detected by this approach for D. melanogaster, in the case of A. gambiae we detected several matches in the UniProt database that were identical and extended the predicted protein sequences.

EST information
Whenever possible, we considered EST sequence information, in order to confirm the structure of the predicted Anopheles genes. In two cases, the ESTs associated to the gene annotation extended their predicted ends. In other cases, the ESTs have supported the predictions. These aspects are summarized in Table 3.

Multiple alignments
The cadherin sequences that were identified and re-annotated according to the previously described procedures were aligned using ClustalX 1.83 (Thompson et al., 1997), and a tree representation ( Figure 2) was generated from the multiple alignment by the same program using the neighbourjoining method. In these alignments, we have included the longest sequences (either the predicted or the experimentally obtained fragment); proteins known to have splice variants were represented by the longest isoform. Sequences shorter than 350 amino acids in length were not considered in order to not affect the alignment (see Table 4).

Merged genes
In this study, the Ensembl designation for Anopheles genes and proteins was abbreviated by omitting the prefixes ENSANGG000000 and ENSANGP-000000, respectively. In A. gambiae, by inspecting the positions of the predicted sequences (Figure 3), seven sets of predicted cadherin proteins were identified whose gene sequences are adjacent on their respective chromosomes. Merging the adjacent predicted genes resulted in gene products with more cadherin   Figure 4 shows the domain architecture of the Anopheles cadherin repertoire with the proposed gene mergers.

Cytoplasmic domains
The D. melanogaster cadherin genes encode cytoplasmic domains of 43-968 amino acid residues, in agreement with Hill et al. (2001). For A. gambiae, taking into account only those sequences for which a transmembrane region was predicted (and considering that some might be incomplete at their 3 ends), the cytoplasmic regions vary (ca. 38-539 residues).

Similar and unique cadherins in D. melanogaster and A. gambiae
The domain architectures of vertebrate and invertebrate cadherins show several differences (Oda and Tsukita, 1999). The classification adopted in this report follows, to some extent, previous studies (Angst et al., 2001;Nollet et al., 2000;Tepass et al., 2000). For simplification and discussion of organization, cadherins will be considered 'classical' (showing a conserved cytoplasmic domain that can bind to catenins) or 'non-classical', and this second category includes sub-groups of 'Fat-like', 'Flamingo', 'Ret-like' and 'Other'. The Drosophila cadherin Dachsous is included in the 'Fat-like' sub-group as it has a large number of cadherin ectodomains but, as mentioned in Gooding et al.

Figure 2.
Radial tree representation of the cadherin proteins in A. gambiae and D. melanogaster. Sequences were aligned using ClustalX and the tree derived with the neighbour-joining algorithm. Drosophila sequences are referred to by their gene name. Anopheles sequences are referred to by their Ensembl entry, omitting the prefix ENSANGP000000. Proteins known to have splice variants are represented by the longest available sequence. Sequences shorter than 350 amino acids in length were omitted. * Experimental fragment is the longest available sequence; * * Proposed merger not validated experimentally; * * * Sequence extended by overlapping EST with a FASTA e-value of zero to two transcripts of this Anopheles gene (proteins 10 175 and 10 219) with 87% sequence identity, which demonstrates the high conservation of these sequences in all their extension. Besides this mosquito gene, this species also has six other genes that encode for proteins with similar domain organization (with minor differences in terms of cadherin repeat number), to which this sequence and the fruit fly CadN align by BLAST with an e-value of zero. All the sequences form a cluster (Figure 2) suggesting that this subgroup has experienced an expansion in Anopheles. All of these mosquito genes are localized on the same chromosome (3R) and adjacent to each other (Figure 3).
The cadN gene product has putative orthologues in other species by reciprocal BLAST analysis, such as C. elegans W02B9.1, C. briggsae CBG07964, and novel predictions in the zebrafish, Danio rerio, (ENSDARG00000001983); chicken (ENSGALG00000004630); and the pufferfish, Fugu rubripes, (SINFRUG00000151656). The cadN gene is reported to be expressed in neurons, also regulating neuronal morphogenesis (Iwai et al., 1997) and is proposed to be involved in synaptic target specificity (Lee et al., 2001).
A. gambiae is highly anthropophilic and finds human hosts largely through olfactory cues (Hallem et al., 2004). Recently, N-cadherins were implicated in D. melanogaster olfaction (Hummel and Zipursky, 2004;Zhu and Luo, 2004). In spite of the eight possible alternatively spliced isoforms for N-cadherin, the expression of one isoform is sufficient to rescue all affected phenotypes (Zhu and Luo, 2004). Shotgun (DE-cadherin) Anopheles 19 477 is a putative orthologue for Shotgun (Drosophila E-cadherin), but the protein fragment currently available in public databases shows differences in terms of domain architecture, viz. the assignment of an EGF domain by SMART and the non-detection, to date, of a transmembrane region or a classical cadherin cytoplasmic segment. Moreover, it is possible to distinguish a synteny-based hit with 28 193, again with several differences at the domain architecture level, such as the number of cadherin extracellular repeats. The gene product 09 575 also clusters with these sequences (Figure 2).
The shotgun gene is a classical epithelial cadherin, expressed in a broad range of tissues, and it has been shown to be required for tissue integrity, oogenesis and cell rearrangements during morphogenesis (Haag et al., 1999;Oda et al., 1997;Tepass et al., 1996).
It should be stressed that we are considering Anopheles sequences that are probably incomplete, and future experimental work should therefore help to clarify the gene architecture and protein domain organization of these sequences.

Non-classical
Fat-like fat/05 443, fat2/12 062 Of the possible orthologues between the Drosophila and Anopheles sequences, Fat and 07 226 show 60% identity in Smith-Waterman local pairwise alignment, and Fat2 and 14 551 show 47% sequence identity. Nevertheless, for Fat2, besides the possibility of distinguishing a synteny-based orthologue (02 131), one of the proposed merging genes also clusters with these sequences (Figure 2). This possible union is not yet confirmed, but the domain structure of the putative product shows some similarities with Fat-like sequences.
Particularly in the case of the Fat and 07 226 proteins, there is a remarkable similarity between the two sequences, even in the cytoplasmic region; both of them have BLAST sequence alignments with an e-value of zero to several fat-like proteins in other species, such as rat Fat, Fat 2 and Fat 3, human Fat, mouse Fat 1 cadherin and zebrafish Fat, all of which have similar domain structures.
The Drosophila fat gene controls cell growth (Agrawal et al., 1995;Garoia et al., 2000) by acting as a tumour-suppressor gene (Bryant et al., 1993) and it is involved in planar polarity (Casal et al., 2002;Fanto et al., 2003;Rawls et al., 2002). It has been shown that fat2 is the true orthologue of the vertebrate fat-like cadherins (Tepass et al., 2000), and more recently Castillejo-Lopez et al. (2004) have reported its involvement in the formation of tubular organs. dachsous/07 504 In the case of Drosophila dachsous, there is an Anopheles sequence presenting a best reciprocal hit (09 993), which demonstrates 58% sequence identity and similar domain organization. Additionally, dachsous has a possible synteny-based orthologue in 29 455, but its predicted protein domain organization is substantially different. Considering the close proximity of the two Anopheles genes, and the domain organization of the products coded, it is possible that they should be merged but, as we did not find any experimental evidence to confirm this, we have considered them separately.
The dachsous gene is involved in the control of imaginal disc morphogenesis (Clark et al., 1995); it is shown to have a role in planar polarity (Casal et al., 2002;Eaton, 2003), as well as in regulating dorsal-ventral signalling in the Drosophila eye (Rawls et al., 2002).
Flamingo flamingo/01 056 (GPRstn) As in D. melanogaster and C. elegans (Hill et al., 2001), A. gambiae has one seven-helix transmembrane cadherin, which leads us to consider Flamingo and GPRstn as orthologues. The protein sequences match with a FASTA e-value of zero with 66% identity, presenting the same number of cadherin domains, as well as of EGF, LamG, GPS (Gprotein-coupled receptor proteolytic site) and HMR (hormone receptor) domains. Also, their extracellular and cytoplasmic regions are of similar length and conservation.
In BLAST searches, the Flamingo and 01 238 proteins produce sequence matches with e-values of zero, to several proteins containing a sevenhelix transmembrane region, viz: mouse mFmi1, MEGF2, CELSR1, CELSR3; rat CELSR2, CELSR3; human CELSR1, CELSR2, CELSR3 and CLR1; C. elegans FMI-1; and a hypothetical protein of C. briggsae (CBG09454), all of which have high sequence similarity to each other; and a D. rerio sequence (CAE30365). The extracellular domain organization of the above sequences is very similar, except for the D. rerio entry which does not have cadherin domains based on the current domain assignments. However, in terms of overall sequence observation, there is some degree of conservation within vertebrate and invertebrate groups, but not between them.
Flamingo is known to be involved in planar polarity (Chae et al., 1999;Usui et al., 1999) and, more recently, to be engaged in neuronal differentiation, dendritic development (Sweeney et al., 2002) and target interactions in the Drosophila visual system (Lee et al., 2003;Senti et al., 2003).
Ret-like cad96Ca/16 655 In the case of Cad96-Ca, Anopheles presents a possible orthologue, 19 144 (coded by gene 16 655), which encodes a signal peptide, a cadherin domain, a transmembrane region and a cytoplasmic segment with a tyrosine kinase domain, similarly to the Drosophila sequence. The two sequences have 53% sequence identity in Smith-Waterman pairwise alignment, showing high conservation, particularly in the cytoplasmic region.
Other CG4655 CG4509 (cad86C)/23 672 16 780 This Drosophila possible merger proposed by Hill et al. (2001) is still considered as such, as no new experimental evidence supports the existence of a unique gene. Nevertheless, the identification of a similar possible merger in Anopheles (and its confirmation by a match in UniProt), 27 524 19 269, with 61% sequence identity from Smith-Waterman local pairwise alignment to the Drosophila union, seems to support the likeliness of the proposed arrangement. However, it is important to remember that the Drosophila genome annotation was used to annotate Anopheles and this might influence to some extent the gene structure proposed for Anopheles genes.
cad74A/18 042 The predicted gene products have similar length, extracellular regions and have an equal number of cadherin domains, matching with a FASTA e-value of zero and 55% identity.
cad87A/08 806 The two protein sequences have a Smith-Waterman identity of 58% and an equal number of cadherin domains, as well as a high conservation in their extracellular region. However, the current Anopheles sequence is shorter by 112 amino acids. cad89D/08 765 Similarly to the observations for Cad87A, the proteins encoded by these genes show considerable similarity in their extracellular domains, with 41% sequence identity. As before, the Anopheles sequence is shorter (1821 amino acids, whereas Cad89D has 2240 residues) and no signal peptide has been identified. cad99C/18 160 The protein sequences coded by these genes match with a FASTA e-value of zero and have 58% identity. Similarly to what was reported for Cad89D, the Anopheles fragment is smaller (by 98 amino acids) and no signal peptide has been identified so far.
cad96Cb/09 438 In the case of Cad96Cb, there is an Anopheles sequence representing a best reciprocal hit (11 927) which demonstrates sequence similarity and similar domain organization. A Smith-Waterman pairwise alignment shows 39% identity between the sequences. cad88C/15 646 The Anopheles gene 15 646 is a putative orthologue of the fruit fly cad88C. However, the mosquito gene product has differences in terms of the number of cadherin repeats and sequence length (121 amino acids shorter). The protein sequences are 49% identical as shown by pairwise alignment. calsyntenin (cals)/08 654 The Cals protein has a best reciprocal hit with Anopheles 11 143 protein, coded for by gene 08 654. The protein sequences show 59% identity by pairwise local alignment. The current Anopheles fragment in SwissProt (Q7QIW3) does not have a transmembrane region or signal peptide predicted. By best BLAST reciprocal hit, Cals has possible orthologues with other invertebrate and vertebrate species: B0034.3 from C. elegans, CBG02547 of C. briggsae, CLSTN2 from Homo sapiens, mouse Clstn1, Q7ZTX9 from D. rerio, and novel predictions from rat (ENSRNOG00000016398), chicken (ENSGALG00000005310) and F. rubripes (SIN-FRUG00000127288).
Cals is reported to be involved in synaptic transmission by binding synaptic Ca 2+ with its cytoplasmic domain (Vogt et al., 2001).

Remaining cadherin repertoire
The remaining Anopheles sequences, to date, appear to have no remarkable similarity or possible Drosophila orthologues, beyond the fact of all having one or more cadherin domains.

Conclusions
Cadherin ectodomains are distributed in the coded products of 17 D. melanogaster and 43 A. gambiae putative genes. These facts suggest an expansion of this protein family in A. gambiae. We propose seven possible gene mergers for Anopheles based on chromosome location analysis and neighbourhood inspection. From these, five were confirmed by sequence matches in public databases. Anopheles should now be considered to have 38 cadherin genes. If two additional unions are confirmed by future sequence data, a further reduction to 36 genes should then be considered.
Our chromosome localizations of the cadherin genes orthologous between D. melanogaster and A. gambiae (Figure 3) are in general agreement with the results reported by Zdobnov et al. (2002), in which the correspondence between chromosomes of the two species using 1 : 1 orthologues and microsynteny blocks was analysed. Specifically, chromosomal arm 2L of Drosophila is conserved relative to the Anopheles 3R arm, the same being the case for Dm3R and Ag2R; the Anopheles 2L chromosome hosts the majority of the Drosophila 2R and 3L orthologues. The only exceptions seem to be Cad96Ca and Cad86C and their respective orthologues. Moreover, the existence of several 1 : 1 orthologues is a promising contribution for subsequent work in functional genomics in the two species.
Among the identified genes, the group of Ncadherins is of particular interest because it has been dramatically expanded in A. gambiae. In Drosophila, there are two genes coding for this type of protein (one of which has eight possible transcripts), but in Anopheles it is possible to identify seven genes (one has four different possible transcripts).
The present study indicates that in the future, both experimental and theoretical work will be needed in order to confirm possible gene unions, as well as their 5 and 3 ends, as the majority of sequences in public databases are still incomplete.
The elucidation of the patterns of tissue expression of cadherins in A. gambiae should guide the selection of candidates for further work in the problem of host preference and host-pathogen interactions. The possibility of in vivo gene silencing by RNA interference provides a powerful approach to