Estimation of Synteny Conservation and Genome Compaction Between Pufferfish (Fugu) and Human

Background: Knowledge of the amount of gene order and synteny conservation between two species gives insights to the extent and mechanisms of divergence. The vertebrate Fugu rubripes (pufferfish) has a small genome with little repetitive sequence which makes it attractive as a model genome. Genome compaction and synteny conservation between human and Fugu were studied using data from public databases. Methods: Intron length and map positions of human and Fugu orthologues were compared to analyse relative genome compaction and synteny conservation respectivley. The divergence of these two genomes by genome rearrangement was simulated and the results were compared to the real data. Results: Analysis of 199 introns in 22 orthologous genes showed an eight-fold average size reduction in Fugu, consistent with the ratio of total genome sizes. There was no consistent pattern relating the size reduction in individual introns or genes to gene base composition in either species. For genes that are neighbours in Fugu (genes from the same cosmid or GenBank entry), 40–50% have conserved synteny with a human chromosome. This figure may be underestimated by as much as two-fold, due to problems caused by incomplete human genome sequence data and the existence of dispersed gene families. Some genes that are neighbours in Fugu have human orthologues that are several megabases and tens of genes apart. This is probably caused by small inversions or other intrachromosomal rearrangements. Conclusions: Comparison of observed data to computer simulations suggests that 4000–16 000 chromosomal rearrangements have occured since Fugu and human shared a common ancestor, implying a faster rate of rearrangement than seen in human/mouse comparisons.


Introduction
Comparative genomics has great potential for maximizing the value of genome sequencing projects. Sydney Brenner and colleagues (Brenner et al., 1993;Elgar et al., 1996) proposed the puffer®sh Fugu rubripes as a model genome for use in dissecting the human genome. As a vertebrate, Fugu is expected to have a similar gene repertoire to human. However, its genome, at y400 Mb, is approximately 7.5 times smaller than that of human. The reduced amount of repetitive sequence and high gene density make this small genome attractive to molecular biologists.
There are two main factors that will determine whether Fugu will be genuinely useful as a model vertebrate. Fugu genes must show suf®cient similarity to their human orthologues to enable the isolation of a Fugu gene with a human (or other mammalian) DNA probe, and vice versa. Further-more, knowledge of the extent of linkage conservation between the two genomes will advise as to the feasibility of positional cloning using map information extrapolated from one species to the other . Several regions of conserved synteny (but not necessarily conserved gene order) have already been reported between these two genomes (e.g. Baxendale et al., 1995;Trower et al., 1996, Elgar et al., 1999and references in Table 2).
Exploring the relationship between the human and puffer®sh genomes in terms of the extent of synteny conservation and patterns of genome compaction could give insights into the evolution of vertebrate genomes, and could also provide more information on the usefulness of Fugu as a model genome. However, at present it is not known how large the syntenic regions are, or how well the gene order is conserved between Fugu and human. Recent research on the zebra®sh (Danio rerio) indicated that for some groups of genes, synteny is conserved in the human but the order of the genes along the syntenic chromosome is different in the two species . Moreover, many mammalian genes have two zebra®sh orthologues, and this is probably due to whole genome or chromosomal duplications that occurred in bony ®sh (including zebra®sh and Fugu) after their divergence from the tetrapod lineage (Amores et al., 1998;Gates et al., 1999). It is also not known whether the compaction of the Fugu genome relative to the human is uniform throughout the genome, particularly in view of the uneven distribution of genes in the human genome (Ikemura and Wada, 1991;Duret et al., 1995;Deloukas et al., 1998).
Here we have made a comparative genomics study of Fugu and the human to investigate the phenomenon of genome compaction and to estimate the level of synteny conservation. There is no genetic map for Fugu (it is not possible to breed this ®sh in the laboratory), so gene linkage is only discernible at the level of genes that were sequenced on the same cosmid or other clone contig. We used two sources of Fugu sequence data: large contiguous genomic sequences determined by a variety of laboratories and obtained from GenBank; and cosmid skimming' data from the Fugu Landmark Mapping Project at the UK MRC HGMP-RC (Elgar, 1996;Elgar et al., 1999). The human map data was obtained from two sources: the Online Mendelian Inheritance in Man database (OMIM 1999); and the physical map of about 30 000 genes (GeneMap '98) constructed from radiation hybrid data by Deloukas et al. (1998).
Fugu sequence data SwissProt version 37 (27 July 1999) contains 5406 human proteins. These were compared to the database of Fugu`skimmed' cosmids using TBLASTN (Altschul et al., 1990) using the BLOSUM62 scoring matrix and the SEG ®lter (Wootton and Federhen, 1996). To remove obvious paralogous hits, only the top hit for each query was retained (provided that it had Pj10 x15 ) as well as weaker hits that were within a factor of 10 5 of the top hit. The results of this BLAST search including human map information are available at http:// biotech.bio.tcd.ie/yamclysag/skimmed.html A`skimmed' cosmid was adjudged to contain two genes if two non-overlapping subclones hit different mapped human proteins that are <40% identical in sequence and had Pf10 x15 in a BLASTP search. Overlapping Fugu cosmids were identi®ed manually and reduced to one entry in Table 1.
Fugu proteins from completely sequenced cosmids were compared to the database of human sequences from GeneMap '98 by the TBLASTN programme applying the SEG ®lter. Only hits with a signi®cance of f10 x15 , and that were no more than 10 5 less likely than the top hit, were accepted. Only the best hit per chromosome was included in further analysis.
Some of the limitations on the analysis of the skimmed cosmids become apparent when the results are compared with the fully sequenced cosmids. Cosmid 168J21 has been fully sequenced under Accession No. AJ010348 (Cottage et al., 1999). The full sequence has three annotated proteins, all of which had human homologues on chromosome 3. In the analysis of the skimmed cosmid sequence only one gene was found. As all three human orthologues are in the SwissProt database, it must be the case that the cosmid subclones do not include the coding sequences of the other two genes.
Human GeneMap '98 sequences Deloukas et al. (1998) compiled a map (GeneMap '98) of human gene-based markers by radiation hybrid mapping. This includes approximately 30 000 genes. By electronic PCR (Schuler, 1997) they found the corresponding genomic sequence, mRNA and/or EST from the public databases. These results are updated weekly and were downloaded from the NCBI FTP site on 21 December 1998.
A BLAST database of human sequences represented on this radiation hybrid map was created. In order to have comparable map units, only the data from the GeneBridge4 panel (Gyapay et al., 1996) were included. Some parts of the genome are represented more than once in the ePCR output because they have been sequenced more than once as genomic sequence, mRNA and/or EST. Redundancies of this kind were removed, preferentially keeping genomic sequences over mRNA over un®nished sequences over ESTs. The ®nal database had 28 133 entries, totalling 226 506 753 nucleotides.
Some markers in GeneMap '98 are listed with several allocated map positions. In these cases the same position found from several independent experiments or the position with the highest con®dence value, as determined by Deloukas et al. (1998), was used. Distances within the genome were estimated by counting the number of intervening genes in GeneMap '98. We then adjusted these values for missing data by multiplying this number by 80 000/30 000 (assuming the human genome contains 80 000 genes and the map contains 30 000 genes).

Computer simulation of genomic rearrangement
In order to make this simulation as realistic as possible, paralogues were assigned at the frequencies observed in the real data. Of the 91 Fugu proteins analysed, 78 had hits in the database of mapped human sequences. The distribution of hits is as follows: 47 hit one human sequence, 14 hit two, eight hit three, two hit four, and families of seven, 11, 12, 15, 39, 42, and 59 human proteins were observed once each. More extensive human protein family size data from an intragenome comparison (Imanishi et al., 1997) was used to con®rm these results in an independent simulation.

Compaction of Fugu introns
The Fugu genome is much smaller than the human genome, but by virtue of being vertebrate is presumed to have a similar gene repertoire (Brenner et al., 1993). The difference in size must therefore be primarily due to differences in non-coding DNA, including both intergenic and intronic DNA. In vertebrate genomes there is a correlation between gene length and G+C content, with long genes being rare in G+C-rich isochores (Duret et al., 1995). This suggested that there might be a correlation between base composition and the size difference between a human gene and its Fugu homologue.
Orthologous Fugu and human introns were identi®ed by ®nding orthologous genomic sequences in GenBank, aligning the protein sequences using the Gap programme (with default settings) of the GCG package, and mapping intron locations onto the protein alignment. Introns were designated orthologous if they were in the same phase and occurred at precisely the same position in the protein alignment produced by Gap. No allowance was made for possible intron sliding during evolution. Using this method, 199 pairs of orthologous introns from 22 genes were found. There were only six cases where we could say with con®dence that an intron had been gained or lost after the divergence of these two species. These were all cases where there was an unambiguous alignment of the two protein sequences, and where an intron was present in one sequence but there was no equivalent intron nearby or out of phase in the other organism. Non-coincident introns and introns in ambiguous alignments were excluded from further analysis. Recent research by Hurst et al. (1999) tentatively suggests that there may be a dichotomy in the relationship of synonymous GC content and intron size, with warm-blooded vertebrates showing a negative correlation, as previously observed, and cold-blooded vertebrates (including Fugu) showing a positive correlation. However, this is not borne out here. In our dataset there is no correlation between intron size and GC3 content of the genes that house them.
Genes were assigned into three equal-sized groups according to their G+C content at codon third positions (GC3) in human, and the lengths of equivalent introns were compared ( Figure 1A). The sum of the lengths of all 199 introns in Fugu was 59 392 bp, just over eight times smaller than the sum of the lengths of all the human introns (488 726 bp). The large introns of GC3-poor genes are seen to be severely compacted. The compaction averages are 2.9, 6.0 and 14.6, respectively, for the high-, medium-, and low-GC3 groups of genes ( Figure 1A), which is broadly consistent with expectations. One-®fth of the Fugu introns (41 of the 199) are actually larger than their human counterparts (many only marginally so), and most of these are high-GC3 genes in the human (Figure 1B). However, for the majority of introns ( Figure 1B) there does not appear to be any consistent relationship between intron lengths in the two species, or between these and GC3 in their host genes.
The compaction of individual genes, instead of individual introns, was also calculated ( Figure 1C, D). Compaction was calculated by dividing the sum of the lengths of introns in a human gene by the sum of the lengths of their Fugu orthologues (excluding any non-coincident introns). The compaction values range from 46 (in the APP gene; Villard et al., 1998) down to values of less than 1 in two genes (growth hormone and int1/wnt1), where the Fugu gene is larger than the human one. If the GC3 content of a gene and the compaction of its introns are related, then one would expect the greatest compaction to be between human genes with low GC3 and Fugu genes with high GC3. Rather surprisingly, there does not appear to be any relationship between the degree of compaction and the base composition in either species ( Figure 1C), or the amount of interspecies difference in base composition ( Figure 1D). The two most severely compacted genes have similar GC3 content in Fugu and human ( Figure 1D).

Synteny conservation between Fugu and human
Synteny conservation between two species can be measured in two directions. We can ask,`What proportion of genes that are syntenic in species A are also syntenic in species B?', or conversely,`What proportion of genes that are syntenic in B are also syntenic in A?'. These are two distinct quantities, as becomes obvious if one considers a hypothetical case where one of the species has only a single chromosome. The only syntenic genes that are known in Fugu are those that have been sequenced on the same clone; there are no large-scale maps of chromosomes. Therefore, we measured Fugu/human synteny conservation in terms of the proportion of neighbouring genes (from the same clone or GenBank entry) in Fugu that are syntenic in human. We also applied various limits to the physical distance permitted between the syntenic genes in human. Two separate datasets were analysed, as described below.

Synteny conservationÐ`cosmid skimming' data
The HGMP-RC Fugu landmark mapping project (Elgar, 1996;Elgar et al., 1996;Elgar et al., 1999) surveyed the Fugu genome by limited sequencing (`skimming') of a large number of genomic cosmid clones. Sets of shotgun sequence reads for 850 randomly chosen cosmids are publically available from their website (http://fugu.hgmp.mrc.ac.uk/). The data consist of 40 303 sequence reads, with an average of 47 reads per cosmid and 486 bp per read. Each read is assumed to contain no more than one gene.
Because these sequences are short and largely unannotated, we compared them to human data from SwissProt, rather than GeneMap '98 (which contains a large number of EST sequences). Cytogenetic map positions for 3963 of the 5406 human proteins in SwissProt were obtained by following links to OMIM. All 5406 proteins were searched against the Fugu cosmid database, using TBLASTN (Altschul et al., 1990). Putative orthologous relationships were identi®ed as described in Materials and methods.
A Fugu cosmid was considered`informative' (i.e. it appeared to contain more than one gene, and so contained linkage information) if two different sequence reads hit two different mapped human sequences which did not themselves show signi®cant sequence identity to one another. We identi®ed 48 informative cosmids, containing 58 links between nearby Fugu genes (Table 1). For 26 of these links  The`+' column refers to conserved linkages between Fugu and human, and the`x' column refers to non-conserved linkages. b All SwissProt IDs are truncated, omitting`_HUMAN' from each one.
(45%), the human homologues are on the same chromosome (i.e. synteny was conserved). The same Fugu Landmark Mapping Project data were recently analysed by Elgar et al. (1999). They reported that`three-quarters' of informative cosmids showed synteny to human. However, it is dif®cult to account for the differences between our results and theirs as they do not specify what stringency they imposed on the de®nition of orthology, neither do they indicate which cosmids displayed an orthologous relationship with which human sequences. Perhaps the greatest discrepancy between these analyses is in the number of informative cosmids found (349 by Elgar et al. compared to 48 in this study). We expect that this difference is due to a greater stringency employed by us in the designation of orthologues (as described in Materials and methods).

Synteny conservationÐcomplete Fugu genomic sequences
We examined the GenBank annotation of all Fugu sequences greater than 5 kb long to look for sequences that coded for two or more proteins. The 21 GenBank entries that ®t this criterion  (Table 2) total just under 0.9 Mb and encode 91 annotated proteins (some putative). Genes from the same GenBank entry have a known linkage relationship in the Fugu genome because they were sequenced contiguously. The proteins encoded by these Fugu sequences were compared using TBLASTN to the database of human nucleotide sequences whose map positions are known in GeneMap '98 (Deloukas et al., 1998). For some of the Fugu sequences, our results con®rm previously published analyses Aparicio et al., 1997;Armes et al., 1997;Scho®eld et al., 1997;Miles et al., 1998;Brunner et al., 1999;Gellner and Brenner, 1999;Reboul et al., 1999).
The results were examined to look for candidate conserved syntenous regions between human and Fugu. This was facilitated by a new method for displaying the relative positions of the homologues in the two species. In many cases, such as in the example shown in Figure 2, there was more than one candidate human chromosomal region for conserved synteny. In Figure 2 the Fugu sequence (AF056116) appears to have conserved synteny with human chromosome 12 by virtue of having several top scoring BLAST hits to human genes that map close together on that chromosome, largely as described by Gellner and Brenner (1999). What is interesting is that regions on chromosomes 7, 17 and 2 also show synteny with this Fugu sequence (including matches to Fugu proteins not having homologues on chromosome 12Ðgenes 3, 4, 6, and 14; Figure 2). These are the human chromosomes that contain the HOX clusters and this indicates that the similarity of these human chromosomes to each other extends beyond those clusters, as has been suggested by others (Ruddle et al., 1994).
To examine synteny conservation in a quantitative way, instead of simply the presence or absence of genes on the same chromosome, we calculated the proportion of Fugu close neighbours (genes from the same GenBank entry) whose homologues were within a speci®ed distance, x, of each other in human. We use the term`proximity conservation' to denote this property of genes remaining within a speci®ed distance of each other (regardless of gene order). To allow for the uneven distribution of genes in the human genome, the distance x was expressed in terms of the estimated number of intervening genes, instead of in the physical map units (cR) that were used in GeneMap '98 (Deloukas et al., 1998). The number of intervening genes was estimated from GeneMap '98 by counting the number of intervening genes appearing on the map between the genes of interest and scaling by a factor of 80 000/30 000 to allow for unsequenced genes. This allows for gene density variation within and between chromosomes. Where more than one human sequence had been assigned to the same map position by Deloukas et al. (1998), these sequences were arbitrarily assigned an order.
The results are summarised in Table 3. Only 18% of Fugu neighbours have sequenced human homologues that are within 10 genes of one another. This increases to 39% within a limit of 200 intervening genes, and to a maximum of 47% within a limit of 4000 intervening genes (this is effectively no limit, because it is approximately the size of a chromosome). The last value is similar to the synteny estimate from Table 1 (which has no limit on the intervening distance).

Computer simulation of genomic rearrangement
We used computer simulations to try to relate the observed level of proximity conservation to the number of genomic rearrangements that have occurred since the divergence of Fugu and human. The simulation started with a linear array of 80 000 genes, representing the current gene order in Fugu. Varying numbers of rearrangements were made in a copy of this genome (representing human) by randomly choosing two endpoints in the genome and inverting the segment in between. To re¯ect the missing data in the human map, randomly chosen genes were marked`unmapped' until only 30 000 remained (the number of genes in Deloukas et al., 1998). Pairs of genes that are neighbours in Fugu were then examined to see if they are neighbours in human, similar to the method of analysis in Tables  1 and 3. To make the simulation more realistic, we modelled the presence of gene families. Because more than half of all human genes are still not included in the human gene map, there is a real possibility that if the human orthologue of a Fugu gene is not mapped, the Fugu gene would mistakenly be paired with a mapped human paralogue instead. This could reduce the estimated level of  (Gellner and Brenner, 1999) against a database of mapped human sequences (GeneMap '98). The relative positions of the best hits of each of the 15 annotated Fugu proteins from this cosmid are shown for each chromosome in turn. The horizontal axis represents position (measured in centiRads) on the human chromosome in question, and each vertical axis represents the relative order (1±15) of the Fugu genes on the Fugu cosmid. White dots designate the topscoring TBLASTN hit for each Fugu protein; black dots indicate weaker hits (that are within 10 5 of the strongest hit). The genes are in the following order in Fugu: 1, ACVR1B; 2, ALR; 3, fhh; 4, R05D3.2-like protein; 5, 138E3.2-like protein; 6 Ikaros-like; 7, wnt1; 8, wnt10b; 9, ARF3; 10, erbB3; 11, PAS1; 12, rpl41; 13, 178O23.1-like protein; 14, diaphonouslike protein; 15, LRP1. In addition to the matches shown here (based on data in GeneMap '98), genes 1, 7, 12 and 15 also have homologues on chromosome 12q13 (Kenmochi et al., 1998;Gellner and Brenner, 1999). The positions of Hox clusters ABCD are represented by crosses on chromosomes 7, 17, 12 and 2, respectively Table 3. Observed levels of synteny conservation between completely sequenced Fugu cosmids and human In cases where there is more than one candidate human chromosome, * marks the human chromosome with the highest numbers of top scoring BLAST hits, which was used in the calculation of the totals at the bottom. Some of these relationships to human chromosomes have previously been described by the original authors Aparicio et al., 1997;Armes et al., 1997;Scho®eld et al., 1997;Miles et al., 1998;Brunner et al., 1999;Gellner and Brenner, 1999;Reboul et al., 1999).
The quantity x is the largest allowed distance (in genes) between one of the human homologues and its nearest neighbour in the syntenous group. For the parts of the genome studied here, the intervals of x=5, 10, 20, 50, 200, 1000 and 4000 genes correspond to average physical distances of 0.50, 1.27, 2.81, 7.05, 29.14, 144.01 and 494.04 cR, respectively. synteny conservation. Simulating this problem requires knowledge of the distribution of gene family sizes, which we addressed in two ways. First, we used the distribution of the numbers of human BLAST hits to the Fugu proteins considered in Table 2 (plus annotated putative proteins, totalling 91) as an approximation of the distribution of family sizes. Second, we used the distribution calculated by Imanishi et al. (1997) from an allagainst-all FASTA comparison of human proteins translated from mapped entries in DDBJ/EMBL/ GenBank. In both cases the family sizes were scaled by a factor of 8/3 to account for unsequenced and unmapped genes. The latter (within-genome) method has the advantage that all the hits to a protein represent paralogues, whereas with between-genome comparisons the orthologues must be identi®ed and removed before gene families can be examined. The results from the two methods were similar and only those using the Fugu data are presented here. Paralogous gene families were randomly assigned among the 80 000 genes in the simulated genome, according to the distributions described above. This process resulted in each simulated Fugu gene having one human orthologue, and possibly also a list of human paralogues, analogous to a list of BLAST hits. Some of the orthologues and paralogues could be`unmapped'. Linkage conservation was measured by looking for the human homologues of 1000 pairs of adjacent Fugu genes, chosen at random. If the human orthologue of one (or both) of the Fugu genes in the pair was`unmapped', a mapped paralogue from the list was used instead where possible. The extent of linkage conservation in human was then calculated, allowing various intervals between the human homologues. The simulation was run 30 times, looking at 1000 pairs of  Figure 3. Extent of proximity conservation in real and simulated datasets. Proximity was measured allowing different gene distances between the human homologues of pairs of linked Fugu genes. The line marked`OBS' graphs the observed data (Table 3). Average results for 30 computer simulations with 0, 8000, 16 000, 32 000 and 80 000 rearrangement breakpoints are shown (8000 breakpoints=4000 rearrangements). The x axis is the limit used for the distance permitted between two human genes that are homologues of two Fugu neighbours, expressed in terms of the estimated number of intervening genes on the chromosome genes each time, with the average results shown in Figure 3. The most striking feature in the simulation results is that the presence of paralogues in an incompletely-sequenced genome has a substantial effect on the measured extent of linkage conservation. If there have been no genomic rearrangements (top curve in Figure 3), gene order conservation (and thus proximity conservation) should be 100%. However, the measured level is only 37%, because for many gene pairs, one or both of the human orthologues is`unmapped' and a mapped paralogue at some other location in the genome has been used instead. This makes many linkages appear broken artefactually. Our measures of proximity conservation in the real data may also be underestimated to a similar degree (see Discussion). When the observed data from the fully-sequenced Fugu cosmids (Table 3) is plotted on the same axes, its initial slope is much greater than for the simulations (Figure 3). Possible reasons for this are discussed below. At large window sizes the line is approximately the same as the simulations with 8000± 32 000 breakpoints.

Discussion
Although we con®rmed the compaction of Fugu genes with respect to their human orthologues, we did not observe any strong relationship between gene compaction and the synonymous G+C content of the gene in either species. This may be an artefact of the sample analysed, or it may indicate true randomness in the compaction of the Fugu genome. There is an inverse relationship between the average compaction of the genes in each GC3 content category and their average GC3 content, which is consistent with expectations based on the lengths of genes in G+C rich isochores in vertebrate genomes (Duret et al., 1995). However, this relationship is not strong enough to be predictive for individual genes.
The incomplete nature of the human genome data, and the uncertainty regarding whether homologues found in BLAST searches are orthologues or paralogues, reduces our power to examine synteny conservation between Fugu and human. The measured proximity conservation depends not only on whether the genes remain close or not, but also on whether they are mapped and sequenced, and if there are paralogues of these genes in the mapped data. The simulations (Figure 3) suggest that the combination of incomplete sequence sets and the presence of gene families may cause the level of proximity conservation to be underestimated substantially, perhaps two-fold.
There is an obvious discrepancy between the slope of the graph of proximity conservation in real data from fully sequenced cosmids, as compared to the results from computer simulations (Figure 3). The observed proximity conservation rises steeply to 37% at a window size of 100 intervening genes, and then plateaus to a shape more like the simulated data. This suggests that the assumptions underlying the simulation are incorrect in some way.
The steep rise may be attributable to three primary factors. One possibility is that the real data is not a random sample of genes from the two organisms. A bias may result from Fugu's role as a model vertebrate genome, inevitably in¯uencing the selection of cosmids for complete sequencing. Cosmids with hypothesised synteny conservation with mammalian genomes may have been chosen preferentially. At least ®ve of the Fugu complete sequences used had known synteny conservation with human chromosomes prior to sequencing Armes et al., 1997;Sandford et al., 1997).
Second, lack of resolution and incomplete data in GeneMap '98 data may affect the results. The arbitrary ordering of human genes that lie in the same radiation hybrid map interval could in¯ate apparent distances in human, although this effect is unlikely to be signi®cant because the average number of genes per interval in the GeneMap '98 data used here is only 1.98. At least one distance in Table 3 has been overestimated due to missing data in GeneMap '98. This occurs with the genes TSC2 and PKD1 (Fugu Accession No. AF013614), which are neighbours in both species . However, PKD1 is not present in the map and instead our method identi®ed a PKD1-like sequence elsewhere on chromosome 16 (Loftus et al., 1999).
A third factor may be that our model of rearrangements is too simple. Our model assumed a random distribution of breakpoints throughout the genome, but comparative analysis of the human and mouse maps has shown that, although interchromosomal rearrangements seem to have random endpoints, the number of intrachromosomal rearrangements is more than expected at random (Ehrlich et al., 1997;Nadeau and Sankoff, 1998). The steep incline at the beginning of the graph may indicate a high frequency of small inversions or other small intrachromosomal rearrangements. Inversions of small segments of chromosome would disrupt gene adjacencies while preserving gene vicinities. This has been proposed by Gilley and Fried (1999), who noticed that some genes that are adjacent in Fugu are 2±4 Mb apart in human. Further examples from our study include wnt10b, ARF3 and erbB3. These genes are adjacent in Fugu (Gellner and Brenner, 1999). In human, wnt10b and ARF3 are adjacent but erbB3 is separated from them by an estimated distance of 603 genes (226 mapped GenBank sequences scaled by 8/3 to allow for missing data) or 7.5 Mb [estimated from the map distance of 31 cR; chromosome 12 has an average of 234 kb/cR (Gyapay et al., 1996)].
It is likely that the initial portions of the simulations in Figure 3 are not directly comparable with the observed data. However, as the window size gets larger the graph lines are approximately parallel to the plot of the observed data. From these an estimate of the extent of rearrangement since the divergence of these two lineages 400 million years ago is 8000±32 000 breakpoints (i.e. 4000±16 000 reciprocal translocations or inversions). This is higher than expected from comparisons of the human and mouse genomes, which diverged 100 million years ago and have only had an estimated 180 rearrangements (Nadeau and Taylor, 1984;Nadeau and Sankoff, 1998). Adjusting our simulations to incorporate a bias towards small rearrangements would only increase the estimated number of rearrangements since the Fugu±human divergence, making the discrepancy in rates even greater. Our estimate of the number of breakpoints depends somewhat on the estimated number of genes in the genome. Simulations based on a gene number of 61 000 instead of 80 000 (see Dunham et al., 1999) led to an estimate of roughly 6000±12 000 breakpoints. Simulations using a gene number of 143 000 (recently suggested by Incyte Pharmaceuticals; see Dickson, 1999) produced the unexpected result that that no rearrangements were called for: the number of proximities observed in Fugu exceeded what would be expected due to the now-sparse sampling of genes, so either Incyte's estimate is unrealistic or some of the orthologues listed in Table 3 are actually paralogues.
Another possible shortcoming of our analysis is the presence of short ESTs (which are not necessarily coding sequence) in the human DNA database used here, resulting in an overestimate of the frequency with which we can expect to ®nd orthologues in this dataset from an amino acid level search. However, this is unlikely to have a great effect on the results, because we found that 78% of a random sample of over 500 human proteins submitted to TREMBL after we downloaded GeneMap '98 were represented in the database. The gene family data is also likely to be oversimpli®ed, as it is based on results from only 91 Fugu proteins. The Imanishi et al. (1997) data is from a larger set of proteins but is not as easy to relate to the human dataset used in this analysis.
Because we have approached the question of synteny conservation from the perspective of known gene adjacencies in Fugu, the proposed genome duplication in the bony ®sh lineage (Amores et al., 1998;Meyer and Schartl, 1999), followed by differential gene loss, should not in¯uence the results. If genes in the ancestral genome were ordered ABCD and this was duplicated in the ®sh lineage, differential gene loss could result in paralogous chromosomes, one bearing AC and another bearing BD. If synteny of these genes had not been disturbed, then the human genome would still contain the four genes arranged ABCD. If we were counting conservation of human linkages in Fugu, then we might plausibly have selected genes A and B for analysis and found that they are not syntenous in Fugu, an artefact of gene loss rather than genome rearrangement. However, as we are starting from the complementary viewpoint (given known relationships in Fugu), the only possible questions are`Are A and C syntenous in the human genome?', and`Are B and D syntenous in the human genome?', which is true in both cases. It is, however, possible that differential gene loss (after the genome duplication) in the Fugu lineage has contributed to the reduction of some intergenic distances as compared to human (e.g. the distance from A to C in the hypothetical example). This may also contribute to the steep initial slope seen in Figure 3. One example of apparent differential gene loss may already have been discovered in the case of the genes IGF2 and TH (insulin-like growth factor and tyrosine hydroxylase), which are adjacent in Fugu but separated by one intervening gene (insulin) in human (E. Chen et al., unpublished;GenBank Accession No. AL021880;Lucassen et al., 1993). Patterns of gene loss and gene order evolution should become clearer when more long homologous sequences from these species become available.