Logo of plosbiolPLoS BiologySubmit to PLoSGet E-mail AlertsContact UsPublic Library of Science (PLoS)View this Article
PLoS Biol. 2005 Oct; 3(10): e314.
Published online 2005 Sep 6. doi:  10.1371/journal.pbio.0030314
PMCID: PMC1197285

Two Rounds of Whole Genome Duplication in the Ancestral Vertebrate

Peter Holland, Academic Editor


The hypothesis that the relatively large and complex vertebrate genome was created by two ancient, whole genome duplications has been hotly debated, but remains unresolved. We reconstructed the evolutionary relationships of all gene families from the complete gene sets of a tunicate, fish, mouse, and human, and then determined when each gene duplicated relative to the evolutionary tree of the organisms. We confirmed the results of earlier studies that there remains little signal of these events in numbers of duplicated genes, gene tree topology, or the number of genes per multigene family. However, when we plotted the genomic map positions of only the subset of paralogous genes that were duplicated prior to the fish–tetrapod split, their global physical organization provides unmistakable evidence of two distinct genome duplication events early in vertebrate evolution indicated by clear patterns of four-way paralogous regions covering a large part of the human genome. Our results highlight the potential for these large-scale genomic events to have driven the evolutionary success of the vertebrate lineage.


It has long been hypothesized that the increased complexity and genome size of vertebrates has resulted from two rounds (2R) of whole genome duplication (WGD) occurring in early vertebrate evolution, thus providing the requisite raw materials [1]. This seemed to be supported by the long-standing speculation that humans have about 100,000 genes, roughly four times the number expected for invertebrates' genomes, but this is now known to be incorrect, with the actual human gene count being closer to 30,000 [2,3]. Conflicting analyses have now made this very controversial, with some studies supporting 2R (e.g., [48]), others seeing only a single round of WGD (e.g., [911]), and still others refuting WGD altogether by concluding that nothing greater than limited segmental duplications have occurred (e.g., [12,13]).

The 2R hypothesis had been bolstered by observations that a few gene families, e.g., Hox clusters [14], follow a “4:1 rule” in the numbers of vertebrate to invertebrate genes. However, comparison of the complete genome sequences of human [2,3] and Drosophila [15] revealed that less than 5% of homologous gene families follow the 4:1 rule [12]. Further, although two sequential duplications are expected to generate the evolutionary topology (AB)(CD) for the descendent genes, rather than (A)(BCD), in fact, the relationships of vertebrate multigene families do not generally show this pattern, as indicated by early studies using only a few genes [16] and confirmed as complete genome sequences became available [2,13]. (However, for a different view, see [17].) Several studies have incorporated data from sparse sampling of genes from taxa thought to have branched near to these purported duplications, including lamprey [18], hagfish, amphioxus [17,1922], and Ciona [23]; although these results are useful for timing duplications, the conclusions could never be viewed as definitively resolving this issue because these products could have alternatively been generated by duplications of individual genes or short gene segments rather than by WGDs. Even duplicating all of the genes in a genome individually is quite different from a whole genome duplicating simultaneously.

There are several reasons why this has been a difficult issue to resolve. After duplication, only the minority of gene pairs will adopt a new function (“neofunctionalization”) or partition old functions (“subfunctionalization”) quickly enough to escape disabling mutations that would lead to their eradication [24]; therefore, rampant gene loss rapidly erases this signal of genome duplication. Further, four-member gene families, even those with the (AB)(CD) topology, can be generated by two rounds of duplications of individual genes or of segments much smaller than the entire genome, generating a condition that cannot be differentiated on this basis from 2R followed by many gene losses. This alternative scenario seems especially plausible because recent analyses have shown that gene duplications occur much more frequently than had been thought, with the typical rate being sufficient to duplicate an entire genome equivalent every 100 million years (MY) [25,26]. Until recently, no complete genome sequence has been available from an outgroup that is closely related to vertebrates, and all methods of phylogenetic reconstructions are less accurate with more distant relatives such as Drosophila and yeast [20]. Lastly, there has not been to date a method to accurately and comprehensively cluster genes into homologous families because methods that rely on sequence similarity alone are highly subject to artifactual association of slowly evolving paralogs and to erroneous exclusion of the more rapidly evolving genes.

Fortunately, as has been shown convincingly for the yeast genome and for Arabidopsis [2730], evidence of an ancient genome duplication can be seen in the large-scale pattern of the physical locations of homologous genes, even when the great majority of the duplicated genes have been lost. Studies have shown that the human genome also has multiple regions of colinear paralogous gene copies [4,21,22,3137], but considered the arrangements of too small a number of genes and genomic regions to be comprehensive. This approach is now available for a large-scale evaluation of the vertebrate 2R hypothesis, because complete (i.e., at least draft quality) genome sequences are available for the tunicate Ciona intestinalis [38] (a basal chordate outgroup) and the vertebrates Takifugu rubripes [39] (a pufferfish “fugu”), Mus musculus [40] (mouse), and human [2,3]. Figure 1 illustrates how the signal of two rounds of genome duplication could be retained by the large-scale pattern in location of duplicated genes, in which many tracks of paralogous duplicates (which may not contain identical subsets of genes) each occur at exactly four positions in the genome, i.e., “tetra-paralogons.” No similar signal would be generated by repeated duplications of genes or even large gene segments; only WGDs would result in such global organization of paralogous genes.

Figure 1
Pattern Predicted for the Relative Locations of Paralogous Genes from Two Genome Duplications


Gene Clustering and Duplication Timing

A graph-based method was used with the complete gene sets of the four chordates (98,517 total genes; see Table 1 for details of each step in the analysis) to generate clusters such that each contains all, and only, those genes that descended from a single gene in their common ancestor (Figure 2). A multiple sequence alignment and a maximum likelihood evolutionary tree were constructed for each cluster, then a Web browser interface was built so that each can be viewed individually. (For more details and updates that include more taxa, see the “PhIGs” [Phylogenetically Inferred Groups] Web site at http://phigs.org/.) We could then easily determine when each gene duplicated relative to lineage splitting by comparing these gene trees with the known evolutionary relationships of the animals. For example, a gene duplication that is specific to only one animal's lineage is seen as two genes from the same genome clustered together. A gene that duplicated once in the unique common ancestor of mouse and human would generate a tree that groups gene copy 1 of human and mouse and, separately, gene copy 2 of human and mouse. Put more generally, gene duplications that are shared by more that one species are seen as a replication of the phylogeny of the descendant organisms for each gene copy. Of course, gene losses and various combinations of these processes are seen as well. Figure 3 shows all possible gene topologies along with how each would be interpreted.

Figure 2
Overview of the Building of a Gene Cluster and Phylogenetic Tree Shown by a Hypothetical Example
Figure 3
Hypothetical Phylogenetic Tree Showing All Possible Types of Gene Relationships and How They Are Most Parsimoniously Interpreted
Table 1
Overview of the Process for Analyzing the Complete Gene Sets with Number of Genes Included at Each Step

This reveals that 46.6% of the ancestral chordate genes appear in duplicate in one or more of the vertebrate lineages, with 34.5% having at least one duplication before the divergence of fish from tetrapods and 23.5% having at least one duplication afterward. (Some of these are counted twice, having had duplications both before and after the fish–tetrapod split.) This means that there are 3,753 gene duplications placed at the base of Vertebrata, which is remarkable because the ancestral genome would be reasonably estimated to have had fewer than 20,000 genes, which is the case for the tunicate as well as other invertebrate outgroups. However, as can be seen in Figure 4, gene duplications are in large numbers on every branch of the tree, making it unclear whether this, in itself, indicates a significant acceleration in duplication rate. Additionally, of the gene clusters with duplications basal to the fish–tetrapod split, 20.6% have had one duplication event, 10.8% have had two, and 5.1% have had more than two, counter to the expectation from 2R, and casting further doubt on the significance of this for the 2R hypothesis.

Figure 4
Phylogenetic Analysis of the Four Chordates with Drosophila as an Outgroup

Gene Family Membership

An early observation in support of 2R was that several gene families have expanded from a single member in invertebrates to having four members for some vertebrates. Previous studies, confirmed in this analysis, have shown that this is not generally true for vertebrate multigene families [12]. As can be seen in Figure S1, there is no peak at four for gene family membership for any vertebrate. In fact, even gene duplications do not predominate; for each vertebrate species considered individually, one member per cluster is the largest category, accounting for 55%, 57%, and 59% of the fugu, mouse, and human genes, respectively, with 53.4% of the gene clusters having no duplication events whatsoever. Thus, there is no signal of 2R remaining in gene family membership, despite early anecdotal observations to the contrary.

Determination of Concordantly Duplicated Regions

To test the extent to which the 3,753 early duplication events that are timed to the base of Vertebrata were generated as part of larger scale, multigene duplications, we examined the relative positions of these resulting paralogs in the human genome (which is currently the best assembled and annotated vertebrate genome). These results are shown in Figure 5 (and more comprehensively in Figure S2) in which the linear array of genes for each chromosome is used to query for paralogs generated by any duplication event prior to the fish–tetrapod split. It is apparent from these figures that there is a large-scale pattern of genome segments that are concordant in having similar arrangements of paralogous genes. We quantified this by identifying all cases in which two or more different early-duplicating genes are within a 100-gene window, then, for each, querying all other places in the genome, using a sliding window to count the number of cases in which their respective paralogs are within both 50 genes upstream as well as 50 genes downstream from that point. There is a distinct pattern of having multiple chromosomes matching with long linear stretches of paralogous genes. This indicates that these duplications occurred in very large segments, consistent with the hypothesis of WGD(s). Having matches to three other chromosomal segments is the dominant category, as can be seen by the darker coloring in Figures 5 and S2 and in the histogram of Figure 6. These patterns, with each genomic region corresponding in gene arrangement to sets of paralogs in three other genomic segments, are strong support for the 2R hypothesis.

Figure 5
Plot of the Genomic Positions of Paralogous Pairs of Human Genes that Arose from Duplications Predating the Fish–Tetrapod Split
Figure 6
Histogram Showing the Lower Bound Estimate of N-fold Redundancy Using the Analysis Reported in Figure 5

Although the 4-fold (i.e., including the query segment) category is the most prevalent, it accounts for only 25% of the genome. Nonetheless, it is striking that this remains the largest category despite approximately 450 MY of evolution. This constitutes a strong signal of 2R, and could not reasonably have been generated by a series of smaller duplication events. For the latter to have generated this pattern, multiple duplications of the same region (or its resulting duplicates) would have to have occurred three times, and have done so for many regions throughout the genome. We would expect, rather, that independent, random duplications would follow a Poisson distribution; this contrasting situation is seen when the same analysis is done with all human gene paralogs generated by duplication after the split of fish and tetrapods (not shown). Even if we were to consider the alternative of a single WGD followed by subsequent independent duplications of large segments, it would be difficult to explain why these would have been predominantly 2-fold for previously duplicated regions. The most parsimonious explanation for the observed pattern can only be 2R.

Tetra-Paralogon Detection

To further establish 2R, we evaluated these sets of paralogs for whether this 4-fold matching indicates that they fall into tetra-paralogons, as illustrated in Figure 1. We formalized this by first identifying paralogons (paralogous genomic segments) containing the same set of at least two duplicated gene pairs, while allowing for a maximum of 100 unduplicated genes in between (similar to the approach in [10]). (The allowance of 100 genes is arbitrary, but the results are not critically dependent on this number, which is only used to find the blocks of paralogous genes.) We infer that duplicated genes in paralogons are likely to have arisen from a single duplication involving all contained, duplicated genes, and that the unique, intervening genes have resulted from differential gene deletions and subsequent genome rearrangements.

We identified 2,953 paralogous human gene pairs that are inferred to have resulted from 1,912 genes that duplicated prior to the divergence of the fish and tetrapod lineages (with some gene losses also). Of these paralogous genes, 32.4% are still in 386 detectable paralogons comprising 772 individual genomic segments, containing from two to 42 gene pairs (Table S1). Of these 772 genomic segments, 454 comprise tetra-paralogons (Figures 7A and S3, Table 2) as shown hypothetically in Figure 1, in which overlapping sets of paralogs fall into 4-fold groups. (Unfortunately, it was not possible for us to evaluate the hypothesis of an additional genome duplication unique to ray-finned fish [41,42] because of the generally poor contiguity of the fugu draft assembly.)

Figure 7
An Arbitrarily Selected Subset of the Human Genome Showing the Physical Relationships Among Paralogous Genes
Table 2
Distribution of the Human Genome's Tetra-Paralogons by Chromosome under the Most Permissive Model for Signal Detection

In contrast, when looking at the gene pairs that arose from a duplication event after the divergence of the fish and mammal lineages (see Figure 4), we find only 11% are detected in paralogons in the human genome, indicating that these duplications have less commonly included large segments of the genome (Figure 7B). This is especially interesting in that their relative recency would make it more likely that any large duplications would remain detectable, reinforcing the contrast with the large-scale structure of those earlier duplications. By looking specifically for tandemly duplicated genes by defining them as paralogs on the same chromosome that are separated by fewer than 10 intervening genes, we can recognize that 50% of these human gene pairs arose from tandem duplication, compared with 6% for the human gene pairs that arose before the divergence of the fish and tetrapod lineages.


No detectible signal of WGD exists in the analysis of gene family membership. There is no peak at four genes per family for any of the vertebrates (Figure S1) as might result from 2R. Presumably this results from a great number of subsequent gene losses that have erased this signal. Likewise, the phylogenetic timing of the duplication events is also inconclusive, because duplications are common on every branch (see Figure 4). Although there is a somewhat greater number assigned to the base of vertebrates, there is no reliable way to evaluate the significance of this. In fact, even if this larger number could be found to be statistically significant, it may simply indicate that this was a period with an accelerated duplication of individual genes or multigene segments or a reduction in the rate of gene loss, rather than indicating WGD.

Conclusive evidence for 2R is seen only when data from gene families, phylogenetic trees, and genomic map position are all taken together, as has been advocated by others [21,32,43]. When examining the genomic map position of only those genes in the human genome that trace their ancestry back to a duplication event at the base of vertebrates, a clear pattern of tetra-paralogons emerges, indicating that 2R occurred at the base of vertebrates. This signal remains most clearly in 25% of the human genome that forms the largest category in the analysis shown in Figures 5 and and6,6, but we also find that 72% of all human genes are included in the total extent of all of the paralogons that overlap with these regions, providing the least constrained estimate of the portion of the human genome still retaining structure from the 2R. This is the outside estimate, because some portion could have as well been the result of segmental duplications of regions earlier established by WGD. This is in contrast to the pattern seen for the many other gene duplications, which generated paralogs that are predominantly arranged in tandem.

This is particularly compelling considering that this signal has survived more than 450 MY of genome rearrangements and the loss of many genes. We can imagine the effect that duplications, translocations, inversions, and deletions (and combinations thereof) would have had on this analysis: (1) Duplications would cause an increase beyond the 4-fold category; (2) translocations would decrease the 4-fold category if they are pervasive enough to clear large regions of paralogs; (3) inversions can either cause a decrease in the number of chromosomes hit by moving paralogous genes beyond the detection of the sliding window analysis or cause an increase by spreading some paralogous genes across the boundaries into adjacent segments; both of which can be exacerbated by gene translocations that blur the edges of the corresponding regions; and (4) deletions would generally increase the 3-fold chromosome category at the expense of the 4-fold category, and a deletion that occurred between the two WGDs would increase the 2-fold chromosome category. Additionally, in some cases, a few individual gene deletions or translocations may have eliminated the links between pairs of duplicated genes. Through these, and combinations of these events, the original 4-fold co-linearity established by 2R (or something less than the perfect 4-fold pattern, if these duplications were long separated) has been eroded.

These tetra-paralogons are spread across nearly all human chromosomes (Table 2). Notably, chromosome Y does not have any tetra-paralogons, perhaps due to its relatively recent origin and small number of genes, or perhaps this indicates a more rapid rate of gene movement. Chromosome 21 also has no tetra-paralogons, and Chromosome 18 has only one that is small. These chromosomes, and other regions without tetra-paralogons, could be of recent origin or they could have undergone multiple rearrangement events that would have destroyed the signal.

Although our study does not specifically address the effect that 2R has had on vertebrate evolution, we note two interesting observations. First, the vast majority of duplicated genes were subsequently deleted, indicating that relatively few genes may have been responsible for the increased complexity seen in vertebrates. Second, it is possible that many genes were loosed from constraint after the genome duplications and experienced an accelerated rate of sequence change before returning to single copy, and it is possible that this has played some role in the evolution of vertebrate complexity [44].

The mechanism of these genome duplication events, whether two separate rounds of either auto- or allo-tetraploidy or a single octoploidy, remains uncertain. We speculate that the most likely scenario is two rounds of closely spaced auto-tetraploidization events, based on the following observations. For most sets of tetra-paralogs, some pairs within the set extend over a longer region than others, indicating two distinct duplication events. If, alternatively, there had been a single octoploidy, then we would have to hypothesize multiple occasions in which two of the four descendant genomic segments lost the same sets of genes independently, which seems unlikely. The phylogenetic trees for the gene families are not consistently nested, as would be expected in the case of allo-tetraploidy or two widely spaced auto-tetraploidy events. Finally, tree topologies of genes within paralogy blocks are not always congruent, indicating that the process of gene loss and rediploidization spanned the duplication events [17].

It remains unclear to what extent such large-scale genomic events have driven macroevolutionary change versus the regular accumulation of small mutations, as is the central tenet of the classical model of evolution. We imagine that rapid and extensive evolutionary change could possibly be an emergent property of having all genes duplicated at the same time, allowing this expanded gene repertoire to evolve together, and so reach a greater level of interaction and complexity than could evolve from cumulative single gene duplications. WGDs have occurred in many lineages, including frogs [45,46], fish [41,42,47], yeast [2730], Arabidopsis [2730], and corn and several other crop species [48], all of which are being studied by modern genomics techniques. We view the broad and pervasive distribution of these tetra-paralogons in the human genome, despite the remarkably small number of genes remaining in duplicate, as robust evidence that 2R occurred at the base of Vertebrata, and anticipate that future studies will soon illuminate the roles this has played in the evolutionary success of the vertebrate lineage.

Materials and Methods

Obtaining chordate sequences

Sequences and gene annotations of the tunicate Ciona intestinalis and the pufferfish Takifugu rubripes were obtained from the Department of Energy Joint Genome Institute Web site at http://www.jgi.doe.gov. Sequences for Homo sapiens (version 19.34b.2) and Mus musculus (version 19.32.2) were obtained from the Ensembl project website at http://www.ensembl.org. For genes with multiple transcripts, only the longest protein sequence was taken, resulting in 15,852 Ciona, 37,241 fugu, 22,444 mouse, and 22,980 human genes. Table 1 shows an overview of the methods along with the numbers of genes and clusters included after each step.


The objective of the gene clustering is to reconstruct groups of genes such that each includes all (and only) the descendents of a single gene in the ancestral chordate. The underlying assumption is that all of the vertebrate genes in such a cluster will have a higher degree of similarity to each other than they will to their ortholog in Ciona, because they have arisen after the Ciona–vertebrate divergence by either gene duplication or lineage splitting. We conceptually translated the protein sequences for all genes, then for each Ciona protein sequence, the best match to any vertebrate protein was found using BLASTP [49]. Likewise, for each vertebrate protein, the best Ciona match was found. This list of best Ciona–vertebrate hits was then ordered by raw score. A graph was constructed such that each protein sequence appears at a node and the raw BLASTP scores between each pair form the weight of each edge. These sequences were then grouped by using the pairs of best hits as seeds for a single linkage clustering of the graph with the minimum edge score of the seed. This recruits to each cluster any sequence with greater similarity to the individual seed sequences than they have to each other, ensuring that genes with similarity due to a duplication before the split of Ciona and Vertebrata are properly apportioned into separate clusters. Any cluster that attempts to use a protein that has already been assigned is eliminated to reduce ambiguity and any cluster with greater than 100 members in a single species is eliminated. Figure 2 illustrates this clustering process.

Phylogenetic analysis

A multiple sequence alignment for each cluster was created using ClustalW 1.81 [50]. This alignment was then trimmed by eliminating all positions with gap characters. If the remaining length of the multiple sequence alignment was less than 100 amino acids, the entire cluster was eliminated. Phylogenetic trees were constructed by using the quartet puzzling maximum likelihood method as implemented in TREE-PUZZLE 5.1 [51] using the JTT model of amino acid substitution and a gamma distribution of rates over eight rate categories with 10,000 puzzling steps used to assess reliability. Any tree whose nodes were not strictly bifurcating was eliminated. Even with strict requirements for membership in the clusters, for reliable sequence alignment, and for confidence of evolutionary analysis, we generated 6,641 gene family clusters that include 39,136 (39.7%) of 98,517 total chordate genes (see Table 1), of which 3,096 had duplicated vertebrate genes and 1,621 produced trees that are strictly bifurcating (i.e., having no polytomies).

Identification of node types

Each node of each tree was classified in comparison to the known evolutionary relationships of the animals. For example, if the gene cluster tree contains exactly four members, and one from each animal, then the parsimonious inference is that no gene duplication occurred. In the case of a similar cluster, but where one member is missing, this is a gene loss in a single group. Gene cluster trees can show duplications specific to individual lineages by having two genes clustered together for the same animal. A gene duplication that occurred before the split of fish from tetrapods is seen as a duplicated tree of the animal relationships after their splitting from Ciona and, similarly, a gene duplication that occurred in tetrapods but before the split of the mouse and human lineages is seen as a duplication of the mouse–human group. Combinations of these, such as a duplication for tetrapods followed by a loss in one of the tetrapod lineages, are also seen and scored (see Figure 3).

The sorting of orthologous and paralogous relationships for each gene cluster provides an effective tool for improving the inferences of gene function by allowing annotations from well studied genomes to be transferred to the orthologous genes of other species. Inferring function from orthology is expected to be more accurate than using sequence similarity alone, since the latter tends to incorrectly associate slowly evolving paralogs. We provide a web based resource for this sorting at http://phigs.org/.

Detecting concordantly duplicated genomic regions

An overview of the genomic distribution and patterns of paralogous gene arrangement was created by plotting the chromosomal location for all genes having a duplication before the fish–tetrapod divergence for each chromosome. We performed a sliding window analysis, looking upstream and downstream of each gene in turn, for 50 genes on each side, and asking whether paralogs generated prior to the fish–tetrapod split occur within 100 genes of each other at another genomic location. This is the most conservative approach for detecting this signal. The results are shown in Figures 5 and S2.

Detecting paralogons and tetra-paralogons

A paralogon was defined as two chromosomal locations in the same genome that have the same set of gene pairs, allowing for a maximum of 100 unduplicated genes between. This significantly expands the set of regions that can be detected by the sliding window analysis. These were detected for the entire human genome considering separately only those paralogs generated by duplications prior to the split between fish and tetrapods versus those arising from duplications after. Each paralogon was tested for its membership in additional paralogous-region pairs. When a segment pairs with three different paralogons, we considered all six possible pairings of the four regions. If, and only if, all six possible combinations are confirmed as paralogons, the group was defined as a tetra-paralogon (see Figure 1). A minimum reconstruction of the signal of 2R remaining in the human genome is found in the extent of complete 4-fold overlap and the maximum by extending these regions to include the complete extent of all contributing paralogons. Genome and chromosome coverage were calculated by summing the number of genes that are encompassed by this more expansive reconstruction and dividing by the total number of genes.

Inclusion of Drosophila genes and phylogenetic analysis

For each cluster that contained only a single gene from each of the four chordate species, the highest scoring BLASTP match to the Drosophila melanogaster gene set [15] was added. We then performed a multiple sequence alignment for each of these 766 sets of genes, followed by phylogenetic analysis of this concatenated data set as above. The resulting tree was rooted at the midpoint with branch lengths proportional to the amount of amino acid substitution as estimated by TREE-PUZZLE 5.1 [51].

Supporting Information

Figure S1

Histogram of Gene Cluster Membership:

The numbers of genes per cluster are shown for each of the three vertebrates individually as well as for all three grouped together. There is no peak at four for any species, or at 12 as the total for all (or 16 for all, considering that there may have been a further genome duplication for fish), indicating that gene losses have been common and have eradicated this signal of genome duplications.

(30 KB JPG).

Figure S2

Plot of the Genomic Positions of Paralogous Pairs of Human Genes that Arose from Duplications Pre-dating the Fish–Tetrapod Split:

On the x-axis is each chromosome arranged from p to q telomeres. On the y-axis is each of the 22 human autosomes plus the X and Y chromosomes. For each query gene on the x-axis, a “hit” is scored if the subject chromosome contains a paralog generated by a gene duplication prior to the fish–tetrapod split. The lower portion of each panel plots the n-fold redundancy along the query chromosome as defined by pairs of paralogs detected in a sliding window analysis. See the Material and Methods section for details, but briefly, for every human query gene, a window was considered of 50 genes to the left and 50 genes to the right, with a “hit” obtained for the subject chromosome if it includes the early-duplicated paralogs of genes on each side of the query. The expected value of four for the 2R hypothesis is highlighted in a darker shade of blue.

(5.2 MB DOC).

Figure S3

Illustration of Whole Genome 4-Fold Paralogy:

The lines connect paralogous genes in the human genome that originated in duplications that occurred after the tunicate–vertebrate split but before the fish–tetrapod split. Numerals around the outside of the figure refer to human chromosome numbers.

(103 KB JPG).

Table S1

Paralogons in the Human Genome:

Paralogons in the human genome are defined as having two or more pairs of paralogous genes separated by no more than 100 intervening genes (see Materials and Methods). A and B in the header refer to the first and second chromosome in considered pairs. The columns labeled “Start” and “End” define the extent of each paralogon by numbered genes. The number of paralogous gene pairs defining the paralogon and the total number of genes encompassed by the region are indicated.

(344 KB DOC).


We thank R. Baker, M. P. Francino, M. Medina, J. Schwarz, and Y. Valles for helpful comments on the manuscript, and S. Rash and W. Huang for technical assistance. This work was performed under the auspices of the U.S. Department of Energy, Office of Biological and Environmental Research, by the University of California, Lawrence Berkeley National Laboratory, under contract No. DE-AC03-76SF00098.

Competing interests. The authors have declared that no competing interests exist.


2Rtwo rounds of whole genome duplication
MYmillion years
WGDwhole genome duplication


Author contributions. PD and JB conceived and designed the experiments. PD performed the experiments. PD and JB wrote the paper.

Citation: Dehal P, Boore JL (2005) Two rounds of whole genome duplication in the ancestral vertebrate. PLoS Biol 3(10): e314.


  • Ohno S. Evolution by gene duplication. Berlin: Springer-Verlag; 1970. 160 pp.
  • Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. [PubMed]
  • Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, et al. The sequence of the human genome. Science. 2001;291:1304–1351. [PubMed]
  • Lundin LG. Evolution of the vertebrate genome as reflected in paralogous chromosomal regions in man and the house mouse. Genomics. 1993;16:1–19. [PubMed]
  • Meyer A, Schartl M. Gene and genome duplications in vertebrates: the one-to-four (-to-eight in fish) rule and the evolution of novel gene functions. Curr Opin Cell Biol. 1999;11:699–704. [PubMed]
  • Spring J. Vertebrate evolution by interspecific hybridization—are we polyploid? FEBS Lett. 1997;400:2–8. [PubMed]
  • Wang Y, Gu X. Evolutionary patterns of gene families generated in the early stage of vertebrates. J Mol Evol. 2000;51:88–96. [PubMed]
  • Larhammar D, Lundin LG, Hallbook F. The human Hox-bearing chromosome regions did arise by block or chromosome (or even genome) duplications. Genome Res. 2002;12:1910–1920. [PMC free article] [PubMed]
  • Guigo R, Muchnik I, Smith TF. Reconstruction of ancient molecular phylogeny. Mol Phylogenet Evol. 1996;6:189–213. [PubMed]
  • McLysaght A, Hokamp K, Wolfe KH. Extensive genomic duplication during early chordate evolution. Nat Genet. 2002;31:200–204. [PubMed]
  • Gu X, Wang Y, Gu J. Age distribution of human gene families shows significant roles of both large- and small-scale duplications in vertebrate evolution. Nat Genet. 2002;31:205–209. [PubMed]
  • Friedman R, Hughes AL. The temporal distribution of gene duplication events in a set of highly conserved human gene families. Mol Biol Evol. 2003;20:154–161. [PubMed]
  • Friedman R, Hughes AL. Pattern and timing of gene duplication in animal genomes. Genome Res. 2001;11:1842–1847. [PMC free article] [PubMed]
  • Popovici C, Leveugle M, Birnbaum D, Coulier F. Homeobox gene clusters and the human paralogy map. FEBS Lett. 2001;491:237–242. [PubMed]
  • Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, et al. The genome sequence of Drosophila melanogaster. Science. 2000;287:2185–2195. [PubMed]
  • Hughes AL. Phylogenies of developmentally important proteins do not support the hypothesis of two rounds of genome duplication early in vertebrate history. J Mol Evol. 1999;48:565–576. [PubMed]
  • Furlong RF, Holland PW. Were vertebrates octoploid? Philos Trans R Soc Lond B Biol Sci. 2002;357:531–544. [PMC free article] [PubMed]
  • Escriva H, Manzon L, Youson J, Laudet V. Analysis of lamprey and hagfish genes reveals a complex history of gene duplications during early vertebrate evolution. Mol Biol Evol. 2002;19:1440–1450. [PubMed]
  • Holland PW, Garcia-Fernandez J, Williams NA, Sidow A. Gene duplications and the origins of vertebrate development. Dev Suppl. 1994:125–133. [PubMed]
  • Holland PW. More genes in vertebrates? J Struct Funct Genomics. 2003;3:75–84. [PubMed]
  • Abi-Rached L, Gilles A, Shiina T, Pontarotti P, Inoko H. Evidence of en bloc duplication in vertebrate genomes. Nat Genet. 2002;31:100–105. [PubMed]
  • Panopoulou G, Hennig S, Groth D, Krause A, Poustka AJ, et al. New evidence for genome-wide duplications at the origin of vertebrates using an amphioxus gene set and completed animal genomes. Genome Res. 2003;13:1056–1066. [PMC free article] [PubMed]
  • Leveugle M, Prat K, Popovici C, Birnbaum D, Coulier F. Phylogenetic analysis of Ciona intestinalis gene superfamilies supports the hypothesis of successive gene expansions. J Mol Evol. 2004;58:168–181. [PubMed]
  • Wolfe KH. Yesterday's polyploids and the mystery of diploidization. Nat Rev Genet. 2001;2:333–341. [PubMed]
  • Lynch M, O'Hely M, Walsh B, Force A. The probability of preservation of a newly arisen gene duplicate. Genetics. 2001;159:1789–1804. [PMC free article] [PubMed]
  • Lynch M, Conery JS. The evolutionary fate and consequences of duplicate genes. Science. 2000;290:1151–1155. [PubMed]
  • Wong S, Butler G, Wolfe KH. Gene order evolution and paleopolyploidy in hemiascomycete yeasts. Proc Natl Acad Sci U S A. 2002;99:9272–9277. [PMC free article] [PubMed]
  • Vision TJ, Brown DG, Tanksley SD. The origins of genomic duplications in Arabidopsis. Science. 2000;290:2114–2117. [PubMed]
  • Kellis M, Birren BW, Lander ES. Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature. 2004;428:617–624. [PubMed]
  • Dietrich FS, Voegeli S, Brachat S, Lerch A, Gates K, et al. The Ashbya gossypii genome as a tool for mapping the ancient Saccharomyces cerevisiae genome. Science. 2004;304:304–307. [PubMed]
  • Katsanis N, Fitzgibbon J, Fisher EM. Paralogy mapping: identification of a region in the human MHC triplicated onto human chromosomes 1 and 9 allows the prediction and isolation of novel PBX and NOTCH loci. Genomics. 1996;35:101–108. [PubMed]
  • Pebusque MJ, Coulier F, Birnbaum D, Pontarotti P. Ancient large-scale genome duplications: phylogenetic and linkage analyses shed light on chordate genome evolution. Mol Biol Evol. 1998;15:1145–1159. [PubMed]
  • Gibson TJ, Spring J. Evidence in favour of ancient octaploidy in the vertebrate genome. Biochem Soc Trans. 2000;28:259–264. [PubMed]
  • Vienne A, Rasmussen J, Abi-Rached L, Pontarotti P, Gilles A. Systematic phylogenomic evidence of en bloc duplication of the ancestral 8p11.21–8p21.3-like region. Mol Biol Evol. 2003;20:1290–1298. [PubMed]
  • Luke GN, Castro LF, McLay K, Bird C, Coulson A, et al. Dispersal of NK homeobox gene clusters in amphioxus and humans. Proc Natl Acad Sci USA. 2003;100:5292–5295. [PMC free article] [PubMed]
  • Castro LF, Furlong RF, Holland PW. An antecedent of the MHC-linked genomic region in amphioxus. Immunogenetics. 2004;55:782–784. [PubMed]
  • Castro LF, Holland PW. Chromosomal mapping of ANTP class homeobox genes in amphioxus: Piecing together ancestral genomes. Evol Dev. 2003;5:459–465. [PubMed]
  • Dehal P, Satou Y, Campbell RK, Chapman J, Degnan B, et al. The draft genome of Ciona intestinalis Insights into chordate and vertebrate origins. Science. 2002;298:2157–2167. [PubMed]
  • Aparicio S, Chapman J, Stupka E, Putnam N, Chia JM, et al. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science. 2002;297:1301–1310. [PubMed]
  • Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, et al. Initial sequencing and comparative analysis of the mouse genome. Nature. 2002;420:520–562. [PubMed]
  • Van de Peer Y, Taylor JS, Meyer A. Are all fishes ancient polyploids? J Struct Funct Genomics. 2003;3:65–73. [PubMed]
  • Jaillon O, Aury JM, Brunet F, Petit JL, Stange-Thomann N, et al. Genome duplication in the teleost fish Tetraodon nigroviridis reveals the early vertebrate proto-karyotype. Nature. 2004;431:946–957. [PubMed]
  • Horton AC, Mahadevan NR, Ruvinsky I, Gibson-Brown JJ. Phylogenetic analyses alone are insufficient to determine whether genome duplication(s) occurred during early vertebrate evolution. J Exp Zool B Mol Dev Evol. 2003;299:41–53. [PubMed]
  • Seoighe C, Johnston CR, Shields DC. Significantly different patterns of amino acid replacement after gene duplication as compared to after speciation. Mol Biol Evol. 2003;20:484–490. [PubMed]
  • Tymowska J, Fischberg M, Tinsley RC. The karyotype of the tetraploid species Xenopus vestitus Laurent (Anura: pipidae) Cytogenet Cell Genet. 1977;19:344–354. [PubMed]
  • Jeffreys AJ, Wilson V, Wood D, Simons JP, Kay RM, et al. Linkage of adult alpha- and beta-globin genes in X. laevis and gene duplication by tetraploidization. Cell. 1980;21:555–564. [PubMed]
  • Taylor JS, Van de Peer Y, Braasch I, Meyer A. Comparative genomics provides evidence for an ancient genome duplication event in fish. Philos Trans R Soc Lond B Biol Sci. 2001;356:1661–1679. [PMC free article] [PubMed]
  • Blanc G, Wolfe KH. Widespread paleopolyploidy in model plant species inferred from age distributions of duplicate genes. Plant Cell. 2004;16:1667–1678. [PMC free article] [PubMed]
  • Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. [PubMed]
  • Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994;22:4673–4680. [PMC free article] [PubMed]
  • Schmidt HA, Strimmer K, Vingron M, von Haeseler A. TREE-PUZZLE: Maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics. 2002;18:502–504. [PubMed]

Articles from PLoS Biology are provided here courtesy of Public Library of Science
PubReader format: click here to try


Save items

Related citations in PubMed

See reviews...See all...


  • MedGen
    Related information in MedGen
  • PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...