Logo of transbhomepageaboutsubmitalertseditorial board
Philos Trans R Soc Lond B Biol Sci. 2008 Apr 27; 363(1496): 1445–1451.
Published online 2008 Jan 11. doi:  10.1098/rstb.2007.2234
PMCID: PMC2614225

Beyond linear sequence comparisons: the use of genome-level characters for phylogenetic reconstruction


The first whole genomes to be compared for phylogenetic inference were those of mitochondria, which provided the first sets of genome-level characters for phylogenetic reconstruction. Most powerful among these characters has been the comparisons of the relative arrangements of genes, which has convincingly resolved numerous branch points, including those that had remained recalcitrant even to very large molecular sequence comparisons. Now the world faces a tsunami of complete nuclear genome sequences. In addition to the tremendous amount of DNA sequence that is becoming available for comparison, there is also a potential for many more genome-level characters to be developed, including the relative positions of introns, the domain structures of proteins, gene family membership, the presence of particular biochemical pathways, aspects of DNA replication or transcription, and many others. These characters can be especially convincing owing to their low likelihood of reverting to a primitive condition or occurring independently in separate lineages, thereby reducing the occurrence of homoplasy. The comparisons of organelle genomes pioneered the way for using such features for phylogenetic reconstructions, and it is almost certainly true, as ever more genomic sequence becomes available, that further use of genome-level characters will play a big role in outlining the relationships among major animal groups.

Keywords: genome, evolution, phylogeny, phylogenetically inferred groups, genome-level characters, gene family

1. Why do we need anything other than molecular sequence comparisons?

Over the past few decades, the comparison of nucleotide and amino acid sequences has revolutionized our understanding of the evolutionary relationships for many groups of organisms. The broader field of systematics has been reinvigorated and a generation of evolutionary biologists has come to accept that molecular sequence comparisons are an essential component for inferring phylogeny of any group. These studies have led to extensive revision of animal systematics and overturning of previous reliance on the features of the coelom and segmentation (Adoutte et al. 1999).

In the 1980s, when comparing molecular sequences for phylogenetic inference was first becoming common, some asserted with great confidence that all evolutionary relationships would soon be convincingly resolved solely with this type of data, leading to much consternation. However, some of the relationships that were equivocal in early molecular studies have remained highly recalcitrant even with much more DNA sequence data in hand. There are several potential explanations, including: (i) multiple nucleotide or amino acid substitutions may have occurred at a single site, obscuring any accumulated signal; (ii) convergent or parallel substitutions may have occurred among different lineages due to having only 4 (for nucleotides) or 20 (for amino acids) possible character states, exacerbated by convergent biases in base composition (Naylor & Brown 1998), which may even cause ever increasing confidence measures for incorrect associations with ever larger datasets (Phillips et al. 2004); (iii) the analysis may show artefactual association of the more rapidly changing lineages (Felsenstein 1978), including the attraction of long branches to the base of the in-group in association with the out-group (which is almost always a long branch; Philippe & Laurent 1998); (iv) in some cases, non-orthologous gene copies may be inadvertently compared among various lineages due to ancestral gene duplications followed by differential losses, or due to incomplete sampling; (v) differing views of scientists on alignments, exclusion sets and weighting schemes frequently cannot be arbitrated based on objective criteria and can lead to radically different phylogenetic reconstructions and (vi) the most difficult problems are when the time of shared ancestry is short relative to the subsequent time of divergence, where there has been little opportunity to accumulate signal and ample time for it to have been erased.

Molecular sequence comparison is now a mature field that has influenced the culture of systematics. Many have come to expect that the future of systematics will be dominated by creating ever more sophisticated methods for teasing a weak signal from noisy data. This causes concern that differing preferences for various methods will ensure that no consensus on many evolutionary relationships will ever be reached.

However, an alternative is possible, i.e. there may be other, less explored types of characters that could be powerful for resolving these contentious relationships. There is no doubt that comparisons of some characters have identified certain robust synapomorphies (shared and derived character states) that have supported long-standing, little contested evolutionary relationships, such as the monophyly of mammals, tetrapods and echinoderms. These synapomorphies are subjectively judged to be of the characters so unlikely to revert to an earlier condition or to occur multiple times in parallel that they could only have arisen once in the common ancestor of the group. Can new sets of characters be found that would meet these criteria to provide confident resolution of some problematic evolutionary relationships? Although there is a broad range of character types to explore, we focus here specifically on the comparison of features of genomes.

2. Comparisons of mitochondrial genomes have laid the foundation

The sequences from mitochondrial genes and genomes have been used extensively for phylogenetic inference, with complete mtDNA sequences being publicly available for more than 1000 animal species. (For a summary of the characteristics of animal mtDNAs, see Boore (1999).) It has been long argued (e.g. Boore & Brown 1998) that the relative arrangement (normally) of the 37 genes in animal mitochondrial genomes constitutes an especially powerful type of character for phylogenetic inference and so constitutes the first set of genome-level features to be used extensively for animal phylogeny. Briefly summarized, these genes are present in nearly all animal groups, are unambiguously homologous and can potentially be rearranged into an enormous number of states such that convergent rearrangements are very unlikely (and demonstrated to be uncommon). In the cases where it has been studied, all genes on each strand are transcribed together (Clayton 1992), so selection on gene arrangements is expected to be minimal. A summary of the evolutionary relationships convincingly demonstrated by this type of data (and in many cases left unresolved by all other studies) is found in Boore (2006), but here are a few of the more significant conclusions of deep-branch phylogenetic relationships: (i) the superphylum Eutrochozoa includes cestode platyhelminths (von von Nickisch-Rosenegk et al. 2001) and the phylum Phoronida (Helfenbein & Boore 2004); (ii) Sipuncula is closely related to Annelida rather than to Mollusca (Boore & Staton 2002); (iii) Annelida is more closely related to Mollusca than to Arthropoda (Boore & Brown 2000); (iv) Arthropoda is monophyletic and, within this phylum, Crustacea is united with Hexapoda to the exclusion of Myriapoda and Onychophora (Boore et al. 1995, 1998) and (v) Pentastomida is not a phylum, but rather a type of crustacean, and joins with Cephalocarida and Maxillopoda to the exclusion of other major crustacean groups (Lavrov et al. 2004).

3. Nuclear genomes, a treasure trove of phylogenetic characters

By a great margin, more DNA sequence is being generated than ever before. The facilities built and the techniques developed for sequencing the human genome are now focusing on many other organisms. The nine largest genome sequencing centres (table 1) collectively can now produce well over 170 billion nucleotides of DNA sequence per year, which would be approximately 57-fold coverage of the human genome. Imminently, there will be complete genomes of at least draft quality for many dozens of animals representing a phylogenetically diverse sample and including several equivocally placed lineages (figure 1; table 2).

Figure 1
This reconstruction of the major branches of animal evolution is used to plot the numbers of taxa with complete genome sequences done and underway. The taxonomic ranks shown are arbitrary, split for illustration, but not meant to be consistent among the ...
Table 1
URLs for the largest public DNA sequencing centres
Table 2
Complete nuclear genome sequencing projects done and underway as summarized in figure 1. (Asterisk indicate genomes currently funded to only low coverage.)

In these genomic data are many higher-order features, beyond the linear sequences, that constitute genome-level characters that are potentially useful for phylogenetic reconstruction, including: (i) gene content, including components of multiunit complexes such as the ribosome, splicosome, DNA replication machinery, or oxidative phosphorylation enzymes and the presence versus the absence of particular biochemical pathways (e.g. de Rosa et al. 1999; Fitz-Gibbon & House 1999; Snel et al. 1999, 2005; House & Fitz-Gibbon 2002; Huson & Steel 2004); (ii) the relative arrangements of genes (Boore & Brown 1998); (iii) movements of genes among intracellular compartments (i.e. plastid, mitochondrion, nucleus; e.g. Nugent & Palmer 1991); (iv) insertions of segments of DNA, including transposons and numts (Fukuda et al. 1985; Richly & Leister 2004); (v) variation in intron positions (e.g. Qiu et al. 1998); (vi) secondary structures of rRNAs or tRNAs (e.g. Murrell et al. 2003); (vii) details of genome-level processes, such as the rearrangements that generate antibody diversity (Frieder et al. 2006) and (viii) deviations from the ‘universal’ genetic code (Telford et al. 2000; Santos et al. 2004). Many others are likely to be found.

Of course, the reliability of these features can only be assessed by the study of their consistency with other characters, and several are already suspect. For example, convergent gene losses may be common as organisms independently evolve smaller genomes or no longer experience selection for maintaining a particular biochemical pathway; in contrast, convergent gain of genes seems much less likely. Independent evolution of smaller genomes may also lead to parallel losses of the most expendable structures in the RNA or protein genes. There is a certain time horizon that limits the usefulness of any particular type of character; for example, once retroelements degrade in the sequence beyond the point where the insertion can be reliably inferred to be of single origin, the insertion is no longer useful as a phylogenetic character. Certain changes in the genetic code and in the tRNA secondary structures of mitochondria are known to have occurred convergently (although occasional homoplasy has not disqualified the use of either morphological characters or molecular sequence comparisons). There is also a problem in the case of closely spaced sequential internodes where random partitioning of polymorphisms, including those of genome-level characters, can lead to incorrect inference of phylogeny (e.g. Salem et al. 2003; see Boore (2006) for additional caveats and precautions).

Already there have been important insights gained from comparing such features, including: (i) tarsiers have been shown to be the sister group to the clade of monkeys and apes rather than the prosimians based on the patterns of SINE element integration (Schmitz et al. 2001); (ii) patterns of SINE and LINE insertions have also supported the monophyly of toothed plus baleen whales, that hippopotamuses are the sister group to cetaceans, that camels are the most basal cetartiodactyls (Nikaido et al. 1999), and that river dolphins are paraphyletic (Nikaido et al. 2001); (iii)animal interphylum relationships have been clarified by the comparisons of the gene membership within Hox clusters (de Rosa et al. 1999) and (iv) a study of the presence of spliceosomal introns supports the monophyly of Actinopterygia and clarifies several relationships within the group, including the basal position of bichirs (Venkatesh et al. 1999). For further discussion, see Murphy et al. (2004), Okada et al. (2004) and Boore (2006).

4. What are the advantages of using these genome-level characters?

In general, these types of features would be expected to change in a saltatory, non-clocklike manner. This may seem, at first, to be wrong-headed, since great effort has been expended for many studies to identify clocklike characters, to enable accurate molecular clock estimates of time of divergence. But it is this aspect that makes these genome-level characters especially useful for addressing the most difficult branch points, those with a short time of shared history followed by a long period of divergence, as mentioned above. It is for resolving these relationships that clocklike behaviour guarantees failure, since the ratio of signal to noise will closely match the ratio of the two time periods. Rather it is the least clocklike characters that are expected to prevail, where an occasional and abrupt change may have occurred and then remain (figure 2). Admittedly, the concomitant disadvantage is that, typically, many such characters must be examined in order to find those that happened to have changed during the period of shared ancestry and so marking the relationship (see Boore (2006) for further analysis and discussion).

Figure 2
Illustration of why clocklike characters (a) may be less informative than non-clocklike characters (b) when the internode between the subsequent lineage splits is short. Each of the four shapes is meant to be a character with states indicated by patterning. ...

5. What about clades without representative genome sequences?

This enormous dataset provides a new class of characters that could lead to definitive resolution of some branches of the tree of life, not only for these taxa but also for others where targeted study for identified characters could be fruitful. As shown in figure 1, whole-genome sampling will include many major lineages, but not all. It seems unlikely that there will soon be available a whole-genome sequence of a gastrotrich or a loriciferan, for example. Fortunately, we can use the genomes in hand to identify sets of genome-level characters that can be diagnostic for the relationships of related groups without genome projects. One could, for example, then determine the gene order using Southern hybridization or probe a large DNA insert library (i.e. in BAC or fosmid vectors) to find a clone to sequence for the region of interest of the genome. Gene rearrangements, losses and duplications can also be identified using comparative genomic hybridization (CGH) chips with tiled large-insert clones, as has been done for a sampling of diverse human populations (Sharp et al. 2005) and more broadly across the great apes (Locke et al. 2003) or using the arrays of oligonucleotides (representational oligonucleotide microarray analysis, ROMA; Sebat et al. 2004).

6. What are the main challenges that are before us?

First, we must increase the representation of the understudied groups of animals for large-scale genomic sequencing. There is no reason to believe that taxa that have been traditionally studied intensively, i.e. those with higher species richness, greater breadth of niche occupation, more important roles in pathogenesis or amenability to laboratory experimentation, will be more informative towards the goals of understanding broad patterns of the evolution of animals and their genomes. Second, we need to have a codification of nomenclature for the genes, which is based on the assessment of orthology (Dehal & Boore 2006). The renaming of genes to indicate orthology is not feasible because it would render large bodies of literature difficult to interpret and because scientists who study the model organisms, and who have largely done the naming, are invested in their parochial nomenclature. Thus, the solution must be a lexicon superimposed on these names already in place. Third, a system must be devised for codifying the genome-level characters themselves for entry into the databases and matrices for broad comparisons. Finally, we need for the community to devise the standards of interpretation and analysis, such as the use of cladistic reasoning rather than associating taxa by similarity alone (Boore 2006). Then, it seems probable that the genome-level characters will provide the best dataset for convincingly reconstructing relationships for some of the most hotly contended nodes in the tree of life and establishing a framework for all organismal relationships.


One contribution of 17 to a Discussion Meeting Issue ‘Evolution of the animals: a Linnean tercentenary celebration’.


  • Adoutte A, Balavoine G, Lartillot N, de Rosa R. Animal evolution—the end of the intermediate taxon? Trends Genet. 1999;15:104–108. doi:10.1016/S0168-9525(98)01671-0 [PubMed]
  • Boore J.L. Animal mitochondrial genomes. Nucleic Acids Res. 1999;27:1767–1780. doi:10.1093/nar/27.8.1767 [PMC free article] [PubMed]
  • Boore J.L. The use of genome-level characters for phylogenetic reconstruction. Trends Ecol. Evol. 2006;21:439–446. doi:10.1016/j.tree.2006.05.009 [PubMed]
  • Boore J.L, Brown W.M. Big trees from little genomes: mitochondrial gene order as a phylogenetic tool. Curr. Opin. Genet. Dev. 1998;8:668–674. doi:10.1016/S0959-437X(98)80035-X [PubMed]
  • Boore J.L, Brown W.M. Mitochondrial genomes of Galathealinum, Helobdella, and Platynereis: sequence and gene arrangement comparisons indicate that Pogonophora is not a phylum and Annelida and Arthropoda are not sister taxa. Mol. Biol. Evol. 2000;17:87–106. [PubMed]
  • Boore J.L, Staton J. The mitochondrial genome of the sipunculid Phascolopsis gouldii supports its association with Annelida rather than Mollusca. Mol. Biol. Evol. 2002;19:127–137. [PubMed]
  • Boore J.L, Collins T.M, Stanton D, Daehler L.L, Brown W.M. Deducing arthropod phylogeny from mitochondrial DNA rearrangements. Nature. 1995;376:163–165. doi:10.1038/376163a0 [PubMed]
  • Boore J.L, Lavrov D, Brown W.M. Gene translocation links insects and crustaceans. Nature. 1998;392:667–668. doi:10.1038/33577 [PubMed]
  • Clayton D.A. Transcription and replication of animal mitochondrial DNAs. Int. Rev. Cytol. 1992;141:217–232. [PubMed]
  • Dehal P, Boore J.L. A phylogenomic gene cluster resource: the phylogenetically inferred groups (PhIGs) database. BMC Bioinform. 2006;7:201. doi:10.1186/1471-2105-7-201 [PMC free article] [PubMed]
  • de Rosa R, Grenier J.K, Andreeva T, Cook C.E, Adoutte A, Akam M, Carroll S.B, Balavoine G. Hox genes in brachiopods and priapulids and protostome evolution. Nature. 1999;399:772–776. doi:10.1038/21631 [PubMed]
  • Felsenstein J. Cases in which parsimony or compatibility methods will be positively misleading. Syst. Zool. 1978;27:401–410. doi:10.2307/2412923
  • Fitz-Gibbon S.T, House C.H. Whole genome-based phylogenetic analysis of free-living microorganisms. Nucleic Acids Res. 1999;27:4218–4222. doi:10.1093/nar/27.21.4218 [PMC free article] [PubMed]
  • Frieder D, Larijani M, Tang E, Parsa J.Y, Basit W, Martin A. Antibody diversification: mutational mechanisms and oncogenesis. Immunol. Res. 2006;35:75–88. doi:10.1385/IR:35:1:75 [PubMed]
  • Fukuda M, Fukuda M, Wakasugi S, Tsuzuki T, Nomiyama H, Shimada K, Miyata T. Mitochondrial DNA-like sequences in the human nuclear genome: characterization and implications in the evolution of mitochondrial DNA. J. Mol. Biol. 1985;186:257–266. doi:10.1016/0022-2836(85)90102-0 [PubMed]
  • Helfenbein K.G, Boore J.L. The mitochondrial genome of Phoronis architecta—comparisons demonstrate that phoronids are lophotrochozoan protostomes. Mol. Biol. Evol. 2004;21:153–157. doi:10.1093/molbev/msh011 [PubMed]
  • House C.H, Fitz-Gibbon S.T. Using homolog groups to create a whole-genomic tree of free-living organisms: an update. J. Mol. Evol. 2002;54:539–547. doi:10.1007/s00239-001-0054-5 [PubMed]
  • Huson D.H, Steel M. Phylogenetic trees based on gene content. Bioinformatics. 2004;20:2044–2049. doi:10.1093/bioinformatics/bth198 [PubMed]
  • Lavrov D, Brown W.M, Boore J.L. Phylogenetic position of the Pentastomida and (pan)crustacean relationships. Proc. R. Soc. B. 2004;271:537–544. doi:10.1098/rspb.2003.2631 [PMC free article] [PubMed]
  • Locke D.P, Segraves R, Carbone L, Archidiacono N, Albertson D.G, Pinkel D, Eichler E.E. Large-scale variation among human and great ape genomes determined by array comparative genomic hybridization. Genome Res. 2003;13:347–357. doi:10.1101/gr.1003303 [PMC free article] [PubMed]
  • Murphy W.J, Pevzner P.A, O'Brien S.J. Mammalian phylogenomics comes of age. Trends Genet. 2004;20:631–639. doi:10.1016/j.tig.2004.09.005 [PubMed]
  • Murrell A, Campbell N.J, Barker S.C. The value of idiosyncratic markers and changes to conserved tRNA sequences from the mitochondrial genome of hard ticks (Acari: Ixodida: Ixodidae) for phylogenetic inference. Syst. Biol. 2003;52:296–310. [PubMed]
  • Naylor G.J.P, Brown W.M. Amphioxus mitochondrial DNA, chordate phylogeny, and the limits of inference based on comparisons of sequences. Syst. Biol. 1998;47:61–76. doi:10.1080/106351598261030 [PubMed]
  • Nikaido M, Rooney A.P, Okada N. Phylogenetic relationships among cetartiodactyls based on insertions of short and long interspersed elements: hippopotamuses are the closest extant relatives of whales. Proc. Natl Acad. Sci. USA. 1999;96:10 261–10 266. doi:10.1073/pnas.96.18.10261 [PMC free article] [PubMed]
  • Nikaido M, et al. Retroposon analysis of major cetacean lineages: the monophyly of toothed whales and the paraphyly of river dolphins. Proc. Natl Acad. Sci. USA. 2001;98:7384–7389. doi:10.1073/pnas.121139198 [PMC free article] [PubMed]
  • Nugent J.M, Palmer J.D. RNA-mediated transfer of the gene coxII from the mitochondrion to the nucleus during flowering plant evolution. Cell. 1991;66:473–481. doi:10.1016/0092-8674(81)90011-8 [PubMed]
  • Okada N, Shedlock A.M, Nikaido M. Retroposon mapping in molecular systematics. Methods Mol. Biol. 2004;260:189–226. [PubMed]
  • Philippe H, Laurent J. How good are deep phylogenetic trees? Curr. Biol. 1998;8:616–623. doi:10.1016/S0960-9822(98)70390-2 [PubMed]
  • Phillips M.J, Delsuc F, Penny D. Genome-scale phylogeny and the detection of systematic biases. Mol. Biol. Evol. 2004;21:1455–1458. doi:10.1093/molbev/msh137 [PubMed]
  • Qiu Y.-L, Cho Y, Cox J.C, Palmer J.D. The gain of three mitochondrial introns identifies liverworts as the earliest land plants. Nature. 1998;394:671–674. doi:10.1038/29286 [PubMed]
  • Richly E, Leister D. NUMTs in sequenced eukaryotic genomes. Mol. Biol. Evol. 2004;21:1081–1084. doi:10.1093/molbev/msh110 [PubMed]
  • Salem A.-H, et al. Alu elements and hominid phylogenetics. Proc. Natl Acad. Sci. USA. 2003;100:12 787–12 791. doi:10.1073/pnas.2133766100 [PMC free article] [PubMed]
  • Santos M.A.S, Moura G, Massey S.E, Tuite M.F. Driving change: the evolution of alternative genetic codes. Trends Genet. 2004;20:95–102. doi:10.1016/j.tig.2003.12.009 [PubMed]
  • Schmitz J, Ohme M, Zischler H. SINE insertions in cladistic analyses and the phylogenetic affiliations of Tarsius bancanus to other primates. Genetics. 2001;157:777–784. [PMC free article] [PubMed]
  • Sebat J, et al. Large-scale copy number polymorphism in the human genome. Science. 2004;305:525–528. doi:10.1126/science.1098918 [PubMed]
  • Sharp A.J, et al. Segmental duplications and copy-number variation in the human genome. Am. J. Hum. Genet. 2005;77:78–88. doi:10.1086/431652 [PMC free article] [PubMed]
  • Snel B, Bork P, Huynen M.A. Genome phylogeny based on gene content. Nat. Genet. 1999;21:108–110. doi:10.1038/5052 [PubMed]
  • Snel B, Huynen M.A, Dutilh B.E. Genome trees and the nature of genome evolution. Annu. Rev. Microbiol. 2005;59:191–209. doi:10.1146/annurev.micro.59.030804.121233 [PubMed]
  • Telford M.J, Herniou E.A, Russell R.B, Littlewood D.T.J. Changes in mitochondrial genetic codes as phylogenetic characters: two examples from the flatworms. Proc. Natl Acad. Sci. USA. 2000;97:11 359–11 364. doi:10.1073/pnas.97.21.11359 [PMC free article] [PubMed]
  • Venkatesh B, Ning Y, Brenner S. Late changes in spliceosomal introns define clades in vertebrate evolution. Proc. Natl Acad. Sci. USA. 1999;96:10 267–10 271. doi:10.1073/pnas.96.18.10267 [PMC free article] [PubMed]
  • von Nickisch-Rosenegk M, Brown W.M, Boore J.L. Sequence and structure of the mitochondrial genome of the tapeworm Hymenolepis diminuta: gene arrangement indicates that platyhelminths are derived eutrochozoans. Mol. Biol. Evol. 2001;18:721–730. [PubMed]

Articles from Philosophical Transactions of the Royal Society B: Biological Sciences are provided here courtesy of The Royal Society
PubReader format: click here to try


Save items

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...