• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of pnasPNASInfo for AuthorsSubscriptionsAboutThis Article
Proc Natl Acad Sci U S A. Sep 11, 2001; 98(19): 10751–10756.
Published online Aug 28, 2001. doi:  10.1073/pnas.191248498
PMCID: PMC58547
Evolution

Incomplete taxon sampling is not a problem for phylogenetic inference

Abstract

A major issue in all data collection for molecular phylogenetics is taxon sampling, which refers to the use of data from only a small representative set of species for inferring higher-level evolutionary history. Insufficient taxon sampling is often cited as a significant source of error in phylogenetic studies, and consequently, acquisition of large data sets is advocated. To test this assertion, we have conducted computer simulation studies by using natural collections of evolutionary parameters—rates of evolution, species sampling, and gene lengths—determined from data available in genomic databases. A comparison of the true tree with trees constructed by using taxa subsamples and trees constructed by using all taxa shows that the amount of phylogenetic error per internal branch is similar; a result that holds true for the neighbor-joining, minimum evolution, maximum parsimony, and maximum likelihood methods. Furthermore, our results show that even though trees inferred by using progressively larger taxa subsamples of a real data set become increasingly similar to trees inferred by using the full sample, all inferred trees are equidistant from the true tree in terms of phylogenetic error per internal branch. Our results suggest that longer sequences, rather than extensive sampling, will better improve the accuracy of phylogenetic inference.

Taxon sampling refers to the process of selecting representative taxa for a phylogenetic analysis. Nonexhaustive taxon sampling occurs for a number of reasons. Data may not be available from every extant species because of constraints of time, money, or rarity. In most cases, the number of potential species increases quickly if one is interested in phylogenetic relationships above the level of genus or family. Therefore, it is impractical, if not impossible, to sample every species from clades of interest. Rather, representative species from each clade are chosen and the reconstructed phylogenetic relationships of these species are taken to represent the evolutionary history of their respective clades.

Insufficient taxon sampling is often cited as a major source of error in phylogenetic analysis (e.g., refs. 110). However, as expected, the value of increasing the number of sequences (species) in a data set depends on the scope of sampling (1114). Sampling within a fully framed monophyletic group may improve phylogenetic accuracy, but sampling outside of the group pushes the most recent common ancestor of the new set of taxa back in time and may decrease accuracy (13). Random sampling of additional taxa is thought to decrease, rather than increase, phylogenetic accuracy (1214).

One reason why increased taxon sampling is thought to improve phylogenetic resolution is that it may counteract the “long branch attraction” problem, where long, unrelated branches may group together erroneously (15, 16). Increased taxon sampling may break long branches and help reduce the average branch length throughout the tree (13, 1719). However, computer simulation results have been equivocal about the benefit of increased taxon sampling for reducing the long branch problem (11, 12, 1921). The importance of extensive taxon sampling is already well established for estimating evolutionary parameters (4, 22, 23) and in independent contrasts (24).

There have also been a number of empirical studies on the value of taxon sampling on phylogenetic inference (210). These studies typically begin with a large number of species and then examine the results of analyzing subsamples; most have concluded that phylogenetic trees reconstructed with more taxa are more accurate than those inferred from fewer taxa. These conclusions assume that the phylogeny inferred by using the largest data set available is closest to the true tree; an assumption that is not well established, because the “true tree” is not known in empirical studies. At present these studies appear to have simply demonstrated that topologies reconstructed by using larger subsamples show higher congruence with the full tree. Therefore, this problem is most readily studied by computer simulation because the “true tree” is known. However, previous simulation and theoretical studies (11, 1921) were often not conducted by subsampling from a large tree, as mentioned above, but rather began with a small number of species and progressively added additional species to long branches in the starting cluster, keeping the subsample tree fixed.

We conducted a simulation study motivated by issues an evolutionary biologist would encounter with real data. We began with a large predetermined phylogeny (as is the case with all empirical studies, the true tree of life having been fixed via evolution) and generated data sets consisting of sampled taxa from the “known” full phylogeny. In our simulations, we examined the problem of taxon sampling by using evolutionary rates, species representations, and gene length parameters for DNA and amino acid sequences derived from molecular sequence databases. In addition, we used model trees based on actual trees published in the literature, rather than an artificial tree created from a theoretical branching process or an artificial clustering scheme, in order to make our simulations an accurate representation of the topologies and distributions of branch lengths found in real data.

Materials and Methods

We used two different simulation schemes. For the first case, we chose the 66-taxa tree representing the phylogenetic relationships among Eutherian mammals from Murphy et al. (ref. 1; Fig. Fig.1).1). The branch lengths represent the number of substitutions per site. This tree was chosen because it revises many well established beliefs about mammalian evolution (see also ref. 25). For instance, lagomorphs had previously been found to be distantly related to sciurognath rodents in analyses of large numbers of genes for a few taxa (e.g., ref. 26), and rodents were thought to be an outgroup to artiodactyls and primates (27). Murphy et al. (1) place lagomorphs in a monophyletic assemblage with rodents and identify the rodents as a sister group of primates to the exclusion of artiodactyls. Therefore, we derived the topology of the tree in Fig. Fig.11 by using their mammalian phylogeny. Note that Murphy et al.'s tree is based on a larger taxonomic sample than other studies, but has a fraction of the genes when compared with other studies that have smaller numbers of representative taxa (26, 27).

Figure 1
Model tree used in the DNA simulations based on the Eutherian mammal tree from ref. 1. Branch lengths indicate the number of substitutions per site.

We simulated DNA evolution for 50 hypothetical genes for Fig. Fig.1,1, each with independent evolutionary properties, using the Jukes–Cantor (28) model of nucleotide substitution [we also conducted analyses under the Hasegawa–Kishino–Yano model (29) and obtained similar results (data not shown)]. The sequence length and substitution rate were determined randomly for each gene. The sequence length was picked from a uniform distribution ranging from 500 to 3,000 (the range of sequence lengths commonly found in the literature). Because the branch lengths of the model tree (Fig. (Fig.1)1) already represent substitutions per site, the substitution rate for each gene represented a random multiplier of these branch lengths, picked from a gamma distribution with a gamma parameter of 1 (as observed from a data set of homologous human and mouse genes at only first and second codon positions). After simulating evolution across the full tree, a random subsample of taxa was chosen. The size of the subsample was randomly selected from a uniform distribution of 5 to 50, and the specific sampled taxa were selected randomly from the full complement of 66 taxa.

The model tree for the second set of simulations was based on an 18-taxa phylogeny of vertebrates (refs. 3032; Fig. Fig.2).2). These 18 taxa represent the most commonly found taxa in the genetic databases. There were 1,167 genes in our orthology database derived from HOVERGEN (33) with sequences for at least four of the taxa. The observed amino acid substitution rate and sequence length (100 to 2,696 sites) for each gene was used as the basis for simulating amino acid sequences on the tree by using the Poisson substitution model. Empirical substitution rates were estimated by using least-squares regression through the origin of pairwise sequence divergences and divergence times (34) for species in Fig. Fig.2.2. Taxon sampling for each gene was determined by the availability of taxa for that gene in GenBank. Because some species are much more common in the database than others, this sampling is biased toward certain taxa and allows us to explore the effect of biased sampling as would be experienced by practicing biologists today. There were 100 simulation replicates for each gene for both the DNA and amino acid simulations. All simulations were conducted by using programs written by the authors.

Figure 2
Model tree used in the amino acid simulations based on a known vertebrate phylogeny. Branch lengths indicate time in millions of years.

All analyses were performed by using PAUP* Version 4.0b4a for Windows (35). Basic phylogeny reconstruction for both sets of simulations was performed by using neighbor joining (NJ), minimum evolution (ME), and maximum parsimony (MP) methods, as well as maximum likelihood (ML) for the DNA simulations only. Distances for NJ and ME were calculated under the JC model. In ME, MP, and ML a single heuristic search was performed with Nearest-Neighbor-Interchange branch swapping. For ME and ML, the NJ tree was used as the starting tree for the heuristic search; for MP, a stepwise addition procedure was used. A more exhaustive, time consuming search is not necessary because it is clear that it does not improve phylogenetic accuracy (3639). The maximum number of trees that could be saved during the heuristic search procedures was set to 10,000 (most of the searches never came close to reaching this limit). When multiple trees were found under the ME, MP, and ML procedures, a majority rule consensus tree (retaining all compatible clades even under 50% frequency of occurrence with the LE50 option in PAUP*) was used to create a single resultant tree for each analysis. The resultant tree was then compared with the true (model) tree and the topological distance, dT (40, 41), was recorded. This distance is twice the number of interior branches at which the two trees being compared differ. For subsample analyses, the topological distance between the true tree and the inferred subsample tree was computed by using the pruned true tree that contained only the subsampled taxa. Tree distance is not directly comparable among trees with different numbers of taxa, because dT directly depends on the number of taxa (two unrooted trees of four taxa trees have a maximum dT of 2, whereas two 66-taxa trees have a maximum dT of 126). Therefore, we normalize the dT value and define the phylogenetic error per internal branch, E = dT/2(n −3), where n is the number of taxa, and n − 3 is the number of internal branches in a bifurcating tree. E ranges from 0 to 1, indicating that the proportion of internal branches inferred incorrectly. Other scaling metrics (e.g., scaling by the number of taxa) led to similar conclusions.

Phylogeny reconstruction was performed under a number of scenarios for each simulation. First, all of the sequences for all genes were concatenated into a single data set; the error between these inferred trees and the true tree is designated Econcat. Second, each gene was analyzed individually (gene-by-gene analysis) with all taxa included. In this case, we compute phylogenetic error (EG) by directly comparing the true tree with the inferred full tree for the given gene. Third, each gene was analyzed individually with only the subsampled taxa for that gene included. In this case, the phylogenetic error (ES) was computed by comparing the inferred subsample tree and true tree pruned to contain only the taxa in the subsample (Fig. (Fig.3).3).

Figure 3
Diagrammatic representation of the relationships among the true and inferred trees and error statistics. EG is the phylogenetic error between the true tree and the full sample inferred tree; ES is the phylogenetic error between the true tree and the subsample ...

Results and Discussion

DNA Sequence Evolution.

In the DNA simulations, the 50 simulated genes consisted of a total of 79,410 sites (an average of 1,588 sites per gene). The average number of taxa subsampled was 28. The results for ME, NJ, MP, and ML were quite similar. Table Table11 lists the number of sampled taxa, number of sites, substitution rate, and results of the ME analysis for each gene; the overall results for all methods are summarized in Table Table2.2. In this set of analyses it is clear that the ML method reconstructs the trees most accurately, followed by MP, then NJ and ME.

Table 1
Results of minimum evolution phylogenetic analysis for the DNA simulations
Table 2
Summary of the results from the DNA simulations

For the concatenated data set, the mean phylogenetic error (Econcat) was 2–3%, with the true tree inferred in about 17% of the replicates. In any case, the inferred tree contained only one or two incorrect partitions. For individual genes, the error (EG) varied tremendously by gene, from 0.04 to 0.72 with a mean of 0.17 for the ME method. In general the true tree was never recovered by using individual genes. The variation among genes is due to a combination of both mutation rate and number of sites. Although both were important factors, the number of sites seems to have been more critical. For ME, the correlation between number of sites and EG was −0.552, whereas the correlation of substitution rate and EG was −0.326. As would be expected, there is no correlation (r = 0.023) between number of sites and substitution rate. For trees inferred with a subsample of taxa, the mean error (ES) was 0.19. This value is very similar to that obtained from trees inferred by using the full set of taxa (cf. EG = 0.17, Table Table1).1). The correlation between the number of taxa and ES (r = −0.055) was 10-fold lower than the correlation with sequence length.

To construct a direct comparison between the phylogenetic relationships of the subsampled taxa as obtained by using all taxa and only the subsample taxa, we pruned the inferred trees obtained by using all of the taxa to contain just the subsampled taxa and determined the phylogenetic error between this pruned inferred tree and the true tree (EP). The mean phylogenetic error for ME was 0.16, which is again quite similar to EG and ES (Table (Table2).2). On average, more than doubling the number of taxa increased the percent of correct branches by only 2–3% (with an average subsample of 28, this increase represents less than a single branch). Note that even though ES is greater than EG and EP for very small subsamples (<10 taxa), the difference in phylogenetic error is usually much smaller than one branch per tree. Therefore, use of only a fraction of taxa provides practically indistinguishable results. The similarities among EG, ES, and EP persist in simulations using more complex models of nucleotide substitution—e.g., the Hasegawa–Kishino–Yano (29) model (results not shown)—and are thus not a function of using a simple substitution model.

Amino Acid Sequence Evolution.

In the amino acid simulations, the 1,167 simulated genes consisted of a total of 464,990 sites (an average of 398.5 sites per gene). The average number of taxa subsampled was 4.7, following the presence of species available for different genes in GenBank. ME, MP, and NJ gave similar results, which are summarized in Table Table3.3. In these analyses, MP was the most accurate, followed by NJ and ME.

Table 3
Summary of the results from the amino acid simulations

For the concatenated data set, we were always able to recover the correct tree with all of the data. For ME, the mean error for individual genes (EG) ranged from 0.02 to 0.84, with a mean of 0.18. Again, the number of sites seems to have been more critical than the substitution rate. The correlation between number of sites and EG was −0.459, whereas the correlation of substitution rate and EG was −0.356.

The mean ES was 0.10, which is lower than the 0.18 for the full complement of taxa (EG). At first glance, it seems to imply that the subsample trees (those containing fewer taxa) were more accurate than the trees containing all of the taxa! Clearly, this observation is unexpected and contrary to the often assumed benefit of increased taxon sampling. As was done for DNA simulations, we test the validity of this result by comparing the amount of phylogenetic error in inferring phylogenetic relationships of the subsampled taxa as obtained by using all taxa and only the subsample taxa (EP, Fig. Fig.3).3). EP was almost identical to ES. The differences of ES and EG are explained by the fact that most of the errors in the trees inferred by using all taxa were those in the most basal three or four taxa (Fig. (Fig.2),2), which were almost always absent from the taxon-sampled trees (because of the nature of the current gene sequence databases). Therefore, the similarity of the ES and EP values make it clear that the subsample and full taxon set analyses were able to reconstruct trees with equivalent accuracy even though the average number of taxa in the subsample was less than five.

Taxon Sampling Versus Phylogenetic Signal.

In general, the above results indicate that incomplete taxon sampling has a much smaller effect on the accuracy of a phylogeny as compared with the number of sites and substitution rates. It is therefore unwarranted to simply dismiss undesired or unexpected phylogenetic relationships obtained by using small numbers of taxa as the result of poor taxon sampling. Poor character sampling with weak phylogenetic signal is more likely to be the cause. In our study, taxon sampling had similar effects on phylogeny reconstruction for all of the major reconstruction methods. When the signal was strong, all of the methods reproduced the correct tree; when the signal was weak, none of them did. The one major difference among the different reconstruction methods in this study is their relationship with substitution rate and sequence length. All methods showed a stronger correlation between reconstruction accuracy and the number of sites than between accuracy and the substitution rate (see also ref. 37). This effect was strongest in MP, which consistently showed both a higher effect of number of sites and a lower effect of rate than any other method (Tables (Tables22 and and33).

Empirical Versus Simulation Studies.

Our simulation results appear to conflict with empirical studies that have reported improved performance with increased taxon sampling. We examined these patterns empirically with the raw data from the study that produced the Eutherian mammal tree (1). For each of the 17 genes (Table (Table4),4), we inferred the phylogeny for all taxa by using NJ. We then created 500 random subsamples consisting of 15, 30, and 45 taxa each. Each subsample was constrained to contain at least one species from each of 13 mammalian orders present in the tree (assuming data were available for the order). These subsamples were analyzed with NJ, and the phylogenetic difference per internal branch was determined between the results of the subsampled taxa and the pruned results from the full taxa (DE, Fig. Fig.3).3).

Table 4
Summary of the empirical study of Eutherian mammal genes

DE declines as the sample size increases (Table (Table4).4). However, this does not indicate that trees with larger numbers of taxa are more accurate with respect to the true tree than those with fewer taxa, because this comparison does not involve the true tree (Fig. (Fig.3).3). The results in Table Table44 merely show that trees based on similar numbers of taxa (e.g., 66 and 45) tend to be more similar than those based on dissimilar numbers of taxa (e.g., 66 and 15).

Because we know the true tree in computer simulations, we previously calculated the error between the true tree and the reconstructed trees (ES and EP). These errors were found to be largely independent of the size of the taxon sample. When we calculate DE for our simulations, we find a negative correlation between it and subsample size (Table (Table5)—i.e.,5)—i.e., as the subsample size gets larger, the topological differences among the full and subsample inferred trees gets smaller. This is the identical pattern found for the empirical results in Table Table4.4. In addition, DE is only slightly smaller than ES and EP (Table (Table5).5). Therefore, there is as much topological difference between full and subsample trees as is observed between these trees and the true tree. This indicates that the similarity of ES and Ep, regardless of the number of taxa, is not due to identical phylogenetic inference. The use of different sample sizes (number of taxa) may lead to different phylogenetic inferences; however, the error associated with these estimates is largely independent of the sample size.

Table 5
Comparison of subsampled data in the DNA simulations

This result has interesting implications for the difference in phylogenetic position of rabbits, rodents, and primates in the Eutherian phylogeny as obtained in studies based on small numbers of taxa and large numbers of genes versus those with more extensive taxon sampling but much fewer genes (1, 25, 26, 34, 42). If the model tree in Fig. Fig.22 is indeed true, then our study indicates that taxon sampling does not explain the discrepancy. Otherwise, the correct phylogenetic relationships of these three groups are yet to be determined. Increasing the number of genes for the large taxon sample of Murphy et al. (1) and Madsen et al. (25) is likely to resolve this issue. In general, our results do not provide evidence in favor of adding taxa to problematic phylogenies; instead, using more genes with longer sequences would be a better use of time and resources.

Acknowledgments

We thank Sudhindra Gadagkar, Mark Miller, Tom Dowling, Mike Douglas, Koichiro Tamura, and two anonymous reviewers for providing useful comments on earlier versions of the manuscript. This research was supported by grants from the National Science Foundation (DBI-9983133), the National Institute of Health (HG-02096), and the Burroughs Wellcome Fund (BWI 1001311) (to S.K.), and National Science Foundation Grant IBN-9977063 (to James L. Collins).

Abbreviations

MP
maximum parsimony
ME
minimum evolution
NJ
neighbor joining
ML
maximum likelihood

Footnotes

This paper was submitted directly (Track II) to the PNAS office.

References

1. Murphy W J, Eizirik E, Johnson W E, Zhang Y P, Ryder O A, O'Brien S J. Nature (London) 2001;409:614–618. [PubMed]
2. Omland K E, Lanyon S M, Fritz S J. Mol Phylogenet Evol. 1999;12:224–239. [PubMed]
3. Yoder A D, Irwin J A. Cladistics. 1999;15:351–361.
4. Saunders M A, Edwards S V. J Mol Evol. 2000;51:97–109. [PubMed]
5. van Tuinen M, Sibley C G, Hedges S B. Mol Biol Evol. 2000;17:451–457. [PubMed]
6. De Rijk P, Van de Peer Y, Van den Broeck I, De Wachter R. J Mol Evol. 1995;41:366–375. [PubMed]
7. Lecointre G, Philippe H, L, Le Guyader H. Mol Phylogenet Evol. 1993;2:205–224. [PubMed]
8. Poe S. Syst Biol. 1998;47:18–31. [PubMed]
9. Soltis P S, Soltis D E, Wolf P G, Nickrent D L, Chaw S-M, Chapman R L. Mol Biol Evol. 1999;16:1774–1784. [PubMed]
10. Johnson K P. Syst Biol. 2001;50:128–136. [PubMed]
11. Kim J. Syst Biol. 1996;45:363–374.
12. Kim J. Syst Biol. 1998;47:43–60. [PubMed]
13. Rannala B, Huelsenbeck J P, Yang Z, Nielsen R. Syst Biol. 1998;47:702–710. [PubMed]
14. Hillis D M. Syst Biol. 1998;47:3–8. [PubMed]
15. Felsenstein J. Syst Zool. 1978;27:401–410.
16. Nei M. Annu Rev Genet. 1996;30:371–403. [PubMed]
17. Swofford D L, Olsen G J, Waddell P J, Hillis D M. In: Molecular Systematics. Hillis D M, Moritz C, Mable B K, editors. Sunderland, MA: Sinauer; 1996. pp. 407–514.
18. Hillis D M. Nature (London) 1996;383:130–131. [PubMed]
19. Hendy M D, Penny D. Syst Zool. 1989;38:297–309.
20. Graybeal A. Syst Biol. 1998;47:9–17. [PubMed]
21. Poe S, Swofford D L. Nature (London) 1999;398:299–300. [PubMed]
22. Robinson M, Gouy M, Gautier C, Mouchiroud D. Mol Biol Evol. 1998;15:1091–1098. [PubMed]
23. Sullivan J, Swofford D L, Naylor G J P. Mol Biol Evol. 1999;16:1347–1356.
24. Ackerly D D. Evolution. 2000;54:1480–1492. [PubMed]
25. Madsen O, Scally M, Douady C J, Kao D J, DeBry R W, Adkins R, Amrine H M, Stanhope M J, De Jong W W, Springer M S. Nature (London) 2001;409:610–614. [PubMed]
26. Graur D, Duret L, Gouy M. Nature (London) 1996;379:333–335. [PubMed]
27. Easteal S, Collet C, Betty D. The Mammalian Molecular Clock. Austin, TX: Landes; 1995.
28. Jukes T H, Cantor C R. In: Mammalian Protein Metabolism. Munro H N, editor. New York: Academic; 1969. pp. 21–132.
29. Hasegawa M, Kishino H, Yano T. J Mol Evol. 1985;22:160–174. [PubMed]
30. Tudge C. The Variety of Life. New York: Oxford Univ. Press; 2000.
31. Hedges S B. In: Major Events in Early Vertebrate Evolution: Palaeontology, Phylogeny, Genetics and Development. Ahlberg P E, editor. London: Taylor and Francis; 2001. pp. 119–134.
32. Hedges S B, Kumar S. Science. 1999;285:2031a.
33. Duret L, Mouchiroud D, Gouy M. Nucleic Acids Res. 1994;22:2360–2365. [PMC free article] [PubMed]
34. Kumar S, Hedges S B. Nature (London) 1998;392:917–920. [PubMed]
35. Swofford D L. paup*: Phylogenetic Analysis Using Parsimony (*and Other Methods) Sunderland, MA: Sinauer; 2000.
36. Nei M, Kumar S, Takahashi K. Proc Natl Acad Sci USA. 1998;95:12390–12397. [PMC free article] [PubMed]
37. Kumar S, Gadagkar S R. J Mol Evol. 2000;51:544–553. [PubMed]
38. Rosenberg M S, Kumar S. Mol Biol Evol. 2001;18:1823–1827. [PubMed]
39. Kumar S. Mol Biol Evol. 1996;13:584–593. [PubMed]
40. Robinson D F, Foulds L R. Math Biosci. 1981;53:131–147.
41. Penny D, Hendy M D. Syst Zool. 1985;34:75–82.
42. Hedges S B, Parker P H, Sibley C G, Kumar S. Nature (London) 1996;381:226–229. [PubMed]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • Compound
    Compound
    PubChem Compound links
  • MedGen
    MedGen
    Related information in MedGen
  • PubMed
    PubMed
    PubMed citations for these articles
  • Substance
    Substance
    PubChem Substance links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...