• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of pnasPNASInfo for AuthorsSubscriptionsAboutThis Article
Proc Natl Acad Sci U S A. Mar 22, 2005; 102(12): 4436–4441.
Published online Mar 11, 2005. doi:  10.1073/pnas.0408313102
PMCID: PMC555482
Evolution

Incorporating gene-specific variation when inferring and evaluating optimal evolutionary tree topologies from multilocus sequence data

Abstract

Because of the increase of genomic data, multiple genes are often available for the inference of phylogenetic relationships. The simple approach for combining multiple genes from the same taxon is to concatenate the sequences and then ignore the fact that different positions in the concatenated sequence came from different genes. Here, we discuss two criteria for inferring the optimal tree topology from data sets with multiple genes. These criteria are designed for multigene data sets where gene-specific evolutionary features are too important to ignore. One criterion is conventional and is obtained by taking the sum of log-likelihoods over all genes. The other criterion is obtained by dividing the log-likelihood for a gene by its sequence length and then taking the arithmetic mean over genes of these ratios. A similar strategy could be adopted with parsimony scores. The optimal tree is then declared to be the one for which the sum or the arithmetic mean is maximized. These criteria are justified within a two-stage hierarchical framework. The first level of the hierarchy represents gene-specific evolutionary features, and the second represents site-specific features for given genes. For testing significance of the optimal topology, we suggest a two-stage bootstrap procedure that involves resampling genes and then resampling alignment columns within resampled genes. An advantage of this procedure over concatenation is that it can effectively account for gene-specific evolutionary features. We discuss the applicability of the two-stage bootstrap idea to the Kishino–Hasegawa test and the Shimodaira–Hasegawa test.

Phylogenetic relationships can be estimated by a frequentist, Bayesian, or parsimony framework (for a general overview, see ref. 1). Within the frequentist framework, available procedures include maximum likelihood and distance methods. The bootstrap procedure (e.g., refs. 24) is widely employed to calculate the significance of a topology that is obtained with the frequentist or parsimony approaches. This procedure assumes that all columns of sequence data are samples from an independent and identical distribution, which we will refer to as the “iid assumption.” Although site dependency seems to be a more realistic assumption for biological sequence evolution (e.g., refs. 58), the iid assumption is widely employed because it saves computation time and convenient asymptotic theories of statistics can then be applied.

Because of the increase of genomic data, phylogeny estimation and testing can be based on multiple genes (e.g., refs. 912). A simple approach is to concatenate multiple genes, estimate the true tree, and then apply a bootstrap procedure to test whether the estimated topology is significantly better than alternatives. Concatenation may violate the iid assumption. Because different genes may have been subject to different evolutionary pressures, it may be appropriate to describe each gene by its own parameter set. The set may include parameters for nucleotide frequencies, branch lengths, etc. This sort of separate analysis of genes may be preferable to concatenation (refs. 911 and 1315, but see ref. 16).

At the extreme, different genes may even support different tree topologies. Permutation tests (1719) are available to detect departures among genes in “homogeneity” of tree support. However, these tests simply examine whether genes support congruent trees and whether we can ignore gene-specific effects via concatenation. These tests fail to give further direction about combining the separate information when multiple genes support different trees.

To determine the optimal topology even when different genes support different topologies, we should consider what topology selection criterion is most appropriate. The usual sum of parsimony or log-likelihood scores over all columns of separately analyzed multiple genes can be the basis for selecting an optimal tree (e.g., refs. 911), and we will refer to this as the “sum criterion.” The sum criterion is superior to sequence concatenation because it is affected by gene-specific evolutionary features. When the sum criterion is used to select optimal tree topologies, it is critical to determine how to test the significance of the selected trees. A simple application of the bootstrap procedure merges across genes the pool of sitewise log-likelihood or parsimony scores and then samples sitewise scores from this merged pool with replacement. The simple one-stage procedure may be ill-advised because the variation among genes is not properly considered and because the iid assumption among scores within the merged pool is no longer valid. Here, we apply a two-stage bootstrap procedure (20) to avoid problems associated with the one-stage version.

Another criterion to select tree topologies is the “average criterion.” If different sequences have different lengths, the sum of scores might fail to be a good criterion because long sequences may have a big impact on the sum of scores and the resulting optimal tree will reflect mainly the tree supported by long sequences and their gene-specific features. This sum of scores criterion therefore might be sensitive to long genes that have experienced unusual evolutionary processes. To remove the effect of sequence length, one can assign each gene an average score per site and then average these averages. In a similar way to the sum criterion, the two-stage bootstrap procedure can then be employed.

For the justification of our two-stage bootstrap procedure, we assume a hierarchical structure of multigene data set generation. First, evolutionary characteristics of each gene are determined. These evolutionary features are summarized by parameters representing nucleotide frequencies, branch lengths, etc. The parameter sets for different genes are independent samples from some common distribution of parameter sets. According to this hierarchical framework, each gene can be visualized as being determined by a common mechanism with perturbation. Because of the perturbation, genes do not all have identical properties. Instead, their properties follow an unknown true distribution. Second, once the properties of each gene are determined, they are then expressed through the sequence columns. These columns are distributed around the evolutionary features specific to a particular gene. The degree of dispersion of sequence columns around the features of a gene can vary among different genes.

Although it is possible that the evolutionary features of a gene and its length are correlated, we assume that this is not the case. Therefore, our two-stage iid assumption leads to a two-stage bootstrap procedure (20) for testing the significance of a tree. In this context, the first stage of the two-stage bootstrap procedure is to sample genes. The second stage samples columns of sequence data with replacement within resampled genes. Although sequence lengths are generally long, the number of available genes may be relatively small. Thus, the expected value of the estimated variance from the two-stage bootstrap procedure may be seriously different from the true variance. This bias should be considered when testing hypotheses. Under the two-stage iid assumption, it is possible to analytically calculate the variance of bootstrap resamples. We show how to do this for testing the significance of a topology.

The two-stage hierarchical structure is similar to a random effects model (21). One main purpose of the conventional random effects model is to test for a difference between groups. In our case, groups correspond to genes. A normality assumption is needed for ease of testing. The purpose of our procedure is to test whether the mean of the random effects model is equal to zero. We note that this test can be done without an assumption about the type of distribution by employing a nonparametric bootstrap procedure.

Methods

Here, we show how the hierarchial structure of genes and sequence columns can be used in determining an optimal tree topology and then testing its significance in a likelihood context. Kullback–Leibler distance (KLD) (22) is a distance measure between two models. In our case, these models correspond to tree topologies. We show that our hierarchical assumption of genes and columns can be directly extended to the KLD measure.

KLD. For the ith gene, KLDi between the true but unknown data-generating mechanism f(·) and the model g(·|θ) is:

equation M1
[1]

where the xij values are sampled sequence columns from f(·) and ni is the length of gene i. Each term of {log f(xij) – log g(xij|θ)} is a sitewise KLDi. The estimate of the minimum KLDi can be obtained with the maximum of equation M2 , which is denoted mKLDi here. That is, we obtain mKLDi at maximum likelihood estimates (equation M3) under the model g(·|θ),

equation M4

The mKLDi is used to choose the model (topology) which is closest to the unknown true model. Because we do not know the true data-generating mechanism f(·), we cannot directly calculate mKLDi. However, the log f(xij) terms are common to all competing models and mKLDi differences can be used in model comparison. Suppose two models (topologies) g1 and g2 are compared, then

equation M5
[2]

where equation M6 and equation M7 are maximum likelihood estimates under the model g1(·|θ1) and g2(·|θ2). The equation M8 term is a sitewise mKLDi difference and is denoted equation M9. If the quantity of Eq. 2 is significantly larger (smaller) than zero, model g2 (g1) is termed closer to the true model. The variance of equation M10 is considered in testing significance. The log equation M11 and log equation M12 terms are not exactly independent among different sites j (j = 1,..., ni) because equation M13 and equation M14 are functions of all xij values. However, for large ni, equation M15 and equation M16 are close to the true values θ1 and θ2, and these true values minimize KLD between f(·) and g1(·|θ1), and between f(·) and g2(·|θ2). Thus, it is almost correct to regard log equation M17 and log equation M18 as functions of solely xij. Because xij has a hierarchical iid structure, functions of xij also have a hierarchical iid structure.

Testing Phylogeny in a Likelihood Context. Suppose we want to know whether tree topology a or b is closer to the truth. If the estimated difference of mKLDi between the two tree topologies deviates significantly from zero, it means that one of the two trees is significantly closer to the unknown true distribution than the other [Kishino–Hasegawa (KH) test, ref. 3]. If one of a and b is the optimal tree as estimated by maximum likelihood or parsimony rather than a tree that was of interest apriori, the uncertainty of the estimate of the optimal tree should be considered in obtaining a confidence set of trees [Shimodaira–Hasegawa (SH) test, ref. 4]. Here, we show how to apply the hierarchical structure of equation M19 to the KH and SH tests.

For the maximum likelihood estimates of parameters, let the log-likelihood values calculated at the jth column of the ith gene under trees a and b be la,ij and lb,ij respectively. Let the sitewise log-likelihood difference be equation M20, which is the sitewise difference of mKLD.

We consider the distribution of sitewise differences of mKLD. That is, for K genes, we assume that yab,ij(i = 1,..., K; j = 1,..., ni) is an observation of random variable Yab,i with the properties equation M21 and equation M22. Except for a finite mean and variance, little is assumed about the type of the distribution. The equation M23 values need not be the same among genes i, but we do require independence between wab,i and equation M24. The expected value of equation M25 is denoted equation M26. We consider the hierarchical structure where wab,i is an observation of random variable Wab with the properties equation M27. We assume that gene lengths ni are random variables that are independent of Wab,i and equation M28 with E(ni) = n (0 < n <∞) and Var(ni) = σn < ∞. The hierarchical structure in our approach is similar to a random effects model. Although a conventional goal with the random effects model would be to test whether equation M29 exceeds zero, we are more interested here in testing whether μab is zero. For our application, normality assumptions for Yab,i and Wab are not required.

KH test for multiple genes. Results pertaining to the KH test for multiple genes are below and are justified in the supporting information, which is published on the PNAS web site. We consider the testing problem, H0: μab = 0 versus H1: μab ≠ 0. If μab is greater (less) than zero, then tree a (tree b) can be regarded as more reliable. Define Sab as the sum of sitewise log-likelihood differences between tree a and b over all columns of all genes,

equation M30
[3]

This means

equation M31
[4]

If Sab is far from zero, then H0 is rejected. In general, n2 and σ2n are big. Because σ2ab,W is multiplied by K(n2 + σ2n), the among-gene variation (σ2ab,W) can have a big impact on the variance of Sab.

To approximate the sampling distribution of Sab under H0,weuse two-stage bootstrap resampling. First, we resample genes (denoted by *) and second, we resample columns of the resampled genes (denoted by [large star]). Because it is computationally intensive to maximize likelihoods for resampled data, we employ the idea of the RELL method (3). That is, we consider two-stage resampling from the set of likelihood values that are already calculated instead of resampling gene and sequence columns. Let the resampled log-likelihood value at the jth column of the ith gene under trees a and b be equation M32 and equation M33. Define

equation M34

where equation M35 and equation M36 is the length of resampled ith gene. We have

equation M37
[5]

and

equation M38
[6]

where yab,i is (yab,i1,..., yab,ini)T. The expected values of Eq. 5 and 6 are

equation M39

and

equation M40
[7]

Eq. 7 means that we should consider the bias in the estimation of Var(Sab) by equation M41. This bias is represented by the second term of Eq. 7. Also in testing H0, this bias should be considered.

Following two guidelines of the bootstrap method by Hall and Wilson (23), we approximate the distribution of equation M42 with the distribution of equation M43, where equation M44 and equation M45 are the square roots of the unbiased estimators of Var(Sab) and equation M46. We have

equation M47
[8]

equation M48
[9]

If equation M49 is outside the 95% interval of equation M50, then we reject the null hypothesis H0: μab = 0. When gene number is large and sequences are long, equation M51 asymptotically follows a standard normal distribution when H0 is true and the test can be performed without a bootstrap procedure. SH test for multiple genes. In many cases, we want to test significance between an optimal and another tree, not between two trees that are both selected prior to the analysis. The SH test is designed for this situation because it considers the uncertainty of choosing the optimal tree (ref. 4; see also ref. 24). Often, the optimal tree is selected from a set of candidates, and the goal is to test whether the optimal tree is significantly better than the others. The candidates that are not significantly worse than the optimal tree are used to construct the confidence set of trees.

The definition of Sab in Eq. 3 can be rewritten

equation M52
[10]

Eq. 10 suggests the sum criterion to determine the optimal tree from multiple genes. Suppose there are p candidate trees, one of which is the optimal tree equation M53. In our method,

equation M54
[11]

To apply the two-stage bootstrap to the SH test, let Hab and equation M55 be equation M56, where a, b = 1,..., p. If a = b, Hab and equation M57 are zero. Under the least favorable configuration (4) in which the expected log-likelihood sum is the same over all trees, the hypothesized value of μab is equal to zero, and this makes equation M58. Following the steps below, we can construct a confidence set of trees.

  1. For each tree τ(τ [set membership] {1,..., p}), calculate the test statistic equation M59.
  2. Generate q sets of two-stage resampled log-likelihood values equation M60.
  3. Calculate equation M61 from equation M62 and equation M63, for all a, b = 1,..., p.
  4. For each tree τ with each resampled data set r (r = 1,..., q), calculate equation M64, where equation M65 of equation M66 is estimated with resampled data as follows
    equation M67
    [12]
  5. For each τ (τ [set membership] {1,..., p}), if Tτ is not in the critical region of the distribution of equation M68 with level α, τ is included in the confidence set of trees.

The Average Criterion to Select Tree Topology. Above, we considered the sum criterion to select the tree topology. As noted, the sum of log-likelihood values over all genes can be sensitive to relatively long genes that happened to experience atypical evolutionary processes. To remove the effect of sequence length, we consider an alternative to the sum criterion that we refer to as the average criterion.

If we define equation M69 and equation M70 as

equation M71

then equation M72 and equation M73. Letting n[minus sign in circle] and n[multiply sign in circle] be the expected values of 1/ni and 1/n2i, the variance of equation M74 and equation M75 are

equation M76

and the expectation of equation M77 is

equation M78

In a similar way to Eq. 7, we should consider the bias in the estimation of equation M79 when using equation M80. The unbiased estimators of equation M81 and equation M82 are, respectively,

equation M83

We can approximate the distribution of equation M84 with the distribution of equation M85. The extension of the KH and SH tests for the average criterion is straightforward. With this criterion, optimal trees from original and resampled data, equation M86 and equation M87, are

equation M88
[13]

The tree inferred with the average criterion is not necessarily the same as the tree obtained by the sum criterion (see below for examples).

Results

Data. As an example, we analyzed data from Cao et al. (10). Sitewise log-likelihood values (la,ij's) of mitochondrial genes were obtained from the authors. The purpose of Cao et al.'s work was to find the position of turtle within the amniotes group (for the 15 tree topologies considered by Cao et al., see Table 1). Rather than placing our focus on the turtle position, here we concentrate on our two-stage resampling procedure and the difference between our method and the weighted SH test after sequence concatenation or after merging separately calculated log-likelihoods. Among 14 mitochondrial genes, we excluded 12S and 16S rRNA data. These two rRNA genes were analyzed by Cao et al. (10) with a nucleotide substitution model, whereas the other 12 genes are protein-coding genes, and Cao et al. (10) used an amino acid replacement model to analyze them. With our hierarchical approach, it may be unreasonable to assume that the former two genes are observations from the same distribution as the 12 protein-coding genes. Tree support varies among the 12 protein-coding genes. When the 12 protein-coding genes were separately analyzed by Cao et al. (10), they found that topologies 2, 3, 4, and 9 were the maximum likelihood topologies for 5, 5, 1, and 1 genes, respectively.

Table 1.
P value and bootstrap probability (BP) of each tree calculated with concatenation of sequences, merging separately calculated log-likelihoods, sum criterion, and average criterion

Variability Among Genes. We compared the distribution of sitewise log-likelihood (la,ij) values between gene pairs and tested whether those distributions are significantly different. To do this, we performed the nonparametric Kolmogorov–Smirnov test (25). Many pairwise comparisons show a significant difference between genes (see supporting information). We obtained qualitatively similar results for all 15 topologies. This finding implies that ignoring gene-specific effects may be unwarranted and that neither sequence concatenation nor a conventional one-stage bootstrap procedure are recommended. In a similar way, we considered topology pairs a and b and tested the among-gene variability of sitewise differences of log-likelihood (yab,ij) values. We observed high variability among genes for most tree pairs that were examined (data not shown).

Comparison with SH Test. Because the second term of Eq. 7 is not negligible in general, we need proper “weighting” by equation M89 and equation M90 as described in Methods to approximate the distribution of (SabKnμab) with the distribution of equation M91. The weighting scheme also can be adapted to the simple SH test. We refer to this adaptation as the weighted SH (WSH) test (4). Because it considers weighting in its test statistic, our method corresponds more to the WSH test than the unweighted SH-test. We obtained optimal tree topologies with the sum and average criteria and we applied the two-stage resampling idea to the SH test. We compared our two stage procedures to the WSH test after sequence concatenation and the WSH test after merging separately calculated log-likelihoods (Table 1).

We applied the WSH test by using the consel software package (26) to sample with replacement from the sitewise log-likelihoods obtained in the analysis of Cao et al. (10). The bootstrap probabilities (BPs) of concatenation and merging procedures were calculated with the consel software. The BPs of our two-stage procedure were calculated in similar ways to Eqs. 12 and 13 without centering.

In our two-stage procedures, topologies 2 and 3 are obtained by the sum and average criteria. Topology 2 is also obtained from the procedure of sequence concatenation and merging log-likelihoods. The confidence sets of topologies in Table 1 contain the same trees (trees 2–4) with a 5% significance level, but the P values of the topologies vary among sets.

Discussion

In the Methods, we explained our method in a likelihood context. If we redefine la,ij as the parsimony score of the jth site of the ith gene under tree a, then getting and testing the optimal tree by parsimony could be done as described in Methods. The idea can also be straightforwardly applied to distance matrix methods. Because the hierarchical structure of gene and sequence columns can be extended to the hierarchical structure of mKLD, it could be also extended to a hierarchical structure of evolutionary distance. From the multiple distance matrices obtained in analyses of separate genes, an average distance matrix could be calculated. The optimal tree would then be reconstructed with this average distance matrix. To assign bootstrap support to subclades, a two-stage bootstrap procedure could be adopted. The idea of an average distance matrix corresponds to the average criterion. The sum criterion also can be extended to the distance method. We can multiply each distance matrix with the length of the gene. The tree can be reconstructed with the sum of these multiple matrices. A two-stage bootstrap procedure can be employed to get the bootstrap support. For long branches on an evolutionary tree of a gene, the estimated distance can be extremely large and have poor precision. Therefore, instead of using the average or sum of distances, it might be better to use the median of multiple estimated distances.

It is possible to estimate equation M92. For gene i, equation M93 can be obtained with samples of yab,ij (j = 1,..., ni). Using the fact that equation M94 of Eq. 8 is an unbiased estimator of Var(Sab) of Eq. 4 together with estimated equation M95, equation M96, equation M97 and equation M98, we can obtain equation M99. To calculate the precision of equation M100, we should specify the distributions of yab,ij and ni, or at least know their third and fourth moments. Because we make minimal assumptions about the first two moments of yab,ij and ni, we do not try to quantify equation M101 .If yab,ij and wab,i followed normal distributions with all equation M102 values equal, the conventional random effects approach (21) could test whether equation M103. Instead, we try to make our method less dependent on the model type and try to make our method less parametric. For this reason, we investigate the heterogeneity of the distribution of yab,ij with the Kolmogorov–Smirnov test instead of directly quantifying equation M104. Even when equation M105, the distribution of yab,ij can be heterogeneous because the equation M106 terms can vary among genes. When all genes have the same equation M107, there still may be variation among genes in impact on the sum criterion because of different sequence lengths ni and because equation M108 may not be zero. This means that concatenation of sequences or merging sitewise log-likelihoods followed by the iid assumption are not reasonable approaches.

From Eq. 4, we see that the sum of sitewise log-likelihood values over all columns of all genes is expected to have smaller variance when equation M109 is small and that variances of the log-likelihoods of genes are significantly affected by the variance of sequence length. On the other hand, the average over all genes of the average log-likelihood per site removes the effect of sequence length and reflects the average structure among genes robustly. The sum and average criteria are two extremes among many possible weighting schemes. Intermediate schemes exist, but it is unclear which intermediate would be best. More experience with genomic data and familiarity with gene-specific variation would help in determining how to select an optimal scheme.

From Cao et al.'s data (10), we excluded the two of 14 mitochondrial genes that are not protein-coding. We did this in order not to violate the hierarchical structure of the mKLD differences, yab,ij. We showed that hierarchical structure of genes and sites within genes are straightforwardly extended to the hierarchical structure of mKLD difference. With different amino acid (or nucleotide) substitution models in different genes, or if genes are seriously heterogeneous for reasons such as horizontal gene transfer, the hierarchical iid assumption of mKLD is violated. If topological difference has a big impact on mKLD but choice of evolutionary model does not, then different models for different genes or heterogeneous genes would not be a problem.

The 12 protein-coding genes from Cao et al. (10) are variable in terms of tree support. Topologies 2, 3, 4, and 9 are the maximum likelihood trees for 5, 5, 1, and 1 individually analyzed genes, respectively. The total concatenated sequence length is 3,235 aa. The sum of lengths of genes that support topology 2 is 1,651 aa, and the sum supporting topology 3 is 1,118 aa. This finding is consistent with the fact that the optimal tree is topology 2 under the concatenation and merging procedures with the sum criterion. The length of the COB gene, which supports topology 9, is 365 aa. This relatively short gene has little effect in the concatenation and merging approaches with the sum criterion. With our two-stage resampling procedure, COB can be resampled more than once and can have a bigger effect. Because the average criterion removes the effect of sequence length, it produces higher P values and bootstrap probabilities than the sum criterion for topology 9. Other analyses that we have performed also indicate that the effect of gene sampling on phylogenetic inference can be substantial (e.g., see supporting information).

In our approach, sequence columns are assumed to be iid samples from a common distribution. However there may exist heterogeneous partitions within a gene (e.g., ref. 14). For example, the three codon positions of protein-coding genes show significant heterogeneity in evolutionary dynamics (e.g., ref. 28). For protein-coding sequences, partitioning could also be done according to structural environment of the position (e.g., buried or exposed to solvent, α-helix or β-strand or coil). Here, we do not intensively investigate the sometimes difficult issue of choosing data partitions. More understanding of the evolutionary process would surely assist in defining data partitions.

Although categorizing by codon position or structural environment may lead to substantial variation in evolutionary process among partitions, these categorization schemes do not fit well into our hierarchical framework. We have treated gene-specific perturbations from a common evolutionary mechanism as random effects. Variation among partitions defined by codon position or protein structure would probably be better described with fixed effects. It is not easy to envision randomly sampling partitions defined by codon or protein structure.

Moreover, if each codon position or structural partition has an appropriate evolutionary model, the ratio of lengths between different branches on a tree might not be expected to vary among partitions. There is no obvious reason, for example, that we would expect all α-helix positions in a genome to evolve quickly relative to β-strand positions on one branch of a tree but slowly relative to β-strand positions on another branch of the tree. In contrast, genes are units on which phenotypic selection acts. Natural selection might induce gene-specific effects on branch length that could vary among branches and would yield ratios of branch lengths that vary among genes.

There have been proposals to take into account the variability of evolutionary process among genes (e.g., refs. 28 and 29). One treatment assumes all genes share a topology and the branch lengths for different genes vary only by gene-specific proportionality constants (28). Another assumes all genes evolved via the same topology, but there is no correlation when branch lengths of different genes are examined. Comparisons between the proportional branch lengths, the uncorrelated branch lengths, and the concatenated treatment have been made (30). Recently, the proportionality treatment was adapted to a Bayesian framework (31 and 32), and the uncorrelated treatment could be similarly adapted. In our procedure, all genes (or data partitions) are assumed to be independent samples from the same distribution. However, partitions that seriously violate this assumption (e.g., morphological versus sequence data) can be simultaneously analyzed in a Bayesian framework (refs. 31 and 32, see also ref. 33). Suchard et al. (34) recently introduced a promising hierarchical Bayesian approach that can pool information among genes without requiring either uncorrelated or strictly proportional branch lengths.

We have referred to the gene-specific properties that are our focus here as being due to “perturbations” from a “common mechanism.” We have intentionally been somewhat vague as to the possible biological sources of these perturbations. The most obvious explanation for gene-specific perturbations is mutation or natural selection. Natural selection operating on a trait coded by the gene could simultaneously affect the branch lengths or other parameters at all gene positions.

Although natural selection and mutation are likely to be important sources of gene-specific perturbations in evolutionary parameters, they should not generate different topologies for different genes. Even if all genes evolved according to the same topology, gene-specific perturbations may be partially responsible for variation among genes in the level of support for the shared topology. Our methods are designed to consider this variation.

A possibility that could result in different topological histories among genes is differential lineage sorting. Rannala and Yang (27) explicitly considered lineage sorting by ascribing variability among gene trees to ancestral polymorphism and stochasticity. There are many advantages to the direct and explicit treatment of Rannala and Yang (27). A possible disadvantage is that highly detailed models may lack robustness when assumptions are violated. Our less detailed treatment of gene-specific properties assumes there is a central tree of interest and that random gene-specific perturbations are unbiased relative to this tree. The fact that we do not specify the biological source of these perturbations can be viewed as either an advantage or disadvantage of our procedure.

Regarding the source of gene-specific perturbations, another possibility is they arise from model misspecification. It may be that all genes share a common history but that the model employed to analyze them is not equally appropriate for all genes. If the effects of this model misspecification are unbiased among genes in that the average inference among genes is expected to be the truth, our hierarchical treatment would be appropriate. Undoubtedly, a more desirable solution would be to remedy the misspecification by improving the model.

Here, we use the sum and average criteria to infer the optimal tree from multigene data. We also introduce a two-stage bootstrap procedure to test the significance of the optimal tree. Our method has an advantage over sequence concatenation in that it can effectively consider gene-specific features. It is justified by the assumption of a hierarchical structure of genes and sites within genes. The method has the advantage of being able to combine the information from multiple genes that might support different trees. We expect that the rapid growth of genomic data will lead to a better understanding of the effects of gene sampling on evolutionary inference. For the time being, only a small number of genes are typically available. With these small data sets, it is especially important to consider effects of gene sampling on the bias and power of phylogenetic hypothesis testing.

Supplementary Material

Supporting Information:

Acknowledgments

We thank Y. Cao for providing data. This work was supported by the Institute for Bioinformatics Research and Development of the Japanese Science and Technology Corporation, Japanese Society for the Promotion of Science Grant 16300086, and National Science Foundation Grants INT-99-0934 and DEB-0120635.

Notes

Author contributions: T.-K.S., H.K., and J.L.T. designed research; T.-K.S., H.K., and J.L.T. performed research; T.-K.S. analyzed data; and T.-K.S., H.K., and J.L.T. wrote the paper.

This paper was submitted directly (Track II) to the PNAS office.

Abbreviations: KLD, Kullback–Leibler distance; KH, Kishino–Hasegawa; SH, Shimodaira–Hasegawa; WSH, weighted SH.

References

1. Felsenstein, J. (2004) Inferring Phylogenies (Sinauer, Sunderland, MA).
2. Felsenstein, J. (1985) Evolution (Lawrence, Kans.) 39, 783–791.
3. Kishino, H. & Hasegawa, M. (1989) J. Mol. Evol. 29, 170–179. [PubMed]
4. Shimodaira, H. & Hasegawa, M. (1999) Mol. Biol. Evol. 16, 1114–1116.
5. Pollock, D. D., Taylor, W. R. & Goldman, N. (1999) J. Mol. Biol. 287, 187–198. [PubMed]
6. Jensen, J. L. & Pedersen, A. K. (2000) Adv. Appl. Prob. 32, 499–517.
7. Pedersen, A. K. & Jensen, J. L. (2001) Mol. Biol. Evol. 18: 763–776. [PubMed]
8. Robinson, D. M., Jones, D. T., Kishino, H., Goldman, N. & Thorne, J. L. (2003) Mol. Biol. Evol. 20, 1692–1704. [PubMed]
9. Adachi, J., Waddell, P., Martin, W. & Hasegawa, M. (2000) J. Mol. Evol. 50, 348–358. [PubMed]
10. Cao, Y., Sorenson, M. D., Kumazawa, Y., Mindell, D. P. & Hasegawa, M. (2000) Gene 259, 139–148. [PubMed]
11. Cao, Y., Fujiwara, M., Nikaido, M., Okada, N. & Hasegawa, M. (2000) Gene 259, 149–158. [PubMed]
12. Nikaido, M., Cao, Y., Harada, M., Okada, N. & Hasegawa, M. (2003) Mol. Phylogenet. Evol. 28, 276–284. [PubMed]
13. Bull, J. J., Huelsenbeck, J. P., Cunningham, C. W., Swofford, D. L. & Waddell, P. J. (1993) Syst. Biol. 42, 384–397.
14. Debry, R. W. (1999) Syst. Biol. 48, 286–299. [PubMed]
15. Debry, R. W. (2003) Syst. Biol. 52, 604–617. [PubMed]
16. Gontcharov, A. A., Marin, B. & Melkonian, M. (2004) Mol. Biol. Evol. 21, 612–624. [PubMed]
17. Farris, J. S., Källersjö, M., Kluge, A. G. & Bult, C. (1994) Cladistics 10, 315–319.
18. Swofford, D. L. (1995) paup*: Phylogenetic Analysis Using Parsimony (* and Other Methods) (Sinauer, Sunderland, MA).
19. Waddell, P. J., Kishino, H. & Ota, R. (2000) Mol. Biol. Evol. 17, 1988–1992. [PubMed]
20. Rao, J. N. K. & Wu, C. F. J. (1988) J. Am. Stat. Assoc. 83, 231–241.
21. Scheffé, H. (1959) in The Analysis of Variance (Wiley, New York), pp. 221–260.
22. Kullback, S. & Leibler, R. A. (1951) Ann. Math. Stat. 22, 79–86.
23. Hall, P. & Wilson, S. R. (1991) Biometrics 47, 757–762.
24. Goldman, N., Anderson, J. P. & Rodrigo, A. G. (2000) Syst. Biol. 49, 652–670. [PubMed]
25. Press, W. H., Teukolsky, S. A., Vetterling, W. T. & Flannery, B. P. (1992) in Numerical Recipes in C (Cambridge Univ. Press, Cambridge, U.K.), 2nd Ed., pp. 623–628.
26. Shimodaira, H. & Hasegawa, M. (2001) Bioinformatics 17, 1246–1247. [PubMed]
27. Rannala, B. & Yang, Z. (2003) Genetics 164, 1645–1656. [PMC free article] [PubMed]
28. Yang, Z. (1996) J. Mol. Evol. 42, 587–596. [PubMed]
29. Yoder, A. D. & Yang, Z. (2000) Mol. Biol. Evol. 17, 1081–1090. [PubMed]
30. Pupko, T., Huchon, D., Cao, Y., Okada, N. & Hasegawa, M. (2002) Mol. Biol. Evol. 19, 2294–2307. [PubMed]
31. Ronquist, F. & Huelsenbeck, J. P. (2003) Bioinformatics 19, 1572–1574. [PubMed]
32. Nylander, J. A. A., Ronquist, R., Huelsenbeck, J. P. & Nieves-Aldrey, J. L. (2004) Syst. Biol. 53, 47–67. [PubMed]
33. Lewis, P. O. (2001) Syst. Biol. 50, 913–925. [PubMed]
34. Suchard, M. A., Kitchen, C. M. R., Sinsheimer, J. S. & Weiss, R. E. (2003) Syst. Biol. 52, 649–664. [PubMed]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...