• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of pnasPNASInfo for AuthorsSubscriptionsAboutThis Article
Proc Natl Acad Sci U S A. Feb 20, 2007; 104(8): 2785–2790.
Published online Feb 14, 2007. doi:  10.1073/pnas.0611164104
PMCID: PMC1815259

Integration within the Felsenstein equation for improved Markov chain Monte Carlo methods in population genetics


In 1988, Felsenstein described a framework for assessing the likelihood of a genetic data set in which all of the possible genealogical histories of the data are considered, each in proportion to their probability. Although not analytically solvable, several approaches, including Markov chain Monte Carlo methods, have been developed to find approximate solutions. Here, we describe an approach in which Markov chain Monte Carlo simulations are used to integrate over the space of genealogies, whereas other parameters are integrated out analytically. The result is an approximation to the full joint posterior density of the model parameters. For many purposes, this function can be treated as a likelihood, thereby permitting likelihood-based analyses, including likelihood ratio tests of nested models. Several examples, including an application to the divergence of chimpanzee subspecies, are provided.

Keywords: speciation, population structure, divergence

Population genetic and phylogenetic models that take a genealogical (i.e., gene tree) approach suffer two nested levels of ambiguity. First, the uncertainty of an estimate of a genealogy can be large and difficult to quantify, and second, it can be difficult to interpret a genealogy estimate explicitly in terms of an evolutionary or population genetics model. In his 1988 review, Felsenstein (1) conceptualized a way thru these uncertainties by positioning the genealogy as a nuisance variable in the definition of the likelihood of the parameters given the data (proportional to the sampling probability of the data):

equation image

where X is the data, G is a genealogy, ψ is the set of all possible genealogies, and Θ is the vector of model parameters to be estimated. The basic idea of considering all of the possible genealogies in proportion to their probability is also contained explicitly in the recursion approach of Griffiths (2) and is suggested in much other work on genealogical models. Although Felsenstein used the notation for summation, integral forms have often since been used, reflecting the fact that genealogies are complex entities with both discrete components (branching topology) and continuous components (branch lengths).

Felsenstein's equation does not have a general closed-form solution, and numerical evaluation is difficult because of the very large number of possible tree topologies for even small data sets. It is possible to approximate Eq. 1 by simulating k independent genealogies from p(G|Θ), G1, …, Gk, in which case, a simulation-consistent estimator of the likelihood can be obtained as

equation image

However, this is usually far too inefficient, because the variance in Pr(X|G) will be very large for randomly generated genealogies. Efficient stochastic evaluation of Eq. 1 requires the availability of methods for sampling G with some consideration of the data. For a given parameter value, Θ, the distribution of G that minimizes the simulation variance is

equation image

However, direct sampling from this distribution is not possible because it requires that the likelihood function can be calculated analytically (3).

Kuhner, Yamato, and Felsenstein (1995).

A solution to the question of how to sample genealogies was described by Kuhner et al. (4), who devised a Markov chain Monte Carlo (MCMC) simulation approach. In the simulation, updates of G are accepted with probability given by the Metropolis–Hastings (5, 6) criterion

equation image

where q(GG*) is the probability that G* is proposed as an update from G. At stationarity, the residence time in the Markov chain will be proportional to the posterior density of that genealogy (i.e., as given by Eq. 3), and trees sampled successively from the Markov chain are correlated draws from the posterior density of genealogies.

The approach devised by Kuhner et al. and used thereafter for a variety of models (710) is to use the genealogies that have been sampled by using one parameter value, Θ0, to estimate the relative likelihood for other values. The likelihood surface for Θ is obtained by running a Markov chain at a fixed value Θ0 close to the mode of the likelihood function while evaluating the likelihood for multiple values of Θ by using importance sampling (11). Let p(G0) be p(G|Θ) evaluated at the point Θ = Θ0, and assume p(G0) > 0 if p(G|Θ) > 0 for all Θ and G. Then from Eqs. 1 and 3, we see that

equation image

where w(Θ, Θ0, G) = p(G|Θ)/p(G0). A single set of values of G are drawn from p(G|X, Θ0) (i.e., by sampling from the Markov chain) and used to estimate the relative likelihoods for other values of Θ:

equation image

where wi(Θ, Θ0, G) and G(i) are the value of wi(Θ, Θ0, G) and G, respectively, in the ith sampled step of the chain.

The method of Kuhner et al. (4) was the first true MCMC method in population genetics, and it showed that likelihood inference of population genetics parameters is possible for complex mutational models. It lead to the development of related methods (1214), and it preceded the use of closely related MCMC methods for phylogenetic inference (1517). However, it suffers the significant shortcoming that the distribution of wi(Θ, Θ0, G) will be very skewed when Θ differs from Θ0, causing the variance of the estimate of the likelihood to be very large and difficult to estimate when |Θ − Θ0| is large (14, 18). Because of the skewed distribution of wi(Θ, Θ0, G), the method will tend to underestimate the likelihood when Θ differs from Θ0 and thus bias the estimator toward values close to Θ0. Kuhner et al. (4) address the problem of large variance when |Θ − Θ0| is large by running multiple chains and updating Θ0 each time the chain is restarted (19).

An alternative to MCMC sampling of genealogies is the sequential importance sampling method of Griffiths and Tavaré (20, 21). Stephens and Donnelly (22) suggested a modification of the approach of Griffiths and Tavaré that samples more efficiently from an approximation to Eq. 3.

Bayesian MCMC.

One way to extend the MCMC approach to generating likelihood surfaces is to explicitly consider a prior distribution of Θ, p(Θ), and to simulate a Markov chain with stationary measure given by the joint posterior density of G and Θ,

equation image

(12, 14, 23). This approach, of running a Markov chain over a state space of genealogies and model parameters, has been extended to multilocus applications for a variety of models (2427). Apart from having a large state space and associated MCMC mixing challenges, the main shortcomings stem from the essential form of the result, which is not a function estimate but merely a record of parameter values. Density estimates can be obtained by binning or by kernel estimators, but the nature of the results effectively precludes estimates of the joint posterior density for models with more than a small number of parameters. In such cases the volume of the parameter space is so large that the number of recorded values that will fall in any portion of it may be low, even for very long runs and even for portions of the parameter space associated with high posterior densities. Because of this “curse of dimensionality,” the number of samples needed increases exponentially with the dimension (28). This means that applications have mostly been limited to the generation of estimates of the marginal posterior densities for each of the model parameters. It also means that it has been difficult to estimate likelihood ratios for models involving several parameters. Here, we propose a method that eliminates the need for a driving value (Θ0) and that generates estimates of the entire posterior probability density function, suitable for optimization and likelihood-ratio tests of nested models.


This approach relies on the analytical calculation of the prior probability of G by integration of p(G|Θ) over the prior distribution of Θ. This makes it possible to draw samples by MCMC directly from the marginal posterior probability of genealogies, p(G|X). Then, using a sample of these genealogies, one can construct an estimate of the posterior density function, p(Θ|X).

As we will show, the calculation of the marginal prior density of genealogies,

equation image

can be done analytically and easily when p(Θ) has a uniform distribution. Access to the prior for G permits an MCMC simulation that has a marginal posterior density for G given by

equation image

(contrast with Eq. 7). Then, the posterior density for Θ is given by

equation image

(contrast with Eq. 1). This can be proved by noting that under the proper scaling, p(X|G, Θ) = Pr(X|G) (4). Then p(Θ|G, X) = Pr(X|G, Θ)p(Θ|G)/Pr(X|G) = p(Θ|G), and Eq. 10 follows by the law of total probability.

The posterior density of Θ can then be consistently estimated as

equation image

where Gi, i = 1, 2, …, k, are the samples from p(G|X) that are generated by the MCMC simulation. Inferences can then be based on p(Θ|X) or on the likelihood function deduced as L(Θ) [proportional, variant] p(Θ|X)/p(Θ). If p(Θ) is a constant, then the posterior probability is directly proportional to the likelihood over the prior range of Θ. In effect, a Bayesian sampling strategy is being used to generate an estimate of the relative likelihood, which can be used in turn to find a maximum likelihood estimate of Θ and to conduct other likelihood-based analyses. It is also useful to note that Pr(X|G) is not part of the final calculation in Eq. 11. As in the method of Kuhner et al., (4) the data are used to determine the probability density from which the genealogies are sampled and thereafter are not required (29).

A Single-Population Model.

Consider a model in which Θ includes just one parameter, θ = 4Nu, and a sample of n gene copies, for a locus with neutral mutation rate u, drawn from a population with effective chromosomal population size 2N and evolving according to Kingman's coalescent (30). Letting the coalescent times in the genealogy be τ = {τ2, …, τn}, where τi is the time interval in G in which there are i ancestors of the sample, then

equation image


equation image

is the total coalescent rate measured over the genealogy (30, 31). If we consider a uniform prior distribution for θ over the interval {0, θmax}, then placing Eq. 12 into Eq. 8 yields

equation image

where Γ(a, b) is the incomplete Gamma function with parameters a and b. Similarly, we find

equation image

Generation of the estimate of the posterior density function, which is a sum of functions in the form of Eq. 14 (see ref. 11) requires only that fn be recorded at intervals from the Markov chain simulation.

Multipopulation Models.

Now consider a family of models (so-called “island models”) in which multiple populations, each of constant size, have been exchanging genes at constant rates for sufficiently long that the probability of a genealogy is solely a function of population sizes and migration rates (32, 33). Here, we develop the case for two populations with a pair of population size parameters (θ1, θ2) and two scaled migration rate parameters (m1 and m2) (8), but the approach can be extended to any number of populations.

For a sample of n1 and n2 gene copies, from each population respectively, G will include n1 + n2 − 1 coalescent events as well as a variable number of migration events. Let c1 and c2 be the number of coalescents in populations 1 and 2, respectively; and let w1 and w2 be the number of migration events out of population 1 and 2, respectively. When the coalescent and migration events are ordered in time, there are a total of a = n1 + n2 + w1 + w2 − 1 time intervals. The probability density of the genealogy, as a function of the parameter set Θ = {θ1, θ2, m1, m2}, is

equation image

where the f and g terms refer to the total coalescent and migration rates, respectively, over the corresponding portions of G, such that

equation image

where n1,i and n2,i are the number of gene copies in populations 1 and 2 during interval i. Then integration over each of the four elements in Θ yields the prior probability

equation image

The result of this integration is a product of four terms, including two that take the same form as Eq. 12 for the scaled population size parameters, as well as two migration terms, each of which takes the form

equation image

where Γ(a, 0, b) is the lower incomplete Gamma function.

Finally, recall that

equation image

where, for this model, the numerator is the product of Eq. 15 and the prior distribution, and the denominator is given by Eq. 17. Then, as with the case of a single-parameter model, p(Θ|G) can be used in Eq. 11 for each of a set of sampled genealogies.

The estimate of p(Θ|X) obtained by using Eq. 11 has some desirable properties. First, the integration over Θ will necessarily equal 1, because it is equivalent to integrating each of the k components of the sum, the result of each of which will necessarily equal 1 (see Eq. 19). Second, because each of the component functions that are summed in Eq. 11 are calculable and differentiable over the prior of Θ, so is the overall function. This means that the function can, in principal, be maximized for all, or any subset, of the parameters in Θ.

Models with Population Splitting and Multiple Loci.

Conventional island models assume an equilibrium between migration and genetic drift and cannot well represent histories that include recent population-splitting events. Such splitting events are a typical component of the speciation process, and they underlie the hierarchical structure of the phylogenetic history of life on earth. By incorporating population-splitting events into multipopulation genetic models it becomes possible to conjoin phylogenetic models with population genetics ones.

Described in supporting information (SI) Text is the two-population “isolation with migration model,” in which there are six parameters including three for population sizes (θ1, θ2, and θa, where θa is the value of θ in the ancestral population); the scaled time at which the ancestral population gave rise to the two descendant populations, t; and the two scaled migration rates, m1 and m2 (23, 25). In this context, G is partly a function of the splitting time (23) and so it is not clearly feasible to develop a prior for G by analytically integrating over t (unlike the case with only population size and migration parameters). However, we can calculate analytically the joint prior, p(G, t), and we can sample pairs of values of G and t, from a Markov chain simulation. The result is an estimate of the posterior density function for all of the parameters apart from t,

equation image

where Θ includes all parameters except t. Although t is not integrated over analytically, the simulations do reveal an estimate of the marginal posterior density for t. Also described in SI Text is a method for considering data from multiple loci that vary in their neutral mutation rates.

Implementation and Examples.

A computer program was written that implements a Markov chain simulation for generating samples from p(G|X) for models with one or two populations, as well as for a two-population isolation with migration model (i.e., with a population-splitting time parameter, t). The state space of the Markov chain includes the prior distribution of G (and t if population splitting is in the model), with a general Metropolis–Hastings update criterion

equation image

The update of G to G* is done by using branch sliding (14) in which a randomly selected branch is moved a random distance in the tree. The migration events originally on the branch are removed, and a random number of new migration events is drawn from a Poisson distribution, conditioned on there being an even or odd number of migration events (depending on whether the starting and ending populations of the branch are the same). The Poisson parameter is taken to be the expected number of migrations over the span of the new branch length, given the current number of migration events that occur over the total length of the tree.

If the model includes t, then it is also necessary to do joint updates of G and t. For these updates, we follow the method of Rannala and Yang (34), in which the new value, t*, is drawn from a uniform distribution over the interval {0, tmax}, and the times of all migration and coalescent events in G before t are multiplied by t*/t, and the times of events after t are summed with (t* − t). At evenly spaced intervals, records are made of t, p(G, t), and of those quantities from G that are needed to calculate p(G|Θ, t). For the case of multiple loci, the updates of the mutation rate scalars are done as in Hey and Nielsen (25).

In general, it is expected that each genealogy will make its greatest contribution to the overall probability over some limited range of Θ. By including a large number of genealogies, sampled from a long-running, well-mixing Markov chain that has reached stationarity, it should be possible to obtain good estimates of p(Θ|X) for any value of Θ. Optimization of the estimate function, under full or nested models, requires some care because the surface may be multimodal over broad and fine scales, either because of the data or because of the particular genealogies that happened to end up in the sample. After trying a number of approaches, we settled on the simulated annealing algorithm that is implemented in the AMEBSA code of Press et al. (35).

Fig. 1 shows an example for the simple case of a data set simulated under a single population model (one parameter, θ = 4Nu). Ten likelihood functions, each based on a single genealogy, are shown together with their average as well as the average for 100 samples drawn from the same simulation.

Fig. 1.
A single population model was simulated with a sample of 20 gene copies drawn from a population with a true population mutation rate of θ = 20. Functions based on 10 individual sampled genealogies are shown along with their mean (gray line) as ...

Nested models and likelihood-ratio tests.

In addition to an estimate of the posterior density, p′(Θ|X), the method can also be used to study nested submodels, e.g., a model with parameter space Θr, where Θr contains a subset of the parameters in Θ, and the remaining parameters take on fixed values. By using Eq. 11, the functions p′(Θ|X) and p′r|X) can be maximized to find the highest probabilities and the associated parameter values, Θ and Θr. Because the posterior probability density of Θ is uniformly proportional to the likelihood, p(Θ|X) = cL(Θ|X) and pr|X) = cLr|X), where c = p(Θ)/p(X) Thus, the posterior density ratio equals the likelihood ratio. If Λ is the log of the ratio of the highest likelihoods found under each model, then this can be estimated from the ratios of the two functions, each at its maximal value, [Lambda with circumflex] = 1n(p′([Theta with circumflex]r|X)/p′([Theta with circumflex]|X)). If the two density functions are good estimates of the true densities, and if the data set X consists of a large number of independent observations, then this ratio can be used in a conventional likelihood-ratio test. If Θr is the true model, then, for unbounded parameters and under certain regularity conditions, we expect that −2[Lambda with circumflex] asymptotically will follow a χ2 distribution with k degrees of freedom, where k is the difference in the number of dimensions (parameters) between Θr and Θ.

To examine the actual distribution of −2[Lambda with circumflex], data sets were simulated under a particular model, Θr. For each data set, a Markov chain simulation was run to generate an estimate of the posterior density function under the full model, p′(Θ|X). This function was maximized over all parameters to generate p′([Theta with circumflex]|X) and then maximized over just those parameters that were free to vary in Θr to generate p′([Theta with circumflex]r X), and −2[Lambda with circumflex] was calculated. Fig. 2 shows the resulting cumulative distributions for three different models, each of which is consistent with the corresponding χ2 distribution, showing that the asymptotic result holds approximately for these moderately sized simulated data sets and that the added simulation variance introduced by the method does not invalidate the use of the classical likelihood-ratio tests. Additionally, the good fit of the χ2 distribution suggests that the estimation and optimization of the likelihood surface is reasonable accurate. Other simulations with small data sets do show that, as expected with less data, that the distribution of −2[Lambda with circumflex] will have a variance larger than that for the corresponding χ2 distribution.

Fig. 2.
Cumulative distributions of −2[Lambda with circumflex] calculated from Markov chain simulations run on 100 simulated data sets. (A) The true model has two populations of identical size, with simulated data sets sampled following: θ1 = θ ...

Chimpanzee case study.

To demonstrate the approach for a model in which an ancestral populations splits into two, we considered the case of two chimpanzee subspecies, Pan troglodytes troglodytes (the Central African Chimpanzee) and Pan troglodytes verus (the Western African Chimpanzee). This divergence has previously been studied by using a Markov chain simulation in which the state space includes both genealogies and model parameters for a data set of 48 genes drawn from the literature (36).

Fig. 3 shows the marginal posterior density estimates from the original method (36), which generates histogram-based estimates, and the new method. As expected, both sets of marginal density estimates are very similar. Fig. 4 shows examples of contour plots of marginal posterior density estimates for pairs of parameters.

Fig. 3.
Posterior probability estimates for population size and migration parameters for P. t. troglodytes and P. t. verus. Histograms generated by a Markov chain with state space that includes genealogies and parameters (36) are shown as dots (one for each of ...
Fig. 4.
Two dimensional contour plots of marginal posterior density estimates for P. t. troglodytes and P. t. verus. (A) θ1 (for P. t. troglodytes) and θA. (B) m1 and m2.

Table 1 shows the likelihood ratio statistic for a series of nested models applied to the chimpanzee data. All of the ratio statistics were calculated as the difference between the highest posterior probability for the full model and the highest posterior probability for the nested model. Only two models were not rejected: the model in which the two migration rates are equal to each other and the one in which m2 is equal to 0. If we were to correct for multiple tests, then other models would also not be rejected.

Table 1.
Tests of nested models for P. t. troglodytes and P. t. verus


Felsenstein's equation has become a centerpiece of modern population genetics and phylogenetic analysis as computational approaches have been developed for faster and improved approximate solutions. Here, we describe an approach that provides greatly improved access to a broad family of population genetics models, i.e., those that can be described with one or more population size and migration parameters. Relying on a Markov chain simulation, the state space is limited to just the posterior density of genealogies, thereby avoiding those MCMC mixing problems that arise because of correlations between G and Θ, when both are part of the state space (25). In addition, the method provides a convenient approach for estimating likelihood ratios.

The finding that the estimate of the likelihood ratio, from nested models, closely approximates the χ2 distribution that is expected under asymptotic assumptions is strong affirmation of the validity of the approach, and it means that the method can be used for many questions that involve a contrast of different demographic models. Model selection and testing of demographic hypotheses based on the full-likelihood function have often been neglected in the fields of molecular ecology and population genetics because appropriate tools for calculating likelihood ratios have not be available. The methods described here should greatly alleviate this problem by providing a powerful computational framework for estimating likelihood functions and likelihood ratios.

Supplementary Material

Supporting Text:


We thank Yong Wang, David Ruppert, and Naomi Altman for helpful discussions. This work was supported in part by a National Science Foundation grant (to J.H.) and by grants from Danmarks Grundsforskningsfond and the Danish Forskningsrådet for Natur og Univers (to R.N.).


Markov chain Monte Carlo.


The authors declare no conflict of interest.

This article contains supporting information online at www.pnas.org/cgi/content/full/0611164104/DC1.


1. Felsenstein J. Annu Rev Genet. 1988;22:521–565. [PubMed]
2. Griffiths RC. J Math Biol. 1989;27:667–680. [PubMed]
3. Stephens M. In: Handbook of Statistical Genetics. Balding DJ, Bishop M, Cannings C, editors. West Sussex, UK: Wiley; 2001.
4. Kuhner MK, Yamato J, Felsenstein J. Genetics. 1995;140:1421–1430. [PMC free article] [PubMed]
5. Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E. J Chem Phys. 1953;21:1087–1091.
6. Hastings WK. Biometrika. 1970;57:97–109.
7. Kuhner MK, Yamato J, Felsenstein J. Genetics. 1998;149:429–434. [PMC free article] [PubMed]
8. Beerli P, Felsenstein J. Genetics. 1999;152:763–773. [PMC free article] [PubMed]
9. Kuhner MK, Yamato J, Felsenstein J. Genetics. 2000;156:1393–1401. [PMC free article] [PubMed]
10. Beerli P, Felsenstein J. Proc Natl Acad Sci USA. 2001;98:4563–4568. [PMC free article] [PubMed]
11. Thompson EA, Guo SW. IMA J Math App Med Biol. 1991;8:149–169. [PubMed]
12. Wilson IJ, Balding DJ. Genetics. 1998;150:499–510. [PMC free article] [PubMed]
13. Beaumont MA. Genetics. 1999;153:2013–2029. [PMC free article] [PubMed]
14. Nielsen R. Genetics. 2000;154:931–942. [PMC free article] [PubMed]
15. Yang Z, Rannala B. Mol Biol Evol. 1997;14:717–724. [PubMed]
16. Larget B, Simon DL. Mol Biol Evol. 1999;16:750–759.
17. Mau B, Newton MA, Larget B. Biometrics. 1999;55:1–12. [PubMed]
18. Stephens M. Bulletin of the 52nd Session of the International Statistics Institute Book. 1999. pp. 273–276.
19. Geyer CJ, Thompson EA. J R Stat Soc B. 1992;54:567–699.
20. Griffiths RC, Tavaré S. Theor Popul Biol. 1994;46:131–159.
21. Griffiths RC, Tavaré S. Stat Sci. 1994;9:307–319.
22. Stephens M, Donnelly P. J R Stat Soc B. 2000;62:605–655.
23. Nielsen R, Wakeley J. Genetics. 2001;158:885–896. [PMC free article] [PubMed]
24. Hey J, Won Y-J, Sivasundar A, Nielsen R, Markert JA. Mol Ecol. 2004;13:909–919. [PubMed]
25. Hey J, Nielsen R. Genetics. 2004;167:747–760. [PMC free article] [PubMed]
26. Palsbøll PJ, Berube M, Aguilar A, Notarbartolo-Di-Sciara G, Nielsen R. Evol Int J Org Evol. 2004;58:670–675.
27. Hey J. PLoS Biol. 2005;3:0965–0975.
28. Scott DW. Multivariate Density Estimation: Theory, Practice, and Visualization. New York: Wiley; 1992.
29. Felsenstein J, Kuhner MK, Yamato J, Beerli P. In: Statistics in Genetics and Molecular Biology. Seillier-Moiseiwitsch F, editor. Vol 33. Hayward, CA: Inst Math Stat and Am Math Soc; 1999. pp. 163–185.
30. Kingman JFC. Stochastic Processes App. 1982;13:235–248.
31. Felsenstein J. Genet Res (Cambridge, UK) 1992;60:209–220. [PubMed]
32. Maruyama T. Theor Popul Biol. 1970;1:273–306. [PubMed]
33. Wright S. Genetics. 1931;16:97–159. [PMC free article] [PubMed]
34. Rannala B, Yang Z. Genetics. 2003;164:1645–1656. [PMC free article] [PubMed]
35. Press WH. Numerical Recipes in C: The Art of Scientific Computing. Cambridge, UK: Cambridge Univ Press; 1992.
36. Won YJ, Hey J. Mol Biol Evol. 2005;22:297–307. [PubMed]
37. Chernoff H. Ann Math Stat. 1954;25:573–578.

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...