• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of geneticsGeneticsCurrent IssueInformation for AuthorsEditorial BoardSubscribeSubmit a Manuscript
Genetics. Jan 2010; 184(1): 243–252.
PMCID: PMC2815920

Bayesian Computation and Model Selection Without Likelihoods

Abstract

Until recently, the use of Bayesian inference was limited to a few cases because for many realistic probability models the likelihood function cannot be calculated analytically. The situation changed with the advent of likelihood-free inference algorithms, often subsumed under the term approximate Bayesian computation (ABC). A key innovation was the use of a postsampling regression adjustment, allowing larger tolerance values and as such shifting computation time to realistic orders of magnitude. Here we propose a reformulation of the regression adjustment in terms of a general linear model (GLM). This allows the integration into the sound theoretical framework of Bayesian statistics and the use of its methods, including model selection via Bayes factors. We then apply the proposed methodology to the question of population subdivision among western chimpanzees, Pan troglodytes verus.

WITH the advent of ever more powerful computers and the refinement of algorithms like MCMC or Gibbs sampling, Bayesian statistics have become an important tool for scientific inference during the past two decades. Consider a model M creating data D (DNA sequence data, for example) determined by parameters equation M1 from some (bounded) parameter space Π [subset or is implied by] Rm whose joint prior density we denote by equation M2. The quantity of interest is the posterior distribution of the parameters, which can be calculated by Bayes rule as

equation M3

where equation M4 is the likelihood of the data and equation M5 is a normalizing constant. Direct use of this formula, however, is often prevented by the fact that the likelihood function cannot be calculated analytically for many realistic probability models. In these cases one is obliged to use stochastic simulation. Tavaré et al. (1997) propose a rejection sampling method for simulating a posterior random sample where the full data D are replaced by a summary statistic s (like the number of segregating sites in their setting). Even if the statistic does not capture the full information contained in the data D, rejection sampling allows for the simulation of approximate posterior distributions of the parameters in question (the scaled mutation rate in their model). This approach was extended to multiple-parameter models with multivariate summary statistics equation M6 by Weiss and von Haeseler (1998). In their setting a candidate vector equation M7 of parameters is simulated from a prior distribution and is accepted if its corresponding vector of summary statistics is sufficiently close to the observed summary statistics sobs with respect to some metric in the space of s, i.e., if dist(s, sobs) < ε for a fixed tolerance ε. We suppose that the likelihood equation M8 of the full model is continuous and nonzero around sobs. In practice the summary statistics are often discrete but the range of values is large enough to be approximated by real numbers. The likelihood of the truncated model equation M9 obtained by this acceptance–rejection process is given by

equation M10
(1)

where equation M11 is the ε-ball in the space of summary statistics and Ind(·) is the indicator function. Observe that equation M12 degenerates to a (Dirac) point measure centered at sobs as equation M13. If the parameters are generated from a prior equation M14, then the distribution of the parameters retained after the rejection process outlined above is given by

equation M15
(2)

We call this density the truncated prior. Combining (1) and (2) we get

equation M16
(3)

Thus the posterior distribution of the parameters under the model M for s = sobs given the prior equation M17 is exactly equal to the posterior distribution under the truncated model equation M18 given the truncated prior equation M19. If we can estimate the truncated prior and make an educated guess for a parametric statistical model of Mε(sobs), we arrive at a reasonable approximation of the posterior equation M20 even if the likelihood of the full model M is unknown. It is to be expected that due to the localization process the truncated model will exhibit a simpler structure than the full model M and thus be easier to estimate.

Estimating equation M21 is straightforward, at least when the summary statistics can be sampled from M in a reasonable amount of time: Sample the parameters from the prior equation M22, create their respective statistics s from M, and save those parameters whose statistics lie in equation M23 in a list equation M24. The empirical distribution of these retained parameters yields an estimate of equation M25. If the tolerance ε is small, then one can assume that equation M26 is close to some (unknown) constant over the whole range of equation M27. Under that assumption, Equation 3 shows that equation M28. However, when the dimension n of summary statistics is high (and for more complex models dimensions like n = 50 are not unusual), the “curse of dimensionality” implies that the tolerance must be chosen rather large or else the acceptance rate becomes prohibitively low. This, however, distorts the precision of the approximation of the posterior distribution by the truncated prior (see Wegmann et al. 2009). This situation can be partially alleviated by speeding up the sampling process; such methods are subsumed under the term approximate Bayesian computation (ABC). Marjoram et al. (2003) develop a variant of the classical Metropolis–Hastings algorithm (termed ABC–MCMC in Sisson et al. 2007), which allows them to sample directly from the truncated prior equation M29. In Sisson et al. (2007) a sequential Monte Carlo sampler is proposed, requiring substantially less iterations than ABC–MCMC. But even when such methods are applied, the assumption that equation M30 is constant over the ε-ball is a very rough one, indeed.

To take into account the variation of equation M31 within the ε-ball, a postsampling regression adjustment (termed ABC-REG in the following) of the sample P of retained parameters is introduced in the important article by Beaumont et al. (2002). Basically, they postulate a (locally) linear dependence between the parameters equation M32 and their associated summary statistics s. More precisely, the (local) model they implicitly assume is of the form equation M33, where M is a matrix of regression coefficients, m0 a constant vector, and equation M34 a random vector of zero mean. Computer simulations suggest that for many population models ABC–REG yields posterior marginal densities that have narrower highest posterior density (HPD) regions and are more closely centered around the true parameter values than the empirical posterior densities directly produced by ABC samplers (Wegmann et al. 2009). An attractive feature of ABC–REG is that the posterior adjustment is performed directly on the simulated parameters, which makes estimation of the marginal posteriors of individual parameters particularly easy. The method can also be extended to more complex, nonlinear models as demonstrated, e.g., in Blum and Francois (2009). In extreme situations, however, ABC–REG may yield posteriors that are nonzero in parameter regions where the priors actually vanish (see Figure 1B for an illustration of this phenomenon). Moreover, it is not clear how ABC–REG could yield an estimate of the marginal density of model M at sobs, information that is useful for model comparison.

Figure 1.
Comparison of rejection (A and D), ABC–REG (B and E), and ABC–GLM (C and F) posteriors with those obtained from analytical likelihood calculations. We estimated the population–mutation parameter θ = 4Nμ ...

In contrast to ABC–REG we treat the parameters equation M36 as exogenous and the summary statistics s as endogenous variables and we stipulate for equation M37 a general linear model (GLM in the literature—not to be confused with the generalized linear models that unfortunately share the same abbreviation). To be precise, we assume the summary statistics s created by the truncated model's likelihood equation M38 to satisfy

equation M39
(4)

where C is a n × m matrix of constants, c0 an n × 1 vector, and equation M40 a random vector with a multivariate normal distribution of zero mean and covariance matrix equation M41:

equation M42

A GLM has the advantage of taking into account not only the (local) linearity, but also the strong correlation normally present between the components of the summary statistics. Of course, the model assumption (4) can never represent the full truth since its statistics are in principle unbounded whereas the likelihood equation M43 is supported on the ε-ball around sobs. But since the multivariate Gaussians will fall off rapidly in practice and not reach far out off the boundary of equation M44, this is a disadvantage we can live with. In particular, the ordinary least squares (OLS) estimate outlined below implies that for equation M45 the constant c0 tends to sobs whereas the design matrix C and the covariance matrix equation M46 both vanish. This means that in the limit of zero tolerance ε = 0 our model assumption yields the true posterior distribution of M.

THEORY

In this section we describe the above methodology—referred to as ABC–GLM in the following—in more detail. The basic two-step procedure of ABC–GLM may be summarized as follows.

GLM1:

Given a model M creating summary statistics s and given a value of observed summary statistics sobs, create a sample of retained parameters equation M47, with the aid of some ABC sampler (rejection sampling, ABC–MCMC, or ABC–PRC) based on a prior distribution equation M48 and some choice of the tolerance ε > 0.

GLM2:

Estimate the truncated model equation M49 as a general linear model and determine, on the basis of the sample equation M50, from the truncated prior equation M51 an approximation to the posterior equation M52 according to Equation 3.

Let us look more closely at these two steps.

GLM1: ABC sampling:

We refer the reader to Marjoram et al. (2003) and Sisson et al. (2007) for details concerning ABC algorithms and to Marjoram and Tavaré (2006) for a comprehensive review of computational methods for genetic data analysis. In practice, the dimension of the summary statistics is often reduced by a principal components analysis (PCA). PCA also has a certain decorrelation effect. A more sophisticated method of reducing the dimension of summary statistics, based on partial least squares (PLS), is described in Wegmann et al. (2009). In a recent preprint, Vogl et al. (C. Vogl, C. Futschik and C. Schloetterer, unpublished data) propose a Box–Cox-type transformation of the summary statistics that makes the likelihood close to multivariate Gaussian. This transformation might be especially efficient in our context as we assume normality of the error terms in our model assumption.

To fix the notation, let equation M53 be a sample of vector-valued parameters created by some ABC algorithm simulating from some prior equation M54 and equation M55 be the sample of associated summary statistics produced by the model M. Each parameter equation M56 is an m-dimensional column vector equation M57 and each summary statistic is an n-dimensional column vector equation M58. The samples P and S can thus be viewed as m × N and n × N matrices P and S, respectively.

The empirical estimate of the truncated prior equation M59 is given by the discrete distribution that puts a point mass of 1/N on each value equation M60. We smooth out this empirical distribution by placing a sharp Gaussian peak over each parameter value equation M61. More precisely, we set

equation M62
(5)

where

equation M63

and

equation M64

is the covariance matrix of [var phi] that determines the width of the Gaussian peaks. The larger the number N of sampled parameter values is, the sharper the peaks can be chosen to still get a rather smooth πε. If the parameter domain Π is normalized to [0, 1]m, say, then a reasonable choice is σk = 1/N. Otherwise, σk should be adapted to the parameter range of the parameter component θk. Too small values of σk will result in wiggly posterior curves, and too large values might unduly smear out the curves. The best advice is to run the calculations with several choices for equation M65. If πε induces a correlation between parameters, a nondiagonal equation M66 might be beneficial. In practice, however, the posterior estimates are most sensitive to the diagonal values of equation M67.

GLM2: general linear model:

As explained in the Introduction, we assume the truncated model equation M68 to be normal linear; i.e., the random vectors s satisfy (4). The covariance matrix equation M69 encapsulates the strong correlations normally present between the components of the summary statistics. C, c0, and equation M70 can be estimated by standard multivariate regression analysis (OLS) from the sample P, S created in step GLM1. [Strictly speaking, one must redo an ABC sample from uniform priors over Π to get an unbiased estimate of the GLM if the prior equation M71 is not uniform already. On the other hand, ordinary least-squares estimators are quite insensitive to the prior's influence. In practice, one can as well use the sample equation M72 to do the estimate. We applied both estimation methods to the toy models presented in the examples from population genetics section and found no significant difference between the estimated posteriors. The same holds true for the so-called feasible generalized least-squares (FGLS) estimator; see Greene (2003). In this two-stage algorithm the covariance matrix is first estimated as in our setting but in a second round the design matrix C is newly estimated. When we applied FGLS to our toy models, we found a difference in the estimated matrices only after the eighth significant decimal. FGLS is a more efficient estimator only when the sample sizes are relatively small as is often the case in economical data sets but not in ABC situations. In theory, both OLS and FGLS are consistent estimators but FGLS is more efficient.] To be specific, set X = (1[vertical ellipsis]Pt), where 1 is an N × 1 vector of 1's. C and c0 are determined by the usual least-squares estimator

equation M73

and for equation M74 we have the estimate

equation M75
(6)

where equation M76 are the residuals. The likelihood for this model—dropping the hats on the matrices to unburden the notation—is given by

equation M77
(7)

An exhaustive treatment of linear models in a Bayesian (econometric) context is given in Zellner's book (Zellner 1971).

Recall from (3) that for a prior equation M78 and an observed summary statistic sobs, the parameter's posterior distribution for our full model M is given by

equation M79
(8)

where equation M80 is the likelihood of the truncated model equation M81 given by (7) and equation M82 is the estimated (and smoothed) truncated prior given by (5).

Performing some matrix algebra (see appendix a), one can show that the posterior (8) is—up to a multiplicative constant—of the form equation M83, where

equation M84

Here T, tj, and vj are given by

equation M85
(9)

and tj = Tvj, where

equation M86
(10)

From this we get

equation M87
(11)

where

equation M88
(12)

When the number of parameters exceeds two, graphical visualization of the posterior distribution becomes impractical and marginal distributions must be calculated. The marginal posterior density of the parameter θk is defined by

equation M89

where integration is performed along all parameters except θk.

Recall that the marginal distribution of a multivariate normal equation M90 with respect to the kth component is the univariate normal density equation M91. Using this fact, it is not hard to show that the marginal posterior of parameter θk is given by

equation M92
(13)

where τk,k is the kth diagonal element of the matrix T, tkj is the kth component of the vector tj, and equation M93 is still determined according to (12). The normalizing constant a could, in principle, be determined analytically but is in practice more easily recovered by a numerical integration. Strictly speaking, the integration should be done only over the bounded parameter domain Π and not over the whole of Rm. But this no longer allows for an analytic form of the marginal posterior distribution. For large values of N the diagonal elements in the matrix Σθ can be chosen so small that the error is in any case negligible.

Model selection:

The principal difficulty of model selection methods in nonparametric settings is that it is nearly impossible to estimate the likelihood of M at sobs due to the high dimension of the summary statistics (curse of dimensionality); see Beaumont (2007) for an approach based on multinomial logit. Parametric models on the other hand lend themselves readily to model selection via Bayes factors. Given the model M, one must determine the marginal density

equation M94

It is easy to check from (1) and (2) that

equation M95

Here

equation M96
(14)

is the acceptance rate p of the rejection process. It can easily be estimated with aid of ABC–REJ: Sample parameters from the prior equation M97 create the corresponding statistics s from M and count what fraction of the statistics fall into the ε-ball equation M98 centered at sobs.

If we assume the underlying model of equation M99 to be our GLM, then the marginal density of M at sobs can be estimated as

equation M100
(15)

where the sum runs over the parameter sample equation M101,

equation M102

and

equation M103

For two models equation M104 and equation M105 with prior probabilities πA and πB = 1 – πA, the Bayes factor BAB in favor of model equation M106 over model equation M107 is

equation M108
(16)

where the marginal densities equation M109 and equation M110 are calculated according to (15). The posterior probability of model equation M111 is

equation M112

EXAMPLES FROM POPULATION GENETICS

Toy models:

In Figure 1 we present the comparison of posteriors obtained with rejection sampling, ABC–REG and ABC–GLM, with those determined analytically (“true posteriors”). As a toy model we inferred the population–mutation parameter θ = 4Nμ of a panmictic population model from the number of segregating sites S of a sample of sequences with 10,000 bp for different observed values and tolerance levels. Estimations are always based on 5000 simulations with dist(S, Sobs) < ε, and we report the average of 25 independent replications per data point. Estimation bias of the different approaches was assessed by computing the total variation distance between the inferred posterior and the true one obtained from analytical calculations using the likelihood function introduced by Watterson (1975). Recall that the L1-distance of two densities f(θ) and g(θ) is given by

equation M113

It is equal to 1 when f and g have disjoint supports and it vanishes when the functions are identical.

When we used a uniform prior θ ~ Unif([0.005, 10]) (Figure 1, A–C), both ABC–REG and ABC–GLM give comparable results and improve the posterior estimation compared to the simple rejection algorithm except for very low tolerance values ε where the rejection algorithm is expected to be very close to the true posterior. The average total variation distances over all observed data sets and tolerance values ε are 0.236, 0.130, and 0.091 for the rejection algorithm, ABC–REG, and ABC–GLM, respectively. Note that perfect matches between the approximate and the true posteriors are difficult to obtain because all approximate posteriors depend on a smoothing step that may not give accurate results close to borders of their supports. However, when we used a discontinuous prior θ ~ Unif([0.005, 3] [union or logical sum] [6, 10]) with an admittedly extremely artificial “gap” in the middle, we observed a quite distinct pattern (Figure 1, D and E). One clearly recognizes that posteriors inferred with ABC–REG are frequently misplaced and often even farther away from the true posterior (in total variation distance) than the prior, especially for cases where the likelihood of the observed data is maximal within the gap. The reason for this is that in the regression step of ABC–REG parameter values may easily be shifted outside the prior support. This behavior of ABC–REG has been observed earlier (Beaumont et al. 2002; Estoup et al. 2004; Tallmon et al. 2004) and as an ad hoc solution Hamilton et al. (2006) proposed to transform the parameter values prior to the regression step by a transformation of the form equation M114, where a and b are the lower and upper borders of the prior support interval. For more complex priors—like the discontinuous prior used here—this transformation may not work. ABC–GLM is much less affected by the gap prior than ABC–REG. The average total variation distances over all observed data sets and tolerance values ε are 0.221, 0.246, and 0.094 for the rejection algorithm, ABC–REG, and ABC–GLM, respectively. Example posteriors with Sobs = 16 based on 5000 simulations with dist(S, Sobs) < 10 are shown in Figure 2.

Figure 2.
Example posteriors for uniform (A) and discontinuous (B) priors. The model is the same as in Figure 1. Posterior estimates using ABC–GLM and ABC–REG for Sobs = 16 were based on 5000 simulations with dist(S, Sobs) < 10. ...

The success of ABC–GLM depends on how well a general linear model fits the truncated model equation M115. Under the null hypothesis that the fit is perfect the estimated residuals rj (see Equation 6) are independently multivariate normally distributed random vectors. Hence the Mahalanobis distances

equation M116
(17)

follow a χ2-distribution with n degrees of freedom. As a quantification of model assessment we propose to report the Kolmogorov–Smirnov test statistic for the empirical distribution of dj and the reference χ2-distribution. (Reporting P-values will be of little use in practice since the null hypothesis does never hold exactly and hence the P-values will become very small due to the large sample size.)

When the summary statistics are created from a general linear model, the fit should be optimal. This is indeed the case as the simulation results in Table 1 show. We performed 200 simulations of randomly created general linear models with m = 3 parameters, n = 4 summary statistics, and a multivariate normal prior. The observed statistics were also created from the respective models. For each simulated observed statistic and different acceptance rates p = 1.00, 0.50, 0.10, 0.05, and 0.01 we calculated the approximate posterior distributions πε, πREG, and πGLM for the rejection algorithm, ABC–REG, and ABC–GLM, respectively. As the prior is multivariate normal, the true posterior π0 can be analytically determined. Table 1 contains the means and standard deviations over the 200 simulations of the total variation distances of the approximate posteriors to the true posterior π0 as well as the mean and standard deviations of the Kolmogorov–Smirnov test statistics for the GLM model fit. As is expected, the model fit is perfect [i.e., the Kolmogorov–Smirnov (KS) statistic is close to 0] for acceptance rate p = 1. As the acceptance rate becomes lower, the model fit deteriorates since the truncated model of a GLM is no longer exactly a general linear model. The total variation distance to the true posterior increases slightly as p gets smaller but the improved rejection posterior πε mostly outbalances the poorer model fit. As is expected in this ideal situation, ABC–GLM and ABC–REG substantially improve the posterior estimation over the pure rejection prior.

TABLE 1
Mean and standard deviation of the L1 distance between inferred and expected posteriors for randomly generated GLMs with NP = 3, NS = 4 [prior N(0, 0.22), 200 simulations]

To test the other extreme we performed 200 simulations for a nonlinear one-parameter model with uniformly rather than normally distributed error terms; the prior was again a normal distribution. (The details of this toy model are described in appendix b.) As Table 2 shows, the GLM model fit is already poor for an acceptance rate of p = 1.00 (KS statistic ~0.10) and further deteriorates as p decreases. Note that the approximate posteriors πREG and πGLM are closer to the true posterior in average than πε and that both adjustment methods perform similarly. As expected, the accuracy of the posteriors increases with smaller acceptance rates, despite the fact that the model fit within the ε-ball decreases. This suggests that the rejection step contributes substantially to the estimation accuracy, especially when the true model is nonlinear. We should mention that in ~30% of the simulations both ABC–GLM and ABC–REG actually increased the distance to the true posterior in comparison to the rejection posterior πε. As a rule of thumb we suggest that posterior adjustments obtained by ABC–GLM or ABC–REG should not be trusted without further validation if the Kolmogorov–Smirnov statistic for the GLM model fit exceeds a value of, say, 0.10. In that case linear models are not sufficiently flexible to account for effects like nonlinearity in the parameters and nonnormality and heteroscedasticity in the error terms. In the setting of ABC–REG a wider class of models is introduced in Blum and Francois (2009), where machine-learning algorithms are applied for the parameter estimations. Whether these extensions can be applied in our context remains to be seen. The advantage of the general linear model is that estimations can be done with ordinary least squares and the important quantities like marginal posteriors and marginal likelihoods can be obtained analytically. For more complex models these quantities will probably be accessible only via numerical integration, Monte Carlo methods, etc.

TABLE 2
Mean and standard deviation of the L1 distance between inferred and expected posteriors for the uniform errors model (see appendix b) with NP = 1, NS = 5 {prior N(0, 22), error Unif[−10, 10], 200 simulations} ...

Application to chimpanzees:

In standard taxonomies, chimpanzees, the closest living relatives of humans, are classified into two species: the common chimpanzee (Pan troglodytes) and the bonobo (P. paniscus). Both species are restricted to Africa and diverged ~9 MYA (Won and Hey 2005; Becquet and Przeworski 2007). The common chimpanzees are further subdivided into three large populations or subspecies on the basis of their separation by geographic barriers. Among them, the western chimpanzees (P. troglodytes verus) form the most remote group. Interestingly, recent multilocus studies found consistent levels of gene flow between the western and the central (P. t. troglodytes) chimpanzees (Won and Hey 2005; Becquet and Przeworski 2007). Nonetheless, a recent study of 310 microsatellites in 84 common chimpanzees supports a clear distinction between the previously labeled populations (Becquet et al. 2007). Using a PCA analysis, indication for substructure within the western chimpanzees was found in the same study.

To demonstrate the applicability of the model selection given in the theory section we contrast two different models of the western chimpanzee population with this data set: a model of a single panmictic population with constant size and a finite island model of constant size and constant migration among demes. While we estimated θ = 2Neμ, priors were set on Ne and μ separately with log10(Ne) ~ Unif([3, 5]) and μ ~ N(5 × 10−4, 2 × 10−4) truncated on μ [set membership] [10−4, 10−3]. In the case of the finite island model, we had an additional prior npop ~ Unif([10, 100]) on the number of islands, and individuals were attributed randomly to the different islands.

We obtained genotypes for all 50 individuals reported to be of western chimpanzee origin from the study of Becquet et al. (2007), excluding captive-born hybrids. We checked the proposed (Becquet et al. 2007) mutation pattern for each individual locus, and all alleles not matching the assumed stepwise mutation model were set as missing data. A total of 265 loci were used, after removing the loci on the X and the Y chromosome as well as those being monomorphic among the western chimpanzees. All simulations were performed using the software SIMCOAL2 (Laval and Excoffier 2004) and we reproduced the pattern of missing data observed in the data set. Using the software package Arlequin3.0 (Excoffier et al. 2005), we calculated two summary statistics on the data set: the average number of alleles per locus, K, and FIS, the fixation index within the western chimpanzees. We performed a total of 100,000 simulations per model.

In Figure 3 we report the Bayes factor of the island model according to (16) for different acceptance rates Aε; see (14). While there is a large variation for very small acceptance rates, the Bayes factor stabilizes for Aε ≥ 0.005. Note that Aε ≤ 0.005 corresponds to <500 simulations and that the ABC–GLM approach, based on a model estimation and a smoothing step, is expected to produce poor results since the estimation of the model parameters is unreliable due to the small sample size. The good news is that the Bayes factor is stable over a large range of tolerance values. We may therefore safely reject the panmictic population model in favor of population subdivision among western chimpanzees with a Bayes factor of B ≈ 105.

Figure 3.
Bayes factor for the island relative to the panmictic population model for different acceptance rates (logarithmic scale). For very low acceptance rates we observe large fluctuations whereas the Bayes factor is quite stable for larger values. Note that ...

DISCUSSION

Due to still increasing computational power it is nowadays possible to tackle estimation problems in a Bayesian framework for which analytical calculation of the likelihood is inhibited. In such cases, approximate Bayesian computation is often the choice. A key innovation in speeding up such algorithms was the use of a regression adjustment, termed ABC–REG in this article, which used the frequently present linear relationship between generated summary statistics s and parameters of the model equation M117 in a neighborhood of the observed summary statistics sobs (Beaumont et al. 2002). The main advantage is that larger tolerance values ε still allow us to extract reasonable information about the posterior distribution equation M118 and hence less simulations are required to estimate the posterior density.

Here we present a new approach to estimate approximate posterior distributions, termed ABC–GLM, similar in spirit to ABC–REG, but with two major advantages: First, by using a GLM to estimate the likelihood function, ABC–GLM is always consistent with the prior distribution. Second, while we do not find the ABC–GLM approach to substantially outperform ABC–REG in standard situations, it is naturally embedded into a standard Bayesian framework, which in turn allows the application of well-known Bayesian methodologies such as model averaging or model selection via Bayes factors. Our simulations show that the rejection step is especially beneficial if the true model is nonlinear for both ABC approaches. ABC–GLM is further compatible with any type of ABC sampler, including likelihood-free MCMC (Marjoram et al. 2003) or population Monte Carlo (Beaumont et al. 2009). Also, more complicated regression regimes taking nonlinearity or heteroscedacity into account may be envisioned when the GLM is replaced by some more complex model. A great advantage of the current GLM setting is its simplicity, which renders implementation in standard statistical packages feasible.

We showed the applicability of the model selection procedure via Bayes factors by opposing two different models of population structure among the western chimpanzees P. troglodytes verus. Our analysis strongly suggests population substructure within the western chimpanzees since an island model is significantly favored over a model of a panmictic population. While none of our simple models is thought to mimic the real setting exactly, we still believe that they capture the main characteristics of the demographic history influencing our summary statistics, namely the number of alleles K and the fixation index FIS. While the observed FIS of 2.6% has been attributed to inbreeding previously (Becquet et al. 2007), we propose that such values may easily arise if diploid individuals are sampled in a randomly scattered way over a large, substructured population. While it was almost impossible to simulate the value FIS = 2.6% in the model of a panmictic population, it easily falls within the range of values obtained from an island model.

Acknowledgments

We are grateful to Laurent Excoffier, David J. Balding, Christian P. Robert, and the anonymous referees for their useful comments on a first draft of this manuscript. This work has been supported by grant no. 3100A0-112072 from the Swiss National Foundation to Laurent Excoffier.

APPENDIX A: PROOFS OF THE MAIN FORMULAS

To keep this article self-contained, we present a proof of formulas (11) and (15). The argument is an adaptation from the proof of Lemma 1 in Lindley and Smith (1972). By linearity it clearly suffices to show the formulas for one fixed sampled parameter equation M119. The results then follow.

Theorem. Suppose that, given the parameter vector equation M120, the distribution of the statistics vector s is multivariate normal,

equation M121

and, given the fixed parameter vector equation M122, the distribution of the parameter equation M123 is

equation M124

Then:

1. The distribution of equation M125 given s is

equation M126

where equation M127 and equation M128.

2. The marginal distribution of s is

equation M129

where equation M130 and equation M131.

Proof. By Bayes' theorem

equation M132

The product on the right-hand side is of the form equation M133, where

equation M134

In the last step we completed the square with respect to equation M135 and used the fact that T is symmetric. Up to a constant that does not depend on equation M136 we hence get

equation M137

where equation M138. This proves the first part of the theorem and—by linear superposition—the validity of Equation 11.

To prove the second part of the theorem, observe that equation M139 with equation M140 and equation M141 with equation M142. Putting these equalities together, we get

equation M143

This, being a linear combination of independent multivariate normal variables, is still multivariate normal with mean equation M144 and its covariance matrix is given by

equation M145

This proves the second part of the theorem as well as formula (15).[filled square]

APPENDIX B: NONLINEAR TOY MODELS

In this section we describe a class of toy models that are nonlinear in the parameter θ [set membership] R and have nonnormal, possibly heteroscedastic error terms. Still their likelihoods are easy to calculate analytically. We set

equation M146

Here fi(θ) are monotonically increasing continuous functions of θ and εi(θ) are independent, uniformly distributed error terms in the interval [–ui(θ), ui(θ)] [subset, dbl equals] R, where ui(θ) are nondecreasing, continuous functions:

equation M147

It is straightforward to check that for a prior π(θ) the posterior distribution of θ given equation M148 is (up to a normalizing constant)

equation M149

where

equation M150

and

equation M151

For the simulations in Table 2 we chose n = 5, equation M152, and equation M153. The prior was equation M154.

References

  • Beaumont, M., 2007. Simulations, Genetics, and Human Prehistory—A Focus on Islands.McDonald Institute Monographs, University of Cambridge, Cambridge, UK.
  • Beaumont, M., W. Zhang and D. Balding, 2002. Approximate Bayesian computation in population genetics. Genetics 162 2025–2035. [PMC free article] [PubMed]
  • Beaumont, M., R. C. Cornuet and J.-M. Marin, 2009. Adaptive approximate Bayesian computation. Biometrika (in press).
  • Becquet, C., and M. Przeworski, 2007. A new approach to estimate parameters of speciation models with application to apes. Genome Res. 17 1505–1519. [PMC free article] [PubMed]
  • Becquet, C., N. Patterson, A. Stone, M. Przeworski and D. Reich, 2007. Genetic structure of chimpanzee populations. Genome Res. 17 1505–1519. [PMC free article] [PubMed]
  • Blum, M., and O. Francois, 2009. Non-linear regression models for approximate Bayesian computation. Stat. Comput. (in press).
  • Estoup, A., M. Beaumont, F. Sennedot, C. Moritz and J. M. Cornuet, 2004. Genetic analysis of complex demographic scenarios: spatially expanding populations of the cane toad, Bufo marinus. Evolution 58 2021–2036. [PubMed]
  • Excoffier, L., G. Laval and S. Schneider, 2005. Arlequin (version 3.0): an integrated software package for population genetics data analysis. Evol. Bioinform. Online 1 47–50. [PMC free article] [PubMed]
  • Greene, W., 2003. Econometric Analysis, Ed. 5. Pearson Education, Upper Saddle River, NJ.
  • Hamilton, G., M. Stoneking and L. Excoffier, 2006. Molecular analysis reveals tighter social regulation of immigration in patrilocal populations than in matrilocal populations. Proc. Natl. Acad. Sci. USA 102 7476–7480. [PMC free article] [PubMed]
  • Laval, G., and L. Excoffier, 2004. Simcoal 2.0: a program to simulate genomic diversity over large recombining regions in a subdivided population with a complex history. Bioinformatics 20 2485–2487. [PubMed]
  • Lindley, D., and A. Smith, 1972. Bayes estimates for the linear model. J. R. Stat. Soc. B 34 1–44.
  • Marjoram, P., and S. Tavaré, 2006. Modern computational approaches for analysing molecular genetic variation data. Nat. Rev. Genet. 10 759–770. [PubMed]
  • Marjoram, P., J. Molitor, V. Plagnol and S. Tavaré, 2003. Markov chain Monte Carlo without likelihoods. Proc. Natl. Acad. Sci. USA 100 15324–15328. [PMC free article] [PubMed]
  • Sisson, S., Y. Fan and M. Tanaka, 2007. Sequential Monte Carlo without likelihoods. Proc. Natl. Acad. Sci. USA 104 1760–1765. [PMC free article] [PubMed]
  • Tallmon, D. A., G. Luikart and M. A. Beaumont, 2004. Comparative evaluation of a new effective population size estimator based on approximate Bayesian computation. Genetics 167 977–988. [PMC free article] [PubMed]
  • Tavaré, S., D. Balding, R. Griffiths and P. Donnelly, 1997. Inferring coalescence times from DNA sequence data. Genetics 145 505–518. [PMC free article] [PubMed]
  • Watterson, G., 1975. Number of segregating sites in genetic models without recombination. Theor. Popul. Biol. 7 256–276. [PubMed]
  • Wegmann, D., C. Leuenberger and L. Excoffier, 2009. Efficient approximate Bayesian computation coupled Markov chain Monte Carlo without likelihood. Genetics 182 1207–1218. [PMC free article] [PubMed]
  • Weiss, G., and A. von Haeseler, 1998. Inference of population history using a likelihood approach. Genetics 149 1539–1546. [PMC free article] [PubMed]
  • Won, Y., and J. Hey, 2005. Divergence population genetics of chimpanzees. Mol. Biol. Evol. 22 297–307. [PubMed]
  • Zellner, A., 1971. An Introduction to Bayesian Inference in Econometrics.Wiley, New York.

Articles from Genetics are provided here courtesy of Genetics Society of America

Formats: