- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

- Journal List
- Genetics
- v.169(3); Mar 2005
- PMC1449551

# Bayesian Analysis of an Admixture Model With Mutations and Arbitrarily Linked Markers

^{*}Institut National de la Recherche Agronomique, Centre de Biologie et de Gestion des Populations (CBGP), Campus International de Baillarguet, 34988 Montferrier-sur-Lez Cedex, France

^{†}Computational and Molecular Population Genetics Lab (CMPG), Zoological Institute, University of Bern, 3012 Bern, Switzerland

^{1}

*Corresponding author:*Computational and Molecular Population Genetics Lab, Zoological Institute, University of Bern, Baltzerstrasse 6, 3012 Bern, Switzerland. E-mail: hc.ebinu.ooz@reiffocxe.tnerual

## Abstract

We introduce here a Bayesian analysis of a classical admixture model in which all parameters are simultaneously estimated. Our approach follows the approximate Bayesian computation (ABC) framework, relying on massive simulations and a rejection-regression algorithm. Although computationally intensive, this approach can easily deal with complex mutation models and partially linked loci, and it can be thoroughly validated without much additional computation cost. Compared to a recent maximum-likelihood (ML) method, the ABC approach leads to similarly accurate estimates of admixture proportions in the case of recent admixture events, but it is found superior when the admixture is more ancient. All other parameters of the admixture model such as the divergence time between parental populations, the admixture time, and the population sizes are also well estimated, unlike the ML method. The use of partially linked markers does not introduce any particular bias in the estimation of admixture, but ML confidence intervals are found too narrow if linkage is not specifically accounted for. The application of our method to an artificially admixed domestic bee population from northwest Italy suggests that the admixture occurred in the last 10–40 generations and that the parental *Apis mellifera* and *A. ligustica* populations were completely separated since the last glacial maximum.

HYBRID populations have been central to theories on adaptation and speciation (Barton 2001), and their study has encountered a new interest since it was shown that they could be ideal in detecting disease genes (Chakraborty and Weiss 1988). The assessment of the degree of admixture of a given population has traditionally relied on the comparison of allele frequencies between two potential parental populations and a putative hybrid population (Roberts and Hiorns 1965; Chakraborty and Weiss 1988; Long 1991). Recently, these methods have been improved by incorporating information on the molecular diversity present in the admixed and in parental populations (Bertorelle and Excoffier 1998; Dupanloup and Bertorelle 2001) or by explicitly taking into account the genetic drift of allele frequencies since the admixture event (Chikhi *et al.* 2001; Wang 2003). However, the accuracy of the estimation of the contribution of the parental populations to the hybrid depends highly on the extent of differentiation between parental populations (Bertorelle and Excoffier 1998) and the time elapsed since the admixture event (Chikhi *et al.* 2001; Choisy *et al.* 2004). No single method was found to date superior to others in all circumstances (Choisy *et al.* 2004). Recent likelihood-based methods, including Bayesian (Chikhi *et al.* 2001) and maximum-likelihood (Wang 2003) approaches, are computationally intensive but have been shown to produce estimates with smaller variances across independent replicates or simulations, especially when the estimate was based on a small number of loci (Wang 2003; Choisy *et al.* 2004). A promising alternative to these methods has been the development of an approach using nongenetic information to more precisely define the contribution of sampled populations to the hybrid (Gaggiotti *et al.* 2002, 2004). Finally, recognizing that a major drawback of all these former approaches is to require an explicit definition of the source populations, some recent methods have attempted to identify admixed individuals without requiring the source parental populations to be defined (Pritchard *et al.* 2000; Dawson and Belkhir 2001; Anderson and Thompson 2002; Falush *et al.* 2003), but their statistical power remains to be assessed.

As stated previously, a common problem with most of the previous methods is their inability to explicitly handle mutations (but see Bertorelle and Excoffier 1998; Dupanloup *et al.* 2004), whereas this is likely to be particularly important when the admixture event is ancient. While ML methods have the potential to provide accurate estimations of demographic and mutational parameters, the calculation of likelihoods under models of nonrecent hybridization events for which mutations have to be taken into account at both independent and partially linked markers remains problematic. A powerful Bayesian alternative to likelihood computation for parameter estimation has been introduced recently (Fu and Li 1997; Tavaré *et al.* 1997; Pritchard *et al.* 1999; Estoup *et al.* 2001), dubbed as approximate Bayesian computation (ABC; Beaumont *et al.* 2002; Marjoram *et al.* 2003). This approach does not require the computation of likelihoods, but simply relies on the comparison of summary statistics computed on observed data with those computed on data simulated under a model for which the parameters of interest are known (Beaumont *et al.* 2002; Marjoram *et al.* 2003). Although the ABC method relies on summary statistics and thus does not use all available data, it has been shown to provide very accurate results in the analysis of relatively simple evolutionary scenarios where full maximum-likelihood methods were available (Beaumont *et al.* 2002; Marjoram *et al.* 2003). Hence, by construction, ABC methods have the potential to consider models of any complexity, provided only that data can be simulated under the model. Recent applications of the latest developments of ABC methods (Beaumont *et al.* 2002) illustrate their potential for the analysis of complex demographic scenarios (Estoup and Clegg 2003; Estoup *et al.* 2004). Recent coalescent-based packages (*e.g.*, Hudson 2002; Laval and Excoffier 2004) provide an efficient tool for simulating genetic data under complex scenarios (including introgression or hybridization scenarios) and have the potential to generate data for independent or partially linked markers. Such versatile simulation packages make it possible, even for biologists unfamiliar with simulation algorithms, to perform parameter estimation under the ABC framework and consider various evolutionary scenarios.

In this article, we apply the ABC method to the estimation of all the parameters of an explicit admixture model (Figure 1) defined previously (Bertorelle and Excoffier 1998; Wang 2003) and described in methods. We use the SIMCOAL2 coalescent simulation program (Laval and Excoffier 2004) to generate a large number of microsatellite data sets for random values of the admixture model parameters, on which several summary statistics are evaluated. These simulated summary statistics are used for parameter estimation in a series of test data sets, which allows us to validate our approach and to compare its performance with a previously published maximum-likelihood (ML) method (Wang 2003). The method is then applied to the case of an admixed population of honeybees from northwestern Italy.

## METHODS

### The demographic model:

To compare the behavior and performances of our approach with previous methods, we used a classical admixture scenario described in Figure 1 and similar to that used in previous studies (*e.g.*, Long 1991; Bertorelle and Excoffier 1998; Wang 2003; Choisy *et al.* 2004).

### The genetic model:

Unlike almost all methods considering that gene frequencies evolve only through genetic drift, our approach also takes mutations into account (as in Bertorelle and Excoffier 1998). This involves the choice of a mutation model and of its parameters. We restricted our study to microsatellite markers for which we used a multistep mutation model, sometimes called generalized stepwise mutation (GSM) model (Zhivotovsky *et al.* 1997; Estoup *et al.* 2002), requiring two parameters per locus: the mutation rate (μ* _{i}*) and the coefficient (

*P*) of the geometric distribution of the length by which a new mutant allele differs from its ancestor. However, these two series of parameters are considered as nuisance parameters, and we will pay attention only to their average values across loci: and .

_{i}Data thus consist here of multilocus genotypes of *n* individuals sampled from each of the three populations.

### The ABC approach:

The rationale and the full description of the ABC method are given in Beaumont *et al.* (2002). In short, the approach involves three successive steps detailed in Figure 2. The first step (simulation step) consists of simulating many (typically 1 million) multilocus data sets with characteristics similar to the observed data set (same number of samples, same number of individuals per sample, same number of loci), using parameter values randomly drawn from some prior distributions (as defined in Table 1). The second step consists of comparing the simulated data set to an observed data set, by mean of a series of summary statistics, retaining the simulations that are arbitrarily close to the observations, and rejecting the other simulations. Finally, the third step is the estimation of the parameters by performing a multiple and locally weighted linear regression on the summary statistics associated with the retained simulations. The set of simulations retained for parameter estimation was selected by strictly following Beaumont *et al.* (2002), by computing a Euclidean distance (δ) between simulated and observed summary statistics and retaining the 1000 simulations having the smallest δ distance (being closest) to the test data set.

**...**

The SIMCOAL2 program (Laval and Excoffier 2004), freely available on http://cmpg.unibe.ch/software/simcoal2, has been used to generate microsatellite data sets in the first step, and a new program (abcEst) has been developed for parameter estimation (step 3 in Figure 2). The program abcEst (Windows or Linux version) is available from L. Excoffier upon request. Compared to the published version of the SIMCOAL2 program, two enhancements were added: the implementation of the generalized stepwise mutation model and the possibility of having different mutation rates at different loci. Microsatellite allele size constraints were included in our simulations by imposing reflecting boundaries at the edge of an allele size range of 30 continuous allelic states (Feldman *et al.* 1997; Pollock *et al.* 1998). This range is consistent with empirical data on repeat numbers at microsatellites in various species (*e.g.*, Garza *et al.* 1995; Goldstein and Pollock 1997; Estoup *et al.* 2000).

Regarding mutation modeling, we draw for each simulation an average mutation rate across loci from a log Uniform distribution, and individual locus mutation rates are then drawn from a Gamma distribution with mean equal to . A similar procedure is also used for the average and individual locus coefficients of the geometric distribution of step lengths and *P _{i}* (see Table 1 for details). Note that we have chosen to implement this hierarchy of parameters and did not draw locus-specific parameters μ

*and*

_{i}*P*from unique distributions, since the average parameters and would have been virtually identical across simulations of a large number of loci and equal to the mean of the priors. Their estimation would thus have been meaningless. Note also that we have chosen a relatively broad prior for compared to previous studies (

_{i}*e.g.*, Wilson and Balding 1998), such as to cover a wide range of possible mutation rates (see Table 1).

In addition to the 9 basic parameters of the admixture model (the admixture proportion λ, the four effective population sizes, the time of divergence *t*_{DIV}, the time of admixture *t*_{ADM} counted in generations, and the mutational parameters and ), 11 composite parameters were computed and recorded. They correspond, respectively, to the times of divergence and admixture scaled by the population sizes (*t*/*N _{i}*, with

*t*=

*t*

_{ADM}or

*t*

_{DIV}, and

*i*= 0, 1, 2, or

*A*), to the population sizes scaled by the mutation rate (θ

*= 2*

_{i}*N*μ, with

_{i}*i*= 0, 1, 2, and

*A*), and to the times of divergence and admixture scaled by the mutation rate (τ = 2

*t*μ, with

*t*=

*t*

_{ADM}or

*t*

_{DIV}). The estimation procedure was thus carried out separately on the 9 basic parameters as well as on the 11 composite parameters.

### Summary statistics:

The following 15 summary statistics were computed on all the simulated microsatellite data sets: the average number of alleles over loci for each of the two parental and the admixed population samples, the average heterozygosity over loci and average modified *M* statistics (Garza and Williamson 2001) over loci for the same three samples, the (δμ)^{2} genetic distance (Goldstein *et al.* 1995) between the two parental population samples, the measure of differentiation *F*_{ST} (Weir and Cockerham 1984) between all three pairs of population samples, the average extent of linkage disequilibrium *D*′ between independent markers in the admixed population, and the *m _{Y}* admixture coefficient estimator (Bertorelle and Excoffier 1998). The formula of the modified

*M*statistics is , where

*k*is the number of alleles at the

_{l}*l*th locus,

*r*is the difference in number of repeats between the largest and the smallest allele at locus

_{l}*l*(

*i.e.*, the range of allele sizes), and

*L*is the number of loci. Compared to its original definition (Garza and Williamson 2001), it just avoids a division by zero when a gene sample is fixed for a single allele. Note that the summary statistics were chosen such as to capture different features of the data, both at the within- and at the between-population level. This choice is partially arbitrary, since there is currently no objective way to define an optimal set of statistics (Beaumont

*et al.*2002), but we have tried to use statistics thought to be informative for some of the parameters of our model. For instance, one would expect heterozygosity to be informative for the estimation of population size, but it should also depend on the admixture proportion in the hybrid population. Also, pairwise

*F*

_{ST}'s are expected to bring information about divergence times between parental populations and about admixture proportions. The

*m*admixture coefficient should obviously bring information on admixture proportion, while

_{Y}*D*′ in the admixed population should decay with admixture time, but also depend on the absolute sizes of the populations (drift). However, we did not attempt here to define an optimal set of statistics or to study the effect of removing or adding summary statistics, which could be the subject of a later study.

### Simulated data sets:

A first series of 10^{6} data sets was simulated and consisted of 50 diploid individuals (100 genes) typed at 50 independent microsatellite loci. This large data set was fractioned into subsets to study the effect of sample size and number of loci on parameter estimation, and thus data sets consisting of 5, 10, 20, and 50 loci studied in samples of 20 and 100 genes were obtained. A second series of 10^{6} data sets, consisting of 50 diploid individuals typed at a mixture of 20 independent and partially linked loci, was simulated. The 20 loci consisted of two unlinked groups of 10 partially linked loci. Each group of 10 partially linked loci was itself divided into two subsets of 5 completely linked loci (genetic distance of 0 cM), 1 cM distant from each other. The 190 pairs of loci thus fell into three linkage categories: unlinked (100 pairs of loci), partially linked at 1 cM (50), and totally linked (40). The coefficient of linkage disequilibrium *D*′ was computed separately in the three categories of markers, thus adding two summary statistics to these simulated data sets with recombination. Note that our choice of three categories of linkage is somewhat arbitrary. While the “completely linked” and independent sets of markers are easy to justify and are commonly found in many data sets, the spacing of 1 cM was chosen such as to have a different amount of loss of potential disequilibrium created by the admixture process over the time periods studied below. Indeed, one would expect that markers 1 cM apart would lose ~5, 63.4, and 98.2% of the original disequilibrium caused by the admixture after 5, 100, and 400 generations, respectively, thus allowing one to potentially use linkage disequilibrium (LD) to estimate admixture time.

### Performance evaluation and test data sets:

The performances of our ABC approach were evaluated in a series of samples having fixed values of the admixture model. For each combination of parameters, the SIMCOAL2 program was used to generate 100 data sets, on which summary statistics were computed and then used as pseudo-observed summary statistics. The same data set was also used as input to a recent ML method (Wang 2003) denoted hereafter WANG03. The latter method has been chosen for a comparison with our approach, because it has been shown to produce good estimates of admixture coefficients, and because it estimates other parameters of the admixture model that can be also compared with those of our ABC method. Moreover, compared to the method of Chikhi *et al.* (2001), Wang's ML method was notably faster, allowing us to get 100 estimates for fixed simulated parameter values in a reasonable amount of time.

It is worth noting that while the simulation of 1 million data sets and the computation of their associated summary statistics for our ABC approach is time consuming (~12 hr on 15 computer nodes), the ABC estimation of the parameters on a given test data set takes only seconds to minutes, so that the evaluation of the performance of our estimation procedure can be easily achieved without much additional computing cost. This evaluation was thus performed in seven situations. Due to the huge amount of computations needed for the comparisons presented here, a few parameters were fixed across the simulations. The population sizes (numbers of genes) were set to 300, the average mutation rate to 0.0005 (reviewed in Ellegren 2004), and the geometric coefficients *P* to 0.3 (*e.g.*, Estoup *et al.* 2002). The first situation modeled a recent admixture (*t*_{ADM} = 5 generations, *i.e.*, *t*_{ADM}/*N*_{e} = 0.0167), an ancient divergence (*t*_{DIV} = 5000 generations, *i.e.*, *t*_{DIV}/*N _{i}* = 16.7), and an admixture proportion of 0.3. This situation was used to evaluate the effects of different numbers of loci and of different sample sizes (Table 2). The other six situations were chosen to evaluate the effects of increasing the time of admixture for two different admixture proportions and of having partially linked markers. The performance of our ABC method and of WANG03 was characterized by the

*relative bias*(average difference between the estimate and the true value divided by the true value), the

*relative root mean square error*(RMSE—square root of the mean square error divided by the true value), the

*95% coverage*(proportion of times in which the true value is within the equal-tailed 95% confidence or credible interval around the estimate), and the

*factor 2*(proportion of times in which the estimated value is in an interval bounded by values equal to 50 and 200% that of the true value). All measurements of bias, RMSE, and factor 2 were computed by taking the mode of the posterior distribution as a point estimate. The factor 2 parameter is intuitively appealing and brings qualitatively different information than the 95% coverage. It indeed tells users how often the estimator is arbitrarily close (factor 2 here) to the true value, while the inclusion of the true value within a confidence interval does not imply that the estimated parameter is “close” to its true value, as this depends on the width of this interval. All measures of performance were estimated over 100 simulated test data sets. Note that 100 replicates may not be enough to get very accurate estimates of relative RMSEs, so that the numbers for this measure should be considered as indicative only.

## RESULTS

### Recent admixture events:

The performance of the ABC method on the recovery of admixture proportions λ for different numbers of loci and different samples sizes is reported in Table 2 and compared to the ML method of Wang (2003). This comparison is based on a scenario that can be considered as advantageous for admixture estimation, because it involves a small admixture time (5 generations) and a long divergence time (5000 generations) relative to the population size (300 genes). In that case, when ABC estimation is performed on 1 million simulated samples, its performance is very similar to Wang's ML method, as attested by the relative RMSE, especially when the number of loci is high (20 or more). As expected, estimations obtained with 1 million simulations are more accurate than those obtained with 100,000 simulations. However, the latter are already quite good with virtually identical negative relative bias and only slightly larger relative RMSE. Note, however, that the same trend is visible in Figure 3, where we report the posterior distributions obtained from the analysis of a single (randomly chosen) case from 10^{6} or 10^{5} simulations. While the modes of the distributions (taken as a point estimate) obtained from the analysis of 10^{6} or 10^{5} simulations are very similar, the distributions obtained from 10^{6} simulations are usually narrower and would lead to smaller credible intervals. We note here that the ABC method generally produces a small negative bias consisting of underestimating the contribution of the source population contributing the least to the admixed population, but that this bias becomes negligible with a larger number of loci.

**...**

The ABC and Wang's ML methods are found consistent as their accuracy increases with larger samples sizes and larger numbers of loci. They both produce estimates that are almost always closer than a factor 2 from the true value. The only notable difference between the two methods is in the coverage of the 95% confidence intervals around the estimated values: the ABC method tends to produce conservative (too broad) intervals, while Wang's ML method gives too narrow intervals with larger samples where the true value is found only in <90% of the cases.

### Old admixture events:

In Table 3, we report the effect of older admixture times on the estimation of the admixture rate for 20 independent or 20 partially linked loci. While the ABC and Wang's ML methods have very similar performance for short admixture time, the ABC method produces more accurate results when the admixture event occurred >100 generations ago, as shown by much smaller relative RMSE values, higher factor 2 scores, and much better coverage properties for the ABC than for the ML method. For both unlinked and partially linked loci, it is important to note that the coverage of the ABC 95% confidence intervals is always very good. On the other hand, confidence intervals provided by the ML method become poorer with longer admixture time for unlinked loci and are already much too low in the case of a recent admixture studied with partially linked loci. The latter effect is certainly due to the fact that the ML method assumes that the loci are unlinked. As a consequence, loci that are correlated provide similar information and tend to generate thinner distributions because they overestimate the amount of information in the data. This is not the case for the ABC method since we explicitly model the correlation between partially linked markers in our simulations.

**Effect of admixture time and partial linkage on the estimation of the admixture rate λ by the ABC and Wang's (2003) methods**

While 20 independent loci provide accurate estimation of admixture rates, there is a serious drop in the quality of the ABC estimates based on partially linked markers, especially for very unequal contribution of the parental population to the admixed population (*i.e.*, λ = 0.1). The decrease in ABC accuracy between linked and unlinked loci is especially marked for older admixture events. Curiously, the ML method is less affected than the ABC method by partial linkage, in the sense that its performance evaluated by the relative bias and RMSE does not degrade much when partially linked markers are used instead of independent markers. However, although the ABC method somewhat suffers from the use of nonindependent loci, its relative RMSE remains two to three times lower than that obtained from the ML method for the oldest admixture times (400 generations).

### Estimation of divergence and admixture times:

Wang's ML method provides estimates of composite parameters such as divergence and admixture times scaled by population sizes; we report in Table 4 the corresponding parameters obtained from the ABC method. Because this ML method assumes that no mutation occurred since the divergence of the two parental populations, and thus that genetic differences between populations are due to a pure drift process, it leads to grossly underestimated divergence and admixture times and presents poor coverage property, even for recent admixtures. By contrast, the divergence time scaled by parental population size *N*_{2} (*t*_{DIV}/*N*_{2}) is only slightly overestimated with the ABC method from both linked and unlinked markers, with good coverage and factor 2 scores. The admixture time scaled by parental population size *N*_{2} (*t*_{ADM}/*N*_{2}) is very well estimated by the ABC method when it is relatively ancient and is underestimated only by 12 and 48% on average when it is recent (five generations) for unlinked and linked markers, respectively. This parameter is also, to a lesser extent, well estimated by the ML method when admixture is recent. However, it is increasingly underestimated for older admixture times, resulting in a virtual absence of coverage by the ML confidence intervals for admixture times ≥100 generations. Finally, the admixture time scaled by the admixed population size *N*_{A} (*t*_{ADM}/*N*_{A}) is only relatively well estimated by the ML method for recent admixtures. Its estimation follows a more complex pattern for the ABC method. The bias is large and negative for recent admixture events, and it becomes positive and associated with a large RMSE for *t*_{ADM} = 100; for older admixture times (*t*_{ADM} = 400), the bias becomes very low and the relative RMSE drops considerably. This pattern is probably due to the poor estimation of the admixed population size *N*_{A} for short admixture times, since small or large population sizes will not create very contrasting patterns of diversity in a few generations, while they should lead to more contrasted patterns for longer evolutionary periods such as a few hundred generations.

### ABC estimation of mutation-scaled parameters:

In Table 5, we present results on the estimation of composite parameters depending on mutations. These parameters are computed only in the ABC method so that comparison with Wang's ML method is not possible. The scaled divergence time τ_{DIV} is relatively well estimated for short admixture time (17% of positive bias) and its relative RMSE is only slightly increased with longer admixture times, resulting in a small drop (96–89%) for the factor 2 score. The scaled admixture time τ_{ADM} is increasingly better estimated with older admixture events, in keeping with results obtained for the scale parameter *t*_{ADM}/*N*_{2}. The relatively poor recovery of this parameter for recent admixture is also visible in Figure 3, where the posterior distribution of τ_{ADM} is not centered at all around the true value in that case. The scaled population sizes θ_{A} and θ_{2} (θ_{1} is not shown in Table 5, but follows the same pattern as θ_{2}) are very well estimated even for old admixture times, while the scaled size of the admixed population (θ_{A}) is better estimated with increasing admixture times. For *t*_{ADM} = 400, θ_{A} estimation shows virtually no relative bias (−0.4%), a relative RMSE (31%) becoming very similar to that of θ_{2} (26%), and an excellent factor 2 score (98%). The relatively flat posterior distribution of θ_{A} for recent admixtures (five generations) underlines the absence of information in the data for such recent events (Figure 3). On the other hand, the mean parameter of the geometric distribution of the GSM model is well estimated with 20 loci and does not seem much affected by the age of admixture. Finally, we note that the coverage of the 95% confidence intervals is very good for all parameters and tends to be too conservative except for .

### Application to a honeybee data set:

This honeybee data set has been previously described and analyzed in Choisy *et al.* (2004). The population under study is located in Courmayeur at the extreme north of the Aosta valley (northwestern Italy) and represented by a sample of 33 worker bees (one per colony). It is considered an artificially admixed population between two different subspecies of *Apis mellifera*, the West-European black honeybee (*A. m. mellifera*) and the Italian yellow honeybee (*A. m. ligustica*). The two parental populations are represented in the analysis by a sample of *A. m. mellifera* from the sanctuary of Ouessant (French Brittany, *n* = 49) and a sample of *A. m. ligustica* from Forli (Emilia-Romania, *n* = 19), an area of intensive queen rearing for exportation. All sampled honeybees were characterized at eight microsatellite loci, and the admixture coefficient of the Courmayeur sample has already been estimated by six different methods (see Choisy *et al.* 2004 for more details). Such estimates of the proportion of *A. m. mellifera* genes in the Courmayeur genetic pool ranged from 0.195 to 0.371 (Choisy *et al.* 2004). Table 6 shows that our ABC estimate (0.259) is well within this range, as is Wang's ML estimate (0.287). These two methods also agree in their estimates of the time of admixture, which is ~0.01–0.02 in units of *N*. Considering that effective population sizes (in number of gene copies) in European honeybee subspecies are of the order of 1000–2000 [Estoup *et al.*'s (1995) Table 4], this implies a rather recent admixture of 10–40 generations, corresponding to 20–80 years (using an average generation time of 2 years for the queens). This is in good agreement with the development of the Italian queen selling industry in Europe in the middle of the twentieth century. As expected from our previous simulations (Table 4), the two methods provide very different estimates of the time of divergence of the two parental populations scaled by effective population sizes. Wang's ML estimates are ~0.15–0.25, whereas the ABC estimates reach 7.2–8.3. *A. m. ligustica* and *A. m. mellifera* have long been considered as two very distinct subspecies of honeybees. At the end of the 1980s (*e.g.*, Ruttner 1988), the current theory based on paleogeography and morphometry was that the Quaternary ice ages were responsible for the separation of the two subspecies, so the divergence time was estimated at ~50,000 years before present (BP). However, mitochondrial studies showed that these two subspecies belonged to two highly divergent lineages having probably diverged ~1 million years ago (Garnery *et al.* 1992). Quite recently, Franck *et al.* (2000) showed that the subspecies *ligustica* had actually a hybrid origin using a much larger sample of colonies, and that its genetic pool was a mixture of two lineages: the M lineage constituted mainly by the *mellifera* subspecies and the C lineage encompassing the South-European subspecies *carnica* and *cecropia*, as well as the Asian *caucasica*. According to Franck *et al.* (2000), the admixture might have taken place any time after the Riss period (in the last 130,000 years), and it is probably rather ancient. The estimated divergence time could thus not correspond to the separation of the C and M lineages, but rather to the time when the admixed *ligustica* and the *mellifera* subspecies were last separated. If we admit the timing given by Franck *et al.* (2000), a sensible estimate would be some time during the last ice age (which at maximum occurred 22,000–14,000 years BP), when honeybee populations were restricted to southern Mediterranean refuges (namely the Iberian and Italian peninsulas, respectively). Taking population sizes as above, we get divergence time estimate intervals of 150–500 years with Wang's ML estimates and 14,400–33,200 years with our ABC approach. Wang's ML estimates for the time of divergence of the two subspecies hence appear clearly underestimated, while the ABC method gives estimates much more compatible with our current knowledge of the evolutionary history of European honeybee populations.

The ABC approach also allows the estimation of several other parameters not estimated by Wang's ML method (Table 6), such as the mutation scaled population sizes (θ's) and the times of divergence τ_{DIV} or admixture τ_{ADM}. Using the mode of the posterior distribution of the average mutation rate (1.85 × 10^{−4}), we obtain an estimate of 23,665 generations (47,330 years) for the divergence time and 26 generations (52 years) for the time of admixture. Both values are in excellent agreement with those mentioned above and with other studies (Ruttner 1988; Franck *et al* 2000). The average geometric coefficient of the GSM mutation model is very high (0.446) and very close to the upper bound of our prior distribution (Table 1). This extreme value implies a surprisingly large proportion of mutations leading to non-single-step mutations (precisely 0.446; Estoup *et al.* 2002). This probably results from the fact that this data set does not fit well to the modeled scenario. More specifically, the potential hybrid nature of one parental population (*ligustica*) may have widened the distribution of allele lengths in the corresponding sample, forcing the analysis to increase the average length of the mutation steps to cope with this widened allelic distribution.

## DISCUSSION

This study shows that the ABC framework allows a fine analysis of an admixture model, providing very satisfactory estimates of admixture rate (λ), mutation-scaled parental population sizes (θ_{1} and θ_{2}), and divergence time τ_{DIV}, as well as those of the mutation model. Estimates of scaled ancestral population size (θ_{0}) are usually poor, and those of the admixed population size (θ* _{A}*) are good only when the admixture time is large. The mutation-scaled admixture time (τ

_{ADM}) is itself very well estimated when the admixture event is relatively old (100 or more generations), while it leads to reasonable point estimates but large credible intervals when it is very recent. Unscaled parameters, such as raw population sizes and raw divergence and admixture times, were usually not estimated as well as the scaled parameters (results not shown), as they do not have independent and contrasting effects on genetic diversity. However, it is worth noting that the size of the admixed population

*N*

_{A}was very well estimated in the case of old admixture events (

*i.e.*, 400 generations). As shown in Table 5, the relative bias on

*N*

_{A}is indeed <10% when the admixture time is 400 generations, while it was ~560% for an admixture event only 5 generations old. This result suggests that the absolute size of old admixed populations could be well estimated under our framework. This is probably because our method implicitly attempts to reconstruct the genetic composition of the admixed population at the time of admixture, which puts us into a framework very similar to a temporal spacing of samples, which is the ideal situation for estimating population sizes independently from mutation rates (

*e.g.*, Williamson and Slatkin 1999; Anderson

*et al.*2000; Berthier

*et al.*2002).

Compared to Wang's ML method, our ABC approach shows comparable performance for the estimation of the admixture coefficient when admixture is recent, but leads to increasingly better relative results when the admixture time is older. We attribute this better performance to the specific handling of mutations, which cannot be neglected when admixture time is ancient. However, to estimate admixture coefficients, methods based on a pure drift process are not handicapped by mutations having occurred before the admixture, as they merely result in larger diversity in parental populations. Drift-based (like current likelihood-based) methods seem also to better deal with short divergence time between parental populations (*e.g.*, 200 generations instead of 5000) than does our ABC procedure when the admixture is recent (results not shown). However, this advantage is valid only for recent admixtures (*e.g.*, <50 generations). Another advantage of the present ABC approach is its ability to correctly estimate other parameters of the admixture model, such as divergence and admixture times. These parameters are often as important as the admixture coefficient itself. The better performance of our approach is probably linked to the fact that we are using information not specifically handled by Wang's ML method, such as information on patterns of LD and mutations, as well as range of allele size. Moreover, our ABC approach allows us to explicitly include information on partial linkage between markers, so that, in contrast to Wang's ML method, accurate confidence intervals are also obtained in this case.

While the admixture model analyzed here (with a hybrid population and two isolated parental populations at mutation-drift equilibrium) corresponds to the standard model assumed by most methods of estimation of admixture coefficients (*e.g*., Long 1991; Bertorelle and Excoffier 1998; Wang 2003; Choisy *et al.* 2004), real models of admixture may be much more complex. They may indeed involve: (i) more than two source populations (Dupanloup and Bertorelle 2001); (ii) some regular (and thus not instantaneous) admixture events over relatively long periods (reviewed in Chakraborty 1986); (iii) subdivided source populations, so that the actual parental population is only partially sampled; and (iv) parental population(s) that are not at mutation-drift equilibrium, due to population size fluctuations or introgression event(s) in a more or less recent past. The ABC approach has the potential to assess the effect of such deviations from the standard admixture model on parameter estimations since the ratio of acceptance under two alternative models approximates the Bayes factor (*e.g.*, Estoup *et al.* 2004; Pritchard *et al.* 1999). Such quantitative model comparisons could be particularly useful in the present evolutionary context to assess the likelihood of different deviations from the standard admixture model and hence learn more about the admixture process that produced the observed data set and potentially consider more realistic models for parameter estimation.

A general feature of the ABC methods that should be underlined here is their ability to assess their performance at almost no extra computation cost. Other estimation methods generally require a validation step, which includes the time-consuming analysis of independently produced simulated data sets (*e.g.*, Choisy *et al.* 2004), whereas this is intrinsic in the ABC approach (*cf.* Figure 2). As a matter of fact, the same ABC process used to build the reference table can be derived to produce test data sets with known values of parameters. The same rejection and regression steps can then be applied to these data sets to produce estimates of parameters that can be compared to their known true values. It is therefore relatively quick and easy to evaluate the performance of the method for any subset of the parameter space under a given model. The applicability of the ABC method to particular cases should, however, depend on available computer power, as a few days of computing time are necessary to obtain a large number of simulated summary statistics from which the estimation procedure proceeds (*e.g.*, 10^{6} iterations). However, reasonable point estimates can be obtained using much fewer simulations and hence shorter computation times (*e.g.*, 10^{5} iterations). It seems reasonable to anticipate that progress in simulation algorithms and higher computing power will be available in future years, promoting the ABC method as the method of choice for analyzing complex evolutionary scenarios and, more specifically in the context of the present study, for old admixture models in which mutation cannot be neglected or when nonindependent markers are available.

## Acknowledgments

We are grateful to Lounès Chikhi and Mark Beaumont for their comments on the manuscript. L.E. was supported by Swiss National Science Foundation grant 3100A0-100800, as well as a grant from the Institut de la Recherche Agronomique during his 2004 sabbatical visit at the Centre de Biologie et de Gestion des Populations. This study was also partially supported by a grant from the French Bureau des Ressources Génétiques.

## References

- Anderson, E. C., and E. A. Thompson, 2002. A model-based method for identifying species hybrids using multilocus genetic data. Genetics 160
**:**1217–1229. [PMC free article] [PubMed] - Anderson, E. C., E. G. Williamson and E. A. Thompson, 2000. Monte Carlo evaluation of the likelihood for
*N*_{e}from temporally spaced samples. Genetics 156**:**2109–2118. [PMC free article] [PubMed] - Barton, N. H., 2001. The role of hybridization in evolution. Mol. Ecol. 10
**:**551–568. [PubMed] - Beaumont, M. A., W. Zhang and D. J. Balding, 2002. Approximate Bayesian computation in population genetics. Genetics 162
**:**2025–2035. [PMC free article] [PubMed] - Berthier, P., M. A. Beaumont, J. M. Cornuet and G. Luikart, 2002. Likelihood-based estimation of the effective population size using temporal changes in allele frequencies: a genealogical approach. Genetics 160
**:**741–751. [PMC free article] [PubMed] - Bertorelle, G., and L. Excoffier, 1998. Inferring admixture proportions from molecular data. Mol. Biol. Evol. 15
**:**1298–1311. [PubMed] - Chakraborty, R., 1986. Gene admixture in human populations: models and predictions. Yearb. Phys. Anthropol. 29
**:**1–43. - Chakraborty, R., and K. M. Weiss, 1988. Admixture as a tool for finding linked genes and detecting that difference from allelic association between loci. Proc. Natl. Acad. Sci. USA 85
**:**9119–9123. [PMC free article] [PubMed] - Chikhi, L., M. W. Bruford and M. A. Beaumont, 2001. Estimation of admixture proportions: a likelihood-based approach using Markov chain Monte Carlo. Genetics 158
**:**1347–1362. [PMC free article] [PubMed] - Choisy, M., P. Franck and J. M. Cornuet, 2004. Estimating admixture proportions with microsatellites: comparison of methods based on simulated data. Mol. Ecol. 13
**:**955–968. [PubMed] - Dawson, K. J., and K. Belkhir, 2001. A Bayesian approach to the identification of panmictic populations and the assignment of individuals. Genet. Res. 78
**:**59–77. [PubMed] - Dupanloup, I., and G. Bertorelle, 2001. Inferring admixture proportions from molecular data: extension to any number of parental populations. Mol. Biol. Evol. 18
**:**672–675. [PubMed] - Dupanloup, I., G. Bertorelle, L. Chikhi and G. Barbujani, 2004. Estimating the impact of prehistoric admixture on the genome of Europeans. Mol. Biol. Evol. 21
**:**1361–1372. [PubMed] - Ellegren, H., 2004. Microsatellites: simple sequences with complex evolution. Nat. Rev. Genet. 5
**:**435–445. [PubMed] - Estoup, A., M. Beaumont, F. Sennedot, C. Moritz and J.-M. Cornuet, 2004. Genetic analysis of complex demographic scenarios: spatially expanding populations of the cane toad, Bufo marinus. Evolution 58
**:**2021–2036. [PubMed] - Estoup, A., and S. M. Clegg, 2003. Bayesian inferences on the recent island colonization history by the bird Zosterops lateralis lateralis. Mol. Ecol. 12
**:**657–674. [PubMed] - Estoup, A., L. Garnery, M. Solignac and J. M. Cornuet, 1995. Microsatellite variation in honey bee (Apis mellifera L.) populations: hierarchical genetic structure and test of the infinite allele and stepwise mutation models. Genetics 140
**:**679–695. [PMC free article] [PubMed] - Estoup, A., C. R. Largiader, J. M. Cornuet, K. Gharbi, P. Presa
*et al.*, 2000. Juxtaposed microsatellite systems as diagnostic markers for admixture: an empirical evaluation with brown trout (Salmo trutta) as model organism. Mol. Ecol. 9**:**1873–1886. [PubMed] - Estoup, A., I. J. Wilson, C. Sullivan, J. M. Cornuet and C. Moritz, 2001. Inferring population history from microsatellite and enzyme data in serially introduced cane toads,
*Bufo marinus*. Genetics 159**:**1671–1687. [PMC free article] [PubMed] - Estoup, A., P. Jarne and J. M. Cornuet, 2002. Homoplasy and mutation model at microsatellite loci and their consequences for population genetics analysis. Mol. Ecol. 11
**:**1591–1604. [PubMed] - Falush, D., M. Stephens and J. K. Pritchard, 2003. Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics 164
**:**1567–1587. [PMC free article] [PubMed] - Feldman, M. W., A. Bergman, D. D. Pollock and D. B. Goldstein, 1997. Microsatellite genetic distances with range constraints: analytic description and problems of estimation. Genetics 145
**:**207–216. [PMC free article] [PubMed] - Franck, P., L. Garnery, G. Celebrano, M. Solignac and J. M. Cornuet, 2000. Hybrid origins of honeybees from Italy (Apis mellifera ligustica) and Sicily (A. m. sicula). Mol. Ecol. 9
**:**907–921. [PubMed] - Fu, Y. X., and W. H. Li, 1997. Estimating the age of the common ancestor of a sample of DNA sequences. Mol. Biol. Evol. 14
**:**195–199. [PubMed] - Gaggiotti, O. E., F. Jones, W. M. Lee, W. Amos, J. Harwood
*et al.*, 2002. Patterns of colonization in a metapopulation of grey seals. Nature 416**:**424–427. [PubMed] - Gaggiotti, O. E., S. P. Brooks, W. Amos and J. Harwood, 2004. Combining demographic, environmental and genetic data to test hypotheses about colonization events in metapopulations. Mol. Ecol. 13
**:**811–825. [PubMed] - Garnery, L., J. M. Cornuet and M. Solignac, 1992. Evolutionary history of the honey bee Apis mellifera inferred from mitochondrial DNA analysis. Mol. Ecol. 1
**:**145–154. [PubMed] - Garza, J. C., and E. G. Williamson, 2001. Detection of reduction in population size using data from microsatellite loci. Mol. Ecol. 10
**:**305–318. [PubMed] - Garza, J. C., M. Slatkin and N. B. Freimer, 1995. Microsatellite allele frequencies in humans and chimpanzees, with implications for constraints on allele size. Mol. Biol. Evol. 12
**:**594–603. [PubMed] - Goldstein, D. B., and D. D. Pollock, 1997. Launching microsatellites: a review of mutation processes and method of phylogenetic inference. J. Hered. 88
**:**335–342. [PubMed] - Goldstein, D. B., A. Ruiz Linares, L. L. Cavalli-Sforza and M. W. Feldman, 1995. Genetic absolute dating based on microsatellites and the origin of modern humans. Proc. Natl. Acad. Sci. USA 92
**:**6723–6727. [PMC free article] [PubMed] - Hudson, R. R., 2002. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 18
**:**337–338. [PubMed] - Laval, G., and L. Excoffier, 2004. SIMCOAL 2.0: a program to simulate genomic diversity over large recombining regions in a subdivided population with a complex history. Bioinformatics 20
**:**2485–2487. [PubMed] - Long, J. C., 1991. The genetic structure of admixed populations. Genetics 127
**:**417–428. [PMC free article] [PubMed] - Marjoram, P., J. Molitor, V. Plagnol and S. Tavare, 2003. Markov chain Monte Carlo without likelihoods. Proc. Natl. Acad. Sci. USA 100
**:**15324–15328. [PMC free article] [PubMed] - Pollock, D. D., A. Bergman, M. W. Feldman and D. B. Goldstein, 1998. Microsatellite behavior with range constraints: parameter estimation and improved distances for use in phylogenetic reconstruction. Theor. Popul. Biol. 53
**:**256–271. [PubMed] - Pritchard, J., M. Seielstad, A. Perez-Lezaun and M. Feldman, 1999. Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Mol. Biol. Evol. 16
**:**1791–1798. [PubMed] - Pritchard, J. K., M. Stephens and P. Donnelly, 2000. Inference of population structure using multilocus genotype data. Genetics 155
**:**945–959. [PMC free article] [PubMed] - Roberts, D., and R. Hiorns, 1965. Methods of analysis of the genetic composition of a hybrid population. Hum. Biol. 37
**:**38–43. [PubMed] - Ruttner, F., 1988
*Biogeography and Taxonomy of Honeybees*. Springer-Verlag, Berlin. - Tavaré, S., D. Balding, R. C. Griffiths and P. Donnely, 1997. Inferring coalescence times from DNA sequence data. Genetics 145
**:**505–518. [PMC free article] [PubMed] - Wang, J., 2003. Maximum-likelihood estimation of admixture proportions from genetic data. Genetics 164
**:**747–765. [PMC free article] [PubMed] - Weir, B. S., and C. C. Cockerham, 1984. Estimating F-statistics for the analysis of population structure. Evolution 38
**:**1358–1370. - Williamson, E. G., and M. Slatkin, 1999. Using maximum likelihood to estimate population size from temporal changes in allele frequencies. Genetics 152
**:**755–761. [PMC free article] [PubMed] - Wilson, I. J., and D. J. Balding, 1998. Genealogical inference from microsatellite data. Genetics 150
**:**499–510. [PMC free article] [PubMed] - Zhivotovsky, L. A., M. W. Feldman and S. A. Grishechkin, 1997. Biased mutations and microsatellite variation. Mol. Biol. Evol. 14
**:**926–933. [PubMed]

**Genetics Society of America**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (161K)

- Bayesian analyses of admixture in wild and domestic cats (Felis silvestris) using linked microsatellite loci.[Mol Ecol. 2006]
*Lecis R, Pierpaoli M, Birò ZS, Szemethy L, Ragni B, Vercillo F, Randi E.**Mol Ecol. 2006 Jan; 15(1):119-31.* - Maximum-likelihood estimation of admixture proportions from genetic data.[Genetics. 2003]
*Wang J.**Genetics. 2003 Jun; 164(2):747-65.* - Bayesian association mapping for quantitative traits in a mixture of two populations.[Genet Epidemiol. 2001]
*Sillanpää MJ, Kilpikari R, Ripatti S, Onkamo P, Uimari P.**Genet Epidemiol. 2001; 21 Suppl 1:S692-9.* - Using linked markers to infer the age of a mutation.[Hum Mutat. 2001]
*Rannala B, Bertorelle G.**Hum Mutat. 2001 Aug; 18(2):87-100.* - ABC as a flexible framework to estimate demography over space and time: some cons, many pros.[Mol Ecol. 2010]
*Bertorelle G, Benazzo A, Mona S.**Mol Ecol. 2010 Jul; 19(13):2609-25. Epub 2010 Jun 18.*

- Exploring Germplasm Diversity to Understand the Domestication Process in Cicer spp. Using SNP and DArT Markers[PLoS ONE. ]
*Roorkiwal M, von Wettberg EJ, Upadhyaya HD, Warschefsky E, Rathore A, Varshney RK.**PLoS ONE. 9(7)e102016* - Detecting Deep Divergence in Seventeen Populations of Tea Geometrid (Ectropis obliqua Prout) in China by COI mtDNA and Cross-Breeding[PLoS ONE. ]
*Zhang GH, Yuan ZJ, Zhang CX, Yin KS, Tang MJ, Guo HW, Fu JY, Xiao Q.**PLoS ONE. 9(6)e99373* - Interethnic admixture and the evolution of Latin American populations[Genetics and Molecular Biology. 2014]
*Salzano FM, Sans M.**Genetics and Molecular Biology. 2014 Mar; 37(1 Suppl)151-170* - Genetic Structure and Diversity of the Endangered Fir Tree of Lebanon (Abies cilicica Carr.): Implications for Conservation[PLoS ONE. ]
*Awad L, Fady B, Khater C, Roig A, Cheddadi R.**PLoS ONE. 9(2)e90086* - Laboratory Colonisation and Genetic Bottlenecks in the Tsetse Fly Glossina pallidipes[PLoS Neglected Tropical Diseases. ]
*Ciosi M, Masiga DK, Turner CM.**PLoS Neglected Tropical Diseases. 8(2)e2697*

- MedGenMedGenRelated information in MedGen
- PubMedPubMedPubMed citations for these articles
- TaxonomyTaxonomyRelated taxonomy entry
- Taxonomy TreeTaxonomy Tree