• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of geneticsGeneticsCurrent IssueInformation for AuthorsEditorial BoardSubscribeSubmit a Manuscript
Genetics. Jun 2007; 176(2): 1363–1366.
PMCID: PMC1894599

Fast and Accurate Estimation of the Population-Scaled Mutation Rate, θ, From Microsatellite Genotype Data

Abstract

We present a new approach for estimation of the population-scaled mutation rate, θ, from microsatellite genotype data, using the recently introduced “product of approximate conditionals” framework. Comparisons with other methods on simulated data demonstrate that this new approach is attractive in terms of both accuracy and speed of computation. Our simulation experiments also demonstrate that, despite the theoretical advantages of full-likelihood-based methods, methods based on certain summary statistics (specifically, the sample homozygosity) can perform very competitively in practice.

PATTERNS of genetic variation in population samples contain important information on both the biological mechanisms (e.g., mutation, recombination, gene conversion, selection) and aspects of population demographic history (e.g., population expansions, bottlenecks, and migration rates). However, extracting this information is often tricky. The simplest methods are based on matching summaries of the data (e.g., expected heterozygosity or average pairwise distances between alleles) to their expected values. Although these methods are attractive in their simplicity, summarizing the genotype data with a single number in this way risks losing information. More complex methods that use sophisticated computations to approximate the full likelihood of the data (Griffiths and Tavaré 1994a,b; Kuhner et al. 1995; Iorio et al. 2005) are more efficient in principle, but typically are difficult to implement, and may take impractical amounts of time to produce reliable results (Stephens and Donnelly 2000; Fearnhead and Donnelly 2001). This has limited their usefulness in practice. Indeed, in some settings the computational complexities of full-likelihood-based approaches are so daunting that many researchers have turned to approximate methods (e.g., Hudson 2001; McVean et al. 2002; Fearnhead and Donnelly 2002; Li and Stephens 2003), often with considerable success (e.g., Crawford et al. 2004; McVean et al. 2004). Thus far, applications of these approximate methods have been to data on single-nucleotide polymorphisms (SNPs). Here we extend one of these methods, the PAC likelihood approach of Li and Stephens (2003), to estimate the scaled mutation parameter θ (= 2Nμ, where N is the effective haploid population size and μ is the mutation probability per meiosis) from microsatellite data. Simulation results suggest that this method is as accurate as full-likelihood-based approaches and considerably faster.

Models and methods:

We consider a simple scenario, where we genotype a single microsatellite locus in n haploid individuals, or n/2 diploid individuals, sampled from a random-mating population that has been evolving neutrally with constant (haploid) size N according to a Wright–Fisher model. Let equation M1 denote the observed alleles (number of repeats of the microsatellite motif). We assume that the locus evolves according to a symmetric stepwise mutation mechanism, where if a mutation occurs in a transmission then the offspring's allele length increases or decreases (with equal probability) by one from the progenitor allele. Although this model is simplistic, it is widely used and is the basis for all the methods of estimating θ that we consider here. However, our approach could be easily modified to deal with other mutation models (e.g., those described in Calabrese and Durrett 2003).

There exist two broad categories of approach for estimating θ in this context. The first is moment estimators based on summary statistics. Kimmel et al. (1998) include two such estimators (their Equations 14 and 15). The first one, the homozygosity estimator, is given by

equation M2
(1)

where equation M3 is an unbiased estimate of the population homozygosity,

equation M4
(2)

where r is the number of different alleles found in the population, and pi is the sample frequency of the ith allele. The second estimator is

equation M5
(3)

where equation M6 is the mean of the ai's. The estimator equation M7 is based on the limiting expected homozygosity in a continuous-time Wright–Fisher model, whereas equation M8 is based on the limiting expected value of the within-population component of genetic variance in the same model Kimmel et al. (1998).

The second category is full-likelihood-based approaches, including maximum-likelihood and Bayesian approaches, which base inference on the likelihood

equation M9
(4)

In principle full-likelihood-based approaches are more efficient than moment estimators based on summary statistics. However, they are considerably harder to implement because the likelihood (4) cannot be computed directly. Instead, the likelihood can be approximated using computational methods such as Markov chain Monte Carlo (MCMC) or importance sampling. Wilson and Balding (1998) and Beerli and Felsenstein (2001) describe two such approaches. Wilson and Balding (1998) take a Bayesian approach, specifying prior distributions for N and μ, and use an MCMC scheme to draw samples from the posterior distribution of θ. This method is implemented in the software MICSAT, which we downloaded from http://www.maths.abdn.ac.uk/~ijw/downloads/download.htm. Beerli and Felsenstein (2001) also use a (different) MCMC scheme; but instead of performing a Bayesian analysis, they use it to compute a likelihood surface for θ (and also, in the case of samples from multiple populations, a set of migration rates among populations; however, here we deal with a sample from a single random-mating population, and so their approach can be used to estimate θ alone). This method is implemented by the program Migrate (version 1.7.3), which we downloaded from http://evolution.genetics.washington.edu/lamarc/migrate.html.

In this article we take a different approach, following Li and Stephens (2003) who suggest approximating the likelihood (4) by exploiting the identity

equation M10
(5)

Although the conditional distributions on the right-hand side of this equation are unknown for most models of interest, they are amenable to approximation (e.g., Stephens and Donnelly 2000; Fearnhead and Donnelly 2001; Li and Stephens 2003). Substituting such an approximation, equation M11 say, into the right-hand side yields an approximate likelihood, which Li and Stephens (2003) term the “product of approximate conditionals” (PAC) likelihood:

equation M12
(6)

Li and Stephens (2003) applied this idea to estimate recombination rates (but not mutation rates) from SNP data and showed the resulting estimates to be competitive with the best available methods for that problem.

Here we show that an analogous approach also works for estimating θ from microsatellite data. For the conditional distributions equation M13 on the right-hand side of (6) we use the approximation suggested by Stephens and Donnelly (2000). This approximation is based on the idea that the next sampled allele, ak, will differ by a random number of mutations (which will typically be a small number of mutations and quite possibly 0 mutations) from a randomly chosen existing allele (equation M14). Stephens and Donnelly (2000, p. 616) assume that the number of mutations, m, has a geometric distribution, with Pr(m = 0) = k/(k + θ). The assumption of a geometric distribution is motivated by the fact that the resulting approximation is exact for the case k = 1; and the assumption on Pr(m = 0) is motivated by the fact that the resulting approximation is exact [and results in the well-known Ewens sampling formula (Ewens 1972)] for so-called “parent-independent mutation” (PIM) models, where the type of a mutant offspring is independent of the type of the progenitor allele. Of course, the stepwise mutation is not PIM, so the approximation is not exact in our setting. Part of our aim here is to show that the approximation is good enough to provide accurate estimates for θ.

Mathematically, the approximation suggested by Stephens and Donnelly (2000) is

equation M15
(7)

where qk = θ/(k + θ) and P is a mutation matrix, whose (i, j)th element is the probability that the type of an offspring is of type j, given that the progenitor is of type i and a mutation occurs. To ease comparison with other approaches, we assume a symmetric stepwise mutation mechanism, so that

equation M16

We note that, unlike in Stephens and Donnelly (2000), we do not impose any reflecting boundaries on the mutation process, although this would be straightforward to do. (Thus, the matrix P has infinitely many rows and columns.) It would also be straightforward to incorporate nonstepwise moves (e.g., Nielsen 1997) or indeed any other desired form for P.

This choice of P has the convenient, although not essential, property that the approximation (7) simplifies, to

equation M17
(8)

This follows from rewriting (7) as

equation M18
(9)

and noting that the matrix with (i, j)th element

equation M19
(10)

is the inverse of (IqP). Equation 10 can be verified by straightforward algebra, multiplying a row of (IqP) by a column of (1 − qP)−1 defined by (10).

Substituting (8) into (6) for equation M20 gives a PAC likelihood for this problem. Note that, as in Li and Stephens (2003), the resulting PAC likelihood is not invariant to the ordering of the sampled alleles a1, a2, … , an. To deal with this, we take the same approach as Li and Stephens (2003); we average (4) over 10 random permutations of a1, a2,  , an. [Results (not shown) obtained using a single random permutation were similar in accuracy.] We use equation M21 to denote the value of θ that maximizes this function [found numerically by computing LPAC(θ) on a dense grid of values for θ].

Comparisons:

We compared the properties of our PAC-based estimator equation M22 with other available methods described above: the moment-based estimators equation M23 and equation M24 and the full-likelihood-based estimators equation M25 and equation M26. To be precise, equation M27 is the mean of 10,000 draws from the posterior distribution for θ obtained using the program MICSAT with default parameter values, and equation M28 is the value of θ that maximizes the approximate likelihood computed using Migrate, again with default parameter values.

Figure 1 compares “bias” (or, more accurately, median error) and “accuracy” (median absolute error) of the resulting estimates, on a log scale. Making comparisons on the log scale means that, for example, underestimating θ by a factor of 2 is considered equally good—or bad—as overestimating by a factor of 2. We use medians rather than means because the means are infinite, due to the fact that there is a small finite probability of each estimator being 0 (and therefore giving a log of −∞); see also Li and Stephens (2003).

Figure 1.
Comparison of the “bias” (a–c) and “accuracy” (d–f) of different estimators. Each section has five curves, one for each estimator: ○, equation M47; [open triangle], equation M48; x, equation M49; #, equation M50; and □, equation M51. In a–c ...

For the scenarios we consider, equation M29, equation M30, equation M31, and equation M32 are consistently better (smaller bias and smaller mean absolute error) than equation M33 and equation M34. If anything the results for equation M35 seem very slightly better than the other three, especially for small values of θ (according to a paired Wilcoxon signed-rank test, the improvement in accuracy over equation M36 is significant at P < 0.05 for all values of n considered at θ = 2 and for n = 10, 20, 80 at θ = 8; the improvement over equation M37 is significant at P < 0.05 for all values of n considered at θ = 2, for n = 20, 80 at θ = 8, and for n = 10, 20, 40 at θ = 32). However, the differences may be too small to be practically important, and in some sense a direct comparison with equation M38 is inappropriate, since it is based on a particular prior distribution for θ.

One additional notable finding from our simulations is that, between the summary statistic estimators, equation M39 performs considerably better than equation M40. Indeed, the finding that equation M41 performs competitively with the likelihood-based methods is, as far as we are aware, novel. While we have no intuitive explanation for this good performance, the poor performance of equation M42 might perhaps have been expected, for the following reason. Equation 3 for equation M43 can be rewritten as equation M44. Thus equation M45 is the mean squared pairwise difference between sampled microsatellite repeats. In the context of sequence data, the corresponding estimate for θ (per base pair) is the mean pairwise distance (per base pair) between sampled haplotypes, also known as the nucleotide diversity, and this is known to be an inconsistent estimator for θ in that context (e.g., Donnelly and Tavaré 1995).

We interpret the poorer performance of equation M46 as indicating that, even in this relatively simple setting, with only a single parameter to be estimated and no migration, the default run lengths we used were insufficient to provide an accurate approximation to the maximum-likelihood estimates. In more complex settings, involving migration, for example, obtaining an accurate estimate of the likelihood surface, and the location of its maximum, seems likely to be still more challenging. Although some work would be necessary to extend our PAC-likelihood method to these settings, our results here, and in Li and Stephens (2003), suggest that this effort may be worthwhile.

Acknowledgments

We thank two anonymous referees for helpful comments on the submitted version of this manuscript. This work was supported by National Institutes of Health grant HG/LM02585 to M.S.

References

  • Beerli, P., and J. Felsenstein, 2001. Maximum likelihood estimation of a migration matrix and effective population sizes in n subpopulations by using a coalescent approach. Proc. Natl. Acad. Sci. USA 98(8): 4563–4568. [PMC free article] [PubMed]
  • Calabrese, P., and R. Durrett, 2003. Dinucleotide repeats in the drosophila and human genomes have complex, length-dependent mutation processes. Mol. Biol. Evol. 20: 715–725. [PubMed]
  • Crawford, D., T. Bhangale, N. Li, G. Hellenthal, M. Rieder et al., 2004. Evidence for substantial fine-scale variation in recombination rates across the human genome. Nat. Genet. 36: 700–706. [PubMed]
  • Donnelly, P., and S. Tavaré, 1995. Coalescents and genealogical structure under neutrality. Annu. Rev. Genet. 29: 401–421. [PubMed]
  • Ewens, W. J., 1972. The sampling theory of selectively neutral alleles. Theor. Popul. Biol. 3: 87–112. [PubMed]
  • Fearnhead, P. N., and P. Donnelly, 2001. Estimating recombination rates from population genetic data. Genetics 159: 1299–1318. [PMC free article] [PubMed]
  • Fearnhead, P. N., and P. Donnelly, 2002. Approximate likelihood methods for estimating local recombination rates. J. R. Stat. Soc. Ser. B 64: 657–680.
  • Griffiths, R. C., and S. Tavaré, 1994. a Ancestral inference in population genetics. Stat. Sci. 9: 307–319.
  • Griffiths, R. C., and S. Tavaré, 1994. b Simulating probability distributions in the coalescent. Theor. Popul. Biol. 46: 131–159.
  • Hudson, R. R., 2001. Two-locus sampling distribution and their application. Genetics 159: 1805–1817. [PMC free article] [PubMed]
  • Iorio, M. D., R. C. Griffiths, R. Leblois and F. Rousset, 2005. Stepwise mutation likelihood computation by sequential importance sampling in subdivided population models. Theor. Popul. Biol. 68: 41–53. [PubMed]
  • Kimmel, M., R. Chakraborty, J. P. King, M. Bamshad, W. S. Watkins et al., 1998. Signatures of population expansion in microsatellite repeat data. Genetics 148: 1921–1930. [PMC free article] [PubMed]
  • Kuhner, M. K., J. Yamato and J. Felsenstein, 1995. Estimating effective population size and mutation rate from sequence data using Metropolis–Hastings sampling. Genetics 140: 1421–1430. [PMC free article] [PubMed]
  • Li, N., and M. Stephens, 2003. Modeling linkage disequilibrium, and identifying recombination hotspots using SNP data. Genetics 165: 2213–2233. [PMC free article] [PubMed]
  • McVean, G., P. Awadalla and P. Fearnhead, 2002. A coalescent-based method for detecting and estimating recombination from gene sequences. Genetics 160: 1231–1241. [PMC free article] [PubMed]
  • McVean, G. A. T., S. R. Myers, S. Hunt, P. Deloukas, D. R. Bentley et al., 2004. The fine-scale structure of recombination rate variation in the human genome. Science 304: 581–584. [PubMed]
  • Nielsen, R., 1997. A likelihood approach to population samples of microsatellite alleles. Genetics 146: 711–716. [PMC free article] [PubMed]
  • Stephens, M., and P. Donnelly, 2000. Inference in molecular population genetics. J. R. Stat. Soc. Ser. B 62: 605–655.
  • Wilson, I. J., and D. J. Balding, 1998. Genealogical inference from microsatellite data. Genetics 150: 499–510. [PMC free article] [PubMed]

Articles from Genetics are provided here courtesy of Genetics Society of America
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • MedGen
    MedGen
    Related information in MedGen
  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...