• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptNIH Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Theor Popul Biol. Author manuscript; available in PMC Jun 1, 2010.
Published in final edited form as:
PMCID: PMC2736640
NIHMSID: NIHMS111984

Site Frequency Spectra from Genomic SNP Surveys

Abstract

Genomic survey data now permit an unprecedented level of sensitivity in the detection of departures from canonical evolutionary models, including expansions in population size and selective sweeps. Here, we examine the effects of seemingly subtle differences among sampling distributions on goodness of fit analyses of site frequency spectra constructed from single nucleotide polymorphisms. Conditioning on the observation of exactly two alleles in a random sample results in a site frequency spectrum that is independent of the scaled rate of neutral substitution (θ). Other sampling distributions, including conditioning on a single mutational event in the sample genealogy or randomly selecting a single mutation from a genealogy with multiple mutations, have distinct site frequency spectra that show highly significant departures from the predictions of the biallelic model. Some aspects of data filtering may contribute to significant departures of site frequency spectra from expectation, apart from any violation of the standard neutral model.

Keywords: site frequency spectrum, single nucleotide polymorphism, Ewens sampling formula, infinite-sites model, standard neutral model

1 Introduction

1.1 Site frequency spectra

Site frequency spectra (SFSs) are widely used to summarize patterns of genome-wide variation at the single nucleotide polymorphisms (SNPs) that abound in virtually all organisms. Fundamental population genetic analyses (Ewens 1972; Tajima 1989; Fu 1995; Griffiths and Tavaré 1998; Stephens 2000) have characterized patterns of genetic variation expected under the infinite-alleles and infinite-sites models of neutral substitution. A scaled version of those single-locus predictions now serve as the point of departure for the analysis of the SFSs comprising hundreds of thousands of independent SNP loci. Because the relative expected multiplicities depend only on sample size, departures from this expectation have been used to identify candidates for targets of selection or other locus-specific processes (for example, Kim et al. 2007). Numerical simulation studies (Braverman et al. 1995; Simonsen et al. 1995) have established that various phenomena, including hitchhiking and expansions in population size, affect spectrum shape, and analytical predictions now exist for a number of forms of departure from the standard neutral model (Marth et al. 2004; Keightley and Eyre-Walker 2007; Živkovíc and Wiehe 2008).

Few spectra constructed from actual genomic SNP surveys conform to expectation under the standard neutral model. For SNPs identified by direct sequencing through the NIEHS Environmental Genome Project, for example, Hernandez et al. (2007) noted a general excess of derived alleles in low and high multiplicities and a corresponding deficiency of alleles in intermediate frequencies. Ascertainment of SNPs through a small panel of individuals (Nielsen et al. 2004) introduces a different bias, toward an excess of SNPs in intermediate frequencies.

1.2 Fitting to an incorrect model

The sheer volume of information available from genomic databases confers unprecedented power to detect departures from models serving as the basis for interpretation of the data. Significant p-values may reflect departures from any aspect of a model, with some aspects fundamental to key inferences and others merely incidental.

Bishop et al. (1975) have presented a lucid treatment of the effect on the Pearson chi-square statistic of fitting data to an incorrect model. In a goodness of fit analysis of counts in k cells, the sample X2 corresponds to

X2=i=1k(ninpi)2npi,

for ni the observed count in cell i, n (= Σi ni) the total number of counts, and pi the expected proportion in cell i. If the true proportions ( pi) of the multinomial distribution from which the observations (ni) are sampled differ from those used to determine the expected counts (pi), then the expectation of X2 corresponds to

E[X2]=k1+i=1k(pipi)pi+(n1)i=1k(pipi)2pi
(1)

(Bishop et al. 1975, Section 9.6). For the number of counts (n) very large relative to the number of cells (k), as is the case for the analysis of genomic SNP data, even small departures between the true and fitted models ( (pipi)2ε) can cause the expected X2 to exceed considerably the degrees of freedom (df = k − 1).

1.3 Comparison of sampling distributions

Here, we address the effect on the shape of expected site frequency spectra of SNPs of closely related models for their sampling distribution. Restriction of consideration to sample genealogies that contain a single mutational event induces a dependence of the SFS on the scaled mutation rate (θ = lim 2Nu, for N the effective number of genes and u the rate of neutral substitution). This dependence on θ implies that the SFS can provide a basis for the estimation of this fundamental parameter. Of particular significance for the detection of departures from the standard neutral model is that we expect classes of SNPs with distinct rates of neutral substitution to show distinct spectra, even in the absence of class-specific processes, including selection.

We characterize the expected shape of site frequency spectra constructed from sample genealogies that contain exactly one segregating site (neutral SNP model). A folded version of the model widely used to represent the standard neutral model (scaled multiplicity model) follows directly from the Ewens sampling formula (ESF, Ewens 1972) conditioned on the observation of exactly two alleles in the sample.

We used ms (Hudson 2002) to simulate 12 ×106 data sets from a non-recombining region for a range of values of θ (0.5 to 6.0). For each data set, we determined the number of segregating sites, the multiplicity of each mutation in the sample, the number of distinct haplotypes (alleles), the length of each branch, and the size of each branch (number of descendent tips). An R file with code for determining maximum likelihood estimates of θ and their confidence intervals is provided as Supplementary Data.

We conducted a series of goodness of fit analyses to explore the influence of various means of filtering large data sets to isolate single nucleotide polymorphisms. Our results indicate that seemingly subtle differences between the theoretical and actual sampling distributions can generate very highly significant X2 values in tests involving the large number of observations typical of genomic SNP data, quite apart from departures from the standard neutral model.

2 Expected patterns of variation

We summarize basic descriptors of variation.

2.1 Number of segregating sites

On level l of the genealogy of a sample of genes (the segment comprising l lineages), the probability of the occurrence of a mutation more recently than a coalescence is

lu(l2)/N+lu=θl1+θ

(see, for example, Ethier and Griffiths 1987). Watterson (1975) observed that the number of mutations accumulated on level l has a geometric distribution with this parameter, and gave the probability generating function (pgf) for the total number of segregating sites (S) in a sample of size n:

gS(z)=l=2ml1l1+θ(1z).
(2)

In particular, the expected number of segregating sites corresponds to

E[S]=gS(1)=l=2mθl1.
(3)

Tavaré (1984) has derived a simple expression for the probability mass function of S:

P(S=iθ)=m1θl=2m(1)l2(m2l2)(θl1+θ)i+1.
(4)

2.2 Number of mutations with a given multiplicity

For the infinite-sites model, under which all mutations are detectable and distinguishable, Fu (1995) noted that the number of genes in a sample that bear a given mutation corresponds to the number of tips that descend from the branch of the sample gene genealogy on which the mutation arose. Fu (1995) derived the mean and variance of the number of mutations in a sample of size m that have multiplicity i (ξi):

E[ξi]=θ/i,
(5a)

Var[ξi]=θ/i+σiiθ2,
(5b)

in which

σii={βm(i+1)fori<m/22(amai)/(mi)1/i2fori=m/2βm(i)1/i2fori>m/2

with

βm(i)=2m(am+1ai)(mi+1)(mi)2mi.

The expected number of mutations present in the sample in multiplicity i (5a) scaled to the total number of mutations (3),

fs(im)=1/ij=1m11/j,
(6)

is widely used to describe the expected SFS for a genomic sample of SNPs, each assumed to correspond to a mutation on an independent gene genealogy.

Expressions closely related or identical to (5) and (6) have been obtained in a variety of contexts (see especially Watterson 1974; Griffiths and Tavaré 1998). In particular, the frequency of a mutation in the sample provides information about its age (Kimura and Ohta 1973), and a number of elegant coalescence-based studies have elucidated the genealogical basis of this relationship (Griffiths and Tavaré 1998, 2003; Wiuf and Donnelly 1999; Stephens 2000; Hobolth and Wiuf 2009). Tajima (1983, 1989) and Griffiths and Tavaré (1998, 2003) showed that (6) corresponds to the expected proportion (rather than number) of mutations that occur in multiplicity i in a sample of size m for low rates of mutation (θ → 0).

2.3 Number of alleles

For the infinite alleles model of mutation, the Ewens sampling formula provides the joint probability of the numbers in which distinct alleles appear in a sample of m genes:

p(a)=m!l=1m(θ+l1)i=1m(θi)ai1ai!,
(7)

for a = (a1, a2,am), with ai the number of alleles observed exactly i times. Explicit reference to the genealogy of the sample under the standard neutral model has yielded elegant combinatorial derivations of the ESF (Kingman 1978; Donnelly 1986; Griffiths and Lessard 2005).

Ewens (1972) derived the probability mass function for the number of distinct alleles (K) observed in a sample of size m,

P(K=iθ,m)=liθiLm(θ),
(8)

for L(θ) providing the Stirling numbers of the first kind (li):

Lm(θ)=θ(θ+1)(θ+m1)=l1θ+l2θ2++lmθm.

This distribution (8) has pgf

gK(z)=Lm(θz)Lm(θ)=l=1mθz+l1θ+l1.

Ewens (1972) gave the expectation and variance of the number of alleles:

E[K]=l=1mθθ+l1
(9a)

Var[K]=l=1mθθ+l1l=1m(θθ+l1)2.
(9b)

From the ESF (7), conditioned on the observation of a biallelic sample, we obtain a folded version of the scaled multiplicity model (6), in which the ancestral and derived alleles are not distinguished. A random sample of m genes contains exactly two haplotypes with probability

P(K=2)=l2θ2Lm(θ)=gK(0)/2=l=2mθl1j=2mj1j1+θ.
(10)

Conditioning on this event, we obtain from the ESF (7) the probability of a sample containing two alleles in multiplicities i and mi:

P(ai=1,ami=1K=2)=1/i+1/(mi)j=1m11/jforimiP(am/2=2K=2)=2/mj=1m11/jfori=mi.
(11)

That θ does not appear in these expressions reflects Ewens’s (1972) finding that the observed number of alleles K provides a sufficient statistic for the estimation of θ: the joint distribution of allele multiplicities (7) conditional on K is independent of θ.

2.4 Conditioning on a single segregating site

2.4.1 Neutral SNP model

Here, we use the term SNP to describe a non-recombining locus at which a single mutational event has occurred in the genealogy of a sample of genes. We describe sites at which two forms segregate in the sample as biallelic, recognizing SNPs as a subset of this group (see Table 1 for an example).

Table 1
Genetic variation in samples of size 19

The probability that the genealogy contains a single mutation and that it lies on level l is

P(SNP,δl=1)=θl1+θj=2mj1j1+θ,

for δl an indicator variable that takes the value 1 only if the mutation occurs on level l. Summing over levels, we confirm that the probability of a SNP is

P(SNP)=gS(0)=l=2mθl1+θj=2mj1j1+θ,
(12)

for gS(·) the Watterson pgf (2) of the number of segregating sites. Comparison of (10) and (12) illustrates the close relationship between conditioning on a single segregating site and conditioning on two segregating alleles: SNPs represent a subset of biallelic polymorphisms.

Conditional on a genealogy containing a single mutation, the mutation arose on level l with probability

P(δl=1SNP)=1l1+θj=2m1j1+θ.

Under our neutral SNP model, a SNP-defining mutation occurs in exactly i of the m sampled genes with probability

fn(im,θ)=l=2mi+11θ+l1(mi1l2)(m1l1)j=2m1θ+j1=1il=2mi+1l1θ+l1(mli1)(m1i)j=2m1θ+j1,
(13)

using Eq. (14) of Fu (1995).

In contrast with (6) and (11), this expression depends on θ. In the limit as the rate of neutral substitution becomes small (θ → 0), (13) converges to (6). Otherwise, we expect classes of SNPs that differ with respect to the rate of neutral substitution to show different site frequency spectra, even under the standard neutral model.

Figure 1 illustrates, for a sample of 19 genes, that the site frequency spectra expected under (6) and (13) show close correspondence for low rates of neutral substitution (θ = 0.01) but that the neutral SNP model (13) predicts more rare and fewer common derived alleles for large mutation rates (θ = 10). For samples of size m = 19 and the range of θ values in our simulations, the numbers of singletons and doubletons expected under the neutral SNP model (13) increase monotonically with θ and multiplicities 4 through 18 decrease monotonically, with the expectation for multiplicity 3 varying non-monotonically.

Figure 1
Histogram (bars) of the multiplicities of derived SNP alleles expected in trees with a single segregating site (13) compared to the expectation (curve) under (6), for low (left, θ = 0.01) and high (right, θ = 10) scaled rates of neutral ...

Like (5) and (6), the expressions shown here have been obtained previously in various contexts (e.g., Griffiths and Tavaré 1998, 2003; Stephens 2000; Hobolth et al. 2008). The relationship between the frequency in the sample of a mutation and the level of the genealogy on which it arose has been exploited to address the distribution of the age of a mutation (Griffiths and Tavaré 1998; Wiuf and Donnelly 1999; Stephens 2000), and (13) can be obtained by rearranging equation (28) of Stephens (2000). Griffiths and Tavaré (2003) showed that the generalization of (13) to accommodate variable population size reduces to (6) under constant population size and low mutation rate (θ → 0).

2.4.2 Estimating θ

Construction of the spectrum expected under the neutral SNP model (13) requires an estimate of θ. In our goodness of fit analyses to the expected counts under the neutral SNP model (13), we substituted the maximum-likelihood estimate (MLE) of θ and reduced the degrees of freedom by one.

For n the number of SNP loci observed, D the observed spectrum of derived allele multiplicities, and T the total number of nucleotide sites, the likelihood of θ corresponds to

P(D,nT,θ)=P(Dn,T,θ)P(nT,θ).
(14)

We model each derived allele count in Table 2 as the realization of an independent Poisson random variable, which implies that the total number of counts n (= Σi ni) also has a Poisson distribution:

P(n=kT,θ)=λkeλk!,

for λ the expected number of SNPs observed,

Table 2
Derived allele counts in sample genealogies containing a single mutation

λ=TP(SNP),

with P(SNP) given by (12). Conditional on n (sum of counts in a given row of Table 2), the joint distribution of multiplicities P(D|n, T, θ) is multinomial (see, for example, Bishop et al. 1975, Chap. 13).

3 Patterns of variation in simulated data

We used ms (Hudson 2002) to simulate 106 data sets of size m = 19 for each of 12 assignments of θ (0.5 to 6.0, in increments of 0.5).

3.1 Magnitude of segregating variation

Table 1 presents the number among the 106 19-gene sample genealogies, simulated under the indicated value of θ, that contained zero, exactly one, or more than one mutation. Also shown are the numbers of samples comprising exactly two alleles and the proportion of those that contained more than one mutation. As the rate of neutral substitution increases, the proportion of samples with a single mutation declines and the percentage of two-haplotype samples that contain more than one mutation tends to increase.

Table 2 shows the multiplicities of mutations at loci for which the sample genealogy contained a single mutational event (mutation number equal to 1 in Table 1).

3.2 Number of mutations having a given multiplicity

Fu’s (1995) analysis addressed the number (rather than proportion) of mutations that have a given multiplicity in a random sample genealogy. For each of the 106 sample genealogies simulated under a given assignment of θ, we distinguished among all mutations, in accordance with the infinite sites model assumed by Fu (1995), and determined the total number that occurred in each multiplicity. Table 3 indicates an excellent fit to the analytical mean and variance (5) of data simulated under θ = 1.0, and other values of θ gave similar results.

Table 3
Mean and variance of the number of mutations with the indicated multiplicity under the infinite-sites model

Figure 2 shows the number of trees simulated under θ = 6.0 which contained the number of mutations indicated on the abscissa in multiplicity 5. This distribution matches the analytical expressions (5) in mean and variance, but has a pronounced right skew. Fu (1996) developed a test based on a Hotelling-like statistic, with critical values determined by simulation, as a means of using the frequency spectrum to detect departures from the standard neutral model.

Figure 2
Histogram of the number, among 106 samples of size m = 19 simulated under θ = 6.0, that contain the number of mutations on the abscissa in multiplicity 5.

3.3 Estimates of θ

We compare estimates of θ inferred from the number of segregating sites (Watterson 1975), the number of alleles (Ewens 1972), and the site frequency spectrum (14).

3.3.1 Number of segregating sites

We found excellent agreement between the Watterson distribution (4) and the total number of segregating sites (S) observed among the 106 trees simulated under each value of θ (analyses not shown).

In a Bayesian context, we examined the posterior distribution of θ based on the number of segregating sites:

P(θS)=P(Sθ)P(θ)P(S).

Assuming a prior P(θ) taking a uniform distribution over [0.01, 100] and zero probability elsewhere, we rescaled the likelihood function (4) implied for each of the 106 sample genealogies generated under a given assignment of θ to obtain a posterior distribution of θ and determined its 95% credible interval. Table 4 compares the posterior mode and percentage of credible intervals that contained the actual value of θ. For values of θ greater than 0.5, the observed number of segregating sites appears to provide lower than expected coverage probabilities and overestimates of θ.

Table 4
Bayesian estimates of θ and coverage probabilities

3.3.2 Number of alleles

Table 5 indicates excellent corroboration of the expressions (9) given by Ewens (1972) for the mean and variance of the number of distinct alleles (K). Figure 3 shows a close match of the entire simulated distribution of K to expectation (8). Ewens (1972) showed that the third and fourth moments of the distribution approach zero for large sample size (m), and Fig. 3 suggests a Gaussian-like shape for even m = 19 under the larger values of θ.

Figure 3
Histograms (bars) of the observed number of trees that contain the number of alleles indicated on the abscissa compared to the analytical distribution (curves) from (8).
Table 5
Observeda and expectedb moments of allele number

Table 4 gives the posterior mode and the proportion of credible intervals that contained the actual value of θ inferred from the number of alleles (8), assuming again a uniform prior distribution over [0.01, 100] for θ. Both aspects appear to show trends similar to those for estimates based on the number of segregating sites, with the overestimation of θ more noticeable for large values of θ.

3.3.3 Site frequency spectra from sites with single mutations

For each row in Table 2, we used (14), with n equal to the row sum (number of SNP loci) and T = 106, to obtain a maximum likelihood estimate (MLE) of θ. Table 6 presents the MLEs, their approximate 95% confidence intervals (2 log-likelihood units around the mode), and the number of trees on which the estimates are based (from the single mutation column of Table 1). The higher uncertainty of estimates for data sets generated for values of θ equal to 4.5 or greater likely reflects inadequacy of the Yates continuity correction for the low number of loci.

Table 6
Maximum likelihood estimate of θ and approximate confidence interval

To examine the coverage probabilities of Bayesian credible intervals based on the site frequency spectrum, we partitioned the106 trees simulated under a given value of θ into 100 groups of 104 trees and conducted a separate analysis on each group, assuming as before a uniform prior distribution over [0.01, 100]. Table 4 presents the posterior modes, the proportion of credible intervals that included the actual value of θ, and the numbers of trees on which the estimates were based. We did not explore values of θ greater than 3.0 due to the low number of trees containing a single segregating site. As for the MLEs in Table 6, the width of the average credible interval increased with θ and the number of trees used in the estimation declined. For θ up to 3.0, basing the estimation on the site frequency spectrum appears to improve accuracy over using only the number of segregating sites (4) or only the number of alleles (8). For higher values of θ, the low number of trees per group appeared to have compromised the estimation of both the mode and the coverage probability.

4 Goodness of fit of site frequency spectra

4.1 Data restricted to sites with a single mutational event

For the subset of simulated sample genealogies that contained a single segregating site (Table 2), we computed the Pearson chi-square statistic under the scaled multiplicity (6) and neutral SNP (13) models, using the MLEs for θ given in Table 6 for the latter. To avoid large departures of the counts from approximate continuity (see Section 3.3.3), we excluded from consideration data sets generated for values of θ greater than 4.0, for which the expected counts in cells representing the highest multiplicities fell below 5.

Figure 4 indicates highly significant departures from the scaled multiplicity model (6). As expected for a fit to an incorrect model (1), the X2 values tend to increase with number (n) of loci contributing to the SFS. In contrast, Figure 5, showing the fit to the correct model (single mutational event), indicates no obvious relationship between the total number of counts and the X2 values obtained under the neutral SNP model (13). While the low X2 values for most of the range under θ = 2.5 and high X2 values under θ = 3.5 in Fig. 5 appear unusual, the simulated SNP data appear to lend much greater support to the correct neutral SNP model (13) than to the scaled multiplicity model (6).

Figure 4
Sample Pearson chi-square values for the fit to the scaled multiplicity model (6) of SFSs generated from simulated SNP data (single mutational event) for samples of m = 19 genes. Shown on the abscissa are the numbers of sets of 10,000 loci in the group ...
Figure 5
Sample Pearson chi-square values for the fit to the neutral SNP model (13) of SFSs generated from simulated SNP data (single mutational event) for samples of m = 19 genes. The dashed horizontal line indicates the expectation (df=16, after estimation of ...

Figure 6 shows the empirical cumulative distribution of p-values from goodness of fit tests to the neutral SNP model (13) for θ up to 2.0. For each value of θ, we partitioned the 106 simulated trees into 100 groups of 104 simulated trees. Elimination of non-SNP sites (trees showing a number of mutations different from 1) reduced the number of loci to the values indicated in the rightmost column of Table 4. For each partition, we determined the p-value associated with the spectrum constructed from those single-mutation trees. A perfect sample would show equality between the cumulative distribution of p-values and the nominal significance level (diagonal line). Figure 6 suggests a tendency toward false positives (Type I error) for θ = 0.5.

Figure 6
Empirical cumulative distribution of p-values obtained for 100 site frequency spectra for the neutral SNP model (13) applied to SNP data (single segregating site) simulated under the four indicated values of θ. The diagonal line represents the ...

4.2 Data restricted to biallelic sites

As noted in Section 2.3, the ESF (7) links the allele frequency spectrum to the expected number of mutations that occur in a sample in a given multiplicity (5a): the ESF, conditioned on the observation of exactly two alleles in the sample (11), corresponds to the folded version of the scaled multiplicity model (6).

Figure 7 shows for assignments of θ up to 2.5, the X2 values obtained in goodness of fit tests to the folded scaled multiplicity model (11) as a function of the number of sample genealogies considered (n). We refrained from analyzing data simulated for assignments for larger values of θ, for which the expected number of counts in at least one cell fell below 5. These plots suggest no obvious increase in the X2 values with total count number as would be expected under fitting to an incorrect model (1). Of possible concern in a number of cases is an unusually low X2 value, indicating too good a fit.

Figure 7
Sample Pearson chi-square values for the fit to the folded scaled multiplicity model (11) of biallelic frequency spectra generated under the indicated θ value. The solid horizontal line represents the average X2 value and the dashed line the expectation ...

Application of goodness of fit tests to the neutral SNP model (13), incorrectly regarding the two alleles segregating in each sample as defined by a single mutational event, underestimates θ and generates highly significant X2 values (Table 7). As Fig. 1 would suggest, the neutral SNP model predicts too many samples containing a rare allele and a corresponding deficiency of samples with the two alleles in comparable frequencies. Although the expectations under the neutral SNP model (13) converge to those under the scaled multiplicity model (6) as θ becomes small, Table 1 indicates that even for θ = 0.5, a substantial fraction (nearly 17%) of the simulated biallelic data sets contained more than a single mutation.

Table 7
Biallelic data fit to the neutral SNP model

4.3 Other data sets showing a poor fit to the scaled multiplicity model

We also explored the effects of other kinds of data filtering, including construction of the spectrum of all mutations (no filtering) and restriction of consideration to a single random mutation in each tree or to a single random branch in each tree. These data partitions all showed a significantly poor fit to the scaled multiplicity model (6).

4.3.1 Random mutation

For each simulated sample genealogy that contained at least one mutation, we determined the multiplicity in the sample of a mutation chosen uniformly at random, without weighting by frequency in the sample. This subset contains all trees in Table 1 except those with zero segregating sites. We found increasing sample X2 values with the number of trees, as in Fig. 4, and a marked departure between the nominal significance level and the cumulative distribution of p-values. Both aspects indicate a very poor fit to the scaled multiplicity model (6).

4.3.2 Size of a random branch

For a given simulated genealogy, we sampled a branch at random, weighting by length relative to the total length of the tree, and determined the number of tips descendent from that branch. While the number of descendants is not generally observable, this experiment permits us to examine branch size apart from the additional stochastic process of mutation. As the scaled neutral substitution rate θ does not affect the size of a branch, our ms output provides a total of 12 × 106 simulated samples suitable for this analysis. A very highly significant departure of simulated spectra from the spectrum expected under the scaled multiplicity model (6) is evident even from subsets of the data. For example, Table 8 compares the observed branch sizes and the expectations under the scaled multiplicity model (6) for a subsample of 106 trees. Relative to predictions from the scaled multiplicity model (6), the observed trees showed an excess of branches of size 1 through 4 and a deficiency of branches of all larger sizes. A fit to the scaled multiplicity model would indicate a particularly large excess (18,371) of singletons (terminal branches 1), with this multiplicity class contributing almost 35% of the total X2 value (3412, with df=17), even in the absence of population expansions or selective sweeps.

Table 8
Observed and expected branch size distributions

4.3.3 All segregating sites

Although the relative numbers of alleles or mutations in a given sample genealogy are of course correlated (7), we constructed the spectrum of all mutations observed in many sample trees generated under a given value of θ, expecting correlations between mutations on the same tree to be dwarfed by the large number of independent trees. We partitioned the 106 samples simulated under each value of θ into 100 groups of 104 trees and determined the p-value associated with a fit to the scaled multiplicity model (6). The empirical cumulative distribution of p-values for the 1,200 tests using all 12 × 106 simulated trees indicated a very poor fit, with 80% of the tests giving p-values less than 0.04.

5 Discussion

5.1 Dependence of SNP site frequency spectra on θ

Fundamental to the interpretation of site frequency spectra constructed from surveys of genomic variation are the analyses of Ewens (1972) and Fu (1995) for the infinite-alleles and infinite-sites model of mutation, respectively. We note that the folded version of the scaled multiplicity model (6) can be obtained directly from the ESF (7), conditioned on biallelic samples (11). A striking property of the ESF (7) is that the allele frequency spectrum conditioned on the number of observed alleles (K) is independent of θ (Ewens 1972). This property is shared by the scaled multiplicity model (6), which is widely used to represent the expected SFS under the standard neutral model for genome-wide SNP data.

In contrast, the observed SFS for sample genealogies that contain a single mutational event (13) does in fact provide information about θ, implying that site frequency spectra can provide a basis for the estimation of this fundamental parameter. Tables 4 and and66 indicate high accuracy of both Bayesian and maximum likelihood estimates of θ for data sets comprising 103 or more SNPs. Liu et al. (2009) have recently developed a generalized least squares method for estimating θ using Fu’s (1995) expressions for the means and covariances (5) of the counts (rather than proportions) of mutations across multiplicities.

Our analysis suggests that differences between the sampling distributions imposed by restriction to biallelic variation (11) and by restriction to single mutational events (13) are readily detectable in analyses incorporating the volume of data typical of genomic SNP surveys. SNPs constitute a subset of biallelic polymorphisms, and the probability of a single segregating site (12) is nearly identical to the probability of a biallelic polymorphism (10). However, our simulated data (Table 1) illustrate that for large θ a substantial proportion of biallelic polymorphisms may comprise multiple segregating mutations in the genealogy of the sample. Incorrect application of the neutral SNP model (13), which assumes a single segregating site, results in substantial underestimation of θ and highly significant X2 values in goodness of fit analyses (Table 7). Similarly, data restricted to sample genealogies containing a single segregating site show highly significant departures (Fig. 4) from the scaled multiplicity model (6) or the biallelic model (11), while giving strong support to the correct neutral SNP model (Figs. 5 and and66).

As has been shown in various contexts (Griffiths and Tavaré 1998, 2003; Stephens 2000), the neutral SNP model (13) reduces to the scaled multiplicity model (6) in the limit of low rates of neutral substitution (θ → 0). As θ increases, samples conditioned on a single segregating site show more rare mutations and fewer common mutations (Fig. 1). Because an excess of rare variants is an iconic signature of selective sweeps or expansions in effective population size (Braverman et al. 1995; Simonsen et al. 1995), our results suggest that careful consideration of the sampling distribution of genomic variation may help to avoid unwarranted inferences about the operation of locus-specific evolutionary processes.

5.2 Departures from the scaled multiplicity model

We found that sampling distributions that show only subtle numerical and conceptual departures from canonical models can be very strongly rejected on the basis of sufficiently large numbers of observations. Describing kinds of neutral variation that show poor fits to the scaled multiplicity model (6) may be as valuable as confirming the fit to expectation of data generated under the correct model.

Single nucleotide polymorphisms may be considered similar to polymorphisms due to a single mutational event in a sample genealogy, regardless of the total number of events in the tree. However, we observed very highly significant deviations from the scaled multiplicity model (6) of site frequency spectra generated by extracting a single random mutation from simulated trees. Similarly, the distributions of the size (number of descendent tips) of a randomly chosen branch and the multiplicities of all segregating sites do not fit the scaled multiplicity model (Section 4.3).

5.3 Modelling actual SNP data

We found that the allele frequency spectra generated by restricting simulated data to biallelic samples fit the folded scaled multiplicity model (11) very well (Figure 7) and the neutral SNP model (13) very poorly (Table 7). Conversely, simulated data restricted to sample genealogies that contained a single mutational event gave strong support to the neutral SNP model and strongly rejected the scaled multiplicity model (Section 4.1).

We suggest that neither model describes actual SNP data. Segregation of exactly two nucleotide bases in a sample may reflect multiple independent substitutions of the same derived base for the ancestral base or back mutations, in violation of both the infinite-sites (5) and infinite-alleles models (7) as well as the neutral SNP model (13). Of the standard models of mutation of population genetics, SNPs may conform most closely to a finite-sites, K-allele model, in which new mutations assume one of four states ( A, C, G, or T). Whether actual SNPs show site frequency spectra similar to those expected for the standard neutral model under this mutation process awaits further analytical and statistical development.

Acknowledgments

In a lifetime of work, Sam Karlin transformed several entire fields. It continues to be an honor to learn from him, to draw upon the part of his work that extended into evolutionary biology, and to contribute to this memorial volume. We are grateful to the anonymous reviewers for valuable comments and references to key works, Asger Hobolth for important insights, and Benjamin D. Redelings for questioning the interpretation of SNP spectra. Support from the National Evolutionary Synthesis Center (NESCent), Durham, NC, for a NESCent Postdoctoral Fellowship to GG and for the Genomic Introgression working group is gratefully acknowledged. Public Health Service grant GM 37841 (MKU) provided partial support for this research.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  • Bishop YMM, Fienberg SE, Holland PW. Discrete multivariate analysis: Theory and practice. The MIT Press; 1975.
  • Braverman JM, Hudson RR, Kaplan NL, Langley CH, Stephan W. The hitchhiking effect on the site frequency spectrum of DNA polymorphism. Genetics. 1995;140:783–796. [PMC free article] [PubMed]
  • Donnelly P. Partition structures, Polya urns, the Ewens sampling formula, and the ages of alleles. Theor Pop Biol. 1986;30:271–288. [PubMed]
  • Ethier SN, Griffiths RC. The infinitely-many-sites model as a measure-valued diffusion. Ann Probab. 1987;15:515–545.
  • Ewens WJ. The sampling theory of selectively neutral alleles. Theor Pop Biol. 1972;3:87–112. [PubMed]
  • Fu YX. Statistical properties of segregating sites. Theor Pop Biol. 1995;48:172–197. [PubMed]
  • Fu YX. New statistical tests of neutrality for DNA samples from a population. Genetics. 1996;143:557–570. [PMC free article] [PubMed]
  • Griffiths RC, Lessard S. Ewens’ sampling formula and related formulae: combinatorial proofs, extensions to variable population size and applications to ages of alleles. Theor Pop Biol. 2005;68:167–177. [PubMed]
  • Griffiths RC, Tavaré S. The age of a mutation in a general coalescent tree. Commun Statist – Stochastic Models. 1998;14:273–295.
  • Griffiths RC, Tavaré S. The genealogy of a neutral mutation. In: Green PJ, Hjort NL, Richardson S, editors. Highly structured stochastic systems. Oxford Univ. Press; Oxford: 2003. pp. 393–412. chapter 13.
  • Hernandez RD, Williamson S, Bustamante CD. Context dependence, ancestral misidentification, and spurious signatures of natural selection. Mol Biol Evol. 2007;24:1792–1800. [PubMed]
  • Hobolth A, Uyenoyama MK, Wiuf C. Importance sampling for the infinite sites model. Statistical Applications in Genetics and Molecular Biology. 2008;7 Article 32. [PMC free article] [PubMed]
  • Hobolth A, Wiuf C. The genealogy, site frequency spectrum and ages of two nested mutant alleles. Theor Pop Biol. 2009 this volume. [PubMed]
  • Hudson RR. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics. 2002;18:337–338. [PubMed]
  • Keightley PD, Eyre-Walker A. Joint inference of the distribution of fitness effects of deleterious mutations and population demography based on nucleotide polymorphism frequencies. Genetics. 2007;177:2251–2261. [PMC free article] [PubMed]
  • Kim S, Plagnol V, Hu TT, Toomajian C, Clark RM, Ossowski S, Ecker JR, Weigel D, Nordborg M. Recombination and linkage disequilibrium in Arabidopsis thaliana. Nat Genet. 2007;39:1151–1155. [PubMed]
  • Kimura M, Ohta T. The age of a neutral mutant persisting in a finite population. Genetics. 1973;75:199–212. [PMC free article] [PubMed]
  • Kingman JFC. Random partitions in population genetics. Proc R Soc Lond A. 1978;361:1–20.
  • Liu X, Maxwell TJ, Boerwinkle E, Fu YX. Inferring population mutation rate and sequencing error rate using the SNP frequency spectrum in a sample of DNA sequences. Mol Biol Evol. 2009 in press. [PMC free article] [PubMed]
  • Marth GT, Czabarka E, Murvai J, Sherry ST. The allele frequency spectrum in genome-wide human variation data reveals signals of differential demographic history in three large world populations. Genetics. 2004;166:351–372. [PMC free article] [PubMed]
  • Nielsen R, Hubisz MJ, Clark AG. Reconstituting the frequency spectrum of ascertained single-nucleotide polymorphism data. Genetics. 2004;168:2372–2382. [PMC free article] [PubMed]
  • Simonsen KL, Churchill GA, Aquadro CF. Properties of statistical tests of neutrality for DNA polymorphism data. Genetics. 1995;141:413–429. [PMC free article] [PubMed]
  • Stephens M. Times on trees, and the age of an allele. Theor Pop Biol. 2000;57:109–119. [PubMed]
  • Tajima F. Evolutionary relationship of DNA sequences in finite populations. Genetics. 1983;105:437–460. [PMC free article] [PubMed]
  • Tajima F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics. 1989;123:585–595. [PMC free article] [PubMed]
  • Tavaré S. Line-of-descent and genealogical processes, and their applications in population genetics models. Theor Pop Biol. 1984;26:119–164. [PubMed]
  • Živkovíc D, Wiehe T. Second-order moments of segregating sites under variable population size. Genetics. 2008;180:341–357. [PMC free article] [PubMed]
  • Watterson GA. The sampling theory of selectively neutral alleles. Adv Appl Prob. 1974;6:463–488.
  • Watterson GA. On the number of segregating sites in genetical models without recombination. Theor Pop Biol. 1975;7:256–276. [PubMed]
  • Wiuf C, Donnelly P. Conditional genealogies and the age of a neutral mutant. Theor Pop Biol. 1999;56:183–201. [PubMed]
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...