• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of geneticsGeneticsCurrent IssueInformation for AuthorsEditorial BoardSubscribeSubmit a Manuscript
Genetics. Sep 2009; 183(1): 249–258.
PMCID: PMC2746149

Frequency Spectrum Neutrality Tests: One for All and All for One

Abstract

Neutrality tests based on the frequency spectrum (e.g., Tajima's D or Fu and Li's F) are commonly used by population geneticists as routine tests to assess the goodness-of-fit of the standard neutral model on their data sets. Here, I show that these neutrality tests are specific instances of a general model that encompasses them all. I illustrate how this general framework can be taken advantage of to devise new more powerful tests that better detect deviations from the standard model. Finally, I exemplify the usefulness of the framework on SNP data by showing how it supports the selection hypothesis in the lactase human gene by overcoming the ascertainment bias. The framework presented here paves the way for constructing novel tests optimized for specific violations of the standard model that ultimately will help to unravel scenarios of evolution.

THE standard models of population genetics (i.e., the Wright–Fisher model and related ones) constitute null models for which an amazing amount of theory has been developed. Population geneticists have used some aspect of the theory (e.g., summary statistics) to test the goodness-of-fit of the standard model on a given data set. Rejection of the standard model typically suggests that alternative hypotheses, such as selection or demographic history, have to be accounted for. Although they test for more than neutrality, tests that compute the goodness-of-fit of the standard model have been referred to as “neutrality tests.” Since different neutrality tests have varying sensitivity to different violations of the standard model, one typically uses a plethora of tests on the data set of interest. One then hopes that the evolutionary processes that generated the data set will be, at least partially, uncovered by the tests. Although neutrality tests based on population samples exhibit important diversity, they can be assigned to families such as “haplotype tests” (e.g., Fu 1997; Depaulis and Veuille 1998) that use the distribution of haplotypes, “tree shape tests” that try to capture specific tree deformations (e.g., Ramos-Onsins and Rozas 2002), and “frequency spectrum tests” that are based on the frequency spectrum (e.g., Tajima 1989; Fu and Li 1993b; Fay and Wu 2000; Achaz 2008).

In this study, I investigate neutrality tests based on the frequency spectrum (hereafter referred to simply as neutrality tests) and show that they are all specific instances of a general framework. Neutrality tests compare two estimators of the population mutation parameter θ that characterizes the mutation–drift equilibrium. It is defined as θ = 2pNeμ, where p is the ploidy (1 for haploids and 2 for diploids), Ne is the effective population size, and μ is the locus neutral mutation rate. When the standard model is true, the expectations of the several unbiased estimators of θ are equal.

Typical estimators of θ, in a sample of n sequences, are equation M1, where S is the number of polymorphic sites and equation M2 (Watterson 1975), and equation M3, where π is the average pairwise difference between all sequences in the sample (Tajima 1983). If an outgroup is available, mutations at frequency i/n can be distinguished from mutations at frequency 1 − i/n. Following Fu (1995)'s notations, ξ is a vector that represents the unfolded frequency spectrum composed of ξi, the number of polymorphic sites at frequency i/n in the sample (i [set membership] [1, n − 1]). When no outgroup is available, the frequency spectrum is folded and is given by a vector η, composed of ηi, the number of polymorphic sites at both frequencies i/n and 1 − i/n. Accordingly, it has been shown that θ can be estimated from equation M4, with ξ1 the number of derived singletons (Fu and Li 1993b), from equation M5, with η1 the total number of singletons (derived and ancestral) (Fu and Li 1993b), and from equation M6 (Fay and Wu 2000). Recently, it has been suggested that singletons should be ignored when θ is estimated in samples with sequencing errors; this leads to estimators such as equation M7, and equation M8 (Achaz 2008). Other estimators of θ, such as equation M9 and equation M10, were designed to minimize their variance (Fu 1994b), although they can be computed using recursions only for a given value of θ.

Neutrality tests compute the goodness-of-fit of a statistic T, which is the difference between two estimators of θ, normalized by its standard deviation:

equation M11
(1)

For a given θ, under the standard model, T has a mean of E[T] = 0 and a variance of Var[T] = 1. Lowercase letters (e.g., t) denote the absolute difference (i.e., the numerator only) and uppercase letters (e.g., T) denote the normalized difference (Equation 1) throughout this work. Interestingly, the variance in the denominator is a function of both θ and θ2. Because θ is unknown, the denominator cannot be computed as such. In practice, unbiased estimators of θ and θ2 must be used instead. Because the variance of equation M12 vanishes asymptotically in a very large sample (equation M13), θ and θ2 are, in practice, substituted by estimators based on S (Tajima 1989), which changes the mean and the variance of T to E[T] ≈ 0 and Var[T] ≈ 1.

Tajima's D (Tajima 1989) is defined by equation M14; the statistics proposed by Fu and Li (1993b) are equation M15, equation M16, equation M17, and equation M18. Another classical statistic is equation M19 (Fay and Wu 2000), even though its variance was not given by the authors. Finally, two other related neutrality tests that are, a priori, immune to sequencing errors were proposed: equation M20 and equation M21 (Achaz 2008). Other tests based on θξ and θη (which are optimized for a given θ-value) as well as the difference between the observed and the expected values of the frequency spectrum were also proposed (Fu 1996).

Here, I show that when using a general weighted linear combination of equation M22 (or equation M23 when no outgroup is available), any estimators of θ [i.e., equation M24] and consequently any neutrality tests can be derived. Nawa and Tajima (2008) recently advocated the use of the equation M25 spectrum, which is expected to be uniform under the standard model, as a visual test for neutrality instead of the classical frequency spectrum. This last proposal is in complete agreement with the current work. Importantly, it has been previously reported that some θ-estimators and neutrality tests could be expressed as specific linear combinations of ξi or ηi (Tajima 1997; Wakeley 2009). Furthermore, Fu (1997) shows that several θ-estimators can be expressed as specific linear combinations of equation M26 (equation M27) or in a related framework that uses equation M28 instead of equation M29. equation M30 was subsequently designed as equation M31 (Fay and Wu 2000). However, some estimators (like equation M32, equation M33, or equation M34) cannot be expressed using the Fu (1997) framework. To the best of my knowledge, no previous study has explicitly derived the framework presented here. No work has yet highlighted the striking simplicity of θ-estimators and related tests, when expressed in this framework. I further show how the use of such a simple framework greatly facilitates the study of previous θ-estimators and their related neutrality tests and how it opens the door for constructing yet undiscovered interesting θ-estimators and neutrality tests with enhanced power.

MODEL

With an outgroup:

According to Fu (1995), we know that

equation M35
(2)
equation M36
(3)
equation M37
(4)

where σii and σij depend only on n and are given in Equation 2 of Fu (1995). This shows that E[iξi] = θ and therefore that any ξi can be used to construct an unbiased estimator of θ:

equation M38
(5)

Consequently, a linear combination equation M39 of the equation M40's (in which the weights sum to 1) is also an unbiased estimator of θ. Mathematically, it is expressed as

equation M41
(6)

where ωi is the weight of each equation M42 in the combined estimator. Therefore, any estimator based on the frequency spectrum can be solely described by an equation M43-vector. Importantly, it should be mentioned that Fu (1997) also proposed a linear combination of iξi, but in which only a subset of the weight vectors was used. Namely, the proposed weight vectors were restricted to ωi = ix.

Using Equations 3 and 4 the variance of equation M44 can be shown to be

equation M45
(7)

Following Tajima (1989), using Equation 1, one can compute a normalized statistic that is, in the general framework,

equation M46
(8)

which can be expressed as a function of an Ω-vector,

equation M47
(9)

with

equation M48
equation M49
equation M50

The Ω-vector results from the difference between two weight vectors normalized to 1. As a consequence, (1) all elements of the Ω-vector sum to 0 and (2) the sum of all positive values cannot be >1 and the sum of all negative values cannot be < −1. Any vector that fits these two constraints can be considered, along with Equation 9, as a neutrality test.

Without an outgroup:

If no adequate outgroup is available, the unfolded frequency spectrum and consequently the equation M51 spectrum, cannot be computed. This implies that one has to use the equation M52 folded frequency spectrum. Following Fu (1995), we define equation M53 and therefore we have

equation M54
(10)
equation M55
(11)
equation M56
(12)

where δi,ni is a Kronecker delta (1 if i = j, and 0 otherwise) and where

equation M57
equation M58
equation M59

Although, we cannot compute the equation M60 spectrum (as defined above), we can compute a folded equation M61 spectrum defined as

equation M62
(13)

This folded equation M63 spectrum is the visual neutrality test proposed by Nawa and Tajima (2008). Using a similar reasoning to that above, a linear combination of equation M64 leads to a generic unbiased estimator of θ defined as

equation M65
(14)

whose variance is given by

equation M66
(15)

Consequently, the corresponding neutrality test equation M67 is

equation M68
(16)

with

equation M69
equation M70
equation M71

It is important to mention that Tajima (1997) previously showed that D, F*, and equation M72 could be expressed as a linear combination of ηi. More precisely, the vectors used then correspond in the present framework to equation M73. This vector definition emphasizes the weight on each ηi rather than on each equation M74.

With or without an outgroup:

Using both definitions of equation M75 (Equation 5) and equation M76 (Equation 13), it is easy to show that we have

equation M77
(17)

As a consequence, the use of an equation M78-vector along with the equation M79 folded frequency spectrum is equivalent to the use of an equation M80-vector with the equation M81 unfolded frequency spectrum only when we have

equation M82
(18)

This makes clear that there is an equivalent equation M83-vector for any equation M84-vector that adheres to the following constraint:

equation M85
(19)

To fold the frequency spectrum, the weight iωi associated with ξi (and not with equation M86) has to be the same as the weight (nini associated with ξni. This translates into an iωi vector that is symmetric around n/2. Furthermore, when the constraint (expressed in Equation 19) is fulfilled, we can write, for any 0 ≤ f ≤ 1,

equation M87

which leads interestingly for f = (ni)/n to

equation M88
(20)

The weights on equation M89 simply result from the sums of the weights on equation M90 and on equation M91 that are pooled when the spectrum is folded. In that respect, any equation M92-vector complying to Equation 19 can be used without the help of an outgroup. The equation M93-vectors are then a subset of all possible values of the equation M94-vectors. The former can be computed from the latter by using Equation 18 or 20.

Because equation M95 is the difference between two normalized equation M96-vectors, all relationships between equation M97 and equation M98 expressed above also hold for equation M99 and equation M100.

RESULTS

The model described above shows that all estimators of θ based on the frequency spectrum are linear combinations of equation M101, weighted by a specific vector equation M102. When no outgroup is available, one can use a linear combination of equation M103, weighted by a vector equation M104. Consequently, neutrality tests can be expressed as a linear combination of equation M105 (or equation M106) weighted by a vector equation M107 (or equation M108), for which a variance can be computed easily. Three applications of the model are developed below. First, I reinvestigate the previous estimators of θ and their corresponding neutrality tests and frame their intrinsic properties in terms of the equation M109 (equation M110) spectrum. Then, since previous tests are only specific instances of the framework, I show how the model can be used to build new tests that are more powerful than previous ones. Finally, I exemplify the benefit of the framework on real data that are known to be subject to an ascertainment bias.

Previous θ-estimators and neutrality tests:

Using Equation 6, all previously reported θ-estimators are given by an equation M111-vector (Table 1). When defined, the corresponding equation M112-vectors are also provided (Table 1). A graphical representation of four estimators of θ is shown in Figure 1. Figure 1 highlights that both equation M113 and equation M114 emphasize the low-frequency polymorphic sites in their estimation of θ (although not as much as equation M115, which is solely based on derived singletons) and that, on the contrary, equation M116 gives more weight to ancestral polymorphisms. Framed in the folded spectrum, equation M117 still weights more low plus high frequencies whereas equation M118 has a uniform weight. Potentially, using other weight vectors, one could express any undiscovered estimator of θ based on the frequency spectrum.

Figure 1.
Estimators of θ. A graphical view of the weight vectors of four typical estimators of θ (for n = 30). All values of the normalized vector sum to 1. In the top four panels, the equation M119-vectors that are defined for the unfolded frequency ...
TABLE 1
Basic characteristics of previous estimators of θ

The numerical variances of the previous estimators of θ are reported in Table 1 (for n = 30 and θ = 1, 10, 100). They can be computed either by their original derivations or by Equation 7. This clearly shows that, among previous estimators of θ, the variance of equation M152 is the smallest and that of equation M153 is the largest. This can be explained by the fact that the variance of equation M154 increases with i. As a consequence equation M155, which puts more weight on ancestral alleles, shows a larger variance. Interestingly, estimators without singletons have relatively small variances.

Previous neutrality tests are given in Table 2. A graphical representation of the equation M156-vectors (and equation M157 when defined) used in four previous tests is reported in Figure 2. Figure 2 shows that the sensitivity of the different tests differs although they share some common features. For example, D and F* both are negatively sensitive to both low and high frequencies (although more sensitive to low frequencies). D shows opposite sensitivity between medium frequencies and low/high frequency, whereas F* shows poor sensitivity to medium-frequency polymorphisms. F and F* have opposite effects on doubletons and singletons. Thus, deviations that enhance both will have opposite effects. Finally, H is oppositely skewed by low and high frequencies.

Figure 2.
Neutrality tests. A graphical view of the weight vectors of four typical neutrality tests (for n = 30). Because the equation M158-vectors used for neutrality tests are computed as a difference between two normalized vectors, all values of equation M159 sum to 0. In the ...
TABLE 2
Basic characteristics of neutrality tests

One crucial aspect of neutrality tests is their important variance under the neutral model. This variance induces a large confidence interval and therefore decreases their power to detect a deviation. It has been argued that this variance is a consequence of the tree shape variance and that neutrality tests based on the frequency spectrum are doomed to exhibit low power (Felsenstein 1992b).

As a consequence, an ideal neutrality test should minimize its variance under the standard model. The variances of the denominator of previous neutrality tests are given in Table 2 (for n = 30 and θ = 1, 10, 100). It is also important to mention that previous derivations of f, f *, y, and y* variances give different values. Simulations show that the new derivations are the correct ones (supporting information, Table S1). First, it should be noted that the original D test has a very low variance when compared to all other tests. This is connected to the low variance of both equation M186 and equation M187. Second, Y and Y * tests have also a small variance, although they ignore an important fraction of the data (i.e., singletons). All other tests have a similar variance.

This predicts that D typically will be sensitive to low, medium, and high frequencies and should be more powerful because it has a relatively low variance under neutrality. Therefore, it has the potential to be an excellent neutrality test and it appears that it is often one of the most powerful tests (Simonsen et al. 1995; Fu 1997). H is sensitive either low or high frequencies; however, its larger variance predicts that it will be useful only when the distortion in the θ-spectrum is very strong. In practice, it is powerful only when there is a large excess of high-frequency polymorphisms. The singleton tests appear to be good candidates to capture an excess of singletons, although they neglect other deviations in the spectrum. The Y and Y * tests have low variance, although ignoring singletons can lead to low power especially when they are in excess (Achaz 2008).

Building new tests:

To design new neutrality tests using this framework I started by analyzing the deviation of the average equation M188 spectrum, which is expected to be uniform under the standard models. Furthermore, because Fu (1995) showed that the covariance between ξi's is weak when compared to their variance, visual inspection of the variance of equation M189 provides a first approximation to the expected variance of equation M190 and therefore of their related Tω tests. I studied two deviations from the standard model: a severe bottleneck and isolated populations with migration.

The severe bottleneck was simulated as a sudden change of size from N chromosomes to N/100 that lasts for a time Tl = 0.1 (in N generations). Accordingly, the coalescent rates within the bottleneck are accelerated by 0.01 and the simulations were performed as in Simonsen et al. (1995). Sampling was performed after a time Tb has elapsed after the bottleneck. The mean and the standard deviation of equation M191 are given in Figure 3a for two times, Tb = 0.03 and Tb = 0.3. Figure 3 shows that most of the deviation comes from the sites with low frequency. Therefore, I designed a new test that captures the deviations within low frequencies. In this test, I used a first vector of ω1i = e−αi, with α = 0.9 and a second uniform vector ω2i = 1. This results in an exponentially decreasing weight for low-frequency mutations (Figure 3) that is positive for frequency i/n ≤ 0.13. The choice of α = 0.9 was mostly empirical, although using α = 0.8 or α = 1 leads to similar results (data not shown). As stressed in the discussion, this study aims at illustrating how easy it is to create new tests with enhanced power; power optimization deserves an entire new study. A graphical view of the equation M192-vector associated with this new TΩ test is given in Figure 3 and its variance is reported in Table 2. Most of the weight of this test is given to low frequencies and its variance is comparable to those of other neutrality tests. The power of this new test and of D, F, and H is reported in Figure 3. Results show that the new test outperforms the previous tests by 20% and is able to detect the deviation for a longer time.

Figure 3.
Example of a severe bottleneck. (a) The mean and the standard deviation of the equation M193 spectrum that is observed in simulations (n = 30, 104 replicates) of a standard model or of a recent severe bottleneck (reduction of f = 1/100 for a time ...

The 95% confidence intervals were built using coalescent simulations under the standard model, using a fixed number of segregating sites (Hudson 1993; Depaulis and Veuille 1998). Although there has been much debate on how confidence intervals should be set (Depaulis et al. 2001; Markovtsova et al. 2001; Wall and Hudson 2001), it has been clearly shown that the choice of a particular method does not alter the results in standard models (Ramos-Onsins et al. 2007) and therefore is not discussed here.

In the second scenario, I compared the power of neutrality tests in detecting a case of isolation with migration (e.g., Nielsen and Wakeley 2001). In the simulations, the isolation event happened at time Ti = 3 and both populations were sampled equally (na = nb = 15). The migration rate between the two populations is variable. Similar to the analysis of the bottleneck, I first report the mean and the standard deviation of the equation M196 spectrum. Figure 4 shows that most of the deviation comes from the sites at frequency 15/30. Additionally, for a small enough migration rate (M = 0.1), there are almost no polymorphisms with frequency >0.5. Although the standard deviations are large, the coefficients of variations (variance/mean) are relatively small. To design a new test, I used for the first equation M197-vector the probabilities given by a binomial law, equation M198 with p = 0.5 and n = 30 and a uniform vector ω2i = 1 as a second vector.

Figure 4.
Isolation with migration. (a) The mean and the standard deviation of the equation M199 spectrum that is observed in simulations (n = 30, 104 replicates) of a standard model or of an isolation with migration model (two populations equally sampled, na = ...

This was motivated by the idea of designing a test that specifically captures an excess of medium-frequency polymorphisms. A graphical view of the resulting equation M203-vector is given in Figure 4 and its variance is given in Table 2. Almost all the weight of this test is given to the 13 < i < 17 sites. The variance of this new test is large, and this is related to the large variance of equation M204 in the sample with even n. Despite this large variance, the test clearly outperforms all previous tests (Figure 4).

Overcoming the ascertainment bias:

As an example of the power of designing new neutrality tests, I analyzed SNP data (from HapMap) around the Lactase gene (LCT), which has been shown to exhibit a footprint of a recent strong selective sweep in European populations (Bersaglieri et al. 2004) as well in eastern African populations (Tishkoff et al. 2007). This pattern of recent selection is one of the strongest in the human genome (Nielsen et al. 2005). Indeed, it has been advanced that the lactase-persistence phenotype (the ability to digest milk as an adult) has been advantageous in European populations of farmers (especially in Northern European ones). The SNPs that are tightly associated with the selective sweep in Europeans are located at 13–22 kb upstream of the gene start (Bersaglieri et al. 2004). From HapMap (release 27, February 2009) I gathered all SNPs in a window of 100 kb centered at the start of the lactase gene. This includes 50 kb upstream and the entire gene. I considered only SNPs whose sample size was at least 85 chromosomes. Because the sample size of all SNPs was not identical, I used the observed frequencies to generate a folded frequency spectrum of 85 chromosomes for the following populations: Utah residents with northern and western European ancestry from the CEPH collection (CEU); Han Chinese in Beijing, China (CHB); Japanese in Tokyo, Japan (JPT); and Yoruban in Ibadan, Nigeria (YRI).

According to the literature, one expects to find a trace of an ongoing selective event in the CEU population only. Without the help of an outgroup, this would translate into an excess of low-frequency polymorphism in the folded frequency spectrum (typically, negative D, F*, and equation M205). Computation of the standard neutrality tests shows a deficit of low-frequency polymorphism rather than an excess. This deficit is even often significant (Table 3). This is clearly caused by the ascertainment bias in the data set. Because the polymorphisms were first screened in a small group and further genotyped in larger groups, rare variants are underrepresented (e.g., Kuhner et al. 2000; Clark et al. 2005). This ascertainment bias has been subject to various corrections (e.g., Wakeley et al. 2001; Nielsen et al. 2004). To avoid any correction, I computed a equation M206 test where the weights of both equation M207 and equation M208 vectors were set to 0 for i < 8. The remaining two vectors were computed using equation M209 and equation M210. As a consequence, this test is D-like in that it considers only polymorphisms with frequencies in the range [0.09, 0.91]. This is reminiscent of ignoring the singletons data set where sequencing errors are suspected (Achaz 2008). Results (Table 3) show that this test significantly deviates from the standard model for the CEU population. Ignoring fewer polymorphisms (e.g., only the 5% that are of low frequency) or changing the minimum sample size leads to similar results (data not shown).

TABLE 3
Neutrality tests in the lactase region

DISCUSSION

Here I developed a unifying framework for θ-estimators on the basis of the frequency spectrum. Namely, all known estimators of θ are linear combinations of equation M212 (or equation M213). Because neutrality tests based on the frequency spectrum are simple functions of these θ-estimators, the framework can be used to derive them. All tests (of this family) proposed so far are embedded in the framework. Using the model, I have shown that estimators of θ based on a folded spectrum always have an unfolded equivalent. The reciprocal, however, is not true.

Besides its unifying appeal, the model developed here can be used in several ways. First, I showed how it can be used to compute the variance of all estimators of θ and consequently of statistics such as equation M214. All variances of all estimators can be computed either using this framework or from their previous derived analytical formula. The same should be true for all t. Importantly, the computation of f, f *, y, and y* revealed differences between both methods. Simulations demonstrate that the previous formulas were not correct while the new ones are. Besides a minor error in the f and f * variance (corrected in Simonsen et al. 1995), it appears that the Cov[π, ξ1] that was derived by Fu and Li (1993b) is inexact. Therefore the variances of f and f * (Fu and Li 1993b) as well as the variances of y and y* (Achaz 2008) that were using this covariance carried along the error. Framed within the model presented here, all variances are correct. Finally, it can be used to compute the variance of h that was not given by the authors (Fay and Wu 2000).

One potentially interesting development is to find an ω-vector that minimizes variance of the associated estimator of θ. This problem was previously addressed thoroughly (Felsenstein 1992a,b; Fu and Li 1993a; Fu 1994a,b). Indeed, it was shown that phylogenetic estimates have lower variance than estimators based on summary statistics (Felsenstein 1992b; Fu and Li 1993a; Fu 1994b). Moreover, Fu (1994a,b) proposed a general method to find weight vectors that minimize the variance of the estimators and showed that the best vector actually depends on the value of θ itself. Nonetheless, it remains true that some estimators have less variance than others (i.e., equation M215 vs. equation M216), whatever is the value of θ. This latter observation suggests that re-exploring this question of minimizing the variance may be of interest.

Nawa and Tajima (2008) recently proposed to use the equation M217 spectrum instead of the classical frequency spectrum as a visual test for neutrality. This can be extended to the unfolded equation M218 spectrum if an outgoup is available. The study presented here fully supports this idea. The visual inspection of the equation M219 spectrum indicates why some tests will reject neutrality. Contrary to what intuition may suggest, when one is interested in θ-estimation, the appropriate representation for weight vectors is the equation M220-vector as defined above rather than weights on the ξi themselves (or on the ηi as in Tajima 1997).

When an outgroup is used to unfold the spectrum, the choice of the appropriate outgoup is of critical importance. If the outgroup is not adequate (too distant or too close), misoriented sites will have a disastrous effect on θ-estimations and therefore on related neutrality tests (Baudry and Depaulis 2003). This adds to the difficulty of using tests based on the full ξ-spectrum. However, when low and high frequencies can be sorted apart, much power is gained in terms of choosing the adequate evolutionary scenario. For example, no high frequencies are overrepresented under recent growth or severe bottlenecks.

Specific problems that concern only some area of the spectrum can be handled easily by setting to 0 all weights in the suspicious area. For example, the sequencing errors can be avoided when the singletons are ignored (Achaz 2008). With the current framework, by ignoring the low-frequency polymorphisms, the ascertainment bias can be overcome and the pattern expected from selection at the lactase gene appears. This strategy has endless extensions as long as we have some prior knowledge of the suspicious area.

Finally, I think that this framework opens the door for new estimations of θ and the related neutrality tests. Using simple examples, I show how the power of neutrality tests can easily be improved to detect deviations from the standard model. To optimize the power of the future new tests, one could (1) minimize their variance under the standard model, (2) select their area of sensitivity on the basis of prior knowledge of the impact of specific deviations, and (3) use recombination estimates to compute smaller confidence intervals (Wall 1999) (because recombination results in quasi-independent replicates that lower the variance of the θ-estimators). By building specific tests that will be sensitive to specific deviations, one could envision how several selected tests will be able to help the population geneticist to choose between different possible scenarios for a given data set. Another interesting alternative would be to use the different θ-estimators as summary statistics to infer the best parameters for a given evolutionary scenario (e.g., using ABC analysis).

The source code for this study was designed as a C++ library for the simulations and a C library for sequence analysis and is available upon request. A dedicated web version of the tests is available at http://wwwabi.snv.jussieu.fr/achaz/neutralitytest.html. Furthermore, the tests will be incorporated in a future release of DNAsp.

Acknowledgments

I thank F. Tajima, E. P. C. Rocha, J. Wakeley, P. Nicolas, and D. Higuet for their interesting comments on the manuscript and T. Treangen for English language improvement. I also thank two anonymous reviewers for their constructive comments. This work was supported by grant 07-GMGE-004-04 from the Agence Nationale de la Recherche.

Notes

Supporting information is available online at http://www.genetics.org/cgi/content/full/genetics.109.104042/DC1.

References

  • Achaz, G., 2008. Testing for neutrality in samples with sequencing errors. Genetics 179: 1409–1424. [PMC free article] [PubMed]
  • Baudry, E., and F. Depaulis, 2003. Effect of misoriented sites on neutrality tests with outgroup. Genetics 165: 1619–1622. [PMC free article] [PubMed]
  • Bersaglieri, T., P. C. Sabeti, N. Patterson, T. Vanderploeg, S. F. Schaffner et al., 2004. Genetic signatures of strong recent positive selection at the lactase gene. Am. J. Hum. Genet. 74: 1111–1120. [PMC free article] [PubMed]
  • Clark, A. G., M. J. Hubisz, C. D. Bustamante, S. H. Williamson and R. Nielsen, 2005. Ascertainment bias in studies of human genome-wide polymorphism. Genome Res. 15: 1496–1502. [PMC free article] [PubMed]
  • Depaulis, F., and M. Veuille, 1998. Neutrality tests based on the distribution of haplotypes under an infinite-site model. Mol. Biol. Evol. 15: 1788–1790. [PubMed]
  • Depaulis, F., S. Mousset and M. Veuille, 2001. Haplotype tests using coalescent simulations conditional on the number of segregating sites. Mol. Biol. Evol. 18: 1136–1138. [PubMed]
  • Fay, J. C., and C. I. Wu, 2000. Hitchhiking under positive Darwinian selection. Genetics 155: 1405–1413. [PMC free article] [PubMed]
  • Felsenstein, J., 1992. a Estimating effective population size from samples of sequences: a bootstrap Monte Carlo integration method. Genet. Res. 60: 209–220. [PubMed]
  • Felsenstein, J., 1992. b Estimating effective population size from samples of sequences: inefficiency of pairwise and segregating sites as compared to phylogenetic estimates. Genet. Res. 59: 139–147. [PubMed]
  • Fu, Y. X., 1994. a Estimating effective population size or mutation rate using the frequencies of mutations of various classes in a sample of DNA sequences. Genetics 138: 1375–1386. [PMC free article] [PubMed]
  • Fu, Y. X., 1994. b A phylogenetic estimator of effective population size or mutation rate. Genetics 136: 685–692. [PMC free article] [PubMed]
  • Fu, Y. X., 1995. Statistical properties of segregating sites. Theor. Popul. Biol. 48: 172–197. [PubMed]
  • Fu, Y. X., 1996. New statistical tests of neutrality for DNA samples from a population. Genetics 143: 557–570. [PMC free article] [PubMed]
  • Fu, Y. X., 1997. Statistical tests of neutrality of mutations against population growth, hitchhiking and background selection. Genetics 147: 915–925. [PMC free article] [PubMed]
  • Fu, Y. X., and W. H. Li, 1993. a Maximum likelihood estimation of population parameters. Genetics 134: 1261–1270. [PMC free article] [PubMed]
  • Fu, Y. X., and W. H. Li, 1993. b Statistical tests of neutrality of mutations. Genetics 133: 693–709. [PMC free article] [PubMed]
  • Hudson, R. R., 1993. The how and why of generating gene genealogies, pp. 23–36 in Mechanism of Molecular Evolution. Sinauer Associates, Sunderland, MA.
  • Kuhner, M. K., P. Beerli, J. Yamato and J. Felsenstein, 2000. Usefulness of single nucleotide polymorphism data for estimating population parameters. Genetics 156: 439–447. [PMC free article] [PubMed]
  • Markovtsova, L., P. Marjoram and S. Tavaré, 2001. On a test of Depaulis and Veuille. Mol. Biol. Evol. 18: 1132–1133. [PubMed]
  • Nawa, N., and F. Tajima, 2008. Simple method for analyzing the pattern of DNA polymorphism and its application to SNP data of human. Genes Genet. Syst. 83: 353–360. [PubMed]
  • Nielsen, R., and J. Wakeley, 2001. Distinguishing migration from isolation: a Markov chain Monte Carlo approach. Genetics 158: 885–896. [PMC free article] [PubMed]
  • Nielsen, R., M. J. Hubisz and A. G. Clark, 2004. Reconstituting the frequency spectrum of ascertained single-nucleotide polymorphism data. Genetics 168: 2373–2382. [PMC free article] [PubMed]
  • Nielsen, R., S. Williamson, Y. Kim, M. J. Hubisz, A. G. Clark et al., 2005. Genomic scans for selective sweeps using SNP data. Genome Res. 15: 1566–1575. [PMC free article] [PubMed]
  • Ramos-Onsins, S. E., and J. Rozas, 2002. Statistical properties of new neutrality tests against population growth. Mol. Biol. Evol. 19: 2092–2100. [PubMed]
  • Ramos-Onsins, S. E., S. Mousset, T. Mitchell-Olds and W. Stephan, 2007. Population genetic inference using a fixed number of segregating sites: a reassessment. Genet. Res. 89: 231–244. [PubMed]
  • Simonsen, K. L., G. A. Churchill and C. F. Aquadro, 1995. Properties of statistical tests of neutrality for DNA polymorphism data. Genetics 141: 413–429. [PMC free article] [PubMed]
  • Tajima, F., 1983. Evolutionary relationship of DNA sequences in finite populations. Genetics 105: 437–460. [PMC free article] [PubMed]
  • Tajima, F., 1989. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123: 585–595. [PMC free article] [PubMed]
  • Tajima, F., 1997. Estimation of the amount of DNA polymorphism and statistical tests of the neutral mutation hypothesis based on DNA polymorphism, pp. 149–164 in Progess in Population Genetics and Human Evolution. Springer-Verlag, Berlin/Heidelberg, Germany/New York.
  • Tishkoff, S. A., F. A. Reed, A. Ranciaro, B. F. Voight, C. C. Babbitt et al., 2007. Convergent adaptation of human lactase persistence in Africa and Europe. Nat. Genet. 39: 31–40. [PMC free article] [PubMed]
  • Wakeley, J., 2009. Coalescent Theory, an Introduction. Roberts and Company. Greenwood Village, Colorado.
  • Wakeley, J., R. Nielsen, S. N. Liu-Cordero and K. Ardlie, 2001. The discovery of single-nucleotide polymorphisms–and inferences about human demographic history. Am. J. Hum. Genet. 69: 1332–1347. [PMC free article] [PubMed]
  • Wall, J. D., 1999. Recombination and the power of statistical tests of neutrality. Genet. Res. 74: 65–79.
  • Wall, J. D., and R. R. Hudson, 2001. Coalescent simulations and statistical tests of neutrality. Mol. Biol. Evol. 18: 1134–1135. [PubMed]
  • Watterson, G. A., 1975. On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol. 7: 256–276. [PubMed]

Articles from Genetics are provided here courtesy of Genetics Society of America
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • MedGen
    MedGen
    Related information in MedGen
  • PubMed
    PubMed
    PubMed citations for these articles
  • Substance
    Substance
    PubChem Substance links