![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 Knudsen and Miyamoto; licensee BioMed Central Ltd. Accurate and fast methods to estimate the population mutation rate from error prone sequences 1CLC bio, 8200 Århus N, Denmark 2Department of Biology, Box 118525, University of Florida, Gainesville, Florida 32611-8525, USA Corresponding author.Bjarne Knudsen: bknudsen/at/clcbio.com; Michael M Miyamoto: miyamoto/at/ufl.edu Received February 25, 2009; Accepted August 11, 2009. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background The population mutation rate (θ) remains one of the most fundamental parameters in genetics, ecology, and evolutionary biology. However, its accurate estimation can be seriously compromised when working with error prone data such as expressed sequence tags, low coverage draft sequences, and other such unfinished products. This study is premised on the simple idea that a random sequence error due to a chance accident during data collection or recording will be distributed within a population dataset as a singleton (i.e., as a polymorphic site where one sampled sequence exhibits a unique base relative to the common nucleotide of the others). Thus, one can avoid these random errors by ignoring the singletons within a dataset. Results This strategy is implemented under an infinite sites model that focuses on only the internal branches of the sample genealogy where a shared polymorphism can arise (i.e., a variable site where each alternative base is represented by at least two sequences). This approach is first used to derive independently the same new Watterson and Tajima estimators of θ, as recently reported by Achaz [1] for error prone sequences. It is then used to modify the recent, full, maximum-likelihood model of Knudsen and Miyamoto [2], which incorporates various factors for experimental error and design with those for coalescence and mutation. These new methods are all accurate and fast according to evolutionary simulations and analyses of a real complex population dataset for the California seahare. Conclusion In light of these results, we recommend the use of these three new methods for the determination of θ from error prone sequences. In particular, we advocate the new maximum likelihood model as a starting point for the further development of more complex coalescent/mutation models that also account for experimental error and design. Background The population mutation rate (θ) remains one of the most fundamental parameters in genetics, ecology, and evolutionary biology [3-5]. This interest in θ derives from the fact that this parameter measures the effective size (Ne) and whole-locus mutation rate (μ) of a population, which are of great importance in understanding its demography and history. Specifically, θ is a compound parameter that is calculated as the product of 2pNeμ (with p = 1 or 2 for haploids and diploids, respectively). Correspondingly, a number of alternative methods are available to estimate θ from a population sample of allelic sequences [6-8]. These alternative methods range from relatively simple summary statistics (moment methods) to full coalescent/mutation models. Indeed, the estimation of θ remains central to even the most complex coalescent/mutation models that are otherwise concerned with the determination of other population genetic parameters (e.g., for growth, migration, and recombination). A population sample of sequences is obtained from interbreeding or potentially interbreeding individuals and is therefore usually associated with a small number of mutations [9-12]. Thus, when estimating θ from a population sample, sequence errors can pose a real problem, since their numbers can begin to approach or even surpass those for the mutations [13-19]. This problem becomes particularly acute when working with error prone data such as expressed sequence tags (EST), low coverage draft sequences, and other such unfinished products [20,21]. For example, an error rate of one mistake per every 500 nucleotides (e.g., as for an EST dataset obtained from single sequencing passes) will make a significant contribution to the observed variation among sequences that differ because of mutations by <1 to 2%. If uncorrected, such errors can lead to an inflated estimate of θ and even erroneous conclusions about the biology of their population [1,2,18,22]. Many sequence errors arise as random accidents that occur during the nucleic acid isolation, cloning/amplification, sequencing, and recording phases of a DNA sequencing study [11,23,24]. As chance events that are rare (even for error prone data), each of these random mistakes will most likely be limited to a single sequence, rather than repeated among two or more different ones within the population sample [1,15]. Thus, these random errors will most likely inflate the number of singletons within the dataset (i.e., polymorphic positions where one sampled sequence exhibits a unique base relative to the shared nucleotide of the others) (Figure (Figure1).1
This study relies on the simple premise that the random mistakes of error prone sequences can be avoided by ignoring the singletons within their population sample. This strategy is first used to obtain independently the same new Watterson [25] and Tajima [26] estimators of θ, which were recently reported by Achaz [1] for error prone sequences. This approach is then implemented in the recent maximum likelihood (ML) model of Knudsen and Miyamoto [2], which incorporates various factors of experimental error and design with those for coalescence and mutation. By relying on only shared polymorphisms, these three new approaches allow for more accurate estimates of θ. However, this greater accuracy comes with a cost as singletons due to actual mutations are ignored along with those due to random errors. To assess this tradeoff, the three new methods are tested against each other and their original predecessors that count singletons with evolutionary simulations and/or analyses of a real population dataset for the California seahare (Aplysia californica). These tests document that these new approaches offer reliable and fast alternatives for the determination of θ from error prone sequences. Results and discussion Three new methods for estimating θ from error prone sequences Infinite sites model As in their original versions, the new Watterson, Tajima, and Knudsen/Miyamoto methods rely on the infinite sites model to accommodate a neutral mutation process [27,28]. The infinite sites model assumes that only a single mutation can occur at any homologous position of the population sample. Thus, each variable site will be represented by only two bases that subdivide the sampled sequences into two non-overlapping subsets consisting of those with the first nucleotide versus the remainder with the second base. Correspondingly, each mutation will map to a specific branch within the sample genealogy, which thereby partitions the sequences into their two non-overlapping subsets. For example, the shared polymorphism at position 3 in Figure Figure11 As random errors are treated as rare chance events, they are also modeled in this study along with the mutations by the infinite sites process. Thus, only a single random error or mutation is allowed at any site of the sampled sequences. In turn, each random error is limited to a single sequence (and therefore to a particular singleton) in contrast to a mutation that can also result in a shared polymorphism (Figure (Figure1).1 New Watterson estimator (θ'W) Define Ti as the length of time (as scaled by Ne generations) during which there are exactly i ancestors for n sampled sequences. Standard coalescent theory tells us that:
and
[4,29,30]. The expected total branch length of the genealogy for n sequences (as measured in units of scaled coalescent time) can now be calculated as:
Let ns be the observed number of segregating (polymorphic) sites in the dataset. Under the infinite sites model, ns also counts the number of mutations, since each observed variable site is attributable to a single mutation. The expected number of mutations per locus per unit of branch length is θ/2. Thus, an estimate of θ can be obtained as:
[25]. A new Watterson estimator that avoids singletons (and thereby random sequence errors, θ'W) can now be derived from Equation (4) by adjusting both its numerator and denominator. The numerator is easily adjusted by counting only the shared polymorphic sites in the original dataset (n's). Although more complicated, the denominator can also be readily adjusted by including in its calculation only the lengths of the internal branches where shared polymorphisms can arise (see below). Let us again consider a point in the genealogy where there are exactly i ancestors for the n sampled sequences. Looking forward in time, the probability that a particular branch of the genealogy is not chosen for the next split (leading to i + 1 lineages) is [(i - 1)/i]. Thus, the probability that this branch remains unbroken to the present is:
By combining Equations (3) and (5), the total length of the external branches where singletons can occur can now be calculated as:
Our Equation (6) is equivalent to equation (10) of Fu and Li [31], apart from our use of different symbols and terms and of time as scaled by Ne generations (rather than generations alone). Thus, as previously noted by them, the total length of the external branches where a singleton can occur is independent of the original number of sampled sequences. Fu and Li [31] also obtained the variance for the total length of the external branches as their equation (14). In an asymmetrical genealogy with a basal split of one versus (n - 1) sampled sequences, a mutation in the internal branch leading to the common ancestor of the (n - 1) group will result in a singleton within the dataset (i.e., a variable site where the single nonmember sequence exhibits the unique base) (Figure (Figure2).2
By combining Equations (4), (6), and (7), θ'W can now be obtained as:
In words, θ'W ignores the singletons (and thereby random errors) in the original dataset, while restricting the total branch length calculation to only those internal internodes where a shared polymorphism can arise. New Tajima estimator (θ'T) Let πij denote the number of observed pairwise differences between sampled sequences i and j (for i ≠ j). Standard coalescent theory tells us that the expected waiting time for these two sequences to coalesce is ~ Ne generations. Thus, the expected value of each πij is θ, whereas that for their sum is:
Rearranging Equation (9) leads to:
[26]. The Tajima and Watterson estimators have the same expected value of θ, even though the former is based on the total differences between each sampled sequence pair whereas the latter is obtained from the total number of segregating sites within the sample. Their different summaries of the observed variation are the basis of Tajima's D that tests for departures from the standard neutral model [32] (see also [31,33,34]). To avoid random errors, Equation (10) can be adjusted in a manner similar to that used to modify Equation (4) of θW. First, let π'ij count the number of observed pairwise differences between sequences i and j after the removal of all singletons from their dataset. Then, the expected number of singletons can be calculated from Equations (6) and (7) as the total length of the external branches and (n - 1) basal branch (as weighted by its probability of occurrence) multiplied by θ/2 {i.e., [1 + (1/n - 1)] θ}. The removal of each singleton from the original dataset reduces the expected sum in Equation (9) by (n - 1). Combining these results, we obtain:
Rearranging Equation (11) then leads to:
Our Equation (12) for θ'T is equivalent to equation (15) of Achaz [1], except for our use of different symbols and total π'ij (rather than average π'ij) for the population sample. Achaz [1] also derived his equation B23 for the calculation of the associated variance for average π'ij [i.e., his Var[π-η1]). New Knudsen/Miyamoto model (θ'KM) Knudsen and Miyamoto ([2], hereafter referred to as "KM") developed a full coalescent/mutation model that accounts for three specific factors of experimental error and design: (a) For random sequence errors, (b) For unobserved polymorphisms due to missing data, and (c) For the uncertain assignment of the multiple sequencing reads for a diploid or polyploid individual to its two or more homologues. Their KM model uses recursion to calculate an exact probability of the population sample under the standard Fisher [35] and Wright [36] model for reproduction (hereafter, referred to as "FW") and the infinite sites process for both mutations and random sequence errors [37,38]. Their model relies on ML to estimate θ and the expected number of errors per full length sequence (ε). At the heart of the KM model is equation (1) of Knudsen and Miyamoto [2], which is reproduced here as:
The parameters and factors of this equation are defined in Table 1. As one works backwards in time, the probability that the next observed event is a specific coalescence is given by the first term in the top line before the double sum. As indicated by this double sum, if indeed a coalescent event occurs, then it will happen between two sampled and/or ancestral alleles with compatible (if not identical) sequences. Such sequences are referred to as combinable (si ~ sj).
The first term in the bottom line before the single sum is then for the probability that the next observed event is instead a mutation. As indicated by this sum, if indeed a mutation occurs, then it will happen to a sampled or ancestral sequence with at least one "singleton" (σi > 0). Here, the definition of a "singleton" is expanded to include the derived mutations of the common ancestors for the different groups of related sampled sequences as well as those for the singletons (in the strict sense) of the original dataset (Table 1). This expanded use of the term acknowledges that a shared polymorphism under the infinite sites model is due to a unique mutation within the common ancestor of those sampled sequences sharing the derived base (Figure (Figure3).3
In the KM model, the available region(s) of each sampled sequence is summarized as a closed interval(s) that is scored over the range of (0:1). The |si| factor then quantifies the total amount of sequence available for this sampled allele. For example, the (0.2:1.0) score for the leftmost (first) sampled sequence in Figure Figure4a4a (0.0:1.0) = (0.0:1.0), with |si| = 1.0.
The purpose of Σi |si| in Equation (13) is to track the total available length of all sampled and/or ancestral sequences for the detection of mutations as one works backwards in time. The corresponding use of |si| in the bottom line of Equation (13) allows for the adjustment of σi due to unobserved polymorphisms resulting from missing data. The KM model uses only Equation (13) when working with error free sequences. In turn, this model also relies on equation (4) of Knudsen and Miyamoto [2] when dealing with error prone sequences. This additional equation of the KM model assumes that the random errors are uniformly distributed along the sampled sequences and that their total number is Poisson distributed with an intensity of λ = ε Σi |si|. The inclusion of Σi |si| in the calculation of λ allows for the adjustment of this error rate to account for incomplete sampled sequences. In contrast to its predecessor, the KM' model uses only Equation (13) when working with either error prone or error free sequences. As for the new Watterson and Tajima estimators, the KM' model operates by counting only those internal branches of the genealogy where a mutation will result in a shared polymorphism (Figure (Figure4b).4b ), as is the sequence of the common ancestor for the (n - 1) basal group of each asymmetrical genealogy (Figure (Figure2).2For sampled sequences without missing data, the previous explanation is complete as to how the KM' model corrects for the branches of the genealogy where a mutation will result in a singleton. However, for incomplete sampled sequences, the "above and below" test of the KM' model also becomes necessary to correct for the regions of each common ancestral sequence where a mutation in its internal branch will lead to a singleton, rather than to a shared polymorphism, because of this missing information (Figure (Figure4b).4b The above and below test checks whether comparable information for a region is known for at least two, descendant, sampled sequences below and at least two, more distantly related, sampled sequences above an internal branch in the genealogy. If not, then a mutation within this region of the ancestral sequence that corresponds to this internal branch will result in a false singleton within the original dataset. For example, a mutation within the first 20% of the leftmost common ancestor in Figure Figure44 Evolutionary simulations and A. californica dataset Evolutionary simulations To test the three new procedures, evolutionary simulations were conducted according to standard methods [4]. Two hundred datasets apiece were simulated for eight or 16 sequences of length 500 from a single population under the baseline conditions of the standard FW and infinite sites models with θ = 1, 2, 4, or 8. In addition to these baseline conditions, sequence errors were introduced as four or eight randomly placed changes among the eight and 16 sequences, respectively, for an expected ε of 0.5. Estimates of θ were then obtained for the 200 datasets of each tested combination with the FW model and the original and new Watterson estimators, Tajima estimators, and KM and KM' models. As expected, the FW model and original Watterson and Tajima estimators overestimate θ when the datasets contain random errors (Table 2). In these cases, the random errors are counted as mutations, thereby inflating their estimates of θ. Furthermore, these overestimations are greater for the Watterson estimator than for the Tajima estimator. The reason is that each singleton makes the same contribution as a shared polymorphism to the number of segregating sites (ns) in Equation (4) of θW, but a smaller one to the sum of the pairwise differences in Equation (10) of θT. Specifically, each singleton is limited in Equation (10) to the (n - 1) pairwise comparisons of the unique sequence to the (n - 1) other sequences. Thus, the Tajima estimator is less vulnerable than the Watterson estimator to random errors even though its vulnerability is still significant.
In contrast, the KM model underestimates θ when the sequences are errorless (Table 2). Furthermore, this tendency to underestimate θ is also evident (although not significant) when the sequences contain errors (simulations A2, A4, B2, and B4; see also [2] for additional cases). In these situations, mutations are sometimes counted as random errors, thereby deflating their estimates of θ. In cases where few to no random errors are expected (i.e., finished sequences), one can first perform a likelihood ratio test to determine if ε = 0 [39]. If this null hypothesis cannot be rejected, then the KM analysis should be restricted to only Equation (13). However, if ε > 0 according to this likelihood ratio test, then equation (4) of Knudsen and Miyamoto [2] is also needed and the user of the KM model must remain aware that her/his estimate of θ most likely includes some slight downward bias. Unlike their original versions, the new Watterson estimator, Tajima estimator, and KM' model all consistently recover the true θ in the simulations both with and without errors (Table 2). Furthermore, these three methods also avoid the tendency of the KM model to underestimate θ due to unnecessary or "greedy" parameters for random errors. In particular, the reliance of the KM' model on one less equation [(4)] and fewer parameters (e.g., λ and ε) makes it much simpler and less prone to over-parameterizations than its predecessor. The mean estimates of θ for the new Watterson estimator, Tajima estimator, and KM' model are also generally associated with greater standard deviations than those for their original versions (Table 2). The proximal reason for these increased standard deviations is that the elimination of singletons results in the loss of some actual mutations along with the random errors. However, this cost of the three new methods appears to be relatively small, given that their standard deviations are never twice as great as those for their original versions (with these discrepancies usually being much smaller). Indeed, the standard deviations for the three new methods are less than or equal to those of their counterparts in four cases (simulations A2, A4, and B4 for θ'W and A4 for θ'T in Table 2). Perhaps more surprising is that the standard deviations for the new Watterson and Tajima estimators are comparable to those for the KM' model, particularly when θ = 1 or 2. These similarities in their standard deviations are surprising given that the KM' model is a full ML model. As summary statistics, the original and new Watterson and Tajima estimators are all computationally fast, requiring much less than 1 CPU second on a 2.4 GHz Pentium 4 CPU to analyze each of the simulated datasets. More importantly, the KM' model is much faster than the FW and KM models as documented by its completed analyses of the more complex datasets (simulations C1 to C4 and D1 to D4 in Table 2). In contrast, the FW and KM models fail to complete their analyses of these more complex datasets due to time and memory constraints. The KM' model is much faster than the FW and KM models, because it relies on only shared polymorphic sites. Less variable positions results in faster coalescences and fewer choices as one works back through the coalescent/mutation recursion of Equation (13). Aplysia dataset The A. californica dataset was the original motivating force behind the development of the KM model [2]. Thus, this real dataset was also analyzed with the KM' model to test further its performance against that of its predecessor. This dataset consists of 18 sequencing reads for six diploid individuals from a laboratory population of the California seahare at the Laboratory for Marine Bioscience, University of Florida (LL Moroz and AB Kohn, unpublished data). Three cloned inserts were sequenced from each individual as a pair of single sequencing passes starting from both ends of an internal segment of 1731 base pairs for the protein-coding region of the nuclear FMRF gene. These pairs of passes overlap in the middle for nine sequences, but at most by only 58 bases. Thus, these 18 essentially single-pass sequences contain many random errors and some missing data and their assignments to the two homologues of each diploid remain uncertain (even though their individual sources are known). Correspondingly, the KM analysis of this dataset with 44 singletons and 10 shared polymorphisms required the full use of its factors for experimental error and design (i.e., for random errors, missing data, and uncertain homologue assignments). A full ML analysis of this complex dataset by the KM model proved too time and memory consuming and a two step procedure was therefore adopted instead whereby the number of errors was first ML estimated followed by the determination of θ for this ML value [2]. This heuristic approach resulted in an estimate of θKM = 6.32 for this sample, which still took more than 12 hours to complete on the same CPU as for the simulations (see above). In contrast, a full ML analysis of this dataset by the KM' model using the same factors for experimental design is much faster as it took less than 1 hour on this CPU to obtain a similar estimate of θ'KM = 7.17. Correspondingly, nucleotide diversity (π) for this sample is calculated as (7.17/1731 positions) = 0.0041 mutations per site. To summarize, the KM and KM' models both support similar estimates of θ for this real dataset, but the latter (as in the evolutionary simulations in Table 2) is much faster as confirmed by its completed, full, ML analysis. Conclusion Expressed sequence tags, low coverage draft sequences, and other such unfinished products are known to contain random errors due to chance accidents during their data collection and recording [1,11,21,23]. However, such sequences often constitute the only available allelic information for a population and methods to deal with their random mistakes are therefore needed to obtain accurate estimates of θ from these error prone data. This study is based on the simple premise that random sequence errors are distributed as singletons. Thus, one can avoid the random mistakes of error prone sequences by ignoring their singletons in favor of their shared polymorphisms (Figures (Figures22 This strategy is implemented in the new Watterson estimator, Tajima estimator, and KM' model. These new methods are all accurate and fast according to their evolutionary simulations and analysis of the real complex dataset for A. californica (Table 2). These methods come with the cost of increased standard deviations, but this price appears small or even negligible compared to their advantages of significantly improved accuracy and/or computational speed. Obviously, additional evolutionary simulations and applications to real datasets are now needed to evaluate more fully under what conditions the removal of singletons is warranted in light of this tradeoff. Nevertheless, the current successes with the new Watterson estimator, Tajima estimator, and KM' model support our recommendation that these three methods be given serious consideration when estimating θ from error prone sequences. Unlike the Watterson and Tajima estimators that represent summary statistics, the KM and KM' models both constitute full ML models that offer the framework for the further incorporation of other experimental and population genetic factors [2]. In particular, such further developments are encouraged for the KM' model in light of its current successes in the evolutionary simulations and A. californica analysis (Table 2). For example, biased errors within a sample due to the systematic misreading of specific bases during DNA sequencing, the postmortem biochemical degradation of ancient DNA, and/or other such sources of error can be accommodated by a finite sites process that allows for repeated mistakes as well as mutations at the same sequence positions [14-16,40]. Likewise, additional factors to account for other population genetic processes such as recombination (which is most likely the most important parameter overlooked in this study of the nuclear FMRF gene for A. californica) can be accommodated by the use of ancestral recombination graphs [41]. Inevitably, these more complex versions of the KM' model will require the use of sampling based procedures for their implementation (e.g., Markov chain Monte Carlo approximations), since the current use of direct ML evaluation will remain practical for only the smaller datasets and simpler models [14,42-44]. Authors' contributions BK derived the equations, wrote the computer program for the estimation of θ, and conducted the evolutionary simulations. MMM provided biological interpretations about the results. Both authors contributed to the design and main ideas of this study, wrote and edited the manuscript, and read and approved its final draft. Acknowledgements We thank Leonid L Moroz and Andrea B Kohn for the use of their unpublished A. californica dataset; William Farmerie and Michele R Tennant for their suggestions about this research; the Carlsberg Foundation for its fellowship to BK; and the University of Florida for its support. A copy of our computer program to estimate θ'W, θ'T, and θ'KM is available upon request from BK. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||||||||||||
Genetics. 2008 Jul; 179(3):1409-24.
[Genetics. 2008]Genetics. 2007 Aug; 176(4):2335-42.
[Genetics. 2007]Theor Popul Biol. 1995 Oct; 48(2):172-97.
[Theor Popul Biol. 1995]Genetics. 1991 Oct; 129(2):513-23.
[Genetics. 1991]Genetics. 2008 Jan; 178(1):351-61.
[Genetics. 2008]Mol Biol Evol. 2005 Jul; 22(7):1561-8.
[Mol Biol Evol. 2005]Mol Biol Evol. 2008 Nov; 25(11):2409-19.
[Mol Biol Evol. 2008]Genetics. 2007 May; 176(1):711-4.
[Genetics. 2007]Mol Biol Evol. 2001 Nov; 18(11):2092-101.
[Mol Biol Evol. 2001]DNA Seq. 2004 Oct-Dec; 15(5-6):362-4.
[DNA Seq. 2004]Genetics. 2008 Jul; 179(3):1409-24.
[Genetics. 2008]Genome Res. 2006 Oct; 16(10):1320-7.
[Genome Res. 2006]Theor Popul Biol. 1975 Apr; 7(2):256-76.
[Theor Popul Biol. 1975]Genetics. 1983 Oct; 105(2):437-60.
[Genetics. 1983]Genetics. 2008 Jul; 179(3):1409-24.
[Genetics. 2008]Genetics. 2007 Aug; 176(4):2335-42.
[Genetics. 2007]Genetics. 1969 Apr; 61(4):893-903.
[Genetics. 1969]Annu Rev Genet. 1995; 29():401-21.
[Annu Rev Genet. 1995]Theor Popul Biol. 1975 Apr; 7(2):256-76.
[Theor Popul Biol. 1975]Genetics. 1993 Mar; 133(3):693-709.
[Genetics. 1993]Genetics. 2008 Jul; 179(3):1409-24.
[Genetics. 2008]Genetics. 1983 Oct; 105(2):437-60.
[Genetics. 1983]Genetics. 1989 Nov; 123(3):585-95.
[Genetics. 1989]Genetics. 1993 Mar; 133(3):693-709.
[Genetics. 1993]Genetics. 1995 Sep; 141(1):413-29.
[Genetics. 1995]Genetics. 1997 Oct; 147(2):915-25.
[Genetics. 1997]Genetics. 2008 Jul; 179(3):1409-24.
[Genetics. 2008]Genetics. 2007 Aug; 176(4):2335-42.
[Genetics. 2007]Genetics. 1931 Mar; 16(2):97-159.
[Genetics. 1931]J Math Biol. 1989; 27(6):667-80.
[J Math Biol. 1989]Math Biosci. 1995 May; 127(1):77-98.
[Math Biosci. 1995]Genetics. 2007 Aug; 176(4):2335-42.
[Genetics. 2007]Genetics. 2007 Aug; 176(4):2335-42.
[Genetics. 2007]Genetics. 2007 Aug; 176(4):2335-42.
[Genetics. 2007]Science. 1997 Apr 11; 276(5310):227-32.
[Science. 1997]Genetics. 2007 Aug; 176(4):2335-42.
[Genetics. 2007]Genetics. 2007 Aug; 176(4):2335-42.
[Genetics. 2007]Genetics. 2008 Jul; 179(3):1409-24.
[Genetics. 2008]Mol Biol Evol. 2001 Nov; 18(11):2092-101.
[Mol Biol Evol. 2001]Genetics. 2007 May; 176(1):711-4.
[Genetics. 2007]DNA Seq. 2004 Oct-Dec; 15(5-6):362-4.
[DNA Seq. 2004]Genetics. 2007 Aug; 176(4):2335-42.
[Genetics. 2007]Mol Biol Evol. 2007 Jun; 24(6):1416-22.
[Mol Biol Evol. 2007]Mol Biol Evol. 2008 Jan; 25(1):199-206.
[Mol Biol Evol. 2008]Genome Biol. 2007; 8(7):R143.
[Genome Biol. 2007]Genet Epidemiol. 2000; 19 Suppl 1():S15-21.
[Genet Epidemiol. 2000]