• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of plosgenPLoS GeneticsSubmit to PLoSGet E-mail AlertsContact UsPublic Library of Science (PLoS)View this Article
PLoS Genet. Oct 2009; 5(10): e1000674.
Published online Oct 2, 2009. doi:  10.1371/journal.pgen.1000674
PMCID: PMC2745702

Inference of Microbial Recombination Rates from Metagenomic Data

Bruce Walsh, Editor

Abstract

Metagenomic sequencing projects from environments dominated by a small number of species produce genome-wide population samples. We present a two-site composite likelihood estimator of the scaled recombination rate, ρ = 2Nec, that operates on metagenomic assemblies in which each sequenced fragment derives from a different individual. This new estimator properly accounts for sequencing error, as quantified by per-base quality scores, and missing data, as inferred from the placement of reads in a metagenomic assembly. We apply our estimator to data from a sludge metagenome project to demonstrate how this method will elucidate the rates of exchange of genetic material in natural microbial populations. Surprisingly, for a fixed amount of sequencing, this estimator has lower variance than similar methods that operate on more traditional population genetic samples of comparable size. In addition, we can infer variation in recombination rate across the genome because metagenomic projects sample genetic diversity genome-wide, not just at particular loci. The method itself makes no assumption specific to microbial populations, opening the door for application to any mixed population sample where the number of individuals sampled is much greater than the number of fragments sequenced.

Author Summary

At a broad scale, the exchange of genetic material through homologous recombination (i.e. what happens in animals during sex) increases the potential rate of adaptation. Bacteria often reproduce clonally, without recombination, by making exact copies of their genomes, but they also have mechanisms analogous to sex that allow them to recombine sporadically. Despite microbes' critical role at the base of our world's ecosystem, microbiologists know surprisingly little about how microbes grow and evolve outside the laboratory. Metagenomic sequencing projects provide a means to sample the genetic diversity of natural microbial populations and have the potential to reveal much about the ecology and evolution of these populations. Here we present a novel method to estimate the recombination rate from metagenomic data, while explicitly allowing for imperfections such as sequencing error and missing data.

Introduction

Microbial populations exchange homologous genetic material at different rates, dramatically affecting the evolutionary potential of the population. While basal mutation rates can be estimated via long-term within-laboratory evolution experiments [1], recombination rates are more difficult to infer because they require identification of multiple alleles at multiple loci in multiple individuals. Further, biogeographic barriers and interspecies interactions may lead to qualitatively different effects than growth in axenic laboratory culture, making determination of recombination rates in an organism's natural environment critical to accurate interpretation [2]. For the purpose of this study, we ignore the mechanism behind homologous recombination (i.e. transformation, transduction, or conjugation) and focus on its effect on genetic diversity.

Much research has investigated human recombination hotspots [3], yet almost nothing is known about variation in microbial recombination rates within a genome. In specific instances, however, studies have experimentally identified sequence motifs associated with recombination hotspots in some species of bacteria and yeast [4]. Mounting evidence suggests that regions known as CRISPR (Clusters of Regularly Interspaced Short Palindromic Repeats) form the basis of a bacterial immune system against phage in which chunks of the phage genome are inserted into the CRISPR region [5]. Thus a reasonable hypothesis would be that these regions or other regions with similar effect might recombine with greater frequency than the rest of the genome.

Inference of a genome-wide, fine-scale recombination map requires both extensive genome-wide sampling of the genetic diversity within the population of interest as well as an appropriate population genetic model, neither of which has been previously available for microbial populations. Microbial population surveys have primarily sequenced a small number of loci (“multi-locus sequence typing”) [6], which yield no information about variation in local recombination rate. Current methods tailored to microbial populations rely on low-power summary statistics [7],[8], heuristics instead of explicitly modeling the source of the recombining fragments [9], or parsimony based on manual inspection [10]. A few studies (e.g. [2],[11]) applied a more rigorous likelihood-based approach using a population genetic model ([12]; discussed more below), but these were still able to estimate only a genome-wide average rate of recombination.

Recently, large-scale metagenomic sequencing projects have begun to generate genome-wide population samples by sequencing random reads from a pool of DNA extracted from all microorganisms in a given environment. Projects that sample environments dominated by only a few microbial “species” are able to assemble near-complete genomes [13],[14], in which the constituent reads contain information about the genetic diversity in the population. Considering the large number of individuals in the sampled community relative to the number of reads sequenced, each read derives almost certainly from a different individual microorganism. With average read depths as high as ten [14], the resulting data hold rich potential for population genetic analysis [15],[16].

Given these data, we can make inferences about parameters such as mutation rate and recombination rate. In population genetic theory, the per-generation mutation rate, An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e001.jpg, and per-generation recombination rate, An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e002.jpg, almost always appear in conjunction with the effective population size, An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e003.jpg, as the parameters An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e004.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e005.jpg. In our microbial context, we assume a single recombination event leads to the replacement of a short tract of sequence, creating two recombination breakpoints. A full likelihood method would yield maximal power by calculating the probability of observing the entire pattern of polymorphism across all samples, given the parameters An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e006.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e007.jpg. In practice, however, this approach is extremely computationally intensive [17], and even a recent breakthrough using a Markov chain Monte Carlo technique only extends full-likelihood to input data containing fewer than An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e008.jpg SNPs [18]. Instead, we follow the lead of previous researchers who sacrificed power for greater practicality by using a composite likelihood method [12],[19],[20] that approximates the true likelihood, as detailed in the Methods section.

However, metagenomic population samples differ from traditional population samples and, as a result, provide new challenges to estimating recombination. First, the sample size varies according to the read depth at a given location instead of being fixed across all loci. Second, the quality of each base call varies along each read, and the random nature of the metagenomic method prevents independent replication of the sampling and sequencing steps to confirm observed polymorphisms. Finally, linkage information is greatly reduced in that instead of the traditional approach of sampling the same individual at all loci, each fragment of DNA derives from a different individual. Depending on the sequencing technology and whether reads were sequenced in pairs, these data will reveal, at most, linkage within two reads of An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e009.jpg nucleotides that are separated by a distance generally less than 40 kilobases.

As high-throughput sequencing becomes ever cheaper, the number of projects producing this sort of data will only increase. The Human Microbiome Project (http://www.hmpdacc.org/) plans to perform metagenomic sequencing of microbes found at five sites around the body. A particularly intriguing future application will be to sequence mixtures of pathogens sampled from within a single infected human. These data, combined with the methods presented here, will allow inferences about the interplay between the immune response and recombination within pathogens.

Methods

We start by deriving our two-locus composite likelihood estimator based on the idea of Hudson [19] and the estimator of McVean et al. [12] but now allowing for realistic amounts of missing data and sequencing error. Sequencing error probabilities are taken as given in the form of per-base quality scores. The resulting likelihood calculation becomes computationally infeasible on metagenomic-scale data, so we further describe several numerical approximations that make our implementation tractable. Finally we define a statistic to quantify the amount of missing data. This statistic will aid analysis and discussion of our estimator of An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e010.jpg.

Composite-likelihood estimator

Our input data consist of a metagenomic assembly (i.e. alignment of reads to a scaffold), untrimmed FASTA sequences for the reads, quality scores for each base in each read and, if applicable, information about read pairs. We explicitly do not consider any uncertainty in either the assembly or in the quality scores for the practical reason that current assembly algorithms and base callers do not generate this information; however, in principle, our method could be extended to incorporate these sources of uncertainty. Given these data, we wish to estimate two population genetic parameters: An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e011.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e012.jpg.

Following [12], we assume that each site in the assembly has at most two different nucleotides and arbitrary label these as zero and one. In the rare event that more than two distinct nucleotides are observed, then we again arbitrarily label them zero and one after first grouping the nucleotides into two categories: the most common nucleotide and everything else. In the case of a tie for the most common nucleotide, we pick one at random. Given this labeling, we can represent the state of a read at a given position by 0, 1 or ?, where the question mark represents missing data. Analogously, we represent the state of a single chromosome at two positions simultaneously: 00, 01, 10, 11, 0?, 1?, ?0, ?1 (ignoring ??, since this conveys no information). An example is given in Figure 1 and described below. Note that, in a metagenomic context, “a single chromosome” means that both nucleotides are either on the same read or on two paired reads. We assume that the total number of sequenced reads is much less than the total number of cells in the sampled environment such that the probability of two independent (unpaired) reads deriving from the same original cell/chromosome is essentially zero.

Figure 1
Cartoon metagenomic assembly.

First we outline our notation more formally. The assembly, An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e013.jpg, extends from position 1 to position An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e014.jpg and contains information about both the content of the reads and their position. The set of quality scores, An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e015.jpg, contains one quality score for each base in each read in the assembly. We assume Phred-calibrated quality scores [21], so any particular quality score, An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e016.jpg, can be converted into an error probability, An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e017.jpg, by means of the formula An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e018.jpg. The configuration for a pair of sites, An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e019.jpg (An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e020.jpg), is a vector of eight numbers corresponding to the number of chromosomes observed in each of the eight states (00, 01, etc.). For example, in Figure 1 the configuration of the leftmost pair of polymorphic sites is {An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e021.jpg, An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e022.jpg, An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e023.jpg, An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e024.jpg, An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e025.jpg, An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e026.jpg, An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e027.jpg, An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e028.jpg}. In addition to the configuration at pair An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e029.jpg, we also have the set of quality scores, An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e030.jpg (An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e031.jpg).

We wish to calculate the likelihood of the observed data, An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e032.jpg, given the quality scores, An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e033.jpg, and the population genetic parameters of interest, An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e034.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e035.jpg. We approximate the true likelihood with the composite likelihood:

equation image
(1)

in which the two-locus configurations are treated as though they were independent among pairs of sites. We take the mutation rate (and thereby An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e037.jpg) to be constant and independent across all sites in the assembly, conditional on the genealogy. However, the recombination rate between two sites An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e038.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e039.jpg depends on their distance apart, An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e040.jpg, as measured by the number of nucleotides separating them. We model recombination in microbial populations as occurring via gene conversion with recombination tract lengths drawn from an exponential distribution [12],[22],[23]:

equation image
(2)

where An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e042.jpg is the average length of the recombination tract. Theoretically An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e043.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e044.jpg might be identifiable, but in practice our data are insufficient to separate them. Instead we fix An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e045.jpg and estimate An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e046.jpg, similar to the approach taken by McVean et al. [12]. Minor misspecification of An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e047.jpg will simply rescale An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e048.jpg, although major misspecification of An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e049.jpg will also change the right-hand side of (2).

Now we turn to the likelihood of a single two-locus configuration. We first account for sequencing error by summing over all possibilities for the truth, An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e050.jpg:

equation image
(3)

where the sum iterates over all An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e052.jpg elements of the set of possible two-locus configurations, An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e053.jpg, and An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e054.jpg is the average number of reads at each site. The first term inside the sum is the error probability, while the second term is the two-locus likelihood without any error. We assume that sequencing errors cause a switch from 0 to 1 and vice versa:

equation image
(4)

where An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e056.jpg and the subscript An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e057.jpg indexes the same position in the same read in the quality scores, the observation, and the truth. In other words, all mismatches between the truth and observed must be the result of an error, while all matches between the truth and the observed cannot have been caused by an error.

Next we account for missing data by summing over all possibilities for the unknown nucleotides in the complete configuration, An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e058.jpg:

equation image
(5)

where the sum iterates over all elements of the set of configurations compatible with the observed data, An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e060.jpg (i.e. those that satisfy the constraints An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e061.jpg, etc.). The first term inside the sum accounts for missing data, while the second term is the pure two-locus likelihood. If we treat the configurations An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e062.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e063.jpg as a specific ordering of chromosomes, then this first term has a binary value of 1 for all configurations An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e064.jpg that match An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e065.jpg at non-missing positions and 0 otherwise. As a result of our definition for the set An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e066.jpg, all configurations An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e067.jpg will match An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e068.jpg at non-missing positions, so the first term is always 1. We describe calculation of the second term in the next section below.

We arrive at the final composite likelihood equation by taking (1) and substituting in (3), (4) and (5), which leaves us with four nested products and sums of significant size as discussed below.

Now we wish to find maximum likelihood estimates to our parameters. Joint maximization of An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e069.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e070.jpg is computationally impractical. Instead, we perform a two-step estimation procedure in which we first estimate An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e071.jpg from single sites using a previously-developed method that correctly handles sequencing error [15] and then estimate An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e072.jpg from pairs of sites by numerically maximizing (1) while holding An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e073.jpg.

Two-locus complete likelihoods without error

We pre-calculate and store the two-locus likelihoods for all possible complete two-locus configurations without error (i.e. the second term in (5)) for a single sample size, An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e074.jpg, across a range of An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e075.jpg values and a single fixed An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e076.jpg value. We generate this table of likelihoods by running a slightly modified version of the complete program from the LDhat package [12], which assumes a finite sites Jukes-Cantor style biallelic mutation model and uses the neutral coalescent-with-recombination importance sampling method of Fearnhead and Donnelly [24]. The original complete program computed likelihoods only for configurations in which both sites were observed to be polymorphic; our modification enables the calculation of likelihoods for configurations with one polymorphic site and one fixed site. We deduce the final probability of both sites being fixed by subtracting all other probabilities from 1.

Given this table for a fixed sample size An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e077.jpg and fixed An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e078.jpg, we can exactly infer an analogous table for smaller sample sizes and approximately infer a table for different values of An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e079.jpg.

A smaller sample size table can be directly generated for an arbitrary new sample size, An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e080.jpg; however, in the interests of clarity, we will describe how to generate a table when An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e081.jpg, which can be iterated. Let the vector An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e082.jpg denote a configuration of sample size An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e083.jpg. Assuming probabilities for ordered configurations (as generated by complete by default), the probability of this new configuration is the sum of the probabilities of An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e084.jpg, An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e085.jpg, An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e086.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e087.jpg.

Adjusting the table for a different An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e088.jpg poses a greater challenge. One option would be to run complete many times to generate tables for different values of An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e089.jpg, but this would be extremely time-consuming. Our alternative solution takes advantage of the fact that, while An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e090.jpg strongly affects the relative probabilities among the three broad categories of (both-sites-polymorphic, one-site-fixed, both-sites-fixed), An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e091.jpg only mildly affects the relative probabilities of different configurations within these categories. The approximate probability of a site being polymorphic under the finite sites mutation model in a sample of size An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e092.jpg is An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e093.jpg (approximate in the sense that this ignores the slight possibility of a site being polymorphic but having back mutations erase all traces of that polymorphism). If two sites are independent An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e094.jpg, then the probabilities corresponding to these three categories of pairs are An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e095.jpg, An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e096.jpg, An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e097.jpg. Now we assume that the ratio between the probabilities of these categories is independent of An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e098.jpg and approximate the probabilities of configurations under some new An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e099.jpg by multiplying by An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e100.jpg (if both sites are polymorphic) or An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e101.jpg (if one site is fixed). If both sites are fixed, then we again deduce the probability by subtracting all other probabilities from 1.

Given these tabulated (or calculated) values, we use linear interpolation to arrive at the final probability for a given An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e102.jpg. Linear interpolation as well as our numerical maximization algorithm require that the likelihood surface be reasonably smooth. The importance sampling algorithm leaves a small amount of error in its estimate of the likelihood, which can lead to small wiggles in the likelihood surface. We solve this problem by smoothing the tabulated values where necessary via cubic splines. Also, for configurations with a single fixed site, the importance sampling algorithm did not reduce the variance in the likelihood below the very low level of the slope across An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e103.jpg, leading to numerical difficulties performing maximization on a non-smooth likelihood surface. We avoid this problem by making the likelihoods for these configurations constant across An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e104.jpg by setting them equal to their average value.

Complexity and approximations

As alluded to earlier, a brute force implementation of the four nested loops in the composite likelihood function would take An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e105.jpg time where An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e106.jpg is the length of the assembly (or region of interest), An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e107.jpg is the read depth and An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e108.jpg is the average number of missing nucleotides at each site, assuming a constant read depth. Real metagenomic data have variable read depth, which makes the situation even worse with the sequencing error component (An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e109.jpg) dominating the complexity at high-depth sites (i.e. where An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e110.jpg). Instead we make several approximations:

  1. Reduce amount of low quality data. We allow no more than five bases with quality below An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e111.jpg (1 in 100 chance of error) in any pair of sites. For an average read depth of An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e112.jpg and a quality distribution from Sanger sequencing, this cutoff eliminates ~3% of our lowest-information-content data for a significant speed increase.
  2. Skip nearby pairs of sites. We consider only those pairs separated by at least 10 bases (in (1), change the second product to start at An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e113.jpg) and we only make pairs for every 5th site (in (1), change the first product to take values An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e114.jpg). Any given pair of adjacent sites is highly unlikely to have had a recombination breakpoint between them. If the sites are separated by a greater distance, the chance of a recombination breakpoint between them increases. Thus this approximation sacrifices a small amount of information to reduce the overall number of pairs of sites. Empirically, simulations suggest this approximation does not greatly increase the variance of An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e115.jpg.
  3. Only use pairs of sites spanned by at least one chromosome (i.e. using the statistic defined in the next section, An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e116.jpg). Pairs of sites not meeting this criteria tend to be far apart and contain relatively little information.
  4. When accounting for error, only consider plausible true configurations, instead of all possible configurations. For a given pair of sites, we first sort the quality scores in ascending order (An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e117.jpg). Then we iterate over truths in decreasing order of probability (for one error: An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e118.jpg, then An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e119.jpg, etc.; for two errors: An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e120.jpg, then An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e121.jpg, etc.) until the probability is less than An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e122.jpg times as likely as the most probable configuration.

Given these approximations, a standard desktop computer can perform this estimation for 10 kb of sequence, average depth of 10 and a realistic error distribution in less than one hour.

Ρs statistic

Before we discuss our results, we need to quantify the amount of missing data between a given pair of sites. Define An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e123.jpg to be the proportion of chromosomes that span a particular pair of sites: An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e124.jpg, where An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e125.jpg is the number of chromosomes spanning both sites (i.e. both sites are covered either by the same read or by paired reads) and An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e126.jpg is the average number of chromosomes covering each site separately (An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e127.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e128.jpg, respectively).

The average value of this statistic together with the average sample size provide an indirect measure for the amount of information about recombination captured by pairs of sites within a given dataset.

Sludge data

We applied our technique to the first 500 kb of the assembly of Candidatus Accumulibacter phosphatis from a recent metagenomic sequencing project of activated sludge from a wastewater treatment plant [13]. The sludge we analyzed came from a laboratory bioreactor in Madison, Wisconsin that had been seeded from a local wastewater treatment plant. We received the data (P. Hugenholtz, personal communication) in the form of a finished assembly consisting of ACE and PhD files covering a An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e129.jpg megabase scaffold of average depth An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e130.jpg. Equivalent data in a different form are also available directly from the Joint Genome Institute via the IMG/M system [25] and the NCBI Trace Archive (genome project id 17657).

Results

We first investigate the information content of a single pair of sites as a function of the amount of missing data. This information sets an upper bound on our estimator's performance since we use the composite likelihood instead of the true likelihood. In particular, the Fisher information, An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e131.jpg, for a single pair of sites with depth An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e132.jpg decreases with An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e133.jpg, although the information only falls off dramatically for An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e134.jpg (Figure 2). We find these results encouraging since the average An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e135.jpg of pairs in the actual sludge metagenome falls just above this threshold at 0.21. Note that the Fisher information holds little meaning on an absolute scale since we calculate the information for a single pair of sites rather than for our actual data with many dependent pairs. Instead, the values in Figure 2 should be interpreted on a relative scale. For instance, for An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e136.jpg, approximately ten independent pairs with An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e137.jpg would contain the same information about An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e138.jpg as a single pair with An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e139.jpg.

Figure 2
Information about An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e140.jpg as a function of missing data.

The bulk of our analyses rely on simulated data where we know the truth and can evaluate the performance of our estimator. We use the program ms [26] in combination with seq-gen [27] to generate sequences across a 10 kb region under a finite-sites model of mutation (An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e144.jpg unless specified otherwise) and the coalescent with recombination. We simulate recombination as gene conversion with mean tract length fixed at An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e145.jpg (see equation 2). The sample size (i.e. number of simulated chromosomes) is An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e146.jpg where An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e147.jpg is the average read depth and An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e148.jpg is the length of each read in a read-pair. We transform these sequences into metagenomic-style data by randomly distributing read starts uniformly across the simulated region and trimming each simulated sequence to only be present for the length of three segments: one read, the gap between read pairs, and one read. Our simulation assumes no variation in read length or distance between read pairs. Note that a gap of zero produces the same effect as unpaired reads with double the read length. For results with sequencing error, we assign quality scores from the true Sanger sequencing quality score distribution as determined from the sludge data. A “sequencing error” causes a switch from the true nucleotide to each of the other three with probability 1/3. Given that we are simulating relatively small datasets with low information content, we occasionally generate an assembly with a maximum likelihood at An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e149.jpg. We exclude these values from all further analyses, but, for each parameter set, we report the proportion of replicates that yielded infinite parameter estimates either in Table 1 or in the text below.

Table 1
Proportion of simulation replicates with An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e150.jpg for each parameter set.

We analyzed the performance of our estimator in the presence of sequencing error across a range of plausible values of An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e159.jpg (0.002 to 0.04), read lengths roughly corresponding to current Illumina, 454 and Sanger sequencing technologies (75, 500, 1000) and gaps between read-pairs (0, 100, 500) by calculating the root mean squared error (RMSE) relative to the true value (An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e160.jpg; Figure 3). Note that while RMSE conveniently summarizes our estimator's sampling distribution, it obscures the inherent asymmetry of the distribution caused by the constraint An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e161.jpg. A clear trend emerges with lower relative RMSE accompanying increased recombination. The estimator has little bias (results not shown) and, for An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e162.jpg, we are able to reliably estimate within a factor of An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e163.jpg of the true value. For most parameters, increasing the read length reduces the variance by virtue of increasing An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e164.jpg, but for larger An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e165.jpg the results for 1 kb reads appear slightly worse than for 0.5 kb reads. Increasing the gap between the paired-end reads increases the variance for all except the very smallest An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e166.jpg. Intuitively, this makes sense: if all pairs of sites are very close together with low An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e167.jpg then a recombination event will only rarely occur between them; however, if all pairs are far apart with high An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e168.jpg then recombination events will saturate between the pairs of sites.

Figure 3
Performance of An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e169.jpg.

With the above results suggesting that longer read lengths do not always yield a better estimate, we decided to directly compare a metagenomic-style sample to a “standard” population genetic sample in which the same individuals are sequenced at all loci. The fair comparison keeps the total number of sequenced bases constant, so we simulate a 10 kb region with either 100 reads of 1 kb each or 10 reads of 10 kb each (Figure 4). For simplicity, we do not simulate sequencing error. As analyzed in the Discussion, despite the average depth being identical between the two sets of simulations, the metagenomic sample (on the left) exhibits less bias and much lower variance than the standard sample (on the right).

Figure 4
Metagenomic versus standard population sampling.

Next we tested our approximation that adjusts the two-site likelihoods for different values of An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e176.jpg (see Methods subsection “Two-locus complete likelihoods without error”) by fixing An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e177.jpg and simulating across An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e178.jpg ranging from 0.002 to 0.025 while estimating An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e179.jpg using a two-site likelihood table generated for An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e180.jpg (Figure 5). Again we do not simulate sequencing error to focus on the effects of An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e181.jpg. Here we see that the correction (on the right in Figure 5) works quite well for An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e182.jpg above the likelihood table's driving value (i.e. An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e183.jpg) and somewhat less well for lower An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e184.jpg, with 3% of the simulations for An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e185.jpg giving infinite estimates. However, the uncorrected estimator (on the left) is strongly biased, with 98% of simulations for An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e186.jpg resulting in infinite (unplotted) estimates and 26% of those for An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e187.jpg. No other parameter values yielded any infinite estimates. The low An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e188.jpg results are exacerbated by the correlation of An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e189.jpg with the number of polymorphic sites. Lower An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e190.jpg means fewer polymorphic sites; since the majority of information about recombination rate comes from polymorphic sites, we see a larger variance in our estimate of An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e191.jpg for low An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e192.jpg.

Figure 5
Likelihood correction for different θs.

Finally we apply our estimator to the sludge metagenomics project by sliding a 50 kb window in 25 kb steps across the first 500 kb of the assembly and independently estimating the recombination rate within each window (Figure 6). All windows produced finite estimates with An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e199.jpg.

Figure 6
Parameter estimates from sludge data.

Discussion

The two-site composite likelihood estimator appears to be better suited for metagenomic samples (i.e. the purpose of this paper) than for standard population genetic samples (i.e. the purposes of [12],[19]) as seen from Figure 4. We believe this results from the balance of two opposing factors: greater linkage (less missing data) pushes the advantage toward the standard sample, while a larger genealogy with more independence pushes the advantage back toward the metagenomic sample. For the parameter ranges investigated here, the latter force wins and we see that the estimates for metagenomic samples have both less bias and lower variance for a fixed amount of sequencing. This result makes sense given the nature of the composite likelihood technique in which we treat each pair of sites as though it were independent of every other pair. The more chromosomes that are sampled, the more closely this independence assumption matches reality. An intriguing open question is how the composite likelihood estimator on metagenomic data compares to a full likelihood estimator on standard data, but we do not pursue this topic here.

The bias in the standard sample estimates (Figure 4) surprised us given theoretical results that assert consistency for the composite likelihood estimator [28]. However, consistency is an asymptotic feature and does not necessarily hold for finite samples. Indeed, further simulations of standard samples with greater sample depth reduced the bias to essentially zero with depth An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e200.jpg (results not shown). Given that metagenomic samples appear nearly unbiased with depth An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e201.jpg, the added independence of the metagenomic sample must allow the estimator to converge faster toward the asymptotic results.

Further, in contrast to Hudson's and McVean's programs (maxhap and LDhat, respectively), our method makes use of all pairs of sites, including sites observed to be fixed. We include these sites primarily as a byproduct of properly accounting for sequencing error, but these additional data also help reduce our variance. As a bonus, using all sites automatically makes our pairwise likelihoods true likelihoods, thus fulfilling one of the requirements for Fearnhead's [28] results proving the consistency of the composite likelihood estimator. If fixed sites were not included, then the pairwise likelihoods would need to be made conditional on only using pairs of segregating sites, which becomes computationally challenging when dealing with missing data. In fact, while maxhap and LDhat allow missing entries in their input data, this feature is not described in the accompanying papers [12],[19], and these implementations do not properly condition their likelihoods to account for the fact that they only use segregating sites. The only disadvantage of using all pairs of sites is that the likelihood calculation scales linearly with the number of pairs and thus using all pairs takes longer; however, our implementation still runs in a reasonable amount of time on realistic amounts of data (see “Complexity and approximations” subsection in Methods).

Real data include sequencing errors, which have the potential to bias population genetic inference and increase the variance of estimators [29]. Trimming the data based on quality scores will help reduce these problems, but the remaining error must still be taken into account. We do not have analytic theory quantifying the amount of bias introduced by sequencing error, but simulations show that unaccounted-for errors produce estimates biased toward a specific finite value of An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e202.jpg that depends on the read length and gap size (results not shown). Intuitively, sequencing error primarily produces singletons, which yield different configurations depending on the distance separating the two sites with errors. If the two sites are close together, then errors will tend to generate 01 and 10 states. If the two sites are far apart, then errors will tend to generate 1? and ?1 states. The first group of states (01, 10) provides evidence for higher recombination since, if both mutations originally fell on the same chromosome (state 11), then recombination would have been necessary to break them up to be (01, 10). The second group of states (1?, ?1) provides evidence for lower recombination since this pattern of missing data is more likely to have arisen from (11, 11) states, which is suggestive of no recombination, then (01, 10) states. Thus sequencing error introduces a highly artificial pattern of configurations, with a combination of evidence for high recombination between close pairs of sites and low recombination between distant pairs of sites leading to a maximum likelihood at an intermediate value. For paired-end reads of 500 bases separated by a gap of 0, errors drive toward An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e203.jpg.

The striking inverse correlation between the estimates of An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e204.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e205.jpg from the sludge data (Figure 6) could either be the result of an unknown artifact or a biological reality stemming from a dependence between recombination efficiency and sequence divergence. One possibility for an artifact would be sequencing error not accounted-for in the quality scores (e.g. a PCR error before sequencing). Such errors would certainly lead to increased estimates of An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e206.jpg, but, on the basis of our simulations, seem unlikely to drive An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e207.jpg down to 0. Also, such errors would have to occur non-uniformly across the genome at a granularity of 50 kb, which seems implausible. Another potential source for an artifact is our two-step estimation procedure in which we first estimate An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e208.jpg without regard to recombination and then estimate An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e209.jpg conditional on An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e210.jpg. Again, however, simulations reveal that, while An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e211.jpg affects the variance of An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e212.jpg, the estimator is unbiased across all tested An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e213.jpg and shows no correlation between An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e214.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e215.jpg (results not shown). Without a clear artefactual explanation, we turn toward biology. Laboratory experiments have shown a negative log-linear dependence between sequence divergence and transformation efficiency [30], and an analysis of a different metagenomic dataset found a similar dependence between divergence and parsimoniously-inferred recombination events [10]. Our data suggest that this pattern holds at a finer resolution with subtle increases in diversity, as quantified by An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e216.jpg, leading to lower rates of recombination in a log-linear manner, with the exception of regions in which recombination appears nonexistent (Figure 7).

Figure 7
Log-linear relationship between An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e217.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e218.jpg.

On an absolute scale, these estimates from the sludge data fall into a plausible range for bacterial populations. For instance, in Campylobacter jejuni An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e220.jpg [31] and in Neisseria meningitidis An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e221.jpg ranges from 0.00270 to 0.034 [11]. However, previous estimates of microbial recombination rates have been based on much smaller amounts of data (in these examples, An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e222.jpg bases) relative to the sludge windows of 50 kilobases. In addition, C. jejuni and N. meningitidis are both pathogens, which makes for a quite different ecological and evolutionary environment than that of the nonpathogenic sludge bacterium A. phosphatis. When the sludge estimates of mutation and recombination are viewed relative to each other, we see that mutation events generally occur more frequently than recombination events (An external file that holds a picture, illustration, etc.
Object name is pgen.1000674.e223.jpg), which places A. phosphatis more toward the clonal end of the bacterial spectrum [32].

Overall, our new estimator produces surprisingly accurate estimates of recombination rate, particularly considering the amount of missing data. The real power of the estimator derives from the greater independence of the genealogies underlying the sample; sequencing error and missing data present hurdles to accessing this information but our estimator has surmounted them. Despite our motivation from microbial populations, our method itself makes no assumptions inherent to microbial populations. For our purpose, a “metagenomic” sample simply means sampling a mixture of a large number of individuals from a single species, in which each read (or pair of reads) can be safely assumed to have originated from a different individual. Given the results from the comparison to a standard sample, the metagenomic approach should always be followed to obtain maximal information about recombination for a fixed amount of sequencing.

An implementation of our Population genetic Inference In Metagenomics (PIIM) method is freely available for download from http://ib.berkeley.edu/labs/slatkin/software.html.

Acknowledgments

We thank Phil Hugenholtz at JGI for sharing the assembly of the sludge data.

Footnotes

The authors have declared that no competing interests exist.

This research was supported by National Institutes of Health grant R01-GM40282 to MS and by a Chang-Lin Tien Fellowship to PLFJ. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Lenski RE, Winkworth CL, Riley MA. Rates of DNA sequence evolution in experimental populations of Escherichia coli during 20,000 generations. J Mol Evol. 2003;56:498–508. [PubMed]
2. Whitaker RJ, Grogan DW, Taylor JW. Recombination shapes the natural population structure of the hyperthermophilic archaeon Sulfolobus islandicus. Mol Biol Evol. 2005;22:2354–61. [PubMed]
3. Myers S, Bottolo L, Freeman C, McVean G, Donnelly P. A fine-scale map of recombination rates and hotspots across the human genome. Science. 2005;310:321–4. [PubMed]
4. Smith GR. Hotspots of homologous recombination. Experientia. 1994;50:234–41. [PubMed]
5. Brouns SJ, Jore MM, Lundgren M, Westra ER, Slijkhuis RJ, et al. Small CRISPR RNAs guide antiviral defense in prokaryotes. Science. 2008;321:960–4. [PubMed]
6. Gevers D, Cohan FM, Lawrence JG, Spratt BG, Coenye T, et al. Opinion: Re-evaluating prokaryotic species. Nat Rev Microbiol. 2005;3:733–9. [PubMed]
7. Fraser C, Hanage WP, Spratt BG. Neutral microepidemic evolution of bacterial pathogens. Proc Natl Acad Sci U S A. 2005;102:1968–73. [PMC free article] [PubMed]
8. Smith JM, Smith NH, O'Rourke M, Spratt BG. How clonal are bacteria? Proc Natl Acad Sci U S A. 1993;90:4384–8. [PMC free article] [PubMed]
9. Didelot X, Falush D. Inference of bacterial microevolution using multilocus sequence data. Genetics. 2007;175:1251–66. [PMC free article] [PubMed]
10. Eppley JM, Tyson GW, Getz WM, Banfield JF. Genetic exchange across a species boundary in the archaeal genus ferroplasma. Genetics. 2007;177:407–16. [PMC free article] [PubMed]
11. Jolley KA, Wilson DJ, Kriz P, McVean G, Maiden MC. The influence of mutation, recombination, population history, and selection on patterns of genetic diversity in Neisseria meningitidis. Mol Biol Evol. 2005;22:562–9. [PubMed]
12. McVean G, Awadalla P, Fearnhead P. A coalescent-based method for detecting and estimating recombination from gene sequences. Genetics. 2002;160:1231–41. [PMC free article] [PubMed]
13. Martn HG, Ivanova N, Kunin V, Warnecke F, Barry KW, et al. Metagenomic analysis of two enhanced biological phosphorus removal (EBPR) sludge communities. Nat Biotechnol. 2006;24:1263–9. [PubMed]
14. Tyson G, Chapman J, Hugenholtz P, Allen E, Ram R, et al. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature. 2004;428:37–43. [PubMed]
15. Johnson PLF, Slatkin M. Inference of population genetic parameters in metagenomics: a clean look at messy data. Genome Res. 2006;16:1320–7. [PMC free article] [PubMed]
16. Simmons SL, Dibartolo G, Denef VJ, Goltsman DS, Thelen MP, et al. Population genomic analysis of strain variation in Leptospirillum group II bacteria involved in acid mine drainage formation. PLoS Biol. 2008;6:e177. doi: 10.1371/journal.pbio.0060177. [PMC free article] [PubMed]
17. Stumpf MP, McVean GA. Estimating recombination rates from population-genetic data. Nat Rev Genet. 2003;4:959–68. [PubMed]
18. Wang Y, Rannala B. Population genomic inference of recombination rates and hotspots. Proc Natl Acad Sci U S A. 2009;106:6215–9. [PMC free article] [PubMed]
19. Hudson RR. Two-locus sampling distributions and their application. Genetics. 2001;159:1805–17. [PMC free article] [PubMed]
20. Wall JD. Estimating recombination rates using three-site likelihoods. Genetics. 2004;167:1461–73. [PMC free article] [PubMed]
21. Ewing B, Green P. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 1998;8:186–94. [PubMed]
22. Frisse L, Hudson RR, Bartoszewicz A, Wall JD, Donfack J, et al. Gene conversion and different population histories may explain the contrast between polymorphism and linkage disequilibrium levels. Am J Hum Genet. 2001;69:831–43. [PMC free article] [PubMed]
23. Langley CH, Lazzaro BP, Phillips W, Heikkinen E, Braverman JM. Linkage disequilibria and the site frequency spectra in the su(s) and su(w(a)) regions of the Drosophila melanogaster X chromosome. Genetics. 2000;156:1837–52. [PMC free article] [PubMed]
24. Fearnhead P, Donnelly P. Estimating recombination rates from population genetic data. Genetics. 2001;159:1299–318. [PMC free article] [PubMed]
25. Markowitz VM, Ivanova NN, Szeto E, Palaniappan K, Chu K, et al. IMG/M: a data management and analysis system for metagenomes. Nucleic Acids Res. 2008;36:D534–8. [PMC free article] [PubMed]
26. Hudson RR. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics. 2002;18:337–8. [PubMed]
27. Rambaut A, Grassly NC. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput Appl Biosci. 1997;13:235–8. [PubMed]
28. Fearnhead P. Consistency of estimators of the population-scaled recombination rate. Theor Popul Biol. 2003;64:67–79. [PubMed]
29. Johnson PLF, Slatkin M. Accounting for bias from sequencing error in population genetic estimates. Mol Biol Evol. 2008;25:199–206. [PubMed]
30. Roberts MS, Cohan FM. The effect of DNA-sequence divergence on sexual isolation in Bacillus. Genetics. 1993;134:402–8. [PMC free article] [PubMed]
31. Wilson DJ, Gabriel E, Leatherbarrow AJ, Cheesbrough J, Gee S, et al. Rapid evolution and the importance of recombination to the gastroenteric pathogen Campylobacter jejuni. Mol Biol Evol. 2009;26:385–97. [PMC free article] [PubMed]
32. Hanage WP, Fraser C, Spratt BG. The impact of homologous recombination on the generation of diversity in bacteria. J Theor Biol. 2006;239:210–9. [PubMed]

Articles from PLoS Genetics are provided here courtesy of Public Library of Science

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...