![]() | ![]() |
Formats:
|
|||||||||||||||||||||||||||||||||||
Copyright : © 2008 Simmons et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Population Genomic Analysis of Strain Variation in Leptospirillum Group II Bacteria Involved in Acid Mine Drainage Formation 1 Department of Earth and Planetary Science, University of California, Berkeley, Berkeley, California, United States of America 2 Chemistry Directorate, Lawrence Livermore National Laboratory, Livermore, California, United States of America Jonathan A Eisen, Academic Editor University of California, Davis, United States of America #Contributed equally. * To whom correspondence should be addressed. E-mail: jbanfield/at/berkeley.edu Received February 19, 2008; Accepted June 12, 2008. This article has been cited by other articles in PMC.Abstract Deeply sampled community genomic (metagenomic) datasets enable comprehensive analysis of heterogeneity in natural microbial populations. In this study, we used sequence data obtained from the dominant member of a low-diversity natural chemoautotrophic microbial community to determine how coexisting closely related individuals differ from each other in terms of gene sequence and gene content, and to uncover evidence of evolutionary processes that occur over short timescales. DNA sequence obtained from an acid mine drainage biofilm was reconstructed, taking into account the effects of strain variation, to generate a nearly complete genome tiling path for a Leptospirillum group II species closely related to L. ferriphilum (sampling depth ~20×). The population is dominated by one sequence type, yet we detected evidence for relatively abundant variants (>99.5% sequence identity to the dominant type) at multiple loci, and a few rare variants. Blocks of other Leptospirillum group II types (~94% sequence identity) have recombined into one or more variants. Variant blocks of both types are more numerous near the origin of replication. Heterogeneity in genetic potential within the population arises from localized variation in gene content, typically focused in integrated plasmid/phage-like regions. Some laterally transferred gene blocks encode physiologically important genes, including quorum-sensing genes of the LuxIR system. Overall, results suggest inter- and intrapopulation genetic exchange involving distinct parental genome types and implicate gain and loss of phage and plasmid genes in recent evolution of this Leptospirillum group II population. Population genetic analyses of single nucleotide polymorphisms indicate variation between closely related strains is not maintained by positive selection, suggesting that these regions do not represent adaptive differences between strains. Thus, the most likely explanation for the observed patterns of polymorphism is divergence of ancestral strains due to geographic isolation, followed by mixing and subsequent recombination. Author Summary Communities of microbes in nature consist of a large number of distinct individuals. The variation in DNA sequence between these individuals contains a record of the evolutionary processes that have shaped each community. In most environments, however, the high number of distinct species makes obtaining information about the nature of this variation difficult or impossible. We obtained large amounts of sequence data for a natural community in an acid mine drainage system consisting of only a few species. This enabled us to reconstruct the genome of the dominant bacterium (Leptospirillum group II) and obtain detailed information about sequence variation between individuals, including differences in both gene content and gene sequence. Our analysis shows extensive recombination between closely related populations, as well as fewer instances of recombination between more distantly related individuals. Additionally, viruses and plasmids account for high variability in gene content between individuals. We conclude that sequence-level variation in this population is maintained through neutral processes (migration, recombination, and genetic drift) rather than natural selection. This suggests that closely related strains of the Leptospirillum group II population may not be ecologically distinct. Introduction A characteristic of natural populations is that they are comprised of individuals that are, in the majority of cases, not genomically identical to each other. Heterogeneity present at any time reflects the outcome of the interplay between processes that create variation (e.g., mutation and lateral gene transfer) and those that remove it (e.g., selection and genetic drift). In addition, genetic variation can be introduced into a population via migration and subsequent recombination. Variation between individuals appears both as divergence at the single nucleotide level and the presence of hypervariable gene “islands” within a more stable set of genes shared by multiple isolates [1–4]. The potential adaptive value of this variation is an important and controversial question in microbial ecology [5,6]. Through analyses of natural populations that explicitly consider genetic variability, it is possible to evaluate the basis for potential metabolic differences and infer aspects of the evolutionary processes that occur over relatively short timescales within natural communities. Initial deductions of the processes that shape microbial lineages were based on genomic studies of single isolated strains. More recently, genomic comparisons amongst multiple isolated strains of the same species have been possible [2,7–13]. These studies defined the concept of a pangenome for a species, reflecting the observation that the gene content of a population exceeds that of any single individual member [14]. Comparative genome analyses revealed extensive gene transfer via mechanisms that include insertion of phage or plasmid DNA. However, there has been only limited analysis of the extent of heterogeneity in gene content within coexisting cells that comprise natural populations [15–17]. Such studies benefit from relatively deep genomic sampling so that orthologous regions can be compared. Despite the rapid increase in number of metagenomic studies of natural communities [18–22] and environments [16,17,23–27] over the past few years, few datasets provide the amount of sequence required for near-complete genome reconstruction and analysis of population-level heterogeneity. Relatively low-diversity microbial communities are especially tractable for this type of analysis, especially those dominated by one organism type with a tightly defined population structure. Acid mine drainage (AMD) biofilms have proven particularly well suited to community genomic analyses because the majority of the biofilm community consists of four to six organism types, one of which is typically dominant. This organism can be relatively deeply genomically sampled by shotgun sequencing [21]. Tyson et al. [21] reported extensive reconstruction of genome fragments from the five species in one AMD biofilm using approximately 76 Mb of genomic sequence data. Most of the composite genome for the dominant bacterium, a member of Leptospirillum group II that is closely related to Leptospirillum ferriphilum, was sampled to a depth of approximately 12×. Binning of assembled fragments was achieved primarily based on GC content (to separate out fragments deriving from coexisting archaea) and sequence coverage (sampling depth, to separate out fragments from a Leptospirillum group III population). Initial studies using this dataset revealed evidence for transposon proliferation [15], genetic exchange between relatively closely related archaeal organisms via homologous recombination [21,28], and gain and loss of genes from proviral DNA [15]. Simultaneous analysis of DNA sequences derived from natural communities presents a number of challenges, the first of which is to correctly assign genome fragments to the appropriate organism. The most robust approach for binning is genome reconstruction [18,21], but this task is complicated by fragmentation of assemblies due to genomic heterogeneity [29], as confirmed at a limited number of loci in a metagenomic dataset of complex environmental systems [16]. In the current study, we resolved much of this fragmentation via manual curation of an assembly of an expanded genomic dataset (130 Mb) to capture the form and distribution of genetic variation genome-wide in a natural population. The coexistence of multiple closely related strains is a commonly observed phenomenon in microbial communities, and is apparent from studies of cultured isolates [2,11,30], marker gene surveys [31–33], and metagenomic data [3,15–17]. Despite the existence of a robust theoretical apparatus in population genetics to determine the strength and direction of selection using polymorphism and divergence data (reviewed in [34]), these approaches have not been previously applied to the question of whether genome-wide nucleotide divergence within microbial populations reveals adaptive differences between closely related, coexisting strains. In this paper, we address the question of whether adaptive differences between closely related, coexisting strains of Leptospirillum group II can be detected using population genetic analysis of single nucleotide polymorphisms (SNPs). We attempt to distinguish in situ differentiation and selection from migration and recombination as the origin of the observed patterns of nucleotide variation. Through comprehensive analysis of both variation in gene sequence and gene content, we uncover details of the recent natural history of Leptospirillum group II. Results Core Genome Assembly and Annotation Leptospirillum group II scaffolds, their functional annotations, and alignments to the earlier assembly [21] are reported in Table S1. The near-complete genome, spread over 79 contigs, representing 2.72 Mb of sequence, as well as the 2,588 ORFs with manually curated start positions and functional annotations, is deposited at the National Center for Biotechnology Information (NCBI), along with the trace files for the 54 Mb of new sequence (accession number AADL00000000). The total number of assembled reads is 42,701, indicating that each read likely derived from a single cell (there were ~108 cells in the sequenced sample [21]). Metabolic analyses and ecological profiles of the Leptospirillum group II species will be reported separately (D. S. A. Goltsman, V. J. Denef, S. W. Singer, N. C. VerBerkmoes, M. Lefsrud, et al., unpublished data). The core genome of the 5-way CG Leptospirillum group II population described here is largely syntenous with that of the UBA Leptospirillum group II population reported previously [18]. (“5-way CG” and “UBA” indicate different sites within the Richmond Mine from which samples were collected for genomic sequencing. The dominant Leptospirillum group II species derived from each of these assemblies were labeled to indicate their geographic origin.) Where lack of sequence information precluded ordering of scaffolds, scaffolds were ordered based on the UBA assembly. These junctions are evident in Table S1 as points where scaffold numbers change. Locations with uncertain ordering in the UBA assembly are circled on the outer ring of Figure 1
Genomic Variation Although extensive reconstruction of a core genome following the dominant path during assembly was possible, variation in gene content, sequence composition, and recombination events contribute to population-level heterogeneity. We mapped the sources of variation for a 700-kb fragment (~1/4 of the Leptospirillum group II genome) that includes the origin of replication (Figure 2
Most of these strain variant paths consist of one or a few reads (Figure 2 A common cause of genome assembly fragmentation is the insertion of transposases into the genomes of some individuals and not others. The genome path for the manually curated assembly (Table S1) was reconstructed across many of these insertion points. Inserted blocks predominantly encode hypothetical proteins (e.g., Figure 3
Genome-Wide Patterns in Sequence Composition We identified and classified every polymorphic site in the assembly. Different Phred quality score cutoffs were used to examine the effect of sequencing error on SNP identification (Table S2), because SNPs appearing in only one read (“singletons”) are more likely to result from sequencing error than are replicated SNPs, yet the number of true singletons is important for subsequent population genetic analysis. Raising cutoff scores reduced the number of singleton polymorphic sites identified, particularly indels. Notably, the ratio of nonsynonymous to synonymous SNPs remained relatively constant for quality cutoffs of 25 and higher (range 0.84–0.88), as did the overall SNP density (range 0.06%–0.09%), suggesting that sequencing error affected these site classes in an unbiased fashion. The proportion of all SNPs found in two or more reads (replicated) ranged from 32% for a cutoff score of 20 to 47% for a cutoff score of 50. We used a quality cutoff of 25, corresponding to a per-base error probability of 3.2 × 10−3, based on an analysis of the effect of increasing quality scores on the outcome of population genetic tests (Table S4 and Text S1). For this quality cutoff, the overall SNP density was 0.09%, and 38% of SNPs were replicated. We mapped the density of replicated and nonreplicated SNPs around the entire genome using sliding windows of length 1 kb and displacement 50 bp (Figure 1 Regions with both high SNP density and high minor allele frequency are indicative of the presence of blocks of variant sequence. We grouped closely related SNP patterns into strains genome-wide with Strainer [26] (Figure 5
Variation Due to Recombination We identified a number of locations where a subset of individuals with a replicated SNP pattern transitioned into another replicated pattern within or between mate-paired reads, indicating homologous recombination (Figure 6
Recombination events also involve a distinct sequence type with approximately 94% sequence identity to the composite sequence. Some reads were approximately 100% identical to the UBA Leptospirillum group II sequence type (Figures 6
One of the deeply sampled CG-UBA recombinant regions begins very close to the origin of replication in contig 11389 (Figure 6 The three deeply sampled recombinant regions between divergent Leptospirillum group II and CG-type reads were also analyzed with the McDonald-Kreitman (MK) test for selection. The MK test uses a 2 × 2 table to test the independence of the ratio of nonsynonymous to synonymous polymorphisms and the ratio of nonsynonymous to synonymous fixed differences [36]. The UBA-CG recombinant region near the origin of replication in contig 11389 (Figure 7 Population Genetic Tests of Strain Variation Fourteen Phrap contigs (1.8 Mbp), totaling 66% of the 5-way CG Leptospirillum group II composite genome, had substrains of sufficient depth (average read depth 3.3) for analysis of polymorphism and divergence in comparison with the dominant strain group. These contigs are primarily located near the origin of replication (Figure 1 Overall, the average density of polymorphisms is low in both the dominant strain and all substrains. Within regions of the dominant strain overlapping variant blocks, pN/pS = 0.43. Within all substrains, pN/pS = 0.33. dN between the dominant strain and all substrains was 0.05%, whereas for synonymous sites, dS = 0.26%, giving dN/dS = 0.18. The average SNP density within the dominant strain was calculated separately for regions overlapping defined substrains and those between substrains (“interstrain regions,” Figure 5 We calculated Tajima′s D [39], an estimator of whether observed polymorphism frequencies deviate from the neutral expectation (see Materials and Methods), for all main strains, substrains, and interstrain regions. No individual values of D were significantly different from 0. The distribution of D is suggestive, however. Within main strains, 43 out of 51 values of D were negative (binomial probability 9 × 10−8, probability of success 0.5) and in interstrain regions 38 of 51 values of D were negative (binomial probability 1 × 10−4, probability of success 0.5). These results suggest an excess of rare mutations in most of the genome relative to the number of segregating sites. Within substrains, there was a trend towards positive values of D (D > 0 for 16 of 31 strains with defined D), but this was not significant (binomial probability 0.14). The numbers of nonsynonymous and synonymous substitutions within and between strains for a quality score cutoff of 25 are shown in Table 1. The contingency table for the MK test yielded χ2 = 20.932 (p < 0.0001), indicating that the ratios of nonsynonymous to synonymous mutations within (1.12) and between populations (0.52) are significantly divergent. The relative number of fixed nonsynonymous substitutions is smaller than the relative number of polymorphic nonsynonymous substitutions, suggesting negative (purifying) selection. Contingency tables for individual substrains with fewer than a total of ten substitutions were tested using the Fisher exact test rather than the χ2 test. After a sequential Bonferroni correction for multiple tests, no strain had a statistically significant difference at p < 0.05 between polymorphisms and fixed differences.
The Poisson random field refinement of the MK test (MKPRF) test [40] also did not find evidence that selection maintains substrains. Because this test uses segregating sites to calculate maximum likelihood estimates of shared parameters, it cannot be applied to substrains with very few segregating substitutions. Only 11 strains could be analyzed with the MKPRF test, and none of these showed statistically significant evidence for selection. We also examined the effect of Phred quality score cutoffs and read depth on the outcome of the MK test (Table S4 and Text S1). The conclusion that purifying selection is acting on strain variant regions is robust to the effect of sequence error, read depth, and polymorphism frequency. Overall, the results indicate that genome-wide evidence of positive selection is not being masked by the presence of invalid single-copy SNPs caused by sequencing error. Discussion Comprehensive analysis of diversity in complex systems has been limited by the inability to deeply sequence any of the constituent populations. The low species richness of the AMD microbial communities has allowed for nearly complete genome reconstruction for the dominant populations. In particular, Leptospirillum group II was sampled an average of 20 times at each genomic locus. Whereas the vast majority of sequencing reads originating from this population have been placed into one genomic path through manual curation of the assembly, we uncovered evidence for considerable population-level diversity in the form of strain variation, in part mediated by integrated elements of putative phage, plasmid, or transposon origin. Although these phenomena have been previously reported for complex environments [3,11,16], here we quantified the distribution of the variability in gene sequence and content across large regions of the genome. The high resolution of our data allowed us to test hypotheses regarding the processes that shaped this natural population. Recombination and Mutation There are two possible sources of new intrapopulation sequence variation: mutation and/or immigration of new variant types, followed by recombination with members of the preexisting population. The variation may have adaptive value, may be deleterious, or may be neutral. These alternative hypotheses predict distinct patterns of variation within the Leptospirillum group II population. Recombination has been widely documented in natural microbial populations [18,21,28,41–44]. We observed the signatures of recombination events among two or more closely related 5-way CG Leptospirillum group II strains, as well as recombination between the 5-way CG population and more distantly related Leptospirillum group II genome types (Figure 6 The similarity between the read depths of the observed strains (Figure S1) raises the question of how many distinct haplotypes are present in the population. It is tempting to conclude that because strains in different regions of the genome are present at similar frequencies, they must therefore belong to a single haplotype present at that frequency in the population. Short read lengths and low overall SNP density prevent us from directly linking distinct strains within the assembly, however. It will be necessary to examine whether the frequencies of these strains remain correlated in multiple samples across space and time to determine the number of distinct haplotypes present in the population. Notably, the value of the per-site population mutation parameter θ in the 5-way Leptospirillum group II population is smaller than most observations of θ obtained from multi locus sequence typing (MLST) data from bacterial pathogens [45,46]. Per-site MLST estimates of θ averaged over multiple loci ranged from a low of 5 × 10−3 for Neisseria gonorrhoeae to 0.1 for Haemophilus influenzae, probably reflecting the more diverse populations from which alleles for MLST were drawn and the limited sampling of individual loci compared to our genome-wide estimate. The value of θ that we observe using an identical formula (2 × 10−4) is indicative of the overall low levels of variation in the Leptospirillum group II population, perhaps suggesting a relatively recent population expansion or purifying selection. We also calculated an estimate of θ using a model for metagenomic data that explicitly accounts for sequencing error in estimation, and tested whether the data fit a model of constant population size or exponential growth [37]; the fit of these models was not significantly different. Contig-based estimates of θ were consistent for a model of constant population size (0.05 ± 0.02). The only other available estimate of θ for microbial metagenomic data is for a population from an activated sludge bioreactor [37], where θ* was estimated to be 0.015 under a model of population growth. The activated sludge population was nearly clonal and had undergone recent population expansion prior to sampling. The slightly higher value of θ we observe in Leptospirillum group II using this method is indicative of a larger amount of standing variation than was present in the sludge population. Selection on Closely Related Strains Given the clear evidence that multiple closely related strains coexist in the Leptospirillum group II population, the key question is whether this variation results in differential fitness between strains. If strains are maintained because each is maximally fit in a different ecological niche, we expect to see evidence of positive or balancing selection in variant regions. This expectation underlies the ecotype or periodic selection concept, which posits that distinct sequence clusters in microbial populations correspond to ecologically separate units [47]. The “Stable Ecotype Model” [5] suggests that diversity within separate clusters is periodically purged by selection, and distinct clusters persist due to the low frequency of intercluster recombination. Permanent divergence only arises if a mutation allows an individual of one ecotype to colonize a new niche [5]. This model suggests that selection maintains coexisting clusters, and that within-cluster diversity is low. Under this scenario, we expect to see an excess of fixed nonsynonymous differences between strains, as these amino acid changes would indicate niche-specific adaptations. Alternatively, if sequence variation is not maintained by positive selection, we expect to find evidence for neutral or negative selection in these regions. We tested these alternative hypotheses for the coexistence of closely related strains using the MK test for polymorphism and divergence [36]. This test posits that under a model of neutral evolution, the ratio of nonsynonymous (replacement) to synonymous substitutions within a population is the same as the ratio of replacement to synonymous fixed differences between populations. An excess of replacement fixed differences indicates positive selection, whereas a dearth indicates negative selection. Our results from MK tests on regions of strain variation indicate strongly that positive selection on amino acid substitutions does not separate coexisting strains. We observe instead a significant deficiency of fixed replacements relative to polymorphic replacements, indicating that negative (purifying) selection is acting to remove deleterious nonsynonymous mutations differentiating closely related strains. This result is robust to the effect of sequencing error in SNP estimation and the use of low-frequency (nonreplicated) polymorphisms (Tables S2 and S4, and Text S1). These additional analyses clearly demonstrate that an excess of low-frequency polymorphisms due to sequencing error is not masking evidence of genome-wide positive selection. Recent theoretical results indicate that an abundance of slightly deleterious polymorphisms can substantially mask the presence of adaptive evolution [48]. However, in that case, one would expect the average allele frequency of nonsynonymous polymorphisms to be significantly lower than that of synonymous polymorphisms [48], which we do not observe (average synonymous site frequency in dominant strain, 0.14; average nonsynonymous site frequency, 0.16). Our results therefore indicate that this population does not fit the predictions of the stable ecotype model, and suggests that strains of 5-way CG Leptospirillum group II detected by sequence variation may occupy a single niche within the biofilm environment. These results are also consistent with the strong purifying selection acting on most genes in the coexisting archaeal genus Ferroplasma, despite a far more extensive mosaic genome structure than that of Leptospirillum group II [15]. Our results also rule out an alternative to the stable ecotype model, in which high recombination rates prevent periodic selective sweeps from purging diversity within strains. In this scenario, strain variant regions contain beneficial mutations decoupled from the remainder of the genome. This hypothesis—that recombination speeds adaptation by spreading beneficial variants throughout the population—again posits that these variant regions are under positive selection. Thus, it predicts a detectable signature of excess fixed amino acid replacements in recombinant regions and distinct patterns of polymorphism frequency between regions containing adaptive variants and the remainder of the genome. Results from MK tests rule out positive selection on strain variant regions. The observed patterns of polymorphism density also do not match the second prediction of the high recombination model. We do not observe a significant difference in polymorphism density between reads within substrain blocks, reads within the main strain, and reads in interstrain regions, as would be expected if the action of selection on these regions was decoupled due to high recombination rates. Although the substrains within the 5-way CG population do not appear to be under selection themselves, it is plausible that that linkage between them is absent due to selective sweeps eliminating variation in intersubstrain regions. This hypothesis predicts a lower polymorphism density in intersubstrain regions than within substrains, since a selective sweep would carry hitchhiking neutral variants to fixation. Again, the observation that polymorphism density within substrain and interstrain regions is not significantly different is strong evidence against this hypothesis. The interblock selective sweep scenario therefore requires a higher mutation rate in intersubstrain regions, which would have allowed polymorphisms to accumulate after a selective sweep to the same levels observed in adjacent substrain regions. Such extreme spatial variation in genomic mutation rate is unlikely, and hence, we can also rule out the “interstrain” sweep model. An alternative explanation for the observed patterns of polymorphism and divergence is that a few sites within each substrain have experienced positive or balancing selection, while all other neutral polymorphisms in the substrain are hitchhiking to higher frequency along with the selected sites. Such a signal might not be detectable using MK polymorphism-divergence tests. This scenario requires that recombination has not yet separated the adaptive site from linked neutral sites. If the mutation occurred in the distant past, the recombination rate would have to have been extremely low relative to the mutation rate to preserve the well-defined substrain blocks observed here. Because there is evidence for significant recombination, this scenario seems unlikely. If the positively selected mutation occurred recently, it would be in the process of sweeping to fixation along with linked neutral polymorphisms. Under this selective sweep scenario, we expect low levels of polymorphism in the region associated with the adaptive mutation relative to the intersubstrain region (Figure 5 Of the three deeply sampled recombinant regions between CG-type and other Leptospirillum group II types (Figure 1 The most likely explanation for the existence of blocks of closely related variant sequence is geographic separation of two or more ancestral strains, followed by subsequent mixing and recombination. The mosaic genome structure of Ferroplasma is also consistent with this model [15]. The high read depth and connectivity of the dominant strain compared to the smaller fragments of alternate sequence type (Figure 5 Assuming that the observed divergence between strains is typical of ancestral haplotypes, and a mutation rate of 10−9 per site per generation [49], the ancestral strains had to be separated for at least 8 × 105 generations to accumulate the observed divergence. This translates into a separation time of at least 1,400 y, given a minimum generation time of 15 h [50]. The Richmond ore body was exposed to weathering for 780,000 y prior to the onset of mining operations around 150 y ago [51]. A physical separation of the strains of order 103 y is therefore possible. Assuming that the two strains were reunited 150 years ago (at most about 100,000 generations), the observed level of variation would be maintained, provided that the effective population size (Ne) is sufficiently large that the original variation would not all drift to fixation in this time (Ne at least the same order of magnitude as the number of generations). If we postulate that the effective population size of Leptospirillum group II is at least of this order of magnitude (plausible, given 108 cells per cm3 in the biofilm), it is certainly possible that the two strains could have comingled as the result of increased connectivity within the mountain due to mining. In summary, the combination of assumptions required to fit alternative explanations to the data makes the simpler explanation of geographic separation and subsequent remixing more likely. Genome-Wide Spatial Patterning of Recombination Substrains are distributed unevenly around the genome in a mosaic of 1–10-kb fragments (Figure 1 The mechanism responsible for this symmetry is not known, but would seem to indicate a higher probability of recombination events associated with newly initiated replication forks proceeding bidirectionally from the origin of replication. It is well known that recombinational intermediates can be converted into functional replication forks (reviewed in [52]). Replication forks resulting from recombination of introduced fragments could interact with replication forks initiated at oriC, resulting in the duplication of a chromosome containing the introduced fragment, but this process is not well understood [52]. Another example of genomic symmetry across the replication axis involves large- and small-scale inversions in closely related bacterial species [53,54]. The reason for this is thought to be selection against disruption of polarized motifs clustered symmetrically near the terminus of replication and important for chromosome replication and segregation [55]. Perhaps recombination events closer to the terminus of replication are counterselected due to interference with these motifs, resulting in an apparent clustering of events around the origin. A breakdown in synteny between related strains further from the origin is evoked as an explanation for symmetry around the origin of replication in Ferroplasma [28], but the lack of major genome rearrangement events observed in our assembly makes it less likely as an explanation for the 5-way CG Leptospirillum group II population. The presence of a large integrated plasmid within the Leptospirillum group II population suggests conjugation as a possible mechanism of genetic transfer between individuals. This plasmid region contains genes coding for proteins in the tra and trb loci, which are known to be involved in formation of a bridge between two cells through which chromosomal DNA transfer can occur. In Escherichia coli Hfr strains, transfer is initiated at the oriT site and proceeds in the 5′ to 3′ direction. Marker genes closer to the oriT site are transferred more frequently due to the increased probability of breakage of conjugation bridges with time. Therefore, if conjugation were the primary mechanism of transfer, we would expect to see more incorporation of novel fragments in 5-way CG Leptospirillum group II close to, and on one side of, the putative oriT (sequence and polarity not known). Despite uncertainty about how the block containing the origin of replication is linked to the block containing the oriT, we can evaluate both of these considerations to some extent. The incidence of recombined substrain blocks is higher clockwise from the integrated plasmid region than counterclockwise from it, except immediately adjacent to the oriT. Selection against recombined regions carrying variants of conserved proteins (e.g., a large region encoding ribosomal proteins) may have obscured this pattern to some extent. The short recombinant fragments observed both in this study and in Ferroplasma [28] are strikingly similar to those observed among strains of E. coli. Milkman and colleagues [56–58] showed that E. coli chromosomes typically contain distinct sequence blocks (size range around 1 kb) embedded in larger regions that have overall high similarity to other strains. A subsequent statistical comparison of six whole E coli genome sequences also supports a mosaic structure, albeit with shorter fragment size (50% < 1 kb, 80% < 2 kb [59]). These segments are an order of magnitude less than the average length observed in individual conjugation (48–105 kb) or transduction (10–32 kb) events [57]. The degree of fragmentation of donor DNA was determined by differences in donor–recipient restriction-modification (RM) systems [57,60]. The clustered pattern of small segments that we observe is consistent with the idea that large pieces of DNA are transferred by conjugation or transduction between closely related strains, fragmented, and then incorporated as a series of small segments [56]. The fragmentation likely occurred through the action of restriction enzymes. It is interesting that most genes annotated as part of RM pathways in Leptospirillum group II are found in strain variant regions, suggesting that restriction specificity varies among closely related organisms (Table S1). A similar observation was made for an environmental Ferroplasma population [15]. Thus, the diversity of RM systems within the 5-way CG population could explain the generation of small DNA fragments from DNA introduced from closely related strains. Gene Content Variation In addition to variation in sequence composition, variation in gene content could lead to an increase in fitness, be neutral, or be deleterious. As shown from between-strain genomic comparisons of isolates [2,7–13] and population analyses [15], a high fraction of individuals in a population contain unique regions due to phage, plasmid, or transposon insertion. Community genomics identifies the laterally transferred regions that are present population-wide in addition to those present at low frequency at the time and conditions of sampling, and also provides information about the degree of gene content variation in these regions. Our data suggest that the majority of the genes on strain variant paths are present in low copy number in the population. Most observed alternate genome paths are short (a single read length, or about 1 kb; Figure 2 The high estimated number of coexisting genome types is fairly consistent with estimates based on genotyping of isolates from a complex system [11]. Importantly, however, the majority of the cells in the population share the majority of their genomes, which is what makes reconstruction of a composite genome possible. Based on their low frequency, the functional significance of these highly variable regions is unclear. Tracking of strain population structure over time may help to resolve the extent to which different genotypes represent distinct adaptations. It is notable that many of the strain variant regions present at higher abundance in the population (reconstructed alternative genome paths listed near the end of Table S1) encode genes typical of plasmids (e.g., trwB, traA, traB, and mobD) or viruses (e.g., integrases). Of those integrated elements present at high frequency, some encode genes of potential functional significance. A case in point involves a region where most individuals carry one or more genes of the LuxIR system, involved in quorum sensing [61], whereas only a minor part of the population encodes the complete pathway (Figure 4 High-frequency variants, which are often associated with marked variation in gene content (such as the LuxIR region), could be the result of multiple processes: (1) a mobile element encoding multiple genes inserts into one genome, and that type increases in frequency; (2) the variable loss of genes from mobile elements results in different individuals with differences in gene content; (3) a mobile element integrates into different individuals at the same site, localized by some genomic feature (e.g., tRNAs); and (4) a mobile element integrates at a single site in one individual, followed by an increase in frequency of that genome type, followed by recombination of related but distinct elements in or near the insertion site. For example, closely related viruses derived from a diverse viral population could use the initial virus insertion as a locus for homologous recombination. This could result in a high diversity of genome types at the altered locus. High-frequency variants also result from recombination between related individuals followed by an increase in frequency of individuals carrying the recombinant block. A notable example of functionally significant genes on a recombinant block is the block of UBA genes containing a CRISPR system, including CRISPR-associated (cas) genes (Figure 7 Results of the current study suggest that a block containing CRISPR genes recombined from the UBA population into the 5-way CG population in a single event (Figure 7 Conclusions Deep sequencing of the 5-way CG Leptospirillum group II population has given us both an unprecedented catalog of within-population variation and the opportunity to test hypotheses relating to the origin and maintenance of this variation. This opens a route for incorporating population genomic evidence into reconstruction of the natural history of the Iron Mountain microbial ecosystem. Very few metagenomic studies of microbial communities have applied the tools of population genetics to understand the origin and maintenance of sequence-level variation (e.g., [15,37]). In the present study, we translate community genomic data into a form suitable for one such test [36] by separation and comparison of genomic regions varying in their SNP composition. We show that, contrary to many models of microbial population structure, extensive strain variation in Leptospirillum group II is not maintained by positive selection for adaptive variants. Instead, we detected strong evidence for purifying selection in strain variant regions. The data support a model of geographic isolation, mixing, and extensive recombination, a scenario consistent with the geological history of Iron Mountain. We believe that the application of population genetics to metagenomic data, still a nascent field, has the potential for many more such insights into the processes governing microbial population structure. In addition to variation at the nucleotide level, our data present a snapshot of ongoing genetic exchange in a natural population, including recombination and the rapid uptake and loss of plasmid and phage-derived genes. Most gene-content variants are present at low frequency, consistent with a very large number of coexisting genotypes. This suggests a dynamic process involving the continual generation and loss of gene-content variants. Very few variant regions have reached high frequency in the population, but two examples documented here involve regions of potential functional importance (LuxIR, involved in quorum sensing, and CRISPR, involved in viral resistance). Despite the high number of variants, the genome structure of the majority of population members is sufficiently similar to allow for assembly of composite genome paths. The functional significance of this extensive gene-content variation remains unclear. To provide an answer to this question, it will be important to examine the expression levels of gene variants across space and time, as well as examine signatures of selection in genes present in regions with high gene-content variation (such as the LuxIR region discussed above). Due to the effect of strain variation on downstream postgenomic approaches such as proteomics [64], a thorough documentation of the population-level diversity in community genomic datasets is of great importance. Current technologies and methods have progressed to strain-level resolution, allowing the discrimination of closely related protein variants [18]. Databases that include population-level strain variant sequences will allow us to track the presence and activity of these variants over time and as a function of geochemical conditions. These types of studies will allow us to determine the functional importance of variation within natural populations. Materials and Methods Biofilm samples. Genomic sequence was obtained from DNA extracted from a pink biofilm growing on a pH 0.83, 42 °C acid mine drainage stream and sampled from the 5-way location within the Richmond Mine, California, in March 2002. Library construction, sequencing, and the draft assembly methods used to analyze the first 76 Mb of data were reported previously [21]. We obtained an additional 54 Mb of sequence from a second 3-kb library constructed from the same sample and reassembled the data for the analyses reported here. Population genomic assembly and functional annotation. The 130 Mb of DNA sequence (GenBank accession number AADL00000000) was assembled using phredPhrap (P. Green, http://www.phrap.org) with parameters chosen to maximize assembly extent (e.g., to allow coassembly of sequences with some SNPs) while minimizing assembly errors. Parameters used were “-minmatch 50 -minscore 50 -penalty −15 -revise_greedy.” Typically, all reads deriving from the Leptospirillum group II population were incorporated at a single genomic locus, except in the presence of strain-specific gene insertions or deletions. However, as noted previously, the dataset contains a very small number of reads with very high sequence identity (~100%) to another Leptospirillum group II type (reconstructed from a sample from the UBA location in the Richmond Mine [18]). Reads from this organism (distantly related strain or closely related species) share approximately 94% average nucleotide-level sequence identity, and were brought into assemblies occasionally. All contigs were manually curated using Consed [65] to eliminate assembly errors and to resolve regions of strain heterogeneity. The set of contigs assigned to Leptospirillum group II, initially based on GC content and depth, was augmented with new contigs that could be confidently linked into the assembly through mate pairs anchored into unique sequence. In a subset of cases, alternative paths resulting from insertion or deletion of genes or sequence divergence result in fragmentation of assemblies. However, strain-specific fragments can be linked into the main genome path via mate pairs from one or both ends of the strain-specific contig. The outcome of reassembly, new data, and manual analysis is much larger genome fragments than achieved via the JAZZ assembly [21] and some differences in gene order. The assembly order is shown in Table S1. ORF positions were determined and refined based on previously reported gene calls [21] as well as the manually curated annotation of the closely related Leptospirillum group II organism assembled from the UBA community genomic data [18]. Functional annotation relied upon on the manually curated annotation of Leptospirillum group II from the UBA genomic dataset (D. S. Aliaga Goltsman, V. J. Denef, S. W. Singer, N. C. VerBerkmoes, M. Lefsrud, unpublished data). Contigs generated in the assembly are a composite sequence, and thus do not show SNPs present at a subset of loci. SNP identification and strain construction. For analysis of sequence variation, the final Phrap contigs and aligned reads were imported from Consed into the program Strainer [66]. Substrains consisting of linked polymorphisms were defined for all large Phrap contigs. A substrain was defined heuristically to consist of two or more linked polymorphisms with Phred scores greater than 20. These were considered to be separate subpopulations from the dominant strain due to the improbability of multiple linked polymorphisms arising simultaneously due to mutation alone. For strain grouping analyses, the composite sequence was corrected to reflect the dominant polymorphism pattern so long as these changes retained SNP grouping defined by reads and their mate pairs. Sequences were grouped based on the presence of SNPs in more than one read (considering only high-quality base calls, with Phred scores > 20). Additionally, custom Perl scripts were developed to identify and classify every polymorphic site in the assembly as synonymous, nonsynonymous, intergenic, or indel. SNPs were classified as replicated or nonreplicated. These counts are summarized in Table S2 and were used to generate Figure 1 Population genetic analysis. For each substrain consisting of at least four reads, we counted the synonymous and nonsynonymous (replacement) substitutions both within the substrain and between the substrain and the dominant composite sequence (Figure 5 Additional custom Perl scripts were used to calculate synonymous and nonsynonymous polymorphisms within each strain and substrain. Substitutions occurring immediately adjacent to masked bases were not counted due to the high probability of alignment or sequencing errors in these regions. The consensus sequence determined by Phrap was used to determine synonymous and nonsynonymous fixed differences between each substrain and the main strain. The number of synonymous and nonsynonymous sites in each strain was calculated using the program codeml in the PAML package [67]. The numbers of polymorphisms and fixed substitutions were used to test the hypothesis that substrains are maintained in the population through selection for adaptively significant variants. Two polymorphism-divergence tests were used: the MK test [36] and the MKPRF test [40]. For the purpose of these tests, each substrain is considered to be a distinct population (note that this differs from the terminology used elsewhere in this paper). The rationale behind the MK test is that in the absence of selection, the number of synonymous and nonsynonymous fixed differences between two populations, and the number of synonymous and nonsynonymous polymorphisms within each population, depend only on the mutation rate and time since divergence. The ratio of nonsynonymous to synonymous polymorphisms within populations should therefore be the same as the ratio of nonsynonymous to synonymous fixed differences between populations. Positive selection for nonsynonymous mutations will produce an excess of nonsynonymous fixed substitutions relative to nonsynonymous polymorphisms. The MKPRF test refines this idea, using a population genetic model to explain the distribution of within-population variation via maximum likelihood estimation of mutation rate, divergence time, and a selection parameter. Polymorphism-divergence data were combined for all genes within a strain for the MK and MKPRF tests, whereas individual genes were also analyzed using the MKPRF test. The MKPRF tests were carried out using an online implementation (Computational Biology Service Unit, Cornell University). In addition to the MK test, we also calculated several summary statistics for each strain and substrain as well as the entire assembly. Because of variable coverage depth across the assembly, it was necessary to weight these parameters by the fraction of total sites at a given depth [68]. The parameter θ = 2Neu, an estimator of the product of population size and mutation rate for a neutrally evolving population, was calculated with Watterson′s infinite sites estimator [69] and a finite sites modification [38]. Estimates for each coverage class were combined to yield a single value. The program “piim” [37], which corrects for sequencing error in the estimation of population genetic parameters from metagenomic data, was also used to calculate θ for the entire assembly. Pairwise heterozygosity (π) was calculated as described for a variable coverage assembly of Drosophila simulans [68] using custom scripts. Under a neutral model of evolution for a Wright-Fisher population, π and θ are expected to be equal. Tajima′s D measures the discrepancy between these statistics: DT = (π – θ)/C, where C is a normalizing constant calculated from the data [39]. The statistical significance of D depends on the sample size, but roughly, for sample sizes greater than 7, D is considered significant at the 95% level if it is smaller than −2 or larger than 2 [39]. DT for strains, substrains, and interstrain regions was calculated with BioPerl population genetics modules [70]. Figure S1: Frequencies of Strains Observed in the 5-Way CG Leptospirillum Group II Population Frequencies were calculated from the average allele frequency for each polymorphism occurring in a strain relative to the dominant population. Strains are sorted in ascending order of allele frequency. Error bars show ± 1 standard deviation. Each point represents one strain. (158 KB DOC) Click here for additional data file.(17K, pdf) Figure S2: Diagram Showing Low Strain Variation in Two Contigs in Different Regions of the Genome Images were derived from the Strainer program [66]. White bars represent sequencing reads, linked by thin white lines to their mate-paired reads. Dark shading on reads indicates the presence of SNPs. Colored boxes indicate reads that have been grouped together based on sequence similarity using SNPs that are replicated in more than one sequencing read. Numbers indicate the number of SNPs used to distinguish different variants (grouped reads surrounded by colored boxes) for cases in which variant types are defined by more than one SNP. (A) Diagram showing strain variation over contig 11067 (54 kb). Variant types (colored boxes surrounding reads) were defined based on SNPs replicated in more than one read. This region contains 15 replicated substitutions (mustard-colored variant: ten, green: two, blue: two, and olive: one). (B) Contig 10961: an area distant from the origin of replication. Only three strain groups were defined, and each contains only one replicated SNP. (153 KB DOC) Click here for additional data file.(124K, pdf) Table S1: Leptospirillum Group II 5-Way CG Genome Reconstruction The reconstruction includes the gene order and functional annotation, and shows relationship of the genome fragments to the previously reported draft assembly. (37 KB PDF) Click here for additional data file.(311K, pdf) Table S2: Classification of All SNPs Observed in the 5-Way CG Leptospirillum Group II Assembly Increasing Phred quality score cutoffs were used to examine the effect of sequencing error on the distribution of SNPs between site classes. (11 KB PDF) Click here for additional data file.(12K, pdf) Table S3 Click here for additional data file.(56K, pdf) Table S4: The Results of MK Tests for Phred Quality Score Cutoffs MK tests using quality score cutoffs of 20 and higher give nearly identical results, as indicated here by the similarity in ratios of segregating nonsynonymous to silent sites and fixed nonsynonymous to fixed silent sites for cutoff scores higher than 20. The use of replicated SNPs only or sites at higher coverage levels only also does not affect the outcome. An asterisk (*) indicates the expected ratio of segregating nonsynonymous to segregating sites if two-thirds of masked sites are nonsynonymous and one-third are silent. The effect of increasing the minimum required coverage depth for a quality score cutoff of 25 is shown in the bottom three rows. (36 KB DOC) Click here for additional data file.(8.5K, pdf) Text S1: Supplementary Results (1) The effect of sequencing quality and read depth on SNP identification and analysis and (2) Frameshift variation in Leptospirillum group II. (37 KB DOC) Click here for additional data file.(18K, pdf) Acknowledgments We thank E. Allen and G. Tyson for comments, B. Thomas for assistance with bioinformatics, J. Plotkin, P. Johnson, and M. Desai for discussions about population genetic analyses, and Mr. T. W. Arman, President, Iron Mountain Mines, Mr. R. Carver, and Dr. R. Sugarek for site access and on-site assistance. Part of this work was carried out using the resources of the Computational Biology Service Unit from Cornell University, which is partially funded by Microsoft Corporation. DNA sequencing was carried out at the Department of Energy Joint Genome Institute. P. Richardson is thanked for assistance. Contig sequences, trace files, and annotations are deposited under NCBI accession number AADL00000000. Abbreviations
Footnotes Author contributions. SLS and JFB conceived and designed the experiments. SLS, GD, VJD, and JFB performed the experiments. SLS, GD, VJD, DSAG, and JFB analyzed the data. MPT contributed reagents/materials/analysis tools. SLS, VJD, and JFB wrote the paper. Funding. Funding was provided by the Department of Energy Genomics:GTL project grant # DE-FG02-05ER64134 (Office of Science), National Science Foundation Biocomplexity Program grant # DEB-0221768, the National Aeronautics and Space Administration Astrobiology Institute, and the Laboratory Directed Research and Development (LDRD) Program, Lawrence Livermore National Laboratory. Competing interests. The authors have declared that no competing interests exist. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
||||||||||||||||||||||||||||||||||
Appl Environ Microbiol. 2007 Jun; 73(11):3695-704.
[Appl Environ Microbiol. 2007]Biol Direct. 2007 Nov 7; 2():27.
[Biol Direct. 2007]Philos Trans R Soc Lond B Biol Sci. 2006 Nov 29; 361(1475):1985-96.
[Philos Trans R Soc Lond B Biol Sci. 2006]Environ Microbiol. 2008 Jan; 10(1):251-64.
[Environ Microbiol. 2008]PLoS Genet. 2007 Dec; 3(12):e231.
[PLoS Genet. 2007]Proc Natl Acad Sci U S A. 2001 Aug 14; 98(17):9853-8.
[Proc Natl Acad Sci U S A. 2001]Proc Natl Acad Sci U S A. 2003 Jul 8; 100(14):8484-9.
[Proc Natl Acad Sci U S A. 2003]Proc Natl Acad Sci U S A. 2005 Sep 27; 102(39):13950-5.
[Proc Natl Acad Sci U S A. 2005]Proc Natl Acad Sci U S A. 2007 Feb 6; 104(6):1883-8.
[Proc Natl Acad Sci U S A. 2007]Nature. 2004 Mar 4; 428(6978):37-43.
[Nature. 2004]Proc Natl Acad Sci U S A. 2007 Feb 6; 104(6):1883-8.
[Proc Natl Acad Sci U S A. 2007]Genetics. 2007 Sep; 177(1):407-16.
[Genetics. 2007]Nature. 2007 Mar 29; 446(7135):537-41.
[Nature. 2007]Nature. 2004 Mar 4; 428(6978):37-43.
[Nature. 2004]PLoS Biol. 2007 Mar; 5(3):e77.
[PLoS Biol. 2007]PLoS Genet. 2007 Dec; 3(12):e231.
[PLoS Genet. 2007]Science. 2005 Feb 25; 307(5713):1311-3.
[Science. 2005]Appl Environ Microbiol. 2008 May; 74(9):2915-20.
[Appl Environ Microbiol. 2008]Environ Microbiol. 2004 Jul; 6(7):686-98.
[Environ Microbiol. 2004]Nature. 2004 Jul 29; 430(6999):551-4.
[Nature. 2004]Nature. 2004 Mar 4; 428(6978):37-43.
[Nature. 2004]Nature. 2007 Mar 29; 446(7135):537-41.
[Nature. 2007]Science. 2006 Jan 27; 311(5760):496-503.
[Science. 2006]Science. 2007 Mar 23; 315(5819):1709-12.
[Science. 2007]Nature. 1991 Jun 20; 351(6328):652-4.
[Nature. 1991]Genome Res. 2006 Oct; 16(10):1320-7.
[Genome Res. 2006]Genome Res. 2007 Aug; 17(8):1219-27.
[Genome Res. 2007]Genetics. 1989 Nov; 123(3):585-95.
[Genetics. 1989]Nature. 2002 Apr 4; 416(6880):531-4.
[Nature. 2002]Science. 2006 Mar 24; 311(5768):1768-70.
[Science. 2006]Science. 2005 Feb 25; 307(5713):1311-3.
[Science. 2005]PLoS Biol. 2007 Mar; 5(3):e77.
[PLoS Biol. 2007]Nature. 2007 Mar 29; 446(7135):537-41.
[Nature. 2007]Nature. 2004 Mar 4; 428(6978):37-43.
[Nature. 2004]Genetics. 2007 Sep; 177(1):407-16.
[Genetics. 2007]Genetics. 2006 Feb; 172(2):759-69.
[Genetics. 2006]Environ Microbiol. 2007 Oct; 9(10):2430-40.
[Environ Microbiol. 2007]Infect Genet Evol. 2006 Mar; 6(2):97-112.
[Infect Genet Evol. 2006]Nat Rev Genet. 2003 Jan; 4(1):50-60.
[Nat Rev Genet. 2003]Genome Res. 2006 Oct; 16(10):1320-7.
[Genome Res. 2006]Curr Biol. 2007 May 15; 17(10):R373-86.
[Curr Biol. 2007]Philos Trans R Soc Lond B Biol Sci. 2006 Nov 29; 361(1475):1985-96.
[Philos Trans R Soc Lond B Biol Sci. 2006]Nature. 1991 Jun 20; 351(6328):652-4.
[Nature. 1991]Mol Biol Evol. 2008 Jun; 25(6):1007-15.
[Mol Biol Evol. 2008]Proc Natl Acad Sci U S A. 2007 Feb 6; 104(6):1883-8.
[Proc Natl Acad Sci U S A. 2007]Proc Natl Acad Sci U S A. 2007 Feb 6; 104(6):1883-8.
[Proc Natl Acad Sci U S A. 2007]Proc Natl Acad Sci U S A. 1991 Aug 15; 88(16):7160-4.
[Proc Natl Acad Sci U S A. 1991]Appl Environ Microbiol. 2002 Feb; 68(2):838-45.
[Appl Environ Microbiol. 2002]Genetics. 2007 Sep; 177(1):407-16.
[Genetics. 2007]Annu Rev Microbiol. 2005; 59():43-67.
[Annu Rev Microbiol. 2005]Nat Genet. 2000 Oct; 26(2):195-7.
[Nat Genet. 2000]Genome Biol. 2000; 1(6):RESEARCH0011.
[Genome Biol. 2000]J Mol Evol. 2006 May; 62(5):615-29.
[J Mol Evol. 2006]Genetics. 2007 Sep; 177(1):407-16.
[Genetics. 2007]Genetics. 2007 Sep; 177(1):407-16.
[Genetics. 2007]Genetics. 1993 Mar; 133(3):455-68.
[Genetics. 1993]Genetics. 1997 Jul; 146(3):745-50.
[Genetics. 1997]Genome Biol. 2006; 7(5):R44.
[Genome Biol. 2006]Genetics. 1999 Oct; 153(2):539-54.
[Genetics. 1999]Genetics. 1993 Mar; 133(3):455-68.
[Genetics. 1993]Proc Natl Acad Sci U S A. 2007 Feb 6; 104(6):1883-8.
[Proc Natl Acad Sci U S A. 2007]PLoS Genet. 2007 Dec; 3(12):e231.
[PLoS Genet. 2007]Proc Natl Acad Sci U S A. 2001 Aug 14; 98(17):9853-8.
[Proc Natl Acad Sci U S A. 2001]Proc Natl Acad Sci U S A. 2003 Jul 8; 100(14):8484-9.
[Proc Natl Acad Sci U S A. 2003]Proc Natl Acad Sci U S A. 2007 Feb 6; 104(6):1883-8.
[Proc Natl Acad Sci U S A. 2007]Nature. 2004 Mar 4; 428(6978):37-43.
[Nature. 2004]Science. 2005 Feb 25; 307(5713):1311-3.
[Science. 2005]Annu Rev Microbiol. 2001; 55():165-99.
[Annu Rev Microbiol. 2001]Science. 2007 Mar 23; 315(5819):1709-12.
[Science. 2007]Nat Rev Microbiol. 2008 Mar; 6(3):181-6.
[Nat Rev Microbiol. 2008]Environ Microbiol. 2008 Jan; 10(1):200-7.
[Environ Microbiol. 2008]Proc Natl Acad Sci U S A. 2007 Feb 6; 104(6):1883-8.
[Proc Natl Acad Sci U S A. 2007]Genome Res. 2006 Oct; 16(10):1320-7.
[Genome Res. 2006]Nature. 1991 Jun 20; 351(6328):652-4.
[Nature. 1991]J Proteome Res. 2007 Aug; 6(8):3152-61.
[J Proteome Res. 2007]Nature. 2007 Mar 29; 446(7135):537-41.
[Nature. 2007]Nature. 2004 Mar 4; 428(6978):37-43.
[Nature. 2004]Nature. 2007 Mar 29; 446(7135):537-41.
[Nature. 2007]Genome Res. 1998 Mar; 8(3):195-202.
[Genome Res. 1998]Nature. 2004 Mar 4; 428(6978):37-43.
[Nature. 2004]Nature. 2004 Mar 4; 428(6978):37-43.
[Nature. 2004]Nature. 2007 Mar 29; 446(7135):537-41.
[Nature. 2007]BMC Bioinformatics. 2007 Oct 17; 8():398.
[BMC Bioinformatics. 2007]Genome Res. 2006 Oct; 16(10):1320-7.
[Genome Res. 2006]Mol Biol Evol. 2007 Aug; 24(8):1586-91.
[Mol Biol Evol. 2007]Nature. 1991 Jun 20; 351(6328):652-4.
[Nature. 1991]Nature. 2002 Apr 4; 416(6880):531-4.
[Nature. 2002]PLoS Biol. 2007 Nov 6; 5(11):e310.
[PLoS Biol. 2007]Genome Res. 2007 Aug; 17(8):1219-27.
[Genome Res. 2007]Genome Res. 2006 Oct; 16(10):1320-7.
[Genome Res. 2006]Genetics. 1989 Nov; 123(3):585-95.
[Genetics. 1989]Mol Biol Evol. 2005 Jan; 22(1):63-73.
[Mol Biol Evol. 2005]BMC Bioinformatics. 2007 Oct 17; 8():398.
[BMC Bioinformatics. 2007]