![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2007 Wilhelm et al; licensee BioMed Central Ltd. Natural variation in SAR11 marine bacterioplankton genomes inferred from metagenomic data 1Department of Microbiology, Oregon State University, Corvallis, OR, 97331, USA 2Center for Genome Research and Bioinformatics, Oregon State University, Corvallis, OR, 97331, USA Corresponding author.Larry J Wilhelm: wilhelml/at/science.oregonstate.edu; H James Tripp: tripph/at/onid.orst.edu; Scott A Givan: givans/at/cgrb.oregonstate.edu; Daniel P Smith: dansmith/at/orst.edu; Stephen J Giovannoni: steve.giovannoni/at/oregonstate.edu Received October 2, 2007; Accepted November 7, 2007. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. This article has been cited by other articles in PMC.Abstract Background One objective of metagenomics is to reconstruct information about specific uncultured organisms from fragmentary environmental DNA sequences. We used the genome of an isolate of the marine alphaproteobacterium SAR11 ('Candidatus Pelagibacter ubique'; strain HTCC1062), obtained from the cold, productive Oregon coast, as a query sequence to study variation in SAR11 metagenome sequence data from the Sargasso Sea, a warm, oligotrophic ocean gyre. Results The average amino acid identity of SAR11 genes encoded by the metagenomic data to the query genome was only 71%, indicating significant evolutionary divergence between the coastal isolates and Sargasso Sea populations. However, an analysis of gene neighbors indicated that SAR11 genes in the Sargasso Sea metagenomic data match the gene order of the HTCC1062 genome in 96% of cases (> 85,000 observations), and that rearrangements are most frequent at predicted operon boundaries. There were no conserved examples of genes with known functions being found in the coastal isolates, but not the Sargasso Sea metagenomic data, or vice versa, suggesting that core regions of these diverse SAR11 genomes are relatively conserved in gene content. However, four hypervariable regions were observed, which may encode properties associated with variation in SAR11 ecotypes. The largest of these, HVR2, is a 48 kb region flanked by the sole 5S and 23S genes in the HTCC1062 genome, and mainly encodes genes that determine cell surface properties. A comparison of two closely related 'Candidatus Pelagibacter' genomes (HTCC1062 and HTCC1002) revealed a number of "gene indels" in core regions. Most of these were found to be polymorphic in the metagenomic data and showed evidence of purifying selection, suggesting that the same "polymorphic gene indels" are maintained in physically isolated SAR11 populations. Conclusion These findings suggest that natural selection has conserved many core features of SAR11 genomes across broad oceanic scales, but significant variation was found associated with four hypervariable genome regions. The data also led to the hypothesis that some gene insertions and deletions might be polymorphisms, similar to allelic polymorphisms. Open peer review Reviewed by Eugene Koonin, Igor B. Jouline (Zhulin) and Peer Bork. For the full reviews, please go to the Reviewers' comments section. Background A particularly vexing aspect of microbial genomics is the common observation of high genome variability among strains of a species [1-3]. Such observations have raised significant questions about the validity of the microbial species concept, and the value of single genome sequences for comparisons between taxa [4]. To reconcile this dilemma, it has been suggested that bacterial species have a "core-genome" consisting of genes that are always present, and a "pan-genome" of genes that are variably present [3]. Metagenomics, the study of genome sequence retrieved from mixed assemblages of organisms collected from nature, is providing high coverage of genome sequence variation from natural microbial communities [5], which can be employed to study the conservation of genome features and illustrate patterns of natural variation. The Sargasso Sea metagenomic data The Sargasso Sea is an oligotrophic subtropical gyre where average surface temperatures are about 23°C, and rarely drop below 20°C [6]. The Sargasso Sea metagenomic data consists of 1.6 G base pairs of unique environmental genomic DNA shotgun sequence. The SAR11 clade accounts for 380 of the 1,412 SSU rRNA genes in the Sargasso Sea data (27%), suggesting that it includes enough SAR11 genome sequence data to encode the equivalent of about 775 SAR11 strain HTCC1062 genomes [7]. Despite the abundance of SAR11 genome sequences in the Sargasso Sea data, the assembly of SAR11 genomes failed when traditional DNA assembly methods were applied [7]. The largest SAR11 rRNA-anchored scaffold reconstructed with the Celera Assembler was relatively small (ca. 21,000 bp), and assembly depth-of-coverage was low (0.94 – 2.2 fold) [7]. Genome streamlining The genome streamlining theory was invoked to explain the small genomes of 'Candidatus Pelagibacter' and Prochlorococcus [8,9]. The essence of the genome streamlining theory is that selection is most efficient in microbial populations that have large effective population sizes, and therefore the elimination of unnecessary DNA from genomes will be most pronounced in organisms, like bacterioplankton, that meet this criterion. In particular, "genome streamlining" usually refers to the elimination of functionless DNA from genomes, because of the cost of replication. However, in principle the same concept should apply to other features of genome evolution. If the genome streamlining theory is correct, then, in large bacterioplankton populations, selection should be unusually efficient at preserving all genome features that have a positive fitness associated with them, and eliminating features that confer a negative fitness. Ecologically significant variation in SAR11 populations This observation of high sequence diversity led to speculation that the SAR11 clade might be a diverse assemblage of perhaps hundreds or thousands of species, each with low coverage in the shotgun sequence library [7]. However, ecological data suggests that the SAR11 clade consists of a few ecotypes, which can be differentiated either phylogenetically [10], or by their appearance in the environment at different depths and seasons [11]. A phylogenetic analysis of 16S rRNAs revealed the presence of two SAR11 ecotypes in the Sargasso Sea metagenomic data [12]. Rusch et al. [13] reported evidence of nine genetically divergent SAR11 populations in metagenomic data from different ocean surface sites. The interpretation most consistent with this data is that the large population sizes of SAR11, and the age of these clades, allow them to accumulate very extensive neutral sequence variation that renders assembly difficult, but that there are relatively few, perhaps less than a dozen, ecotypes that are important to the ecology of the oceans [2]. Throughout this paper we refer to genes found in the metagenomic data as "SAR11 genes" if they were found on fragments of DNA sequence that harbored at least one gene identified as a 'Candidatus Pelagibacter' gene, according to the tests described below. We use the term 'Pelagibacter', to refer to genes from the two sequenced genomes of 'Candidatus Pelagibacter ubique', because these isolates are genetically distinct from Sargasso Sea SAR11 populations. Pelagibacter strains HTCC1062 and HTCC1002 were isolated from the cold, nutrient rich Oregon coast [14,15], have an average 16S rRNA sequence similarity to SAR11 16S rRNA genes from the Sargasso Sea metagenomic data of 96%, and belong to a 23S-ITS-16S phylogenetic cluster that is distinct from Sargasso Sea populations [16]. In addition to ocean currents and geography, the large temperature difference between the Sargasso Sea and the Oregon coast is most likely a significant barrier that isolates their respective SAR11 populations. A SAR11 strain obtained very recently from the Sargasso Sea, HTCC7211, shows a higher temperature optimum than the Oregon coast isolates, as predicted from the significant difference in temperature between these environments (Stingl, unpublished data). Autecology approaches to the study of metagenomic data Autecology is the study of the ecology of species. A variety of approaches have been employed to infer information about microbial genome variation for specific taxa from metagenomic data, with the intent of understanding the roles of organisms in ocean surface ecology. Hallam et al. [17], used a composite genome sequence of Cenarchaeum symbiosum to study the diversity of Marine Group I archaeal genomes in metagenomic databases using BLAST score ratios [18] to identify conserved genes, and a similar approach was employed by Coleman to study genome variation in Prochlorococcus [19]. Recently, a study was published by Rusch et al. [13] that used marine microbial genomes as templates to identify environmental fragments from ocean metagenomic data. Rusch et al. [13] used BLASTN to identify homologs to SAR11 genes, but did not employ further tests to exclude homologs originating from non-SAR11 organisms. Objectives The objective of this study was to identify genomic features of SAR11 that are conserved between two Pelagibacter isolates and their counterparts in the Sargasso Sea. The predicted proteins of the HTCC1062 genome were used as TBLASTN queries to identify homologs to SAR11 proteins in the Sargasso Sea metagenomic DNA, and only genes that retrieved Pelagibacter homologs as their top scores when the NCBInr database was searched with BLASTP were classified as SAR11 genes. We quantified gene-to-gene boundaries to assess gene insertions, deletions and rearrangements, and the occurrence of non-orthologous genes adjacent to Pelagibacter orthologs. We measured synteny and displayed the relationship between synteny and amino acid identity in a novel way that allows observation of small-scale gene insertions (<5 genes). Collectively, these tools enabled us to ask questions about the presence of genes in conserved gene order, gene rearrangements, and the juxtaposition of Pelagibacter homologs with genes that are not found in Pelagibacter, which might reveal genes unique to the Sargasso Sea population. To our surprise, with the exception of the hypervariable regions of the genome, we found that the genomes of the divergent population we studied were remarkably similar to the query genome in gene content and gene order. The results suggest that extraordinarily high allelic variation and genome rearrangements mask the conservation of many genome properties in native SAR11 populations. Results The genome sequence of any individual microbial cell is a sub-sample of natural variation. Metagenomics is now providing extraordinary datasets that deeply sample natural variation, raising the question, how should this variation be represented? We have taken a comparative approach to learn more about SAR11 genomes, but we have also searched for evolutionary models that will help us understand and graphically represent natural genome variation. Our comparative approach stems from a larger goal of understanding the nature of species and ecotypes in SAR11 populations. A full description of a species or ecotype might include: 1) a list of core genes that confer a relatively conserved phenotype, 2) a list of ancillary genes that may be found, along with the probabilities that each gene is present, and 3) information about patterns in gene composition and gene order that reflect evolutionary processes, such as the divergence of species or ecotypes. Phylogenetic relationships between the query genome and SAR11 genomes represented in the metagenomic data are shown in Fig. Fig.1.1
Strategy The bioinformatic strategy we employed is outlined in Fig. Fig.2.2
An expect score of 1 × 10-10 was employed in the initial homolog detection step (Fig. (Fig.3A).3A
Synteny We define "syntenic fragments" as homologous fragments that passed the best-hit and synteny tests (Fig. 2B,C
To visualize variation among the syntenic fragments, the data were plotted as a function of gene position in the HTCC1062 genome and amino acid identity (Fig. (Fig.3D).3D The Sargasso Sea metagenome data includes assemblies that are known to contain some errors resulting from unrelated fragments being joined incorrectly [20]. However, an analysis of syntenic fragments that did not include assemblies (Additional file 6) yielded similar plots that supported the same conclusions. Syntenic fragments cover most of the HTCC1062 genome (Fig. (Fig.3D),3D
Genome rearrangements were not random, but were concentrated at boundaries between operons. We compared the evidence for gene re-arrangements to the distribution of predicted operon boundaries [21]. The average number of rearrangements detected per gene boundary was 2.58, but the average at boundaries between operons was 3.21, and the average within operons was 1.94. An analysis of variance indicated that these differences are highly significant (see methods). Perhaps not surprisingly, this finding suggests that selection allows rearrangements between operons more frequently than re-arrangements within operons. The number of Sargasso Sea fragments in SAR11 syntenic fragment plots declines sharply at sequence similarities above 90%, indicating that the coastal isolates are genetically distinct from their counterparts in the Sargasso Sea. Sequence identity between the syntenic genes and the HTCC1062 genome ranged from high (98%) to low (30%) and averaged 71%. SSU rRNA variation supported a similar conclusion: the average sequence identity of SAR11 16S rRNA fragments from the Sargasso Sea metagenomic data to HTCC1062 is 96.0% ± 2.8% (sd), with a median of 96.9% (n = 379). For comparison, the amino acid identity between E. coli and Salmonella for GroEL is 98% and among Burkholderia species is 75%, whereas the lowest SAR11 GroEL syntenic fragments have an amino acid identity of 77% (avg = 0.87). For RecA the numbers are 96% similarity for E. coli and Salmonella, 92% within the genus Burkholderia, and 63% (avg = 0.81) for the lowest syntenic fragments. Hallam et al. observed a similar average amino acid identity (65%) among 4,000 Sargasso Sea fragments related to C. symbosium, and conservation of gene-order as well [17]. Coleman et al. [19] observed a protein sequence identity of 80% among 1,574 genes between two strains of the abundant marine cyanobacterium Prochlorococcus that are 99.2% similar at the 16S rRNA locus. Syntenic fragment analysis predicts a very similar average sequence identity in the Prochlorococcus metagenome (79%, Additional file 4). The high sequence divergence observed in the metagenomic data cannot be explained by assuming weak selection. The ratio of synonymous to non-synonymous substitution rates for a selection of 19 genes from the syntenic fragment data ranges from 0.04 to 0.23, indicating purifying selection (Table 4). The implication of this observation is that the divergence of amino acid sequences in the Sargasso Sea SAR11 populations is occurring in proteins that serve important functions.
Scrutinized in detail, the syntenic fragment plots reveal many conserved differences between the HTCC strain genomes and the SAR11 metagenome. For example, the syntenic fragment plots illustrate that the Sargasso Sea SAR11 genomes include proteorhodopsin genes (Fig. 5A,B
Nowhere in this analysis did we see clear evidence of the Sargasso Sea SAR11 ecotypes reported previously from this dataset [12], and suggested by Fig. Fig.1.1 Hypervariable regions Although the distributions of homologous fragments and syntenic fragments suggest a relatively conserved SAR11 "core" genome, they also revealed four distinct hypervariable regions, HVR1 – 4 (Fig. 3A,C
The largest of the SAR11 hypervariable regions, HVR2, is a 48 kb region flanked by the sole 5S and 23S rRNA genes in the HTCC1062 genome. HVR2 mainly contains proposed lipopolysaccharide (LPS) biosynthesis genes (Table 5), and appears to be analogous to previously observed regions encoding cell surface properties. Based on current annotations, all but one of the enzymes involved in the biosynthetic pathways for the inner and outer core of lipopolysaccharide (LPS) are present in HVR2, while the enzymes involved in synthesis of the unexposed regions of the LPS are found elsewhere in the genome (Table 5).
The existence of a hypervariable region similar to HVR2 in the Sargasso Sea SAR11 population was inferred from the metagenomic data. Only two fragments containing SAR11 16S rRNA genes are found among the 349,742 homologous fragments, but SAR11 5S rRNA genes were found on 36 fragments. Nineteen of these fragments carried homologs to HTCC1062 CDS 570 upstream of the 5S rRNA gene, as found in the HTCC1062 genome, suggesting that the Sargasso Sea SAR11 cells, like the Pelagibacter isolates, have split ribosomal RNA operons. The remaining SAR11 HVR's appear to be related to transport and secretion (HVR1&4) or unknowns (HVR3) (Tables 6, 7, 8, 9, and 10). The transport and secretion functionality associated with HVRs 1 and 4 is consistent with the assertion of Coleman [19] that these islands may play a role in niche adaptation by supporting differential nutrient acquisition capabilities.
Comparison of the genomes of HTCC1062 and HTCC1002, which were isolated from the same seawater sample [14], provided further support for the conclusion that some HVRs are hotspots for the acquisition of foreign DNA by horizontal gene transfer (HGT). The genome of HTCC1002 is 12,298 nucleotides larger than the genome of HTCC1062. Most of the length difference is due to 31 genes inserted in HVR3 of HTCC1002. The 16S rRNAs of these strains differ by one nucleotide, and in protein coding regions they are 97.4% similar in nucleotide sequence. Although HGT is clearly a source of variation in the HVRs, gene duplications were also observed in the comparison of HTCC1002 and HTCC1062, suggesting that other mechanisms of genome evolution might be contributing the high variability observed in these genome regions (Fig. (Fig.77
Gene indels in core regions of the HTCC1002 and HTCC1062 genomes: could they be polymorphic sites? Given the high conservation of core regions of the genome suggested by the plots of syntenic fragments, we were at first surprised to find, in comparisons of the HTCC1062 and HTCC1002 genomes, 62 gene indels (gene insertions in one genome relative to the other) in core regions. Various data support the model that cells normally harbor a transient pool of neutral genes that are continually culled by a deletion bias in DNA replication [27-29]. The prediction is that random gene insertions will be "neutral" in the sense that they are under no (or low) evolutionary pressure. Syntenic fragment plots provide a test of this model, since the prediction is that there will be no coverage in the column above genes that are randomly inserted into the query genome. Of 62 indels between the two genomes, 44 are deletions in HTCC1002 relative to HTCC1062, and are thus testable with the HTCC1062 sytenic fragment plot (Fig. (Fig.8).8
Previous studies of E. coli in culture reported a balanced polymorphism that arose spontaneously and was maintained by selection [30]. The resulting haplotypes diverged phylogenetically and met Cohan's definition for ecotypes [31]. Thus, there is precedent that anticipates balanced polymorphisms in natural microbial populations. We speculate that the gene indels we observed could be balanced polymorphisms, and that this phenomenon might explain part of the natural variation in gene content found in microbial populations. Gene complement We were surprised to find no indication of conserved differences in gene complement that would suggest significant physiological differences between the coastal strains and Sargasso Sea SAR11 populations. This is not to say that there is not physiological variation within and between the Sargasso Sea and coastal Oregon SAR11 populations. Rather, we observed no genes thought to control significant phenotypic attributes in HTCC1002 and HTCC1062 that were not also well-represented among the syntenic fragments, and no conserved occurrences of functionally important genes in the SAR11 metagenomic data that did not have orthologs in the HTCC1062 genome. We should have seen such genes if they were highly conserved, because our analysis included an average of 118 fragments that covered each end of each gene and provided evidence for the identity of the adjacent gene. Only 19 genes from the HTCC1062 genome are not represented in the SAR11 metagenome (syntenic fragments or homologous fragments that passed the best BLAST hit test). Of these, nine are from the hypervariable regions (Table 11), and six returned no hits to NCBI databases with expect scores less than 1 × 10-10. Two members of this group (CDSs 542 and 555) are suspected to be involved in outer membrane biosynthesis and CDS 1217 is a bacterial-like globin (Table 11). With only one exception, we found closer homologs to these genes in the Sargasso Sea dataset than in NCBI databases.
Conclusion The reconstruction of microbial genomes from metagenomic data is a challenge for microbial ecologists, particularly when the genomes originate from large plankton populations that exhibit inherently high natural variability. Venter attributed the failure to assemble SAR11 genomes for the Sargasso Sea metagenomic data, despite apparently high coverage, to high species diversity. But, an alternative explanation is that some genomes are inherently diverse in neutral characteristics because of large effective population sizes. In such cases, it seems likely that a linear genome, the objective of assembly, is an overly simplistic model for representing natural genome variation. Therefore, we sought a strategy that would reveal conserved and variable elements of genome structure. This strategy was predicated on the use of a related query genome, and designed to be useful for predicting genome properties for studies of ecology and evolution. The amino acid sequence divergence between the query genome we used and the Sargasso Sea SAR11 populations exceeds the divergence between some microbial genera, suggesting that genomic properties have had ample time to diverge in response to selection. We were surprised to find that there was heavy coverage of most of the genome by syntenic fragments, suggesting a relatively conserved gene order in core regions of the genome. Previous reports have shown that the conservation of gene order between prokaryotic genomes dissipates faster than protein sequence identity or gene complement [32,33]. Synteny is regarded as a rapidly evolving property of genomes, second only to DNA with regulatory functions [34]. Huynen and coworkers compared orthologs from an evolutionarily diverse set of 9 genomes to show that gene order becomes nearly random before protein identity decays below 50% [34]. The seemingly incongruous observation of the preservation of synteny in the presence of 30% divergence in average amino acid sequence can be explained by noting that Huynen and coworkers studied species that are highly divergent in functional properties, whereas the SAR11 population appears relatively uniform, despite the accumulation of considerable neutral sequence variation. This explanation implies that selection is acting to preserve gene order in SAR11 populations. We speculate that this may be another example of streamlining selection, in this case acting to preserve gene order [9]. Rusch et al. [13] reported that synteny was conserved in all abundant marine prokaryotes, but their method of observing synteny was qualitative and therefore they did not offer a quantitative estimate that could be used for comparative purposes. We provide a quantitative estimate of synteny and also show that genome rearrangements are not random, but are concentrated at particular sites, which are often operon boundaries. This observation has implications for the "extreme assembly" employed by Rusch, which has an unproven ability to detect and report the frequency of genome rearrangements. The high coverage provided by this dataset enabled us to examine core regions of the genome for conserved gene rearrangements, including insertions. Although many conserved gene rearrangements were observed, none involved genes acquired by HGT. The approaches we describe are robust for some conclusions; for example, the identification of genomic regions that are missing or highly diverged from the query sequence, and regions, such as the proteorhodopsin gene locus, where genes and gene order vary in conserved patterns. There are also caveats. For example, it is likely that only the termini of large cassettes of inserted genes in the target genomes can be observed, and then, only where they abut regions of conserved gene order. Thus, the absence of any observations suggesting conserved insertions of novel genes in the SAR11 metagenomic data suggests that the Sargasso Sea SAR11 variants are very similar to their coastal counterparts, in core regions of the genome, but does not rule out the involvement of HVRs in evolution by gene acquisition. The concept of the "core" and "pan" genomes is gathering support as genome sequencing reveals more examples of conserved core genome regions and hypervariable regions, or islands of genome variability, as they are sometimes known [3,19]. While the mechanisms that generate variability in these regions remain poorly understood, it is increasingly apparent that they often encode niche-specific proteins that are important to competitive success. Our observations are consistent with the interpretation that natural selection has concentrated genes that encode cell surface properties into HVR2, and that this region is subject to unusually rapid rates of sequence divergence and re-arrangements of gene order. It has been postulated that similar variability in the C. jejuni LPS cassette is an adaptive response to selection pressure to evade host acquired immune responses [35]. Viral predation on microbial cells is intense in the ocean water column and is likely to provide a keen source of selective pressure that favors microbial populations with diverse, rapidly evolving surface properties. An analogous variable genome region containing genes for cell surface components (LPS cassettes) has been observed in Prochlorococcus sp. [26,36], and is evident in our syntenic fragment plot for this organism (arrow in Additional file 4B). We propose that the structural RNA genes flanking the LPS cassettes provide zones of conserved DNA sequence that promote horizontal exchange of the cassettes by homologous recombination. Multi-locus sequence typing has shown that rates of intraspecific recombination are high within the coastal SAR11 population [37]. Alternatively, this variable genome region could be explained by horizontal gene transfer from another species, a hypothesis that is consistent with the observation that the AT content of HVR2 is anomalously high (79%). To explain the apparent conservation of core genome regions, and reconcile it with the apparent variability of SAR11 genomes in nature, we hypothesize that the large population sizes of SAR11, and the age of these clades that is inferred from phylogenies, have allowed them to accumulate very extensive variation in genome properties that are selectively neutral [2]. However, it is clear that our study offers only a glimpse of the evolution of the SAR11 clade. In particular, the study of genome organization, as inferred from plots of syntenic fragments and other methods, revealed less about the evolution of SAR11 ecotypes than did phylogenetic analyses. In part, we attribute this to a relatively high level of conserved organization in SAR11 genomes, and the role of hypervariable regions, which by definition are not amenable to studies based on the identification of conserved genome features. An intriguing finding is the correspondence between "gene indels" in core regions of the two Pelagibacter genomes, and patterns in the synteny plots from another ocean. This finding needs further investigation. In part, it is consistent with the established idea that genes are randomly inserted into genomes at a low frequency, and lost if not preserved by positive selection [27-29]. But, our findings suggest that some gene indels might have another explanation. They could be polymorphisms that are maintained in populations by processes similar to those that maintain allelic polymorphisms [30]. The most appealing approach to resolving this interesting question would be the sequencing of many genomes from closely related isolates. Our findings indicate that SAR11 genomes from different oceanic provinces share many conserved features despite dynamic processes of genome change that are at work in nature. The Sargasso Sea SAR11 populations are conserved in local gene order, and gene complement, with respect to populations that live in richer, colder coastal water, but diverge dramatically in amino acid sequence similarity. A broad implication is that large microbial populations such as bacterioplankton accumulate high diversity in some genome properties, while remaining constrained in others [2]. Protein evolution provides an analogy. Protein families can encompass wide variation in amino acid sequences while retaining the key elements of three-dimensional structure that confer function [38]. Similarly, in old clades that comprise large populations, microbial genomes may wander over sequence space, giving an illusion of variability, while remaining highly constrained in features that govern cellular structure and function. An important challenge going forward will be in establishing quantitative measures that reveal these properties of genome conservation from metagenomic data. Materials and methods Phylogenetic analysis Homolog search Fragments carrying genes with high similarity to HTCC1062 genes are identified at the protein level with TBLASTN [41], using the amino acid sequence as input, a 1 × 10-10 expect score cutoff, and complexity filtering off. The results are limited to the first 3,000 hits. Command line: blastall -i sar11_proteins.fa -d venter_nt -p tblastn -e '1e-10' -F F -v 3000 -b 3000, where sar11_proteins.fa is a fasta file of HTCC1062 proteins, venter_nt is the Sargasso Sea fragment data in BLAST format. Default values were used for all unnamed parameters, blastall version was 2.2.12. For convenience, the set of fragments identified in this fashion are hereafter referred to as "homologous fragments". Syntenic fragment detection A subset of the homologous fragments that shared synteny with the HTCC1062 genome was identified by finding fragments that were common to the lists from adjacent HTCC1062 genes, and verifying that the genes are arranged in tandem on the fragment. Each gene on these syntenic fragments was subjected to the reciprocal best-hit test (Fig. (Fig.2).2 Syntenic fragment plots with different query genomes To assess the selectivity of syntenic fragments we studied a set of organisms of varying relevance to ocean surface ecology. Of these, Procholorococcus marinus MED4 provides an example from a clade that is relatively abundant in the Sargasso Sea but forms a shallow cluster by 16S rRNA gene sequence analysis [42]. Escherichia coli was chosen as an organism that is unlikely to appear often in the Sargasso Sea. Five additional cultured marine strains were also used as query genomes. They are listed in Table 2. We found that the syntenic fragment plots are characteristic for each group, and the numbers of syntenic fragments in the plots correlates with the abundance of each organism's 16S rRNA genes in the metagenomic data (Additional file 4). For each selected organism, the relative abundance of genomic DNA within the Sargasso Sea metagenome is estimated by the number of 16S rRNA fragments satisfying identity thresholds of 97%, 93%, and 90%. These values are shown as insets in Additional file 4. The number of syntenic fragments recovered for each organism correlates with the number of similar 16S rRNA genes. The most abundant organism, SAR11, produced by far the most syntenic fragments (71,696) while P. marinus returned an intermediate number (8,398). Other organisms (e.g. C. atlanticus in Table 2 and Additional file 4) that are virtually undetectable by 16S rRNA analysis returned a number of syntenic fragments (1258) similar to our negative control E. coli (406). Additional file 3 shows the distribution of syntenic fragments recovered for E. coli. If the syntenic fragment detection process described here accurately recovers environmental fragments arising from a given template organism, then the list of syntenic fragments should be unique for each organism. Thus, to estimate the selectivity of our method we compared the syntenic fragment lists of all organisms tested. The percentage of syntenic fragments unique to each organism (Table 2) is 98.8 % for HTCC1062 and 99.0 % for P. marinus. Less-abundant organisms have as few as 30% unique syntenic fragments, but not in all cases. In numbers, the low abundance organisms are not distinguishable from each other or E. coli, though other marine microbes show a narrower, more distinct syntenic fragment pattern than E. coli, similar to Additional file 4C (data not shown). Genome rearrangements and operon prediction To find fragments containing HTCC1062 genes in altered gene orders, the genes on the homologous fragments that did not show synteny to HTCC1062 were subjected to the best BLAST analysis described above for the syntenic fragments. The number of occurrences of a given best-hitting HTCC1062 gene adjacent to a non-syntenic best-hitting HTCC1062 gene was determined, scoring the number of times any unique pair of genes occurred together. To determine if genome rearrangements are concentrated at operon boundaries, a statistical analysis was performed comparing the number of gene pairs found that violate operon boundaries (disallowed pairs; n = 1301), to the number found that preserve operon boundaries (allowed pairs; n = 2131). A maximum likelihood estimator was used to calculate the probability that this observation could be due to chance, assuming the rearrangements at gene boundaries follow a normal distribution. The list of predicted operons for P. ubique HTCC1062 was obtained from via the ENTREZ Genome Project [21]. Hypervariable regions Site-specific recombination mediated by integrases has been shown to cause rapid change in some islands of genomic variability [24]. However, SAR11, as with most examples thus analyzed [43], failed to display clear signatures of the integron model – attC sites were not found associated with any of the HVRs. HVR2 includes an integrase gene, and HVR4 is flanked by tRNA genes, which have been shown to serve as site-specific recombination sites for temperate phages and transmissible plasmids [44]. Comparison of the genomes of strains HTCC1062 and HTCC1002 provides evidence that the 23S and 5S rRNA genes flanking HVR2 are sites of homologous recombination that allow novel variations of the LPS region to spread rapidly within populations. The HVR2 regions in these two genomes are 99.96% similar in nucleotide sequence, compared to 97.4% similarity for the genomes overall. In addition to few point mutations, the two HVR2 sequences differ by a deletion of 13 nucleotides that removes one from a set of four tandem repeats within a hypothetical gene. In contrast, the HVR1 regions of HTCC1062 and HTCC1002 reveal the loss and gain of divergent, tandem duplicated genes. One gene is deleted from a set of four tandem, divergent gene duplications of Type V secretion proteins in strain HTCC1062 (Fig. (Fig.7).7 Tests for selection Sequences were analyzed for synonymous and non-synonymous substitution rates using the software program SWAAP 1.0.2 [46], set to the Li method (1993) with a window size of 90 and step size of 18. The values reported in Table 2 were created from alignments that include those portions of the HTCC1062 gene and syntenic fragment sequence defined by the HSP start and end positions taken from the TBLASTN results (see homolog detection step). Translated sequence was used to guide the alignments when necessary. Nucleotide divergence values were calculated with the software program DnaSP 4.0 using the 'synonymous non-synonymous substitution' option under the analysis menu with default parameters [47]. Searching for genes conserved in Sargasso Sea SAR11 populations and not found in the genomes of the coastal isolates Non-homologous genes found alongside genes that had best hits to HTCC1062 genes were regarded as candidates for genes specific to the Sargasso Sea SAR11 populations. The amino acid sequence of each CDS (the determination of open reading frames on the environmental fragments was taken from the conserved domain feature tags of the NCBI GenBank record) on every homologous fragment with sufficient length to include non-HTCC1062 homologs was used as a BLAST query sequence against the NCBI non-redundant proteins database (BLASTP, expect score cutoff 1 × 10-6) and the NCBI Conserved Domain Database (CDD) [48]. We examined the data for the frequency of any specific non-HTCC1062 gene occurring next to a given HTCC1062 gene, using the gene descriptions from the NCBI database as well as the protein family identifier from the CDD as search strings for the identification of common genes. Calculating a synteny index We define a synteny index to be the fraction of best-hitting HTCC1062 homologs found adjacent to a best-hitting and syntenic HTCC1062 homolog. The amino acid sequence of each CDS on all homologous fragments carrying at least two genes (238,663 of 349,742 total, Table 1) was queried against the NCBI non-redundant proteins database. For every best-hitting gene on the fragment, if at least one best-hitting neighbor was present, it was counted as a syntenic observance. To calculate the synteny index the total of all syntenic observances is divided by the total observances of best hitting genes, syntenic and non-syntenic. Genes conserved in the coastal isolates but not found in the Sargasso Sea SAR11 populations HTCC1062 genes that were not represented among the set of homologous fragments identified in the Sargasso Sea dataset by TBLASTN were considered as candidates for genes specific to the coastal ecotypes (Pelagibacter). The translation products of these CDS were used as queries in a TBLASTN search of the Sargasso Sea dataset and the NCBInr database, using an expect score cutoff of 10. The highest scores found for these 19 genes are listed in Table 11. Three CDS (414, 788 and 875) failed to hit the NCBInr database with an expect score of less than 1 × e-4 and are classified as hypothetical proteins in the HTCC1062 genome. Competing interests The author(s) declare that they have no competing interests. Authors' contributions LJW carried out the programming and drafted the manuscript. HJT contributed the comparisons of the HTCC1002 and HTCC1062 genomes. SAG operates the annotation pipeline. DPS carried out the analysis of genome re-arrangements. SJG conceived of the study, and participated in its design and coordination and helped to draft the manuscript. All authors read and approved the final manuscript. Reviewer's comments Reviewer's report 1 Eugene Koonin, National Center for Biotechnology Information, Bethesda, MD This is quite an exciting study that makes an extensive use of metagenomic data to assess the natural variability of Pelagibacter (SAR11), apparently, the most abundant marine bacterium. Pelagibacter seems to break the trend that, previously, seemed consistent in all compared prokaryotic genomes, namely, that gene order evolves faster than sequence (at least, coding sequence). It is reasonably proposed that the high effective population size of Pelagibacter leads to rapid accumulation of neutral substitutions. It is further proposed that purifying selection acts to maintain the gene order in Pelagibacter and that this is somehow related to genome streamlining. The connection seems pretty mysterious but, at face value, it is hard to deny that, at the time scale when many substitutions accumulate, genome rearrangements do not. An explicit population-genetic analysis of this discrepancy would be interesting but one needs a good model for genome rearrangements, of course. There seems to be striking analogy between the hypervariable regions in Pelagibacter discovered here and pathegenicity islands in bacterial pathogens, e.g., E. coli. I think it is worth more emphasis. It will be very revealing to see whether these findings smoothly extend to other marine bacteria such as Prochlorocccus, and whether the population-genetic explanation holds compared to the available abundance data. I actually suspect that the data is there for such a comparison to be done right now but, certainly, that can be done in a new paper, the present one stands on its own. Reviewer's report 2 Igor B. Jouline (Zhulin), Associate Professor and Senior R&D Staff Member, Joint Institute for Computational Sciences, The University of Tennessee – Oak Ridge National Laboratory This study aims at evaluating metagenomics data on the most abundant marine proteobacteria, SAR11, by comparing it with available whole genome data of the SAR11 strains. The overarching goal is to understand the nature of species and ecotypes in SAR11 populations. A more specific aim was to assess trends in the SAR11 genome evolution by determining the core genome and variable regions. One of the main results of this study is a quantitative estimate of synteny and a demonstration that genome rearrangements in SAR11 are not random, but are concentrated at particular sites, which are often operon boundaries. On a broader scale, this work presents an approach, which is general enough to be applied to other metagenomic data sets in order to study evolutionary trends in natural microbial populations. I don't have any major concerns regarding this paper. It reads well and I found it quite interesting for a non-specialist in microbial ecology and metagenomics. I would definitely recommend it to anyone who is seriously thinking about deriving information from the metagenomics data sets. Reviewer's report 3 Peer Bork, senior scientist (bioinformatics)/group leader at the EMBL (Heidelberg); joint coordinator of the EMBL Structural and Computational Biology program; visiting group leader at the MDC (Berlin-Buch) I think this is a great case study that wisely uses metagenomics data to understand more about the natural variation and evolution of Pelagibacter/SAR11, probably the most abundant marine bacterium. A number of trends could be observed, some are similar to studies in other phyla, some are different; so the paper also adds in general terms to our understanding of bacterial populations and the impact of habitats on their evolution. Interestingly, it is proposed that gene order seems to surprisingly evolve slower than sequence and which is explained partially by the large population size. I assume that for the more detailed Quantification of this observation more data will be needed, but the approach demonstrates the power of integrating metagenomics data in genome analysis. The quantification of the non-random accumulation of mutations along the genome is another interesting aspect worth noting as is a surprizngingly high number of core genes (or relative little indels) given the many individuals from which sequence fragments were available. Taken together, the paper implies substantial amount of work in an important research area, it is based on a new integration approach, and it addresses an number of important points that are all worth publishing. Additional file 1 Distribution of expect scores for Pelagibacter syntenic fragments. The values in the plot are taken from the TBLASTN search in which predicted proteins from the HTCC1062 genome were the query genes and the Sargasso Sea metagenomic database was searched (Fig. (Fig.2A).2A Click here for file(93K, doc) Additional file 2 GC content of Pelagibacter syntenic fragments. Histogram of the GC content of metagenomic fragments from the syntenic fragments bin, compared to the mean for the HTCC1062 genome. Click here for file(165K, doc) Additional file 3 Sargasso Sea syntenic fragment plot for Escherichia coli. A syntenic fragment plot using Escherichia coli as the query genome. The data provide a rough measure of the false-positive rate in the syntenic fragment plots. Click here for file(1.7M, doc) Additional file 4 Syntenic fragment plots of three representative organisms. The bar chart in the upper right corner indicates the number of fragments containing the query organism's 16S rRNA, at the indicated degree of similarity. The horizontal line indicates the average syntenic fragment score. Click here for file(731K, doc) Additional file 5 Syntenic fragments carrying at least three genes. A large portion of the syntenic fragments are not of sufficient length to carry more than three genes. These data show that the general trends of genome coverage and range of amino-acid level identity shown in Fig. Fig.3D3D Click here for file(121K, doc) Additional file 6 Analysis of assemblies versus unassembled reads. Pelagibacter syntenic fragment analysis performed on unassembled reads (A), and on sequence data containing assemblies as well as unassembled reads (B). The plots are essentially similar. Click here for file(460K, doc) Additional file 7 Enlargement of HVR3 and HVR4. HTCC1062 syntenic fragment plot showing detail in the region of HVR3 and HVR4. Click here for file(3.2M, doc) Additional file 8 Enlargement of HVR1. HTCC1062 syntenic fragment plot showing detail in the region of HVR1. Click here for file(3.2M, doc) Acknowledgements This work was supported by research grant MCB-0237713 from the NSF Microbial Observatories Program, and a Marine Microbiology Initiative investigator award from the Gordon and Betty Moore Foundation. The authors thank Saul Kravits, Steve Ferriera, Justin Johnson, Robert Friedman, Yu-Hui Rogers, J. Craig Venter and staff for genome sequencing of HTCC1002, and Kevin Vergin, Dee Denver, Robert Burton, and Ulrich Stingl for critical advice. Accession numbers of strains used in this study. HTCC1062: NC_007205 HTCC1002: AAPV00000000. Tool release. The source code for the identification and display of syntenic fragments is available at the Giovanonni laboratory web site [49]. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||||||||||||||
Proc Natl Acad Sci U S A. 2002 Dec 24; 99(26):17020-4.
[Proc Natl Acad Sci U S A. 2002]Proc Natl Acad Sci U S A. 2005 Sep 27; 102(39):13950-5.
[Proc Natl Acad Sci U S A. 2005]Proc Natl Acad Sci U S A. 2005 Feb 15; 102(7):2567-72.
[Proc Natl Acad Sci U S A. 2005]Nature. 2004 Mar 4; 428(6978):37-43.
[Nature. 2004]Science. 2004 Apr 2; 304(5667):66-74.
[Science. 2004]Genome Biol. 2005; 6(2):R14.
[Genome Biol. 2005]Science. 2005 Aug 19; 309(5738):1242-5.
[Science. 2005]Science. 2004 Apr 2; 304(5667):66-74.
[Science. 2004]PLoS Biol. 2007 Mar; 5(3):e77.
[PLoS Biol. 2007]Science. 2005 Feb 25; 307(5713):1311-3.
[Science. 2005]Nature. 2002 Aug 8; 418(6898):630-3.
[Nature. 2002]Environ Microbiol. 2005 Sep; 7(9):1466-79.
[Environ Microbiol. 2005]Proc Natl Acad Sci U S A. 2006 Nov 28; 103(48):18296-301.
[Proc Natl Acad Sci U S A. 2006]BMC Bioinformatics. 2005 Jan 5; 6():2.
[BMC Bioinformatics. 2005]Science. 2006 Mar 24; 311(5768):1768-70.
[Science. 2006]PLoS Biol. 2007 Mar; 5(3):e77.
[PLoS Biol. 2007]Nat Rev Microbiol. 2005 Jun; 3(6):459-69.
[Nat Rev Microbiol. 2005]Nucleic Acids Res. 2005; 33(3):880-92.
[Nucleic Acids Res. 2005]Proc Natl Acad Sci U S A. 2006 Nov 28; 103(48):18296-301.
[Proc Natl Acad Sci U S A. 2006]Science. 2006 Mar 24; 311(5768):1768-70.
[Science. 2006]Science. 2000 Sep 15; 289(5486):1902-6.
[Science. 2000]Nature. 2005 Nov 3; 438(7064):82-5.
[Nature. 2005]PLoS Biol. 2007 Mar; 5(3):e77.
[PLoS Biol. 2007]Science. 2006 Mar 24; 311(5768):1768-70.
[Science. 2006]Nat Rev Microbiol. 2004 May; 2(5):414-24.
[Nat Rev Microbiol. 2004]EMBO Rep. 2001 May; 2(5):376-81.
[EMBO Rep. 2001]Nature. 2003 Aug 28; 424(6952):1042-7.
[Nature. 2003]Science. 2006 Mar 24; 311(5768):1768-70.
[Science. 2006]Nature. 2002 Aug 8; 418(6898):630-3.
[Nature. 2002]Mol Biol Evol. 2002 Dec; 19(12):2265-76.
[Mol Biol Evol. 2002]Proc Natl Acad Sci U S A. 2005 Aug 23; 102(34):12112-6.
[Proc Natl Acad Sci U S A. 2005]J Mol Evol. 2005 Aug; 61(2):171-80.
[J Mol Evol. 2005]Annu Rev Microbiol. 2002; 56():457-87.
[Annu Rev Microbiol. 2002]Genome Res. 2001 Mar; 11(3):356-72.
[Genome Res. 2001]Proc Natl Acad Sci U S A. 1998 May 26; 95(11):5849-56.
[Proc Natl Acad Sci U S A. 1998]Science. 2005 Aug 19; 309(5738):1242-5.
[Science. 2005]PLoS Biol. 2007 Mar; 5(3):e77.
[PLoS Biol. 2007]Proc Natl Acad Sci U S A. 2005 Sep 27; 102(39):13950-5.
[Proc Natl Acad Sci U S A. 2005]Science. 2006 Mar 24; 311(5768):1768-70.
[Science. 2006]Curr Opin Microbiol. 2001 Feb; 4(1):35-40.
[Curr Opin Microbiol. 2001]Nature. 2003 Aug 28; 424(6952):1042-7.
[Nature. 2003]Science. 2005 Feb 25; 307(5713):1311-3.
[Science. 2005]Mol Biol Evol. 2002 Dec; 19(12):2265-76.
[Mol Biol Evol. 2002]Proc Natl Acad Sci U S A. 2005 Aug 23; 102(34):12112-6.
[Proc Natl Acad Sci U S A. 2005]J Mol Evol. 2005 Aug; 61(2):171-80.
[J Mol Evol. 2005]Science. 2005 Feb 25; 307(5713):1311-3.
[Science. 2005]J Mol Biol. 1987 Jul 5; 196(1):199-216.
[J Mol Biol. 1987]Nucleic Acids Res. 1994 Nov 11; 22(22):4673-80.
[Nucleic Acids Res. 1994]Bioinformatics. 2006 Nov 1; 22(21):2688-90.
[Bioinformatics. 2006]Nucleic Acids Res. 1997 Sep 1; 25(17):3389-402.
[Nucleic Acids Res. 1997]Appl Environ Microbiol. 2002 Mar; 68(3):1180-91.
[Appl Environ Microbiol. 2002]Nucleic Acids Res. 2005; 33(3):880-92.
[Nucleic Acids Res. 2005]Nat Rev Microbiol. 2004 May; 2(5):414-24.
[Nat Rev Microbiol. 2004]Environ Microbiol. 2003 May; 5(5):383-94.
[Environ Microbiol. 2003]Nucleic Acids Res. 1989 Mar 11; 17(5):1907-14.
[Nucleic Acids Res. 1989]Environ Microbiol. 2003 May; 5(5):383-94.
[Environ Microbiol. 2003]Genome Res. 2004 Jun; 14(6):1036-42.
[Genome Res. 2004]Bioinformatics. 2003 Dec 12; 19(18):2496-7.
[Bioinformatics. 2003]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D192-6.
[Nucleic Acids Res. 2005]