![]() | ![]() |
Formats:
|
||||||||||||||||||
Copyright © 2008 The Author(s) Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers 1Department of Chemistry and Biochemistry, UCB 215, University of Colorado at Boulder, Boulder, CO 80309-0215 and 2Center for Environmental Biotechnology, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Mail Stop 70A-3317, Berkeley, CA 94720, USA *To whom correspondence should be addressed. Tel: Phone: +1 303 492 1984; Fax: +1 303 492 7744; Email: rob.knight/at/colorado.edu Received June 16, 2008; Revised July 15, 2008; Accepted July 16, 2008. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. This article has been cited by other articles in PMC.Abstract The recent introduction of massively parallel pyrosequencers allows rapid, inexpensive analysis of microbial community composition using 16S ribosomal RNA (rRNA) sequences. However, a major challenge is to design a workflow so that taxonomic information can be accurately and rapidly assigned to each read, so that the composition of each community can be linked back to likely ecological roles played by members of each species, genus, family or phylum. Here, we use three large 16S rRNA datasets to test whether taxonomic information based on the full-length sequences can be recaptured by short reads that simulate the pyrosequencer outputs. We find that different taxonomic assignment methods vary radically in their ability to recapture the taxonomic information in full-length 16S rRNA sequences: most methods are sensitive to the region of the 16S rRNA gene that is targeted for sequencing, but many combinations of methods and rRNA regions produce consistent and accurate results. To process large datasets of partial 16S rRNA sequences obtained from surveys of various microbial communities, including those from human body habitats, we recommend the use of Greengenes or RDP classifier with fragments of at least 250 bases, starting from one of the primers R357, R534, R798, F343 or F517. INTRODUCTION Pyrosequencing (1) has the potential to revolutionize our ability to understand the microbial world. Because the vast majority of microbes have not been cultured, and perhaps cannot be cultured using existing techniques (2), culture-independent techniques that analyze small subunit ribosomal RNA sequences (16S rRNA in the case of bacteria and archaea), amplified by PCR directly from environmental samples, are critical for understanding the composition and dynamics of complex microbial communities (3). Pyrosequencing greatly increases the rate at which we can characterize microbial communities for several reasons. Because pyrosequencing is a single-molecule technique, the heterogeneous 16S rDNA PCR products can be characterized directly without cloning, thus saving an immense amount of labor. The cost is also much less than with traditional methods such as capillary (Sanger) sequencing. Finally, linking a unique sequence barcode to a 16S rDNA-directed PCR primers allows dozens to hundreds of uniquely tagged samples (4), e.g. 16S rRNA samples from different microbial communities (5,6), to be analyzed in a single run (multiplex barcoded pyrosequencing). One limitation of pyrosequencing is the short read lengths: ~100 bases for the original GS 20 instrument, ~250 bases for the current GS FLX platform and an anticipated ~400 bases for the next-generation GS XLR instrument. We previously demonstrated that these short read lengths could be used for phylogenetic-based comparisons of communities based on UniFrac (a metric based on the extent of branch length that any two communities share on a phylogenetic tree constructed from all reads from all communities that are being analyzed) (7–10). However, these analyses rely solely on phylogenies, which are necessarily approximate on datasets of this scale, and taxonomic information is lost. Taxonomic information is critical for relating findings about each community to what is known about the lifestyles of each kind of microbe at varying levels of resolution. For example, at the species level, Bacillus anthracis is a pathogen, whereas B. cereus is not; at the phylum level, cyanobacteria tend to be photosynthetic primary producers whereas bacteroidetes are heterotrophs, and an increased firmicute:bacteroidetes ratio is associated with obesity (11–13). Consequently, testing whether the short read lengths that are produced by pyrosequencing could be used for taxonomic assignment is an important task. Several different methods for taxonomic assignment have been proposed. The vast datasets produced by pyrosequencing prevent the application of standard phylogenetic techniques such as likelihood- or parsimony-based tree reconstruction because there are too many sequences and too many possible trees to search, and thus fast but approximate techniques are required. Similarity searching (e.g. BLAST) can be used to find the closest matches to each sequence. The speed of this technique can be improved by examining reduced representations of the sequence (such as the abundance of each subsequence of a defined length), albeit with some cost in accuracy. Tree-based methods, in which a tree of reference sequences is constructed, and internal nodes are assigned to taxa based on known tip taxon assignments, might be expected to give more accurate assignments (14,15). In this study, we compare several methods: BLAST (16), the online Greengenes (17) and RDP (18) classifiers, and two tree-based methods. For each of the tree-based methods, we used several algorithms to assign the sequences using the tree, thereby allowing us to compare a broad range of taxonomic assignment methods against one another. The results have implications for designing, executing and interpreting large-scale surveys, such as those that will emanate from the recently launched International Human Microbiome Project (19). MATERIALS AND METHODS Datasets The reference taxonomy and sequences were taken from Bergey's manual release 7.8 from RDP. These sequences were used for ‘leave-one-out’ evaluation of each method, in which each sequence is excluded from the dataset and classification is performed using the remaining sequences. The three bacterial 16S rRNA datasets (only near full-length sequences) used for the analysis of unclassified sequences from community samples were the same as those used in ref. (7) and included: (i) sequences from the distal gut (ceca) of 19 C57Bl/6J lean and obese (ob/ob) mice (total of 3732 sequences, 3453 unique, with an average length of 1161 bases) (11); (ii) 16S rRNA sequences from microbial communities occupying five locations along the length of the colon, plus stool, from three healthy, unrelated adult humans (total of 11 864 sequences, 7761 unique, with an average length of 1349 bases) (20) and (iii) sequences obtained from communities positioned at 10 different depths (0–60 mm) in the hypersaline Guerrero Negro microbial mat (11 738 sequences, 11 164 unique, with an average length of 1233 bases; Harris,J.K., Walker,J.J. and Pace,N.R., unpublished data; see ref. (21) for details about this mat located in Baja, Mexico). Clipping Unique near full-length 16S rRNA sequences were aligned using the NAST web tools (22) and the Greengenes Core Set (17) (12 January 2008 release). Parameters used were: minimum identity, 75%; minimum length, 50% of the length of the sequence (either full length or clipped). Clipped sequences were generated by extending forward from each primer (relative to the orientation of the primer) 100, 250 or 400 bases on the original, ungapped sequences. Primers used in this study were the same as those described in ref. (7). Sequences that had unknown bases at the primer starting point or were not long enough to clip to the desired size were excluded from the analysis. Sequences that were unique as full-length sequences but identical when clipped were consolidated into a single record, associated with the number of times they corresponded to full-length sequences. Calculating lineage recovery and coverage Each taxonomy assignment method produces lineage assignments at the levels of domain, phylum, class, order, family and genus, both for the original sequences and for the clipped sequences. Because the true taxonomy is unknown for most environmental samples, taxonomy assignments for the clipped sequences were compared to those obtained using the same method for the original near full-length sequences in order to assess accuracy. The procedure thus tests reproducibility of each method on the clipped sequences. If five different full-length sequences all produced the same clipped sequence, and that sequence was incorrectly assigned at the family level, we would count this as five incorrectly assigned sequences rather than one. For each taxonomy rank, coverage was defined as the number of taxa assigned to both the original and clipped sequences, divided by the number of taxa assigned to the original sequences (in some cases, taxa at a given level were assigned to clipped but not full-length versions of the same sequence). Recovery was defined as the fraction of the covered clipped sequences assigned to the same taxon as the corresponding full-length sequences. Assigning lineages to sequences We used five different methods to assign taxonomy to each sequence (Figure 1
All five methods required a tree representing the taxonomy in order to map an original lineage assignment to the levels of Domain, Phylum, Class, Order, Family and Genus. To build this taxonomic tree, we obtained ranks from Bergey's Manual release 7.8. In the tree, taxa at each level of the taxonomy were considered to branch from the level above as a polytomy; e.g. all the classes within the firmicutes, such as the bacilli and the clostridia, branched from the firmicutes. The taxonomic tree contained only the levels described above; intermediate levels such as suborder were excluded. Method 1—BLAST In this procedure, we used BLAST (NCBI's blastall version 2.2.15) (16) to search each sequence against the set of sequences in version 7.8 of Bergey's manual as downloaded from RDP (note that this list only includes cultured microbes). First, we built a BLAST database from all the original full-length sequences. Then we BLASTed each clipped or full-length sequence against the database (using an E-value of 1e-50 for ‘leave-one-out’ evaluation, in which we excluded one sequence from the dataset and asked how well it could be classified based on the remaining data, and 1e-20 for real data). From the BLAST results, we determined the ‘nearest neighbors’ by highest bit score: ‘more neighbors’ were determined using a maximum permitted difference of 10% from the maximum bit score. The query sequence itself was excluded from the hit list for leave-one-out evaluation. Finally, from the selected hits, we identified the relevant lineages, and chose the ‘common lineage’ by starting at the highest level classification and working down until the classifications disagreed. Alternatively, we chose the ‘major lineage’ where two-thirds of the classifications agreed. Method 2—online RDP classifier In this procedure, we submitted the query sequences to the online RDP classifier at http://rdp.cme.msu.edu/classifier/classifier.jsp (18). We parsed the detailed classification results, using only taxon assignments with ≥50% bootstrap support: allowing less bootstrap support than this resulted in very poor recovery (data not shown). We then converted each assignment to the standard levels in the taxonomic tree. Method 3—online Greengenes classifier based on NAST alignment In this procedure, we submitted a NAST alignment of the query sequences to the online Greengenes classifier at http://greengenes.lbl.gov/Classify (17). At the time we performed the comparison, the last database update was 1 December 2008, and the database contained 188 073 aligned 16S rRNA records of at least 1250 bases. We parsed the taxonomy assignments in the Greengenes output using the RDP taxonomy, and converted each assignment to domain, phylum, class, order, family and genus using the taxonomic tree and the thresholds suggested in ref. (23). Method 4—phylogenetic tree-based method: building tree from the NAST alignment followed by a tree-based algorithm In this procedure, we aligned the query sequences using NAST (75% identity over ≥50% of the sequence length). We then applied the mask LaneMaskPH (24) to the alignment, and used Clearcut 1.0.7's relaxed neighbor joining method (25) to build the tree. Taxonomy was assigned using the nearest ancestral node in the tree that had a defined taxonomy assignment. We also performed a test of this procedure using the ARB parsimony insertion procedure (26) to build the phylogeny, which is a method traditionally used when performing these analyses manually. For this test, we inserted the fragments into the Greengenes core set tree (2006 release), mapped the sequences back to ancestral nodes in the Hugenholtz taxonomy, and converted the names to the RDP taxonomy for comparison. Because ARB cannot be easily automated, we focused on Clearcut for large-scale comparisons. Method 5—multimer clustering tree-based method: building trees from multimer counts followed by a tree-based algorithm Multimer clustering has been widely used for taxonomic assignment of metagenomic sequences: its great attraction is that it does not require a multiple sequence alignment, so can potentially avoid the computational cost of the alignment step (27,28). In this procedure, we first calculated the frequencies of each of the overlapping multimers (in this case, 5-mers) for all sequences. Any multimer containing an ‘N’ or other ambiguous base was excluded from the analysis. We then built a Bray–Curtis distance matrix (where distance is defined as the sum of the k-mer frequency differences divided by the sum of all k-mer frequencies) from the matrix of multimer counts. We then built a relaxed neighbor joining (NJ) tree from the distance matrix using Clearcut, and assigned the taxonomy based on the nearest ancestral node in the tree that had a defined taxonomy assignment. The tree-building algorithm was adapted from that described in ref. (27). Three algorithms for inferring the lineage from the tree topology For each of the tree-aware methods, we evaluated three algorithms for inferring the lineage using the tree. The input to each algorithm consisted of (i) a tree (either a phylogenetic tree built from the alignment using Clearcut or a UPGMA clustering of a distance matrix based on multimer counts) containing both the query sequences and the database sequences, (ii) a taxonomic tree containing all the taxon names as nodes and (iii) a mapping of each database sequence to a node in the taxonomic tree. Algorithm 1—last common ancestor In this method, we first assigned taxonomic information to each tip in the tree that represented a sequence in Bergey's taxonomic database. We then performed a postorder tree traversal (i.e. visiting each node after visiting all the nodes descending from that node), and assigned the taxonomic information for each internal node to the node in the taxonomic tree that was the last common ancestor (LCA) of all taxonomic tips that descend from that internal node. For instance, an internal node that had descendants from the taxonomic database from both the classes Bacilli and Clostridia within the phylum firmicutes would be assigned only to the phylum level (firmicutes), since that would be the LCA of these two sequences in the taxonomic tree. For each query sequence with unassigned taxonomy, we traced back to the nearest ancestor with a valid node assignment in the taxonomic tree (assigned from at least two children with taxonomic nodes assigned). We then assigned the lineage from this taxonomic node. Algorithm 2—Fitch parsimony We performed a postorder traversal of the phylogenetic tree, calculating the states of internal nodes using the Fitch parsimony algorithm (29). To assign states to query sequences, which were stored as leaves in the tree without assigned taxonomies, we traced back to the nearest ancestor that had a valid taxonomy assignment (assigned from at least two children). We then assigned taxonomic categories from phylum to genus, stopping at the level of detail at which the category became ambiguous. Algorithm 3—Fitch parsimony with back-propagation The approach was the same as Fitch above, except that after the postorder Fitch assignment, we performed a preorder traversal to set ambiguous status (if the parent state was unambiguous, and the state of one of the children is the same as the parent, we assigned the unambiguous status to the current node). When ties occurred (i.e. two assignments were equally good), we chose one of the assignments at random. This method allowed more nodes to be assigned unambiguously at the expense of some inaccurate assignments. Analyses were implemented using the PyCogent toolkit (30). RESULTS The overall strategy of our analysis was as follows. First, we performed leave-one-out analysis using full-length sequences where the taxonomy is known to ask how well each method predicts the taxonomy of one sequence from the remaining sequences in the database. Second, we used large, empirical datasets to check the internal consistency of each method in assigning taxonomy to full-length and clipped sequences. Third, we examined the ability of each method to consistently determine the proportion of taxa of each type in a given sample using clipped sequences (i.e. to check whether the errors in assignment cancel out, so that the number of sequences of a given type is correct even when there are errors in which sequence is placed in which class). Finally, we collected information about the relative speed of the methods. In general, different regions of the 16S rRNA gene differed greatly in their ability to recapture the taxonomic information inferred from full-length gene sequences. Moreover, different methods varied greatly in stability and reliability. Because gold-standard taxonomic information is difficult to obtain, we primarily relied on internal consistency of the methods between clipped and full-length sequences. Figure 2
Of the BLAST-based approaches, the ‘common lineage from nearest neighbors’ (blue line) approach always returned a result, but was always less accurate (lower recovery rate) than the other methods. The ‘common lineage from more neighbors’ (black line) approach always had a higher accuracy, but lower coverage (at all levels) than the ‘major lineage from more neighbors’ (green line) approach (see Methods section for descriptions of these algorithms). However, none of the BLAST-based approaches performed as well as the tree-based approaches (i.e. the black line on the right panel). The error rates of these methods are comparable to those previously reported for the RDP classifier, suggesting that the RDP classifier does not perform uniquely well. In the following analyses, single-sequence genera were always included. Figure 3
Recovery and coverage of six methods is shown using three different datasets at the genus level (Figure 4
The phylogenetic tree-based methods (‘Fitch’, ‘FitchAndBack’ and ‘LCA’), which are based on trees built from NAST alignments and represented by the black lines in Figure 4 Thus far, we have discussed the consistency of taxonomic assignments for full-length versus clipped sequences. An equally important question is how consistent the overall proportion of each phylogenetic group is after applying each of the procedures. Figure 5
In contrast, the three tree-based approaches are much more sensitive to the region of sequence examined. Extra caution is necessary when these methods were used for taxonomy assignment. For example, with one of the worst regions, reading forward 250 bases from F1099, the reported frequency of euryarcheota can be up to 39% (considerably more than is reasonable from current estimates of the representation of this group). However, certain regions, such as F343 and R357 with 250-base reads, recapture the results of the more stable methods almost exactly. One common procedure, often manually applied, is to import the alignments into ARB (20), put the sequences into a standard tree using ARB parsimony insertion and then use either the group names in the standard tree (based on the Hugenholtz taxonomy) or the Fitch parsimony status from known tip taxa to infer the taxonomy. Figure 6
Finally, we compared the speed of the different methods: all methods were run locally on the same hardware except RDP, NAST alignment and Greengenes, which were run at the respective websites. Figure 7
DISCUSSION Overall, our analyses show that taxonomic assignment is highly sensitive to the region of the 16S rRNA gene sequenced, to the assignment method used, and in some cases, to the length of the region sequenced. However, the results are encouraging: most regions, no matter how short, provide stable estimates of the abundance of each phylum in the dataset, which is all that is reported in many 16S rRNA sequence-based analyses. Thus, even results collected using earlier generation pyrosequencers (GS20 with average read lengths of 80–100 bases) should be stable to future re-analyses provided that the Greengenes or RDP classifiers are used. Similarly, many regions of the 16S rRNA gene, e.g. 100-base reads forward from positions 343, 517, 784, or backwards from positions 357, 534, 798 or 926 relative to the Escherichia coli sequence, allow excellent coverage and recovery even at the genus level with relatively short reads. Moreover, reads produced by the current GS FLX apparatus (average of 250 bases) will for the most part hold up to re-sequencing using the newer XLR instrument, which produces longer reads. In general, the Greengenes and RDP classifiers produce highly stable and accurate results even on very disparate datasets, so we can consider taxonomy assignment tools mature in that sense. One important cautionary note about the leave-one-out results using Bergey's manual is that different groups of bacteria have been studied at different levels of effort (e.g. human-associated commensals and pathogens, and their close relatives, have been far more extensively studied than other taxa), and only organisms that can be cultured are included. Both of these factors substantially bias the phylogenetic representation, and thus Bergey's provides an extremely biased test set. As Figure 2 Our analysis suggests that the V6 region is not optimal for pyrosequencing analyses that are directed at taxonomic assignment, as opposed to measuring levels of diversity. These results are consistent with our previous analysis of pyrosequencing and community clustering (7), which showed that regions overlapping V6 (R1114, 100 bases) were much less suitable for community clustering than other regions, such as R357. Therefore, we recommend using primers R357 and F8 to generate 250-base reads starting at R357: this region which spans V2 and V3 performs well for both community clustering and taxonomic assignments in a wide range of datasets (e.g. the mouse and human gut, and the Guerrero Negro microbial mat). Automated methods that can be incorporated into high-throughput workflows perform at least as well as manual analyses with ARB, suggesting that these manual analyses are no longer required except perhaps for independent confirmation of suspect taxa. We therefore suggest that the Greengenes and/or RDP classifications be treated as sufficient evidence to support taxonomic classifications: although they are not perfect, they achieve high accuracy under a wide range of conditions. However, we recommend that future development should focus on improving the run-time performance, especially of the Greengenes classifier. Finally, one important consideration with this and all other taxonomic analyses is that the taxonomy assignment for each read is only as good as the underlying taxonomy and the phylogeny on which it is based. New taxa by definition cannot be identified using these types of techniques, although sequences that remain unclassified are often fertile grounds for new lineage discovery. De novo tree-building is still required for identifying new lineages, and thus support for improvement of taxonomies and iterative tree-building approaches are essential for maximizing the utility of rapid classification methods for pyrosequencer datasets—datasets that will undoubtedly grow in number and size given the impending explosion in comparative metagenomic surveys of human body as well as terrestrial and oceanic habitats. ACKNOWLEDGEMENTS We thank Catherine Lozupone, Phil Hugenholtz, Frederic Bushman, Peter Turnbaugh and Jeffrey I. Gordon for critical discussion of the article. We thank J. K. Harris, J. J. Walker and N. R. Pace for sharing unpublished data. This work was supported by NIH grant P01DK078669 and the Crohn's and Colitis Foundation of America. Analyses were performed using the W.M. Keck Bioinformatics Facility at CU Boulder. Additional support was provided by the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research, Genomics: GTL Program, through contract DEAC02-05CH11231 (T.Z.D., G.L.A.). FUNDING Funding to pay the Open Access publication charges for this article was provided by NIH P01DK078669. Conflict of interest statement. None declared. REFERENCES 1. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–380. [PubMed] 2. Rappe MS, Giovannoni SJ. The uncultured microbial majority. Annu. Rev. Microbiol. 2003;57:369–394. [PubMed] 3. Pace NR. A molecular view of microbial diversity and the biosphere. Science. 1997;276:734–740. [PubMed] 4. Binladen J, Gilbert MT, Bollback JP, Panitz F, Bendixen C, Nielsen R, Willerslev E. The use of coded PCR primers enables high-throughput sequencing of multiple homolog amplification products by 454 parallel sequencing. PLoS ONE. 2007;2:e197. [PubMed] 5. Hamady M, Walker JJ, Harris JK, Gold NJ, Knight R. Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex. Nat. Methods. 2008;5:235–237. [PubMed] 6. McKenna P, Hoffmann C, Minkah N, Aye PP, Lackner A, Liu Z, Lozupone CA, Hamady M, Knight R, Bushman FD. The macaque gut microbiome in health, lentiviral infection, and chronic enterocolitis. PLoS Pathog. 2008;4:e20. [PubMed] 7. Liu Z, Lozupone C, Hamady M, Bushman FD, Knight R. Short pyrosequencing reads suffice for accurate microbial community analysis. Nucleic Acids Res. 2007;35:e120. [PubMed] 8. Lozupone C, Hamady M, Knight R. UniFrac–an online tool for comparing microbial community diversity in a phylogenetic context. BMC Bioinformatics. 2006;7:371. [PubMed] 9. Lozupone C, Knight R. UniFrac: a new phylogenetic method for comparing microbial communities. Appl. Environ. Microbiol. 2005;71:8228–8235. [PubMed] 10. Lozupone CA, Hamady M, Kelley ST, Knight R. Quantitative and qualitative beta diversity measures lead to different insights into factors that structure microbial communities. Appl. Environ. Microbiol. 2007;73:1576–1585. [PubMed] 11. Ley RE, Backhed F, Turnbaugh P, Lozupone CA, Knight RD, Gordon JI. Obesity alters gut microbial ecology. Proc. Natl Acad. Sci. USA. 2005;102:11070–11075. [PubMed] 12. Ley RE, Turnbaugh PJ, Klein S, Gordon JI. Microbial ecology: human gut microbes associated with obesity. Nature. 2006;444:1022–1023. [PubMed] 13. Turnbaugh PJ, Ley RE, Mahowald MA, Magrini V, Mardis ER, Gordon JI. An obesity-associated gut microbiome with increased capacity for energy harvest. Nature. 2006;444:1027–1031. [PubMed] 14. Barker D, Pagel M. Predicting functional gene links from phylogenetic-statistical analyses of whole genomes. PLoS Comput. Biol. 2005;1:e3. [PubMed] 15. Krause L, Diaz NN, Goesmann A, Kelley S, Nattkemper TW, Rohwer F, Edwards RA, Stoye J. Phylogenetic classification of short environmental DNA fragments. Nucleic Acids Res. 2008;36:2230–2239. [PubMed] 16. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–410. [PubMed] 17. DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, Keller K, Huber T, Dalevi D, Hu P, Andersen GL. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl. Environ. Microbiol. 2006;72:5069–5072. [PubMed] 18. Cole JR, Chai B, Farris RJ, Wang Q, Kulam-Syed-Mohideen AS, McGarrell DM, Bandela AM, Cardenas E, Garrity GM, Tiedje JM. The ribosomal database project (RDP-II): introducing myRDP space and quality controlled public data. Nucleic Acids Res. 2007;35:D169–D172. [PubMed] 19. Turnbaugh PJ, Ley RE, Hamady M, Fraser-Liggett CM, Knight R, Gordon JI. The human microbiome project. Nature. 2007;449:804–810. [PubMed] 20. Eckburg PB, Bik EM, Bernstein CN, Purdom E, Dethlefsen L, Sargent M, Gill SR, Nelson KE, Relman DA. Diversity of the human intestinal microbial flora. Science. 2005;308:1635–1638. [PubMed] 21. Ley RE, Harris JK, Wilcox J, Spear JR, Miller SR, Bebout BM, Maresca JA, Bryant DA, Sogin ML, Pace NR. Unexpected diversity and complexity of the Guerrero Negro hypersaline microbial mat. Appl. Environ. Microbiol. 2006;72:3685–3695. [PubMed] 22. DeSantis TZ, Jr, Hugenholtz P, Keller K, Brodie EL, Larsen N, Piceno YM, Phan R, Andersen GL. NAST: a multiple sequence alignment server for comparative analysis of 16S rRNA genes. Nucleic Acids Res. 2006;34:W394–W399. [PubMed] 23. DeSantis TZ, Brodie EL, Moberg JP, Zubieta IX, Piceno YM, Andersen GL. High-density universal 16S rRNA microarray analysis reveals broader diversity than typical clone library when sampling the environment. Microb. Ecol. 2007;53:371–383. [PubMed] 24. Hugenholtz P. Exploring prokaryotic diversity in the genomic era. Genome Biol. 2002;3 REVIEWS0003. 25. Sheneman L, Evans J, Foster JA. Clearcut: a fast implementation of relaxed neighbor joining. Bioinformatics. 2006;22:2823–2824. [PubMed] 26. Ludwig W, Strunk O, Westram R, Richter L, Meier H, Yadhukumar, Buchner A, Lai T, Steppi S, Jobb G, et al. ARB: a software environment for sequence data. Nucleic Acids Res. 2004;32:1363–1371. [PubMed] 27. Hao B, Qi J. Prokaryote phylogeny without sequence alignment: from avoidance signature to composition distance. J. Bioinform. Comput. Biol. 2004;2:1–19. [PubMed] 28. McHardy AC, Martin HG, Tsirigos A, Hugenholtz P, Rigoutsos I. Accurate phylogenetic classification of variable-length DNA fragments. Nat. Methods. 2007;4:63–72. [PubMed] 29. Fitch WM. Distinguishing homologous from analogous proteins. Syst. Zool. 1970;19:83–92. 30. Knight R, Maxwell P, Birmingham A, Carnes J, Caporaso JG, Easton BC, Eaton M, Hamady M, Lindsay H, Liu Z, et al. PyCogent: a toolkit for making sense from sequence. Genome Biol. 2007;8:R171. [PubMed] 31. Huber JA, Mark Welch DB, Morrison HG, Huse SM, Neal PR, Butterfield DA, Sogin ML. Microbial population structures in the deep marine biosphere. Science. 2007;318:97–100. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||
Nature. 2005 Sep 15; 437(7057):376-80.
[Nature. 2005]Annu Rev Microbiol. 2003; 57():369-94.
[Annu Rev Microbiol. 2003]Science. 1997 May 2; 276(5313):734-40.
[Science. 1997]PLoS One. 2007 Feb 14; 2(2):e197.
[PLoS One. 2007]Nat Methods. 2008 Mar; 5(3):235-7.
[Nat Methods. 2008]Nucleic Acids Res. 2007; 35(18):e120.
[Nucleic Acids Res. 2007]BMC Bioinformatics. 2006 Aug 7; 7():371.
[BMC Bioinformatics. 2006]Appl Environ Microbiol. 2005 Dec; 71(12):8228-35.
[Appl Environ Microbiol. 2005]Appl Environ Microbiol. 2007 Mar; 73(5):1576-85.
[Appl Environ Microbiol. 2007]Proc Natl Acad Sci U S A. 2005 Aug 2; 102(31):11070-5.
[Proc Natl Acad Sci U S A. 2005]PLoS Comput Biol. 2005 Jun; 1(1):e3.
[PLoS Comput Biol. 2005]Nucleic Acids Res. 2008 Apr; 36(7):2230-9.
[Nucleic Acids Res. 2008]J Mol Biol. 1990 Oct 5; 215(3):403-10.
[J Mol Biol. 1990]Appl Environ Microbiol. 2006 Jul; 72(7):5069-72.
[Appl Environ Microbiol. 2006]Nucleic Acids Res. 2007 Jan; 35(Database issue):D169-72.
[Nucleic Acids Res. 2007]Nucleic Acids Res. 2007; 35(18):e120.
[Nucleic Acids Res. 2007]Proc Natl Acad Sci U S A. 2005 Aug 2; 102(31):11070-5.
[Proc Natl Acad Sci U S A. 2005]Science. 2005 Jun 10; 308(5728):1635-8.
[Science. 2005]Appl Environ Microbiol. 2006 May; 72(5):3685-95.
[Appl Environ Microbiol. 2006]Nucleic Acids Res. 2006 Jul 1; 34(Web Server issue):W394-9.
[Nucleic Acids Res. 2006]Appl Environ Microbiol. 2006 Jul; 72(7):5069-72.
[Appl Environ Microbiol. 2006]Nucleic Acids Res. 2007; 35(18):e120.
[Nucleic Acids Res. 2007]J Mol Biol. 1990 Oct 5; 215(3):403-10.
[J Mol Biol. 1990]Nucleic Acids Res. 2007 Jan; 35(Database issue):D169-72.
[Nucleic Acids Res. 2007]Appl Environ Microbiol. 2006 Jul; 72(7):5069-72.
[Appl Environ Microbiol. 2006]Microb Ecol. 2007 Apr; 53(3):371-83.
[Microb Ecol. 2007]Bioinformatics. 2006 Nov 15; 22(22):2823-4.
[Bioinformatics. 2006]Nucleic Acids Res. 2004; 32(4):1363-71.
[Nucleic Acids Res. 2004]J Bioinform Comput Biol. 2004 Mar; 2(1):1-19.
[J Bioinform Comput Biol. 2004]Nat Methods. 2007 Jan; 4(1):63-72.
[Nat Methods. 2007]Genome Biol. 2007; 8(8):R171.
[Genome Biol. 2007]Science. 2007 Oct 5; 318(5847):97-100.
[Science. 2007]PLoS Comput Biol. 2005 Jun; 1(1):e3.
[PLoS Comput Biol. 2005]Science. 2005 Jun 10; 308(5728):1635-8.
[Science. 2005]Nucleic Acids Res. 2007; 35(18):e120.
[Nucleic Acids Res. 2007]