• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of pnasPNASInfo for AuthorsSubscriptionsAboutThis Article
Proc Natl Acad Sci U S A. Jul 3, 2007; 104(27): 11436–11440.
Published online Jun 25, 2007. doi:  10.1073/pnas.0611525104
PMCID: PMC2040916

Global patterns in bacterial diversity


Microbes are difficult to culture. Consequently, the primary source of information about a fundamental evolutionary topic, life's diversity, is the environmental distribution of gene sequences. We report the most comprehensive analysis of the environmental distribution of bacteria to date, based on 21,752 16S rRNA sequences compiled from 111 studies of diverse physical environments. We clustered the samples based on similarities in the phylogenetic lineages that they contain and found that, surprisingly, the major environmental determinant of microbial community composition is salinity rather than extremes of temperature, pH, or other physical and chemical factors represented in our samples. We find that sediments are more phylogenetically diverse than any other environment type. Surprisingly, soil, which has high species-level diversity, has below-average phylogenetic diversity. This work provides a framework for understanding the impact of environmental factors on bacterial evolution and for the direction of future sequencing efforts to discover new lineages.

Keywords: environmental distribution, microbial ecology, phylogenetic diversity, UniFrac

A global picture of microbial diversity has remained elusive, yet it is critical to understanding microbial adaptation to different environments and their function in those environments. Sequencing of 16S rRNA genes from environmental samples has revolutionized our understanding of microbial systematics and diversity, revealing how far we are from cataloguing the vast diversity of microorganisms on Earth (14). Integrating information from these environmental surveys, however, has thus far been a formidable obstacle to a global understanding of microbial ecology. Determining physical and chemical factors, such as temperature, pH, or geography, that correlate with differences between diverse microbial communities will reveal how easily microbes tolerate different kinds of environmental change and will increase our understanding of microbial ecology and evolution. In addition, determining the environment types that contain the most phylogenetic diversity will reveal where new sequencing efforts to catalog global bacterial diversity will be most efficient at uncovering deep-branching lineages. Because of inconsistencies in how diversity is measured in individual studies, e.g., how operational taxonomic units (OTUs) are selected or which region of the rRNA gene is sequenced, it is only by integrating information from these studies into a single phylogenetic context that these important questions can be addressed.


Toward a Global Survey of Natural Environments.

We created an environmentally annotated tree of the bacteria including 21,752 sequences from 202 environmental samples compiled from 111 studies of diverse, globally distributed natural environments. We chose published studies that sequenced the most 16S rRNA clones, surveyed natural environments, and used primers sufficiently general to amplify all bacteria. The samples represent a vast diversity of environments, ranging from “normal” environments such as soil, seawater, and sediments to environments at the extremes of temperature (hot springs, hydrothermal vents, marine ice), salinity (hypersaline basins, lakes and mats), acidity (acidic springs and rocks, alkaline lakes), and nutrient availability (oligotrophic caves) [Table 1 and supporting information (SI) Data Set 1]. To normalize sampling effort across studies that used different techniques [e.g., by using restriction fragment length polymorphism (RFLP) patterns to screen for unique clones], we chose OTUs from each sample using a 97% identity threshold (5), including one sequence from each OTU in the analysis (see Materials and Methods).

Table 1.
Summary of the 15 groups into which we binned the 202 samples

Salinity Is the Major Factor Relating Microbial Communities.

We clustered the environmental samples by the phylogenetic lineages that they contain by applying principal coordinates analysis (PCoA) (6, 7) (Figs. 1 and and2)2) and hierarchical clustering (8, 9) (and see SI Fig. 4) to a matrix of UniFrac distances by using the UniFrac web interface (10). UniFrac measures the distance between two communities as the fraction of branch length in a phylogenetic tree that leads to descendants of members of either community but not both (11). It thus captures the amount of environment-specific evolution in a single phylogenetic tree. Surprisingly, the major division is by salinity (Fig. 1 and SI Fig. 4). Almost all nonsaline environments (Fig. 1, pink circles), even those with extreme temperature and pH such as hot springs and acidic endolithic communities, cluster to the left of the diverse saline environments (Fig. 1, green triangles) along principal coordinate (PC) 1. Samples where saline and nonsaline water mix (blue squares) have intermediate values. The saline environments include marine samples, lakes, and springs: note that determinations of salinity in this study are qualitative and based on the habitat descriptions rather than on direct measurements of salt concentration. Remarkably few samples deviate from this trend, and those that do are illustrative. Two nonsaline samples cluster with the saline group: one is a microbial mat from a chemautolithotrophic cave community involved in mineral deposition, which may be locally saline (12); the other is from an anoxic rice paddy soil (13), where salinization is a common agricultural problem. One saline sample clusters with nonsaline: this is a coastal ocean sample from a study that also sampled the adjacent river and estuary (14), raising the possibility of contamination.

Fig. 1.
Results of PCoA colored by salinity. Results of PCoA with a UniFrac distance matrix comparing the 202 samples summarized in Table 1 and SI Data Set 1. The scatterplot is of principal coordinate 1 (PC1) vs. principal coordinate 2 (PC2). The symbols are ...
Fig. 2.
Results of PCoA colored by environment type. A scatterplot of PC1 vs. PC2 (A) and PC3 vs. PC2 (B). The symbols represent the 202 samples and are as described in Table 1. A file of this scatterplot in which pop-up windows indicate which point corresponds ...

Environments of the same type also cluster together, in both the hierarchical cluster (SI Fig. 4) and PCoA plots (Fig. 2), even though each type includes diverse environments (Table 1). For example, nonsaline water samples (blue pentagons, Fig. 2) have high PC2 values, and surface soils (Fig. 2, purple inverted triangles) and sediments (Fig. 2, yellow sidewise triangles) have low PC2 values, indicating that substrate type (water vs. sediment) is the second most important factor for explaining community differences. Soils and sediments cluster separately, and submerged soils and aquifers (Fig. 2, gray diamonds) generally cluster with sediments. Interestingly, even hot springs (Fig. 2, cyan sidewise triangles) partition by substrate type along PC2. Hot spring sediments (Nsp_1, Nsp_7: see SI Data Set 1 for label descriptions) cluster with nonsaline sediments along PC2, and communities that colonized glass slides placed in the microbial mats (Nsp_93, Nsp_94) cluster near nonsaline water.

As we showed previously in marine environments (11), cultured samples from different environments (Fig. 2, pink circles and hexagons) generally cluster together rather than with their environment types. Cultured samples separate by salinity, however, both in the hierarchical cluster (SI Fig. 4) and along PC1 (Fig. 1). Although cultured samples do not separate from other water samples when PC1 and PC2 alone are used, PC3 clearly separates these groups (Fig. 2B). A few samples still do not separate from the cultured isolates when the first three principal components are used. These samples include both uncultured marine ice samples (Fig. 2B, green circles), about half of the endolithic communities (Fig. 2B, green triangles), and a small proportion of the other environment types. We have previously noted the similarity between uncultured marine ice communities and cultured isolates (11) and related it to the observation that most bacteria in marine ice can be cultured (15). The results suggest that the same may be true for many endolithic communities.

The saline environments separated along PC2 according to the same properties as the nonsaline environments, although clustering within each saline environment was looser. Hierarchical clustering (SI Fig. 4) and PCoA (Fig. 2) divided saline water samples into three subgroups: surface water, mostly in coastal regions (Fig. 2, blue inverted triangles); subsurface water, mostly in the open ocean (Fig. 2, gray sidewise triangles); and anoxic water from many locations (Fig. 2, cyan triangles; Table 1). The saline sediments (Fig. 2, purple circles) clustered together but overlapped other saline environments, including hypersaline mats, stromatolites, hydrothermal vent colonizers (Table 1, Saline–misc; Fig. 2, yellow squares), and anoxic saline water samples. Like nonsaline water and cultured isolates, surface/coastal water and cultures from saline environments separated from saline sediments along PC2. These results reinforce the suggestion that substrate type (water vs. sediment) is the second most important property for structuring diversity, perhaps because of differences in lineages adapted to planktonic vs. sessile lifestyles. However, because anoxic water samples cluster with sediments, oxygenation may also be important. For instance, clades of obligate anaerobes, such as the Clostridia, and clades with many planktonic representatives, such as filamentous α-proteobacteria, probably account for some of these community differences.

Environment Types Differ Substantially in Phylogenetic Diversity (PD).

We also determined the PD of each sample, which is the branch length that remains when all other sequences are removed from the tree (16), and the PD gain (G), which is the branch length a sample adds to a tree containing sequences from all other samples (16). For example, if a new sample contained only sequences already found in other studies, adding that sample's sequences to the tree would add no new branch length, and the G value would be 0. Environments with high G values are promising sites for discovering new, diverse microbial lineages. Samples with high PD and low G values have many phylogenetic lineages that are also found in other environments.

Because sequencing effort influences diversity estimates, we regressed both G (Fig. 3) and PD (SI Fig. 5) values on the number of OTUs in each sample. The relationships between sequencing effort and both PD and G are approximately linear (R2 of 0.76 and 0.91, respectively), suggesting that deep sequencing of one environment uncovers as much new diversity as shallow sequencing of many related environments. Regressions for individual environment types indicated substantial differences in their contributions to known diversity (Fig. 3). We quantified these differences by calculating the residual of each sample from the regression of all samples (Fig. 3, blue line). Highly positive or negative residuals indicate high or low diversity respectively (Table 1; see Data Set 1 for individual sample results).

Fig. 3.
Unique diversity (G) regression analysis. Plot of the amount of branch length that is added to the phylogenetic tree (G value) by each of the 202 samples, vs. the number of OTUs that represents each sample. The main regression line is shown in blue. Bacteria ...

Soils Are Less Diverse Than Expected and Sediments and Hypersaline Mats Are More Diverse.

Surprisingly, surface soils had significantly lower G values than other environments and negative average PD residuals (Table 1), even though soil is often described as one of the most diverse environment types on earth (17, 18). High estimates for soil diversity are based on the number of OTUs found in each sample (18) and reassociation kinetics (19) and not phylogenetic diversity. The high species diversity in soil may result from more closely related species persisting in the same sample, perhaps adapting to different niches by horizontal gene transfer (which would not affect phylogenetic relatedness measured by 16S rRNA).

The nonsaline cultured group also had significantly lower average PD and G residual values than the other environments (Table 1). This result is consistent with the observation that few lineages in these environments can be cultured (2). The saline-cultured environments also had negative average residuals for both total PD and G.

Saline sediment and saline-misc (Table 1) have significantly higher G values than other environment types. Nonsaline sediments and springs resembled saline sediments, but the sample sizes were too low for statistical significance (Table 1). Saline and nonsaline sediments also had high average PD residuals (Table 1). High diversity in sediments is consistent with previous observations and may stem from their highly stratified nature and chemical gradients (17). Nonsaline sediments are less thoroughly sampled than saline sediments and are thus especially good targets for future sequencing efforts. Interestingly, the miscellaneous saline and nonsaline spring groups had high G and low PD values, indicating that they, on average, contain relatively few, but highly divergent, lineages.

Some environment types clustered poorly, suggesting that they may not form natural groups. Residuals for individual samples are thus of interest (see SI Data Set 1 for values). The sample with the lowest G residual (Sws_M_163; −3.64 standard deviations from the mean) was from the Sargasso Sea (20), an environment known to have low diversity because of nutrient limitation and little spatial heterogeneity. The samples with the highest G residuals (So_Mm+_166 and So_Mm+_168; 5.08 and 3.71 standard deviations from the mean, respectively) were from different layers of the Guerrero Negro hypersaline mat, the molecular analysis of which introduced 15 previously unidentified candidate phyla, an unprecedented number for a single environment (21).


The comprehensive analysis of the environmental distribution of bacteria has provided insights that were not apparent in the original studies. Because the analysis relies on a phylogeny of 16S rRNA sequences, the clear grouping of samples by environment type indicates a direct relationship between 16S rRNA lineages and environmental distribution. Thus, although processes such as horizontal gene transfer can be important factors for adaptation to new environments, they cannot obscure the overall evolutionary pattern, suggesting that bacteria make genomic trade-offs that prevent major changes in lifestyle simply through new gene acquisition. Some factors, such as salinity, seem especially to encourage such lineage-specific adaptations.

The results also add an interesting perspective to the study of extreme environments. Although organisms in environments at the extremes of temperature and pH are presumably under strong selective pressures, they still cluster by salinity and substrate type, indicating that the general properties of these environments still primarily determine which lineages can survive there.

The ability of comparisons of 16S rRNA data to reveal the effects of specific chemical and physical factors on microbial communities depends on the quality of information that has been measured for the source environments and the accessibility of this information in the public databases. Although we found clear patterns of variation between environment types, such as the split between saline and nonsaline environments, testing whether this split stems from ionic strength, osmolarity, availability of sulfate for reduction, or other factors remains unresolved, in part because detailed measurements were not available for many of the environmental samples. Another limitation is that, because the records do not include information on how many times each sequence was observed in each sample, it is not possible to compare samples by using quantitative measures of β diversity such as weighted UniFrac (22). Information about relative abundances is also required for almost all measurements of α diversity (total diversity of a sample) including Chao1, ACE, rarefaction analysis, and the Shannon and Simpson indices (reviewed in ref. 23). Thus, improved availability of environment information within structured, machine-readable fields in the database is a key requirement for future large-scale analyses of the factors influencing microbial diversity.

The overview that this analysis provides is useful for evaluating where to direct new sequencing efforts. The environmental clustering patterns allow us, at least in some cases, to define environment types based on the occurrence of similar bacterial lineages rather than arbitrary criteria. For instance, nonsaline lakes and rivers behave as a cohesive group but saline water does not. Evaluation of these environment types, as well as of individual environments, allows us to identify optimal targets for finding new diversity.

Materials and Methods

Selecting Relevant Environmental Samples.

We extracted GenBank records from the April 15, 2006 release and identified small subunit (SSU) rRNA sequences and their associated publication titles. We identified SSU rRNA sequences as records that had any of the terms (“SSU,” “16S,” “18S,” or “small subunit” and “rRNA,” “rDNA,” or “ribosomal RNA”) (the search was case-insensitive). We extracted reference information for each record using a custom parser, and grouped the sequences that had the same title. The Excel spreadsheet that summarizes these large SSU rRNA surveys is available; see SI Text. The 267,731 putative SSU rRNA sequences were associated with 17,836 unique titles. Of these, 1,032 titles were associated with at least 50 sequences. Surprisingly, fewer than half of all of the studies were associated with any publication, (485 of the 1,032 studies associated with at least 50 16S rRNA sequences). This underscores the importance of generating a standardized form for annotating sequences in the public databases with detailed information on the environments from which the sequences came.

Making the Phylogenetic Tree.

We used NAST (24) to add sequences from the 111 selected studies to the standard Arb alignment (25). We then added the 21,752 sequences from the studies to a guide tree with >110,000 sequences using the Arb parsimony insertion tool. The guide tree was initially described in ref. 26 but was subsequently enhanced by the Pace lab (J. K. Harris and N. R. Pace, personal communication). We used a lanemask (“lanemaskPH”) that is provided with the Hugenholz Arb database (27) available at the Ribosomal Database Project II (28), to exclude hypervariable regions from consideration while generating the tree. We chose a parsimony insertion algorithm rather than a de novo method such as neighbor joining (NJ) because it can relate sequences from different parts of the 16S rRNA molecule. This is essential because there is very little overlap in sequenced 16S rRNA regions when comparing all of the studies. For instance, only 6,552 of the 21,752 sequences (30%) were complete between positions homologous to 300 and 700 in Escherichia coli 16S rRNA and only 7,102 (33%) were complete for the region between E. coli positions 700 and 1,100. To test whether the Arb parsimony insertion tree gave similar results to a tree built de novo, we performed PCoA clustering on NJ trees of sequences from the 82 and 90 environments that had >15 sequences in the 300–700 region and the 700–1,100 region, respectively. The NJ trees were also made in Arb, by using the Jukes–Cantor model of nucleotide substitution. We compared the results to those from Arb parsimony insertion trees with the same set of sequences. For both regions, the results of PCoA clustering with the parsimony insertion and NJ trees were almost identical (data not shown). Clustering by using only the portion of the data that could be incorporated into the NJ trees recovered the saline/nonsaline split as the most important division in the data for both regions, although the coordinate axes were rotated slightly.

Selecting OTUs and Annotating the Tree with Environment Information.

We divided the sequences into 225 environmental samples using annotations from the associated publications. By excluding 23 samples with <15 OTUs each, we produced a tree with 12,984 OTUs representing 202 samples. For each environmental sample, we chose OTUs with a 97% identity threshold using our Divergent Set software (5). We decided to dereplicate the sequence data for several reasons. First, dereplication of the data has little effect on clustering with UniFrac, because inclusion of near similar sequences will not change the amount of unique branch length in the tree. Removing near similar sequences thus produces a smaller tree that is more easily manipulated, without affecting the results. Second, because the inclusion of very small samples in a UniFrac analysis can produce spurious results, we wanted to exclude small environmental samples. Because some studies deposit near-identical sequences in GenBank, and others deposit sequences only after choosing OTUs, we needed to remove near-identical sequences from all studies to evaluate our sampling effort fairly. Finally, when we corrected the raw PD and G values for sampling effort, it was again essential to ensure that the results would be robust to the methodology used to choose OTUs in the original studies. We chose the 97% threshold because this is the most common threshold used for dereplication at the species level. Repetition of the analysis with all available sequences, i.e., without choosing OTUs at all, provided almost identical UniFrac clustering results (data not shown).

Statistical Analyses.

We performed PCoA and hierarchical clustering in the UniFrac web interface (10), using the Arb tree and a file mapping sequence labels to environmental samples as input. PCoA is similar to principal coordinates analysis (PCA), except that the starting point is a matrix of distances between samples rather than a matrix of observations about each sample. We used the unweighted pair group method with arithmetic mean (UPGMA) hierarchical clustering algorithm, which produces clusters by finding the nearest pair of neighbors at each step, finding the midpoint between these neighbors, and adding a cluster consisting of the neighbors to a growing tree.

We also used the Arb tree for diversity analyses. We calculated PD for each sample by removing all sequences not from the sample from the tree and summing the remaining branch length. We determined G by removing only the sequences from that sample from the tree and summing the remaining branch length. We corrected each PD and G value for sampling effort by calculating the residual from the regression of PD and G vs. OTU count for all of the samples. We determined whether the average G and PD residuals for each environment type were significantly different from samples not in that environment type with a two-tailed Student's t test. These statistical analyses were performed by using custom code written in the Python language.

Supplementary Material

Supporting Information:


We thank Noah Fierer, Norman Pace, Jeffrey Gordon, Michael Yarus, Jesse Zaneveld, Kirk Harris, Jeff Walker, and Ruth Ley for valuable feedback on drafts of the manuscript. C.L. was supported by National Institutes of Health Predoctoral Training Grant T32 GM08759. This work was performed by using the Keck RNA Bioinformatics facility (Yale University, New Haven, CT).


operational taxonomic unit
principal coordinates analysis
phylogenetic diversity
gain in phylogenetic diversity
small subunit


The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/cgi/content/full/0611525104/DC1.


1. Hugenholtz P, Goebel BM, Pace NR. J Bacteriol. 1998;180:4765–4774. [PMC free article] [PubMed]
2. Rappe MS, Giovannoni SJ. Annu Rev Microbiol. 2003;57:369–394. [PubMed]
3. Schloss PD, Handelsman J. Microbiol Mol Biol Rev. 2004;68:686–691. [PMC free article] [PubMed]
4. Pace NR. Science. 1997;276:734–740. [PubMed]
5. Widmann J, Hamady M, Knight R. Mol Cell Proteomics. 2006;5:1520–1532. [PubMed]
6. Gower JC. Biometrika. 1966;53:325–338.
7. Krzanowski WJ. Principles of Multivariate Analysis. A User's perspective. Oxford: Oxford Univ Press; 2000.
8. Felsenstein J. Inferring Phylogenies. Sunderland, MA: Sinauer; 2004.
9. Sokal RR, Michener CD. Univ Kansas Sci Bull. 1958;38:1409–1438.
10. Lozupone C, Hamady M, Knight R. BMC Bioinformatics. 2006;7:371–385. [PMC free article] [PubMed]
11. Lozupone C, Knight R. Appl Environ Microbiol. 2005;71:8228–8235. [PMC free article] [PubMed]
12. Engel AS, Lee N, Porter ML, Stern LA, Bennett PC, Wagner M. Appl Environ Microbiol. 2003;69:5503–5511. [PMC free article] [PubMed]
13. Hengstmann U, Chin KJ, Janssen PH, Liesack W. Appl Environ Microbiol. 1999;65:5050–5058. [PMC free article] [PubMed]
14. Crump BC, Armbrust EV, Baross JA. Appl Environ Microbiol. 1999;65:3192–3204. [PMC free article] [PubMed]
15. Brinkmeyer R, Knittel K, Jurgens J, Weyland H, Amann R, Helmke E. Appl Environ Microbiol. 2003;69:6610–6619. [PMC free article] [PubMed]
16. Faith DP. Biol Conservation. 1992;61:1–10.
17. Torsvik V, Ovreas L, Thingstad TF. Science. 2002;296:1064–1066. [PubMed]
18. Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, Chang HW, Podar M, Short JM, Mathur EJ, Detter JC, et al. Science. 2005;308:554–557. [PubMed]
19. Gans J, Wolinsky M, Dunbar J. Science. 2005;309:1387–1390. [PubMed]
20. Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA, Wu D, Paulsen I, Nelson KE, Nelson W, et al. Science. 2004;304:66–74. [PubMed]
21. Ley RE, Harris JK, Wilcox J, Spear JR, Miller SR, Bebout BM, Maresca JA, Bryant DA, Sogin ML, Pace NR. Appl Environ Microbiol. 2006;72:3685–3695. [PMC free article] [PubMed]
22. Lozupone CA, Hamady M, Kelley ST, Knight R. Appl Environ Microbiol. 2007;73:1576–1585. [PMC free article] [PubMed]
23. Magurran AE. Measuring Biological Diversity. Oxford: Blackwell; 2004.
24. DeSantis TZ, Hugenholtz P, Keller K, Brodie EL, Larsen N, Piceno YM, Phan R, Andersen GL. Nucleic Acids Res. 2006;34:394–399. [PMC free article] [PubMed]
25. Ludwig W, Strunk O, Westram R, Richter L, Meier H, Yadhukumar, Buchner A, Lai T, Steppi S, Jobb G, et al. Nucleic Acids Res. 2004;32:1363–1371. [PMC free article] [PubMed]
26. Hugenholtz P, Huber T. Int J Syst Evol Microbiol. 2003;53:289–293. [PubMed]
27. Hugenholtz P. Genome Biol. 2002;3:1–8.
28. Maidak BL, Cole JR, Lilburn TG, Parker CT, Jr, Saxman PR, Farris RJ, Garrity GM, Olsen GJ, Schmidt TM, Tiedje JM. Nucleic Acids Res. 2001;29:173–174. [PMC free article] [PubMed]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences
PubReader format: click here to try


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...