NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Dean L, McEntyre J, editors. Coffee Break: Tutorials for NCBI Tools [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 1999-.

Cover of Coffee Break

Coffee Break: Tutorials for NCBI Tools [Internet].

Show details

Microbial diversity

let's tell it how it is

, PhD.

Created: ; Last Update: March 26, 2004.

An impressive number of bacteria—about 30,000 species—are represented in GenBank. However, our view of the microbial world is both scant and skewed. A recent estimate suggests that the sea may support as many as 2 million different bacteria, and a ton of soil might contain 4 million (1). Less than half of the bacteria represented in GenBank—about 13,000—have been formally described, and almost all of these (90%) lie within 4 of the 40 bacterial divisions (2). Similar or greater paucity of knowledge also exists for archaea and viruses (3).

Image boat.jpgSampling "wild" microorganisms leads to the discovery of new species and novel metabolisms, which may be important from both a basic science and a practical perspective (for example, see Refs 4,5 [search PubMed]). For example, if we characterized the community in the human gut, it would be easier to spot non-native organisms in food poisoning outbreaks. Pathogens that may underlie neurological syndromes that present with features of infection would stand out against the background flora (1). Engineered communities of microorganisms might also be able to assist clean up of environmental disasters or create sustainable energy sources.

Exploring bacterial diversity is typically done by amplifying rRNA genes, in particular 16S rRNA genes, from DNA samples isolated from a habitat. The sequences are then compared to each other and to the 16S rRNA sequences from known species. If no close match to an existing 16S rRNA gene sequence is found, then the test sequence is thought to represent a new bacterium and is listed in GenBank as "uncultured bacterium". Even in well-studied, discrete places like the human mouth, new groups of uncultured bacteria continue to be discovered all the time. A newly identified organism has to be isolated and cultured in the lab to be described further; but many bugs are just not amenable to monoculture—they have adapted to living in a specific environment and may need to be part of a complex community to survive (1-3).

16S rRNA genes are considered standard because they are thought to be conserved across vast taxonomic distance (they are critical for protein translation), yet show some sequence variation between closely related species. However, one problem with using rRNA genes is that they are often present in multiple copy numbers; therefore, other representative genes may be used for sampling specific populations.

Whole Genome Shotgun Sequencing of Environmental Samples

New approaches to environmental sampling are emerging (69). One of these used a microarray to discover and assist in the isolation of new viruses (6); another used a shotgun clone and sequencing method to explore marine viral communities (9). Two others have used whole genome shotgun (WGS) sequencing on a population of bacteria, obviating the need to isolate each organism before sequencing can begin (7,8). These methods, used in combination with existing methods, may provide shortcuts to the discovery of new genes and give a holistic persective to microbial populations.

One recent study used a WGS approach to explore a sample from an acid mine drainage biofilm (7; AADL00000000). These investigators report that near-complete genomes for Leptospirillum Group II and Ferroplasma Type II were assembled, along with more fragmentory assemblies for Leptospirillum Group III, Thermoplasmatales archaeon gpl, and Ferroplasma acidarmanus Type I. Analysis of the results provided some insight into how such organisms survive in an extreme environment.

In another test case of the WGS method, Venter et al. (8) sampled water from the Sargasso Sea—one of the most well-characterized regions of ocean in the world. The major set of samples produced 1.66 million short sequences, some of which could be grouped together into larger genomic pieces. There remained about 400,000 paired-end reads and singleton reads.

Finding the Data

Using a WGS method to sequence an undefined population as opposed to a single organism adds significant complexity to the assembly process and to the identification of genes. About 25% of the assembled data from the Sargasso Sea had 3X coverage or greater; these well-sampled portions were used to cluster the sequence by “organism”.

The assembled sequences have been deposited in the WGS division of GenBank, with the project Accession number AACY01000000; thus, there are 811,372 WGS contigs in GenBank with the Accession numbers AACY01000001–AACY01811372. 498,641 of the WGS contigs are assembled into 232,442 scaffolds, the rest remain “singleton” WGS contigs; all but 10,685 of the scaffolds are made up of two contigs only. For the organism genomes listed in Table 1, 301 of the total scaffolds plus 36 singleton WGS contigs were used; the remainder have not been associated with any particular organism.

Table 1. The organism bins assembled from the Sargasso Sea WGS environmental sample dataset (8).

Table 1

The organism bins assembled from the Sargasso Sea WGS environmental sample dataset (8).

All of the short sequence reads, including those that were not included in the assembly, can be found in the Trace Archive.

The assemblies were then further clustered into 30 tentative organism “bins” based on depth of coverage, oligonucleotide frequencies and similarities to previously sequenced genomes. Of these, 12 are of sufficient size to be considered a genome assembly, while the remaining 16 are relatively small single scaffolds (Table 1). All organism bins have been assigned a taxonomy ID, and have been placed in the taxonomic tree. Figure 1 shows the graphical representation of the cf. Shewanella SAR-1 “genome” sequence.

Figure 1. (a) Genome view of cf.

Figure 1

(a) Genome view of cf. Shewanella SAR-1, constructed from the whole genome shotgun sequence derived from Sargasso Sea environmental samples (8). Genes have been classified according to the COG functional categories of the protein products, and color-coded (more...)

A variety of approaches suggested that there are at least 1000 species represented in the Sargasso Sea samples (8). Burkholderia species were represented in a high proportion (a genus that includes human and plant pathogens and some environmentally important bacteria), as were two distinct species closely related to Shewanella oneidensis. Both of these genera require a more nutrient-rich environment than the open ocean can offer, suggesting that they originated from microhabitats such as marine snow. The cyanobacterium Prochlorococcus was also relatively abundant in some samples.

Although the primary focus of this study was on bacterial populations, WGS environmental sampling may be an equally valid approach for exploring plasmids (Table 2), phage, viruses, and eukaryotic microbes.

Table 2. The plasmid bins assembled from the Sargasso Sea WGS environmental sample dataset (8).

Table 2

The plasmid bins assembled from the Sargasso Sea WGS environmental sample dataset (8).


Curtis T P, Sloan W T, Scannell J W. Estimating prokaryotic diversity and its limits. Proc Natl Acad Sci USA. 2002;99:10494–10499. [PMC free article: PMC124953] [PubMed: 12097644]
DeLong E F. Microbial seascapes revisited. Curr Opin Microbiol. 2001;4:290–295. [PubMed: 11378481]
Roossinck M J. Plant RNA virus evolution. Curr Opin Microbiol. 2003;6:406–409. [PubMed: 12941413]
Kazor C E, Mitchell P M, Lee A M, Stokes L N, Loesche W J, Dewhirst F E, Paster B J. Diversity of bacterial populations on the tongue dorsa of patients with halitosis and healthy patients. J Clin Microbiol. 2003;41:558–563. [PMC free article: PMC149706] [PubMed: 12574246]
Béejà O, Aravind L, Koonin E V. et al. Bacterial rhodopsin: evidence for a new type of phototrophy in the sea. Science. 2000;289:1902–1906. [PubMed: 10988064]
Wang D, Urisman A, Liu Y T. et al. Viral discovery and sequence recovery using DNA microarrays. PLoS Biol. 2003;1 [PMC free article: PMC261870] [PubMed: 14624234]
Tyson G W, Chapman J, Hugenholtz P, Allen E E, Ram R J, Richardson P M, Solovyev V V, Rubin E M, Rokhsar D S, Banfield J F. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature. 2004;428:25–26. [PubMed: 14961025]
Venter J C, Remington K, Heidelberg J. et al. 2004. Environmental genome shotgun sequencing of the Sargasso Sea Science[Epub ahead of print] [PubMed: 15001713]
Breitbart M, Salamon P, Andresen B. et al. Genomic analysis of uncultured marine viral communities. Proc Natl Acad Sci USA. 2002;99:14250–14255. [PMC free article: PMC137870] [PubMed: 12384570]
PubReader format: click here to try


  • PubReader
  • Print View
  • Cite this Page
  • PDF version of this page (358K)

Related information

  • PMC
    PubMed Central citations
  • PubMed
    Links to pubmed

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...