NCBI Logo
NCBI News




In this issue



PubMed Abstract Plus

CD Tree and Cn3D Release

Whole Genome Shotgun Growth

New BLAST View Options

New Genome Builds–Map Viewer

New Organisms in UniGene

RefSeq Release 22

GenBank Release 158

NCBI Courses

Submissions Corner

PubChem Grows to 15 Million

Masthead




The Growth in Number and Diversity of Whole Genome Shotgun Sequencing Projects at the NCBI

The number of Whole Genome Shotgun (WGS) sequencing projects with data in GenBank continues to grow at a rapid pace. Projects include not only single genomes from individual organisms but also metagenomes—whole genome shotgun sequences from biological communities. There are now more than 400 projects listed in the WGS directory on the GenBank ftp site as of January 24, 2007.

This article highlights some of the recent genome sequences and metagenomes and shows how to access these important data using the Entrez system and the NCBI BLAST services.

WGS sequence in both GenBank and the Entrez system are organized by project, and each project is assigned a master accession that begins with a unique four letter prefix. All sequences belonging to the same project have accessions that share the same four letter prefix. Examples of WGS master accessions can be seen in Tables 1 and 2 below. As described in the Summer/Fall 2004 issue of the NCBI news, the Whole Genome Shotgun Projects page that provides a list of WGS projects and accessions is available on the NCBI Website:

Table 1

Table 1. Whole genome shotgun assemblies of single organism genomes with selected recent examples.

Table 2

Table 2. Environmental sequencing projects (metagenomes) with data in GenBank. Environmental sequencing projects (metagenomes) with data in GenBank.

The Entrez Genome Projects Database

The Entrez Genomes Projects Database provides convenient access to all WGS genomes.

The following query in the Genome Projects page retrieves more than 400 summaries of projects with WGS data.

has wgs[Properties]

Each of these Project Summaries links to a Project Overview page that provides links to the sequencing center involved and explains the motivation for the project. Figure 1 shows the Project Overview page for the European rabbit. The data are linked through the "Project data" menu. Genome specific resources including, in some cases, a genome-specific BLAST service are linked under "Resource Links".

Figure 1

Click on image for larger view

Figure 1. The Genome Project summary page for the European rabbit. The page provides access to the WGS data through the "Project Data" menu and links to genome BLAST pages under "Resource Links".

Individual Genome Sequences

WGS projects include over 250 bacteria and more than 120 eukaryotes from a wide range of taxonomic groups as shown in Table 1. In addition to the Rhesus macaque, reported in the last issue of the NCBI news, notable additions in the animals include the African elephant, the rabbit, the guinea pig, the shrew, the hedgehog, and the domestic cat. There are also a number of first genomes from an interesting array of taxonomic groups: a beetle, Tribolium castaneaum; a marsupial, the short-tailed opossum; and a monotreme, the duck-billed platypus, and the first tree species, the black cottonwood. Low coverage sequences of large genomes like that of the elephant may contain nearly a million individual sequences in GenBank. Master GenBank records that collect all of the WGS sequences for a particular project are useful for retrieving these projects using the Entrez system. The genome project overview pages, described above, provide access to the WGS data through these Master records. Master records can be used in the Entrez nucleotide database to retrieve the sequences for WGS projects. For example, the following query retrieves the master records for all mammalian WGS projects from the nucleotide database.

wgs_master[Properties] AND mammals[Organism]

Searching biological features in WGS genomes

The prokaryotic genomes and several WGS projects for higher eukaryotes are available as annotated genomes with reference sequences, gene records and, for the eukaryotes, sequence maps in the Map Viewer. The Rhesus macaque the Tribolium castaneum, and the Populus trichocarpa are three recent examples of organisms with fully annotated WGS-based genomes. Features of these annotated genomes may be searched effectively using the Entrez Gene, Protein and Nucleotide databases. The assemblies, Reference Sequence mRNAs and proteins, and other sequence collections for annotated genomes are available for BLAST searches through the genomic BLAST pages linked to the Map Viewer homepage.

Many of the other WGS projects for eukaryotes are not currently available with annotated genes or proteins. These can be searched for these features through sequence similarity using the NCBI BLAST services. In some cases genome-specific BLAST pages are linked to WGS projects (Figure 1). In all cases, WGS data are available as the "wgs" nucleotide database in the pull-down list on nucleotide-nucleotide BLAST forms linked to the BLAST homepage.

Figure 2 shows the result of a BLAST search with the platypus beta-2-microglobulin mRNA (AY125948) against the platypus wgs sequence. The search identifies two WGS records containing the four exons of the platypus microglobulin gene. This genomic sequence is not available in any other form at NCBI.

Figure 2

Click on image for larger view

Figure 2. A nucleotide-nucleotide BLAST search against the wgs database using the platypus beta-2-microglobulin mRNA as a query. The database was limited with the Entrez query platypus[Oganism]. The four hits to the two WGS sequences identify the four exons of the platypus microglobulin gene. Use the following RID to retrieve live results: 1163023234-11232-134418568822.BLASTQ4.

Environmental Sequences: Metagenomics

An interesting application of WGS techniques involves obtaining sequence information from entire biological communities rather than individual species. Acquiring and analyzing sequences obtained from biological communities without isolation of individual clones has been termed ‘Metagenomics’. Whole Genome Shotgun metagenomic studies or metagenomes are important for assessing microbial diversity in all ecosystems because the majority of microbial species are unknown and probably unculturable. Communities from unusual or extreme environments seem particularly likely to be rich sources of unknown organisms that may have evolved interesting or useful adaptations. Metagenomes may provide clues to the genetic and biochemical adaptations of these organisms. Perhaps more importantly, in the same way that single organism genomes can provide insights into the specific metabolic pathways in the organism, metagenomes may provide important insights into community metabolism.

Metagenome projects have added sequences to GenBank from unusual environments and communities including acid mine drainage impacted streams, open water and deep sea ocean communities, communities associated with whale falls in the deep ocean and the chemoautotrophic symbiotic community associated with an annelid worm lacking a digestive or excretory system. There are also data from less exotic though still largely unknown communities such as those in the human gut and farm soil. In addition, Metagenomic techniques have been applied to obtain sequences of extinct organisms including Woolly Mammoth and Neanderthal man from well preserved remains.

Environmental Genome Projects: Metagenomes

The simplest access to the metagenomes is through the Environmental Projects link on the right hand side of the Entrez Genome Projects Homepage. More than 30 projects are available at the time of this writing. The Environmental Projects can also be retrieved with the following query from the Genome Projects homepage or from the search box on the NCBI homepage:

type_environmental[Properties]

These metagenome projects have varying amounts and types of data; some projects listed are in progress and have no data yet at NCBI, some have only Trace Archive sequences available and several have WGS data available. Table 2 shows the environmental projects with WGS sequence at NCBI. As described above, all Project Overview pages in the Genome Projects database provide access to the data and other linked resources. In addition, each of the Environmental Projects has links to two specialized BLAST services; one that can search the nucleotide sequences and, in some cases, proteins for the specific metagenome, and one that allows searches against all or subsets of the metagenomes at once.

Example: The Gutless Worm Metagenome

The sediment dwelling marine annelid, Olavius, entirely lacks a digestive system and has a highly reduced excretory system. This worm depends on a consortium of at least four bacterial symbionts to provide its nutritional and excretory needs. The metagenome for this consortium has provided important insights into relationships among the organisms in this symbiotic community. The consortium contains two sulfur-oxidizing gamma proteobacteria (δ1 and δ3) and two sulfate- reducing delta proteobacteria (δ2 and δ4). The complementary metabolisms of the two sets of bacteria provide each other with appropriate electron donors and acceptors and the worm with organic carbon and other nutrients while processing the worm's nitrogenous waste.1 The genome-specific BLAST page allows searches against the metagenome of the chemoautotrophic bacterial symbionts. Translating BLAST searches quickly confirm the presence of the two sets of bacterial symbionts. Figure 3 shows the results of a translating BLAST search with a sulfite reductase (YP_387022) from Desulfovibrio, a delta proteobacterium. The sulfite reductase finds matches to all four of the symbionts. The best match, AASZ01000485, is a section of the partial assembly of the delta 1 symbiont (DS021230).

Figure 3

Click on image for larger view

Figure 3. Results of a translating BLAST search (tblastn) against the Olavius metagenome using a Desulfovibrio sulfite reductase protein sequence (YP_387022) as a query. The results contain hits from all four of the symbionts, the two delta and two gamma proteobacteria. The most similar sequence is from the delta 1 symbiont . Use the following RID to retrieve live results:

1162499811-7887-105366644997.BLASTQ2.

Continued growth

The rapid influx of WGS data in the form of organism and community genomes will continue for the foreseeable future. This new kind of data provides challenges for analyses and completely new perspectives as the scope increases beyond individual organisms to community genomes, and pathways—even of extinct communities. This growing and vast amount of largely unannotated sequence will continue to be most effectively searched through the NCBI BLAST services. The Entrez Genomes Project database will continue to provide the most convenient mechanism to access the WGS and other genomes at NCBI.

1 Woyke T, et al. 2006 . Symbiosis insights through metagenomic analysis of a microbial consortium. Nature, 443(7114):950-5. PMID: 16980956

 

back to previous articleContinue to next article

NCBI News | Fall/Winter 2002 NCBI News: Spring 2003