New Entrez Genome Released on November 9, 2011
Historically the Entrez Genome data model was designed for complete genomes of microorganisms (Archaea, Eubacteria, and Viruses) and a very few eukaryotic genomes such as human, yeast, worm, fly and thale cress (Arabidopsis thaliana). It also included individual complete genomes of organelles and plasmids. Despite the name, the Entrez Genome database record has been a chromosome (or organelle or plasmid) rather than a genome.
The new Genome resource uses a new data model where a single record provides information about the organism (usually a species), its genome structure, available assemblies and annotations, and related genome-scale projects such as transcriptome sequencing, epigenetic studies and variation analysis. As before, the Genome resource represents genomes from all major taxonomic groups: Archaea, Bacteria, Eukaryote, and Viruses. The old Genome database represented only Refseq genomes, while the new resource extends this scope to all genomes either provided by primary submitters (INSDC genomes) or curated by NCBI staff (RefSeq genomes).
The new Genome database shares a close relationship with the recently redesigned BioProject database (formerly Genome Project). Primary information about genome sequencing projects in the new Genome database is stored in the BioProject database. BioProject records of type "Organism Overview" have become Genome records with a Genome ID that maps uniquely to a BioProject ID. The new Genome database also includes all "genome sequencing" records in BioProject.
What are the differences between the old and new Genome database?
- Single genome records now represent an organism and not a genome for one isolate. A record can contain multiple genomes of different strains/isolates or multiple assemblies of the same isolate/individual.
Examples:
- Escherichia coli is only one record in new Genome database as opposed to hundreds of chromosomes and plasmids for different strains.
http://www.ncbi.nlm.nih.gov/genome/167 - Mus musculus is only one record in new Genome database but it points to six different Assemblies.
http://www.ncbi.nlm.nih.gov/genome/52
- Escherichia coli is only one record in new Genome database as opposed to hundreds of chromosomes and plasmids for different strains.
- Organelles and plasmids that are not part of the whole genome (chromosomes) are no longer indexed and therefore cannot be found by using Genome Search terms. They can be found only by an organism search.
- The scope of Genome has expanded from only RefSeq genomes to all genomes (both INSDC and RefSeq).
- The E-utilities (Entrez API) interface to Genome (db=genome) has changed to reflect the new data model. Please see the E-utility documentation for details:
http://www.ncbi.nlm.nih.gov/books/NBK25500/
For example, the Genome DocSum (Document Summary) has new fields:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=genome&id=176
These differences are summarized in the following table.
| Old Genome | New Genome | |
|---|---|---|
| Unit | Single replicon | Genome information for an organism |
| Organism | Individual/isolate | Species (multi-isolate) |
| Scope | Genome, metagenome | Organism top level project with at least one genome sequencing project |
| Data scope | RefSeq only | RefSeq and INSDC |
| Data types | Sequenced genomes | Assembly, SRA, projects |
| Sequence types | Chromosomes, organelles, plasmids | Chromosomes, organelles, plasmids, scaffolds, contigs |
| Display types | Single | Multiple |
| Relations | Genome Project | BioProject, Assembly |
| Total count | 14,007 (6,613 taxids) | 6,218 (species level) |
How do I find data that used to be in the old Genome database?
-
The new Genome IDs cannot be directly mapped to the old Genome IDs because the data types are very different. Each old Genome ID represented a single sequence that can still be found in Entrez Nucleotide using standard Entrez searches or the E-utilities. We recommend that you convert old Genome IDs to Nucleotide GI numbers using the following remapping file available on the NCBI FTP site:
ftp://ftp.ncbi.nih.gov/genomes/old_genomeID2nucGI -
The old tabular format has been changed. The new tables are available here:
http://www.ncbi.nlm.nih.gov/genome/browse/Text versions of the tables can be downloaded from the NCBI FTP site:
ftp://ftp.ncbi.nih.gov/genomes/GENOME_REPORTS/Previously developed tables (lproks.cgi, leuks,cgi) will be supported for the next 2 months, and are available on the NCBI FTP site:
ftp://ftp.ncbi.nih.gov/genomes/genomeprj/
For further information about the new Entrez Genome database, contact NCBI's Help Desk.