NCBI Home
Site
Map
Resource Guide
Alphabetical List
About NCBI
general and contact information
GenBank
submit your sequence, general
information
Molecular Databases
nucleotides, proteins, structures and
taxonomy
Literature Databases
PubMed, PubRef, OMIM, Citation Matcher
Genomes and Maps
maps, the human genome and model
organisms
Tools
for data mining and analysis
Research at NCBI
people and projects
Software Engineering
Tools, R&D and databases
Education
teaching resources and on-line tutorials
FTP site
download data and software
|
|
 
| General tips for obtaining Entrez database statistics |
 |
You can determine the number of records in a given Entrez database by viewing the index of the
Filter field. Each database has the term "all" in its
Filter field. The number in parentheses beside that term is the number of records
currently present in the database.
For example, to see the number of records in the PubMed
database, follow these steps (the links will open in a separate window).
Similar steps can be used to see the number of records in PubMed Central, in the MMDB
Structure database, etc.
- From the Entrez
home page, follow the link for the PubMed database
- On the PubMed database page, select Preview/Index from
the grey area under the search box
There are two search boxes on the Preview/Index page: (a) the search box near
the top of the page shows the active query; (b) the search box near the bottom of
the page is like a "worksheet" that allows you to browse the index of a search
field of interest and/or to select one or more terms from the index for addition
to your active query
- Select the Filter field from the pop-up menu of searchable fields that
is shown beside the lower search box.
- Enter "all" (without quotes) as the search term and press the
Index button
A window will appear at the bottom of the page that allows you to see your term
in the index of the search field, and to browse up and down the index. (Tip: If
no term is entered in the search box before pressing the "Index" button, the
system will automatically take you to the first term in the index. Entering a
search term simply forces the system to jump to a specific part of the
index.)
- The number in parentheses beside the term "all" is the number of
records currently in the PubMed database.
Additional statistics for some Entrez databases are presented on
special web pages accessible through the links given in the sections below.
|
| Additional statistics web page for specific databases |
 |
Consensus CDS (CCDS) Database Statistics
The Consensus CDS (CCDS)
Database home page includes a link to statistics.
As noted on the home page, the Consensus CDS (coding sequence) project is a
collaborative effort to identify a core set of human protein coding regions that
are consistently annotated and of high quality. The long term goal is to support
convergence towards a standard set of gene annotations on the human genome.
|
dbSNP Statistics
The dbSNP Summary
page displays statistics for the current dbSNP release, and also provides the
ability to view summary information for previous releases.
|
GenBank Growth Statistics
Graph
Numbers - Current GenBank
Release Notes
Numbers - Past GenBank
Release Notes
The top of each GenBank Release Notes file shows number of sequence records and
bases in a given release.
Specific sections of the Release Notes include additional statistics:
2.2.6 (per division statistics) -- for current release only
2.2.7 (per organism statistics) -- for current release only
2.2.8 (growth of GenBank) -- from December 1982 through the
present
To plot the growth of data for specific GenBank divisions or organisms,
compare the statistics in section 2.2.6 or 2.2.7, respectively,
from current and past Release Notes.
|
Entrez Gene database statistics
The blue sidebar of the Entrez Gene
database homepage provides a link to statistics
summarizing the number of organisms represented in Entrez Gene from major taxonomic groups
such as Archaea, Bacteria, Eukaryota, Viroids, and Viruses, and the number of gene records currently
available for each of the organisms. Following the link for an individual organism name, such as
Homo sapiens,
will display a table showing the current as well as previous number of gene records for that organism.
|
Gene Expression Omnibus (GEO) Statistics
The upper right corner of the GEO home
page provides statistics summarizing the number of platforms, samples, and
series currently available in the database.
|
OMIM Statistics
The blue sidebar of the Online Mendelian
Inheritance in Man (OMIM) home page includes a link to OMIM statistics. That shows the total number of
records in the database, as well as the breakdown of the number of records in
categories that correspond to the MIM number prefixes:
| * |
genes with known sequence
|
| + |
genes with known sequence and phenotype
|
| # |
Phenotype description, molecular basis known
|
| % |
Mendelian phenotype or locus, molecular basis unknown
|
| no prefix |
Other, mainly phenotypes with suspected mendelian basis
|
|
Taxonomy Statistics
The NCBI
Taxonomy home page includes a link to taxonomy statistics.
By default, the cumulative, current statistics are shown for the number of higher
taxa, genera, species, and lower taxa represented in NCBI's taxonomy database.
The number
of taxa that were added in any particular year can be viewed by following the link
for the year of interest.
As noted in the Taxonomy database summary
description in the Resource Guide, the NCBI Taxonomy Database contains the
names and lineages of living and extinct organisms that are represented in the
genetic databases with at least one nucleotide or protein sequence. New organisms
are added to the database as sequence data are deposited for them. The purpose of
the taxonomy project at NCBI is to build a consistent phylogenetic taxonomy for
the sequence databases.
|
| Genome Statistics |
 |
Entrez Genome Database Statistics
The number of records available in the Entrez Genome
database can be determined using the approach described under Entrez Database Statistics (by searching for "All"
in the Filter field). Note that an organism can have multiple records in the
Entrez Genome database (for example, one for each chromosome and one for each
plasmid or organellar genome).
The number of organisms represented in Entrez Genome from each main domain
of life -- archaea, bacteria, and eukaryota -- can be viewed by clicking on
the domain of interest in the blue sidebar of the page. The top of the resulting
page will show the total number of organisms in the group that have records in
Entrez Genome, followed by a list of the organism names. The blue sidebar of the
Entrez Genome home page provides similar links for viruses, viroids, and
organelles.
Whenever possible, statistics for individual genomes are provided as well, such as
the size of the genome (and/or individual chromosomes) in base pairs, the number
of various features annotated on a genome, etc. The method by which these
statistics can be accessed depends upon the software used to view a particular
genome. Entrez Genome employs two graphical viewing software programs: (1) an
original, basic viewer that is used to show smaller genomes such as bacteria,
viruses, and organelles, and (2) a more powerful Map Viewer for larger and more
complex genomes, such as those of eukaryotes. Additional information is provided
in the next two sections on statistics for individual genomes.
|
Statistics for Individual Prokaryotic and Viral Genomes
For many organisms in Entrez Genome,
statistics showing the number of bases in a genome are shown on the organism's
overview page. For example, the Escherichia
coli K12, complete genome page shows there are 4639675 bp in the genome. The
page also includes a table listing statistics for the features that were annotated
on the genome (e.g., all genes, protein
coding genes, structural RNAs, pseudo genes, and other features).
|
Statistics for Individual Eukaryotic Genomes
The Map Viewer is a software
program that provides special searching, browsing, and viewing capabilities for a
growing subset of organisms in Entrez Genome. Each organism name shown on the Map
Viewer home page leads to a "genome view" for that organism. Whenever possible,
an organism's genome view page provides a link to the statistics for that
organism's current genome build.
For example, the top of the human genome
view page includes a link to the current build statistics (under the header
"Homo_sapiens genome view", follow the link for "Build XX.X statistics").
The build statistics summarize the types and quantities of data used in the genome
build, and the types and quantities of objects (e.g., genes, markers, ESTs,
phenotypes) placed on different types of maps (e.g., sequence, genetic,
cytogenetic). The chromosome lengths are displayed in the detailed graphical
views of individual chromosomes (also known as "map views"). The "View Summary"
at the bottom of each map view shows the number of objects on each map in the
display. More maps and objects can be displayed using the "Maps&Options" dialog
box.
There is also an umbrella
page that provides easy access to the build statistics for every organism
represented in Map Viewer for which we have statistics.
|
| Usage Statistics |
 |
PubMed Usage Statistics
PubMed
usage statistics show the number of searches from January 1997 through the
present. (The section on "Entrez databases", above, provides tips on how to
determine the number of records in PubMed.)
|
|