For a given virus, determine which isolates are represented in
the sequence databases
|Sample User Question
How many different isolates of human immunodeficiency virus type 1 are
represented in the sequence databases? How many of these have a complete genome
sequence record in Entrez?
|Comments / Analysis
Searching an Entrez database for a specific virus (using the Organism search
field) will retrieve all the records containing data from that source organism.
For example, a search of Entrez Protein for human immunodeficiency virus type
1[orgn] will retrieve almost 200,000 sequence records. However, it is not
possible to see at a glance which isolates are represented, and to then obtain
and compare data from the isolates of interest. This is because the
sequence records are not grouped by isolate, but are instead listed in reverse
chronological order based on modification date.
In contrast, NCBI Taxonomy Database organizes all data in Entrez by organism.
Whenever submitters of sequence data include
information about lower taxonomic nodes (subspecies of animals, varieties of
plants, strains of bacteria, isolates of viruses, etc.) that information is used
to more finely organize the data in the Taxonomy Browser. So the Taxonomy Browser
is the best place to go for an organism-specific (or other taxon-specific) views
of the available data.
(We are using the word "organism" loosely in this exercise because viruses are
not true organisms. See also the note about scope of
|Step By Step Guide
- Taxonomy Browser - retrieve the entry for human
immunodeficiency virus type 1 and view all the taxonomic nodes that fall beneath
- enter human immunodeficiency virus type 1 as the query and leave
the search mode set to complete name
- the search results page will list that virus and all the
isolates for which we have sequence data
- simply count the number of isolates to answer the user's first question
(at the time of this writing, there are 38 isolates)
- to answer the user's second question and see at a glance which isolates,
if any, have a complete genome sequence in the database, check the box
beside Genome near the top of the page and press the Display button.
(At the time of this writing, only the general entry for human immunodeficiency
virus type 1 is associated with a complete genome record, not any of the
- to see at a glance the types and quantities of data available for the
isolates, check the box beside each data type of interest (e.g.,
nucleotide, protein, structure) near the top of the page and press the
Display button. (At the time of this writing, most of the isolates have
one to several protein sequence records, and a few of them have 3-D structure
Submitters of sequence data provide various levels of detail about source
Submitters of sequence data provide various levels of detail about the source
organism they used. Some submitters, for example, only provide general
information (e.g., "Human immunodeficiency virus") in their GenBank
submission, while others indicate a type (e.g., "Human immunodeficiency
virus type 1", "Human immunodeficiency virus type 2"), and yet others indicate an
isolate (e.g., "Human immunodeficiency virus type 1 (BH10 ISOLATE)"). Data
from a source organism is organized under the most specific node that was
indicated by the submitter. For example, browse up the taxonomic tree to display
the Primate lentivirus group. Near the top of the display is the general
header "Human immunodeficiency virus". If submitters did not specify type, their
sequence records are grouped under that general category.
Scope of NCBI Taxonomy Database
The NCBI Taxonomy Database contains the names and lineages of >130,000
organisms, both living and extinct, that are represented in the molecular biology
databases with at least one nucleotide or protein sequence. New organisms are
added to the database as sequence data are deposited for them. The purpose of the
taxonomy project at NCBI is to build a consistent phylogenetic taxonomy for the
Taxonomy Browser Search Modes (complete name, token set, etc.)
The Taxonomy Browser offers several search modes, which can be selected
from the pop-up menu beside the text box:
- complete name looks for a complete common or scientific name of an
organism, or the complete name of any other taxonomic node, e.g., dog, Canis
familiaris, Canidae. However, the terms such as or Canid
will not retrieve any records because they are not complete names.
- wild card allows searching with an asterisk as a wild card anywhere
in the string, e.g.: Canid* or *anid*, or Ca*id
- token set searches for any string, whether in the middle of a word
or at the end, e.g., dog. However, "dog" will appear as a complete word, and not
a word stem, in the retrieved records. For example, the search will retrieve dog
hookworm and black-tailed prairie dog, but it will not retrieve dogfish sharks or
- phonetic name searches for names pronounced phonetically, e.g., a
"drosofila" will retrieve Drosophila.
- taxonomy id searches by TaxId, which is a unique identifer assigned
to every node in the taxonomic hierarchy. Some examples of TaxIDs: human (9606),
Mammalia (40674), Canidae (9608), and for Canis familiaris (9615).
The "lock" check box preserves your selected search mode for subsequent
searches. If the box is not checked, the Taxonomy Browser will return to
the default search mode of complete name after each search.