Information Hubs
Course Home Modules Schedule Exercises Comments Credits

Viral isolates:
For a given virus, determine which isolates are represented in the sequence databases

  Sample User Question Comments/Analysis Step By Step Guide Additional Tips  

Sample User Question back to

How many different isolates of human immunodeficiency virus type 1 are represented in the sequence databases? How many of these have a complete genome sequence record in Entrez?

Comments / Analysis back to

Searching an Entrez database for a specific virus (using the Organism search field) will retrieve all the records containing data from that source organism. For example, a search of Entrez Protein for human immunodeficiency virus type 1[orgn] will retrieve almost 200,000 sequence records. However, it is not possible to see at a glance which isolates are represented, and to then obtain and compare data from the isolates of interest. This is because the sequence records are not grouped by isolate, but are instead listed in reverse chronological order based on modification date.

In contrast, NCBI Taxonomy Database organizes all data in Entrez by organism. Whenever submitters of sequence data include information about lower taxonomic nodes (subspecies of animals, varieties of plants, strains of bacteria, isolates of viruses, etc.) that information is used to more finely organize the data in the Taxonomy Browser. So the Taxonomy Browser is the best place to go for an organism-specific (or other taxon-specific) views of the available data.

(We are using the word "organism" loosely in this exercise because viruses are not true organisms. See also the note about scope of taxonomy database.)

Step By Step Guide back to top

  • Taxonomy Browser - retrieve the entry for human immunodeficiency virus type 1 and view all the taxonomic nodes that fall beneath it

    • enter human immunodeficiency virus type 1 as the query and leave the search mode set to complete name
    • the search results page will list that virus and all the isolates for which we have sequence data
    • simply count the number of isolates to answer the user's first question (at the time of this writing, there are 38 isolates)
    • to answer the user's second question and see at a glance which isolates, if any, have a complete genome sequence in the database, check the box beside Genome near the top of the page and press the Display button. (At the time of this writing, only the general entry for human immunodeficiency virus type 1 is associated with a complete genome record, not any of the isolates.)
    • to see at a glance the types and quantities of data available for the isolates, check the box beside each data type of interest (e.g., nucleotide, protein, structure) near the top of the page and press the Display button. (At the time of this writing, most of the isolates have one to several protein sequence records, and a few of them have 3-D structure records.)

Additional Tips back to

Submitters of sequence data provide various levels of detail about source organism

Submitters of sequence data provide various levels of detail about the source organism they used. Some submitters, for example, only provide general information (e.g., "Human immunodeficiency virus") in their GenBank submission, while others indicate a type (e.g., "Human immunodeficiency virus type 1", "Human immunodeficiency virus type 2"), and yet others indicate an isolate (e.g., "Human immunodeficiency virus type 1 (BH10 ISOLATE)"). Data from a source organism is organized under the most specific node that was indicated by the submitter. For example, browse up the taxonomic tree to display the Primate lentivirus group. Near the top of the display is the general header "Human immunodeficiency virus". If submitters did not specify type, their sequence records are grouped under that general category.

Scope of NCBI Taxonomy Database

The NCBI Taxonomy Database contains the names and lineages of >130,000 organisms, both living and extinct, that are represented in the molecular biology databases with at least one nucleotide or protein sequence. New organisms are added to the database as sequence data are deposited for them. The purpose of the taxonomy project at NCBI is to build a consistent phylogenetic taxonomy for the sequence databases.

Taxonomy Browser Search Modes (complete name, token set, etc.)

The Taxonomy Browser offers several search modes, which can be selected from the pop-up menu beside the text box:

  • complete name looks for a complete common or scientific name of an organism, or the complete name of any other taxonomic node, e.g., dog, Canis familiaris, Canidae. However, the terms such as or Canid will not retrieve any records because they are not complete names.
  • wild card allows searching with an asterisk as a wild card anywhere in the string, e.g.: Canid* or *anid*, or Ca*id
  • token set searches for any string, whether in the middle of a word or at the end, e.g., dog. However, "dog" will appear as a complete word, and not a word stem, in the retrieved records. For example, the search will retrieve dog hookworm and black-tailed prairie dog, but it will not retrieve dogfish sharks or Doguera baboon.
  • phonetic name searches for names pronounced phonetically, e.g., a "drosofila" will retrieve Drosophila.
  • taxonomy id searches by TaxId, which is a unique identifer assigned to every node in the taxonomic hierarchy. Some examples of TaxIDs: human (9606), Mammalia (40674), Canidae (9608), and for Canis familiaris (9615).

The "lock" check box preserves your selected search mode for subsequent searches. If the box is not checked, the Taxonomy Browser will return to the default search mode of complete name after each search.

Information Hubs Return to Slides (*.html or *.mht format)
Return to Exercises List
Revised 09/07/2006