NCBI Logo
NCBI News




In this issue


Influenza Database and Tools

Trace Archives at 1 Billion

Entrez Nucleotide Split Database

Third Party Annotation Database

RefSeq Release 18

1918 Killer Flu Virus

UniGene

GenBank Release 155

Mammoths and Moas at NCBI

Recent NCBI Publications

NCBI Papers Most Cited

NCBI Courses

BLAST Lab

Genome Builds and Map Viewer


Masthead





Entrez Nucleotide Split Facilitates Focused Searches

The Entrez Nucleotide database is now partitioned into two specialized components containing Expressed Sequence Tag (EST), and Genome Survey Sequence (GSS) records, respectively, and a third component containing the rest of the nucleotide records, called 'CoreNucleotide' (November 2005 issue of the NCBI News). About half of the 80 million Entrez Nucleotide records fall into the new EST component, with the remainder falling into the GSS and CoreNucleotide components as shown in the chart of Fig. 1.

click for larger version

Click on image to view larger

Figure1. A. Result counts for searches of Entrez Nucleotide are given for each of the three component databases, with links to Document Summaries. B. Preview/Index pages are now maintained separately for each of the component databases. Clicking on the 'CoreNucleotide' link will display its 'Preview/Index' form, allowing the existing query of 'helicase' to be refined using advanced query-building tools.

The split facilitates searches using specialized field limitations in each of the component databases. The component databases are displayed on the database selection pull-down list on any of the NCBI Entrez pages and searches can still be performed against the combined nucleotide database through the global query or through the 'nucleotide' option. When a search is performed in the Nucleotide database, the counts for matching records in each component database are shown in a statistics line and linked to Document Summary displays, Fig. 1A. This display allows the search to be narrowed immediately to a specific component database. Within the EST and GSS component databases, the dbGSS and dbEST format is the current default display format.

As a consequence of the database split, the 'Preview/Index' tab on the 'Nucleotide' page now displays an intermediate page, shown in Fig. 1B, through which separate 'Preview/Index' forms for the CoreNucleotide, GSS and EST component databases may be selected. The 'Preview/Index pages for the EST and GSS databases allow searches using fields corresponding to sections and identifiers present in the dbGSS and dbEST format that were not available prior to the division of the database.

The new fields for these components include Citation Title, Clone ID, Library Name, GSS/ EST Name, GSS / EST ID and Submitter Name. The GSS component also has a separate Library Class field that allows the selection of different types of genomic DNA libraries. Using these component specific fields to limit the search can result in more precise retrieval than was possible before the division of the nucleotide system. For example, one can retrieve all sequences from a specific cDNA library using the Library Name field limitation. Searching with the phrase Atlantic salmon spleen in the EST component database retrieves over 9,000 records. These are from several different cDNA libraries; all contain the word Atlantic salmon or its taxonomic translation, Salmo salar, and 'spleen' in one or more of the indexed fields. Finding all of these records may be desirable. However the Library Name field can be used to retrieve sequences only from specific libraries. Library names are unique and only those records with the exact phrase as the Library Name field in the native dbEST format are retrieved. The following query retrieves 5,551 records from just two libraries.

atlantic salmon spleen[Library Name] OR atlantic salmon spleen cdna library[Library Name]

Another useful feature of the separate index is the ability to search for clone identifiers. This is especially helpful for retrieving EST clones from the Integrated Molecular Analysis of Genomes and their Expression (IMAGE) repository:

The query 'IMAGE 8635484[Clone ID]' finds both the 3' and 5' EST reads from that clone. Likewise matching end sequences from BAC clones can be easily be retrieved from the GSS component with a similar query, for example CH252-49B15[Clone ID]. As is true for other Entrez databases, each nucleotide component database as well as the combined nucleotide database has advanced search options and functions available through the Limits, History, and Clipboard. These are accessible from the corresponding tabs below the search box. Because the indexed fields differ for each of the component databases, the advanced features available through the Preview/Index and Details tabs can only be used after selecting a component database. Clicking on the either of these tabs from a combined database search produces an information screen, as shown in Fig. 1B, with options to select either the CoreNucleotide, EST, or GSS component database to access these features. However searches that include a field limiter valid in only one of component databases can be run without error in the combined nucleotide database. For example, the following search in the umbrella Nucleotide database that includes the GSS indexed Field Limit, [Library Class], returns BAC end sequences from the GSS database:
bac ends[Library Class] AND bos taurus[organism]

The standard options for displaying and saving records in various formats are available through the 'Send to' pull-down list in each of the component databases and on the combined results page. Batch retrieval through Batch-Entrez works as before and identifiers valid in any of the three component database can be uploaded for retrieval. If the list contains identifiers from EST, GSS, and CoreNucleotide the split result counts for each of the component database will be shown as with standard Web Entrez searches. The entire results set or results for a specific component database can be formatted and saved to file. Search tips and help for the Nucleotide split database are found at:

—MR

back to previous articleContinue to next article

NCBI News | Fall/Winter 2002 NCBI News: Spring 2003