NCBI LogoNCBI News

In this issue

Entrez Genomes

IgBLAST

BLAST 1.4

PubMed Central

Recent Publications

News Briefs

Mitochondria
Energize RefSeq

PSI-BLAST Profiles

Frequently Asked
Questions

Textbooks Linked
to PubMed

BankIt 3.0

Mouse and Rat
in LocusLink

BLASTLab

Malaria Menace
Mapped

Masthead


Blast Lab


How to Search Huge Local Databases


The amount of public sequence data is growing at an exponential rate and is likely to continue to do so for the foreseeable future. With this data growth comes the problem of transferring, formatting, and searching gigabase-scale databases. Given current data transfer rates and computational resources, including CPU speeds and memory configurations, a “divide and conquer” approach has been implemented by NCBI in the current version of standalone BLAST. Features of standalone BLAST (blastall) and formatdb allow one to create and search arrays of smaller databases rather than having to search a single huge database. This allows efficient searches of databases with effective sizes far in excess of the RAM available on most small computer systems.



Standalone BLAST is able to search several databases sequentially with a single query using a syntax such as

blastall -i infile -d “part1 part2 part3” -p blastn -o out


In this case, the databases “part1”, “part2”, and “part3” have been created in the usual manner using formatdb with a syntax such as

formatdb -i part1 -o T -p F


The ability to name multiple databases in the blastall command line gives the user the flexibility to search an arbitrary group of databases that may be derived either from the division of a single huge source database or from several separate source databases. However, since each database must be formatted in a separate step, this process may become cumbersome if many databases are to be created.


A recent feature of formatdb streamlines the formatting process by creating several smaller database “volumes” automatically from a single huge source file. Furthermore, searches of these volumes are performed without explicitly naming each volume on the blastall command line.

To create a set of database volumes from a single source file, with a filename of “huge”, use formatdb with a syntax such as

formatdb -i huge -o T -p F -v 1000000000


This command line will create a number of database “volumes,” each containing one billion base pairs or fewer, as specified by the “-v” option, from the source database file. The volumes will have names consisting of the root database followed by a two-digit volume extension, followed by the usual BLAST database extensions. These smaller databases can be searched as if they were a single entity using

blastall -i infile -d huge -p blastn -o out


In this case, BLAST recognizes that the database “huge” has been partitioned into several volumes because it detects a file with the name of the root database followed by an extension of “nal” (for protein databases, the extension is “pal”). This file specifies a database list to be searched when the root database name is specified to BLAST. BLAST sequentially searches each database listed in this “nal” file and generates output that is indistinguishable from that of a single database search. A sample "nal" file, resulting from formatting the datafile “huge” into three volumes, is given below. The “DBLIST” line can also be edited to specify additional databases to be searched.

#
# Alias file created Tue Jan 18 13:12:24 2000
#
#
TITLE huge
#
DBLIST huge.00 huge.01 huge.02
#
#GILIST
#
#OIDLIST
#


The “nal” and “pal” files can also be used to simplify searches of multiple databases created separately as in the first example. For instance, a file called “multi.nal” containing the following lines could be created from scratch using a text editor.

#
# Alias file created Tue Jan 18 13:12:24 2000
#
#
TITLE multi
#
DBLIST part1 part2 part3
#
#GILIST
#
#OIDLIST
#


The “multi.nal” file would allow the three databases, “part1”, “part2”, and “part3”, to be searched by specifying a single database name, “multi”, on the blastall command line as follows:

blastall -i infile -d multi -p blastn -o out


The BLAST Lab feature is intended to provide detailed technical information on some of the more specialized uses of the BLAST family of programs. Topics are selected from the range of questions received by the BLAST Help Group.



Continue
NCBI News | Winter 2000