NCBI Logo
NCBI News




In this issue


The Reference Human Genome

SARS Coronavirus Resource

Gene Expression Omnibus (GEO)

Major Histocompat-ibility Complex database (dbMHC)


RefSeq Release 1 Ready for Download

GenBank Release 137

New Microbial Genomes in GenBank

Sequence Revision History Page Offers New Comparison Function

BLAST Lab

Masthead





Blast Lab logo

Using the Advanced Features of Formatdb

NCBI provides commonly used BLAST databases in preformatted form on the BLAST ftp site. Other databases are provided in FASTA or Abstract Syntax Notation (ASN.1) format and must be prepared for BLAST before use with the formatdb program contained within the standalone BLAST package. This BLASTLab will describe some advanced features of formatdb that allow considerable flexibility in the manipulation and use of local BLAST databases.

Database subsets - one master database with many virtual aliases

For a given NCBI-provided database, one can create a virtual database subset using a GI list and database aliases. To create a human-specific protein database from the protein nr database, first, get the current formatted protein nr database from the BLAST ftp site at:

Then, retrieve the human-specific GI list from the Entrez/Protein page

by searching with “human[orgn]”, displaying the result as “GI List”, and saving the list using the “Send to file” button.

Convert the GI list into binary format using:

formatdb -F input_GI_list -B output_GI_list

Finally, create the database alias using:

formatdb -i nr -p T -F out_GI_list -L nr_human -t nr_human _subset

This procedure will create a database alias file named “nr_human.pal”, which specifies a virtual database containing the human subset of the nr database that can be searched using a BLAST command line such as:

blastall -i query -p blastp -d nr_human

Note that the database name used with the “-d” switch above lacks the “.pal” extension even though the alias file created by formatdb bears the extension.

Formatting nucleotide and protein database from a single file using ASN.1 source files

NCBI database files are provided in both FASTA and ASN.1 formats. ASN.1 formatted database files offer two advantages: 1) they are often smaller than the FASTA formatted versions due to the compression of the sequence data, and 2) they can be used to generate both a nucleotide and protein BLAST database from annotated records since the protein sequences from coding region annotations are integral parts of the ASN.1 sequence record. As an example, to create a nucleotide database from the completed E. coli O:157 genome, accession number NC_002655, from an ASN.1 source file called “NC_002655.asn”, use:

formatdb -i NC_002655.asn -p F -a T -b F -e T -o T -n
E.coli.O157_nuc

To create a database from the protein sequences in the record, use:

formatdb -i NC_002655.asn -p T -a T -b F -e T -o T -n E.coli.O157_prot

The “-p” option in the command lines above indicates the type of database, as either protein (T) or nucleotide (F). The “-a T” option informs formatdb that the input file is in ASN.1 format, “-b F” indicates that the input file is not a binary file, and “-e T” indicates that the input file is a ‘seq-entry’ type ASN.1 file. We use “-n” to name the output database.

Exporting FASTA-formatted sequences from a BLAST database

Finally, while formatdb is designed to begin with FASTA-formatted sequences and produce a BLAST database, a related program, “fastacmd” can be used in the reverse sense to produce FASTA-formatted sequences from a BLAST database. For example, to extract all the sequences in a database named “blast_db” in FASTA format, set the “-D”, or “dump”, command line option to “T” and specify the database name using the “-d” switch as given below:

fastacmd -d blast_db -DT

The fastacmd “-T” option can also be used to retrieve taxonomic information for sequences in preformatted NCBI databases, e.g:

fastacmd -d nt -s 555 -T

The output of this command is:

NCBI sequence id: gi|555|emb|X65215.1|BTMISATN
NCBI taxonomy id: 9913
Common name: cow
Scientific name: Bos taurus

Other options, such as the “-I” option to retrieve database statistics are also available. To see the full list of options, run fastacmd with a single dash and no parameters, “fastacmd -”.

The program “fastacmd” is also available within the standalone BLAST package on the BLAST ftp site at:

—TT

Continue to: Masthead


NCBI News | Fall/Winter 2002 NCBI News: Spring 2003