Gene Expression Omnibus (GEO)
Major Histocompat-ibility Complex database (dbMHC)
RefSeq Release 1 Ready for Download
GenBank Release 137
New Microbial Genomes in GenBank
Sequence Revision History Page Offers New Comparison Function
Using the Advanced Features of Formatdb
NCBI provides commonly used BLAST databases in preformatted form on the BLAST ftp site. Other databases are provided in FASTA or Abstract Syntax Notation (ASN.1) format and must be prepared for BLAST before use with the formatdb program contained within the standalone BLAST package. This BLASTLab will describe some advanced features of formatdb that allow considerable flexibility in the manipulation and use of local BLAST databases.
Database subsets - one master database with many virtual aliases
For a given NCBI-provided database, one can create a virtual database subset using a GI list and database aliases. To create a human-specific protein database from the protein nr database, first, get the current formatted protein nr database from the BLAST ftp site at:
Then, retrieve the human-specific GI list from the Entrez/Protein page
by searching with human[orgn], displaying the result as GI List, and saving the list using the Send to file button.
Convert the GI list into binary format using:
Finally, create the database alias using:
This procedure will create a database alias file named nr_human.pal, which specifies a virtual database containing the human subset of the nr database that can be searched using a BLAST command line such as:
Note that the database name used with the -d switch above lacks the .pal extension even though the alias file created by formatdb bears the extension.
Formatting nucleotide and protein database from a single file using ASN.1 source files
NCBI database files are provided in both FASTA and ASN.1 formats. ASN.1 formatted database files offer two advantages: 1) they are often smaller than the FASTA formatted versions due to the compression of the sequence data, and 2) they can be used to generate both a nucleotide and protein BLAST database from annotated records since the protein sequences from coding region annotations are integral parts of the ASN.1 sequence record. As an example, to create a nucleotide database from the completed E. coli O:157 genome, accession number NC_002655, from an ASN.1 source file called NC_002655.asn, use:
To create a database from the protein sequences in the record, use:
The -p option in the command lines above indicates the type of database, as either protein (T) or nucleotide (F). The -a T option informs formatdb that the input file is in ASN.1 format, -b F indicates that the input file is not a binary file, and -e T indicates that the input file is a seq-entry type ASN.1 file. We use -n to name the output database.
Exporting FASTA-formatted sequences from a BLAST database
Finally, while formatdb is designed to begin with FASTA-formatted sequences and produce a BLAST database, a related program, fastacmd can be used in the reverse sense to produce FASTA-formatted sequences from a BLAST database. For example, to extract all the sequences in a database named blast_db in FASTA format, set the -D, or dump, command line option to T and specify the database name using the -d switch as given below:
The fastacmd -T option can also be used to retrieve taxonomic information for sequences in preformatted NCBI databases, e.g:
The output of this command is:
NCBI sequence id: gi|555|emb|X65215.1|BTMISATN
Other options, such as the -I option to retrieve database statistics are also available. To see the full list of options, run fastacmd with a single dash and no parameters, fastacmd -.
The program fastacmd is also available within the standalone BLAST package on the BLAST ftp site at: