Program Parameters for formatdb and fastacmd
- Two BLAST Database Related Tools
Tao Tao, Ph.D.
User Service
NCBI, NLM, NIH

1. Introduction

When discussing biological sequences, we generally refer to text strings with each of its letter represents a nucleotide or a amino acide residue in the actual biological sequence. There are many ways to display a biological sequence, with FASTA being the most widely used one.

The FASTA format was first adopted by the authors of FASTA sequence alignment program [1]. In this format, a ">" initialed definition line (defline in short) precedes the actual sequence. This defline contains a brief description of the actual sequence. NCBI further expanded the defline by using the first string in the FASTA defline as seqID and breaking this string into pipe (|) separated fields to encode additional information.

In addition to FASTA, NCBI also provides sequences in GenBank (GenPept for protein), ASN.1, XML formats. However, only FASTA formatted sequences can be used with command line standalone BLAST or client BLAST (blastcl3) as input query.

We cannot use sequences in above formats as BLAST databases. Rather, we will need to convert them into BLASTable format using formatdb, which takes only sequences in FASTA or ASN.1 format as input. Once we identify a hit of interest, we often need to get the entries out of a formatted BLAST database in human readable FASTA format. We can use the fastacmd program to accomplish this task.

In this document, we will go over the technical details of FASTA sequence deflines, the two database related programs, formatdb and fastacmd, their program parameters, and their practical usages.

We would like to point out that NCBI provides the common set of BLAST databases in preformatted form, generated directly from our relational databases. We break large databases into smaller and easy to handle volumes. These databases are readily blastable after inflation and extraction. Use these preformatted databases whenever you can. Users can quickly regenerate the FASTA sequences from these preformatted databases if they are needed. See this page for more information on available BLAST databases:

http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastdb.html

    1.1 Biological sequences in FASTA format

FASTA formatted sequence consists of a single comment line called defline, which is marked by a ">" sign at the beginning followed by the description of the sequence. The defline terminates with a new line character and is followed by one or more lines of actual sequences, each terminated with a new line character. Deflines for FASTA sequences from NCBI follow a distinctive structure, which has several pipe (|) separated fields. Details are given below.

Table 1.1 Sequence ID (seqID) Fields in the FASTA Deflines of Sequences from NCBI
Database NameIdentifier Syntax and Examples
GenBank 1>gi|digits|gb|accession|locus
EMBL Data Library 1>gi|digits|emb|accession|locus
DNA Database of Japan>gi|digits|dbj|accession|locus
NBRF PIR 2>gi|digits|pir||entry
Protein Research Foundation 2>gi|digits|prf||name
SWISS-PROT 2>gi|digits|sp|accession|entry name
Protein Data Bank 2>gi|digits|pdb|entry|chain
Patents 2>gi|digits|pat|country|number
GenInfo Backbone Id>gi|digits|bbs|number
NCBI Reference Sequence 1, 2>gi|digits|ref|accession|
General database identifier 3>gnl|database|identifier
Local Sequence identifier 3>lcl|identifier
NOTE:
1 Nucleotide Defline Examples:
>gi|304804|gb|L17338.1|DROZENA Drosophila pseudoobscura zen gene
>gi|293667|gb|L13590.1|MUSIMSWAL Mus musculus DNA sequence
>gi|18917|emb|X52321.1|HVBAMYL Barley mRNA for beta-amylase
>gi|15042013|dbj|AB055093.1| Bacillus sp. KSM-KP43 gene for 16S rRNA
2 Protein Defline Examples:
>gi|68510037|ref|NP_766538.2| lipin 1 isoform a [Mus musculus]
>gi|24430466|gb|AAN61186.1| maturase [Perilla frutescens]
>gi|46090801|dbj|BAD13538.1| cytochrome b [Tanakia lanceolata]
>gi|223168|prf||0602187A protein,Leu,Ile,Val binding
>gi|50552|emb|CAA46200.1| protein kinase [Mus musculus]
>gi|51315829|sp|P84108|FLL1_ACEDI Flagellin-like protein
>gi|230326|pdb|1SGT| Trypsin (SGT) (E.C.3.4.21.4)
>gi|1082471|pir||S52920 disintegrin (EC 3.4.24.-) - human (fragment)
3 For records that are not included in the NCBI Entrez database.

    1.2 Conversion to blastable format

Text files with sequences in FASTA or ASN.1 format cannot be used as BLAST databases directly during a BLAST search. To make them recognizable by BLAST, we will need to format them using formatdb. When formatting a database with the "-o F" setting, formatdb generates 3 files for the input sequence file, all arerequired by BLAST programs.

If the sequence file is from NCBI, or its deflines conform to NCBI convention, formatdb is capable of parsing the seqIDs from the deflines to generate additional indexing files with the "-o T" setting. For sequence files with custom deflines, the exact number of files generated with "-o T" setting will depend on the actual format of the deflines.

For large databases, with size larger than one GB, formatdb automatically splits up the file and generate multiple volumes, each no larger than the one GB (default setting: -v 1000). Each volume will has its own specific set of database files. An alias file, generated automatically, ties up individual volumes to form a large virtual database.

    1.3 Sequence retrieval from formatted BLAST databases

We often need to retrieve the FASTA sequence for a specific entry within a BLAST database for visual inspection or other analyses. We can do so if the database was formatted with the "-o T" setting. What we need is fastacmd, another database related tool from the standalone BLAST package, in combination with the seqID for the entry. When using preformatted databases from NCBI, we can also obtain taxonomic information. For details see Section 4.2

2. Installation and configuration

There is not specific setting for formatdb or fastacmd if the BLAST package was installed properly. For more information on the installation of BLAST pages, see pc_setup.html or unix_setup.html.

3. Program parameters

We will list the detailed program parameters for formatdb and fastacmd separately below.

    3.1 Command line parameters for formatdb

Command line parameters for foramtdb are discussed here with each parameter listed in its own table.

Table 3.1.1
Parameter-i
FunctionSepcifies the input file(s) to be formatted
DefaultN/A
Input format[File In]
ExampleTo format an input FASTA file my_seq.txt, use: -i my_seq.txt
Note
This parameter is mandatory. It requires the full file name with extension. The input file should have sequences in FASTA or ASN.1 format, except when converting a gi list to binary form. To format multiple input files, quote the input file names as in -i "db1 db2". The FASTA output from other programs can be pipe to this option using "-i stdin". Renaming of database is recommended (mandatory in the first case). See Table 3.1.9.

Table 3.1.2
Parameter-p
FunctionThe input type is protein
DefaultT
Input format[T/F]
ExampleTo format nucleotide database, use: -p F
Note
T: true, input is protein
F, false, input is nucleotide.

Table 3.1.3
Parameter-o
FunctionParses deflines and indexes seqIDs
DefaultF
Input format[T/F]
ExampleTo enable seqID parsing and indexing, use: -o T
Note
T: Parse SeqID and create indexes
F: Do not parse SeqID and do not create indexes.
For input FASTA sequence file with NCBI styled deflines, use "-o T". Otherwise, use "-o F".

Table 3.1.4
Parameter-t
FunctionAdds custom title to the database
DefaultN/A
Input format[String]
ExampleTo add the title "combined nt, est, and htgs", use: -t "combined nt, est, and htgs"
Note
This adds a more descriptive title to the database, which is displayed in the header section of the BLAST output.

Table 3.1.5
Parameter-l
Functionspecifies the logfile name
Defaultformatdb.log
Input format[File Out]
ExampleNone
Note
The default setting is usually sufficient. We recommend users check this log file after each formatdb run to make sure there is no obvious error.

Table 3.1.6
Parameter-a
FunctionThe input file is in ASN.1 format
DefaultF
Input format[T/F]
ExampleTo format ASN.1 file, use: -a T
Note
The deftault is to expect sequences in FASTA format. Currently, multiple sequences in ASN.1 format downloaded from Entrez do NOT work properly with formatdb - only the first entry will be formatted.

Table 3.1.7
Parameter-b
FunctionThe input ASN.1 file is in binary form
DefaultF
Input format[T/F]
ExampleTo format binary ASN.1 file, use -b T
Note
T: binary ASN.1 file F: text ASN.1 file expected Use this with -a T option.

Table 3.1.8
Parameter-e
FunctionThe input ASN.1 is a seq-entry file
DefaultF
Input format[T/F]
ExampleTo set this to true, use: -e T
Note
To format sequence in ASN.1 form downloaded from Entrez, use: -e T

Table 3.1.9
Parameter-n
FunctionRenames the resulting database
DefaultN/A
Input format[String]
ExampleTo rename the formatted database to combined_nt, use: -n combined_nt
Note
This parameter renames the formatted database to a name different from the input file, which is recommended when formatting input sequences piped from stdin. Mandatory when formatting multiple input files. Do NOT combine -n with -L.

Table 3.1.10
Parameter-v
FunctionSets the upper limit of database volume size, input in MILLIONS of letters
Default0
Input format[Integer]
ExampleTo break the formatted database into 100 megabase volumes, use: -v 100
Note
Zero invokes default of 1000, which is one gigabase or 109 letters. If an input database is broken into multiple volumes, formatdb will automatically create an alias file with db_name.nal extension, which ties all the volume together. The complete database can be called using "-d db_name" See Table 3.1.13 and section 3.1.

Table 3.1.11
Parameter-s
FunctionCreates sparse indexes - limited only to accessions
DefaultF
Input format[T/F]
ExampleTo activate this option, use: -s T
Note
Activation of this parameter will reduce the size of the database indexing files.

Table 3.1.12
Parameter-V
FunctionActivates verbose mode and checks for non-unique IDs
DefaultF
Input format[T/F]
ExampleTo activate this warning, use: -V T
Note
This prints warnings on screen if duplicate IDs are found.

Table 3.1.13
Parameter-L
FunctionCreates an alias file with this name
DefaultN/A
Input format[File Out]
ExampleTo create a nucleotide database alias named mouse_subset for nt from a gi list named mouse.gil, use: formatdb -i all -p F -F mouse.gil -L mouse_subset
Note
It will use the GI file argument from -F to calculate the database size. Do NOT combine -L with -n.

Table 3.1.14
Parameter-F
FunctionSpecifies an input GI file
DefaultN/A
Input format[File In]
ExampleTo input a GI file named mouse_gi, use: -F mouse_gi
Note
It takes a text file with a list of GIs from Entrez or Eutils.

Table 3.1.15
Parameter-B
FunctionGenerates binary GI file from the text GI file specified in -F
DefaultN/A
Input format[File Out]
ExampleTo generate binary GI file worm.gil from input text GI list worm, use: formatdb -F worm -B worm.gil
Note
This converts the -F input to a more efficient binary format. The resulting file can be used in database aliases or by -l parameters of BLAST programs during while searching against a preformatted database from NCBI.

Table 3.1.16
Parameter-T
FunctionReads in taxonomic information and writes the group bit to the ASN.1 defline
DefaultOptional
Input format[File in]
ExampleTo read in gi_taxid_prot.dmp, use: -T gi_taxid_prot.dmp
Note
This parameter allows formatdb to read in a file with gi/taxid information and write the taxid information to the ASN.1 defline. The inputs are gi_taxid*.dmp files from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/

    3.2 Command Line Parameters for fastacmd

The program fastacmd works on a formatted BLAST database generated by formatdb. Its program parameters are discussed below in their individual tables.

Table 3.2.1
Parameter-d
FunctionSpecifies the input database
Defaultnr
Input format[String]
ExampleTo use ecoli database, use: -d ecoli
Note
fastacmd will NOT be able to retrieve specific entries from databases formatted with "-o F". To work with multiple databases simultaneously, use -d "db1 db2". Databases in quotes must be of the same type.

Table 3.2.2
Parameter-p
FunctionSpecifies the database type
DefaultG
Input format[G/T/F]
ExampleTo work with a protein database, use: -p T
Note
G: guess mode, looking for protein first, then nucleotide; T: protein; F: nucleotide.

Table 3.2.3
Parameter-s
FunctionSpecifies the seqID or test strings
DefaultN/A
Input format[String]
ExampleTo retrieve sequences/information for gi 5556, use: -s 5556
Note
Use GI, accession, or text string (for custom database). Multiple entries need to be comma-delimited as in "-s AF123456,U12345". Specific sequence retrieval from custom database with deflines conforming to NCBI format will need the first text string. See Section 4.2.2.

Table 3.2.4
Parameter-i
FunctionSpecifies the input file containing GIs, accessions, or text strings for batch retrieval
DefaultN/A
Input format[String]
ExampleTo batch retrieve sequences specified by gi_list, use: -i gi_list
Note
The input file must be a text file with one entry per line. The complete file name with extension should be used.

Table 3.2.5
Parameter-a
FunctionRetrieves duplicate accessions
DefaultF
Input format[T/F]
ExampleNone
Note
The "-a T" setting retrieves all entries with deflines containing the input text string. For example, it is quite common for pdb entries of different chains to have the same record id, but with different chain number. The command line "fastacmd -d pdb -s 2hhd -a" retrieves all the entries, while "fastacmd -d pdb -s 2hhd" retrieves only some of them.

Table 3.2.6
Parameter-l
FunctionSpecifies the line length of the returned sequences
Default80
Input format[Integer]
ExampleTo set line length to 70, use: -l 70
Note
Changes to default is not recommended.

Table 3.2.7
Parameter-t
FunctionRequires defline to contain the target GI
DefaultF
Input format[T/F]
ExampleTo make sure "-s 511603" retrieves only the entry with this gi in the defline, add: -t T
Note
It only affects compound entries from a non-redundant databases such as protein nr.

Table 3.2.8
Parameter-o
FunctionSpecifies the output file
Defaultstdout, print to screen
Input format[String]
ExampleTo save the output to file called my_hit, use: -o my_hit
Note
Redirection using "|" or ">" also works.

Table 3.2.9
Parameter-c
FunctionUses Ctrl-A's as non-redundant defline separator
DefaultF
Input format[T/F]
ExampleTo keep the Ctrl-A in the defline, use: -c T
Note
It is only for non-redundant databases, such protein nr.

Table 3.2.10
Parameter-D
FunctionDumps the entire database
Default0
Input formatInteger
ExampleTo dump the database in FASTA format, use: -D 1
Note
It overwrites all other options except -I, and accepts the following values
0: No dump; 1: FASTA; 2: GI list; 3: Accession.version

Table 3.2.11
Parameter-L
FunctionSpecifies the subsequence range
Default0,0
Input format[String]
ExampleTo get subsequence between 10 to 200, use: -L 10,200
Note
In default setting, 0 in 'start' refers to the beginning of the sequence and 0 in 'stop' refers to the end of the sequence. Double quote the input if it contains white space(s). fastacmd will apply the input to all retrieved record in a batch retrieval.

Table 3.2.12
Parameter-S
FunctionSpecifies the strand of nucleotide sequence to retrieve
Default1
Input format[Integer]
ExampleTo retrieve the reverse complement, use: -S 2
Note
fastacmd will apply the input to all retrieved entries during batch retrieval. Values and functions are:
1: The entry itself; 2: The reverse complement of the entry

Table 3.2.13
Parameter-T
FunctionPrints taxonomic information for requested sequence(s)
DefaultF
Input format[T/F]
ExampleTo print taxonomic information, use: -T T
Note
This works only for preformatted databases provided by NCBI and requires the installation of the taxdb.tar.gz archive. See Section 4.2.2 for more information.

Table 3.2.14
Parameter-I
FunctionPrints database information only
DefaultF
Input format[T/F]
ExampleTo get the database information, use -I T
Note
Prints title, database type, total length, and number of sequences for the target database specified by -d. Overrides all other parameters.

Table 3.2.15
Parameter-P
FunctionRetrieves sequences with PIG ID
DefaultN/A
Input format[Integer]
ExampleTo retrieve PIG 234 from nr, use: -d nr -P 234
Note
PIG stands for "Protein Identification Group". Each PIG contains one or more protein entries with the exact same sequence. PIG number list or table is NOT available to the public at this time.

4. Practical usage

In this section, we will present some additional information on these two programs and discuss their practical usages.

    4.1 Using formatdb

The function of this program is to convert sequence files to blastable format, index the entries, and encode additional information to the ASN.1 defline to make the resulting databases more useful.

        4.1.1 Number of formatdb output files: 3, 5, 7 and 9

The program formatdb processes the input sequence file and generates different number of files. The exact number of files generated will depend on whether the "-o T" option is used, whether the sequence deflines conform to NCBI format, whether they are from NCBI Entrez database, and whether they are protein or nucleotide.

Table 4.1.1 Formatdb Output List
Nucleotide db file extensionProtein db file extensionContentFormat
.nhr .phr Deflines binary
.nin .pin Indices binary
.nsq .psq sequence data binary
Additional files generated using the "-o T" *
.nnd .pnd GI data binary
.nni .pni GI indices binary
.nsd .psd non-GI databinary
.nsi .psi non-GI indices binary
- .ppd PIG data binary
- .ppi PIG indices binary
NOTE:
* Number of files produced with "-o T" depends on the defline format.

We recommend that users use NCBI preformatted BLAST database whenever possible. These preformatted BLAST databases are split into smaller volumes and easy to handle. They also contain added Linkouts and taxonomic information. Users can fully exploit the linkout information under the wwwblast setup [2].

        4.1.2 Format a custom database with entries from NCBI Entrez

Sequences obtained from NCBI Entrez database or Eutilites [3] should be formatted with "-o T" option. For batch sequences, only FASTA format should be used at this time. For an annotated genome or genomic segment, it is advantageous to use the ASN.1 format as input to formatdb, since we can use it as a nucleotide or a protein input. When such a record is used as a protein input to formatdb, we will format the annotated CDS, or the proteins products into a BLASTable protein database.

If the preformatted database, containing target entries of interest, is available from NCBI's BLAST db ftp directory, users can use the database alias alternative by creating a database alias using a GI list without formatting a separate database. We will discuss the detail in Section 4.1.3.

The "-T" option was added to formatdb 2.2.12, which allows the incorporation of taxonomic information into the ASN.1 defline using the gi_taxid_nucl.dmp or gi_taxid_prot.dmp from the taxonomy ftp site at:

ftp.ncbi.nlm.nih.gov/pub/taxonomy/ 

The following command line formats the bacterial_protein FASTA file and adds the taxid information to the ASN.1 deflines:

formatdb -i bacterial_protein -p T -o T -T gi_taxid_prot.dmp

        4.1.3 Use a GI list to create an alias file for a master database

All preformatted BLAST databases from NCBI can be used in conjunction with a GI list fed to the parameter -l to restrict a given BLAST search to a subset of entries delimited by that list. A more efficient and informative way to use GI list, however, is to generate a binary GI list based database alias using formatdb.

We can readily generate a GI list by searching in Entrez Nucleotide or Protein database. Correct usage of GI list BLAST requires a good understanding of the sequence partition among the available BLAST databases. We also need to know that BLAST programs, will not report error/warning if a sequence specified by a GI is missing from the target database.

For example, a GI list representing human mRNAs, obtained from Entrez Nucloetide using "human[orgn] AND biomol_mrna[prop]" contains GIs for ESTs as well. GIs representing these ESTs will have no corresponding entries. Using this GI list as input to -l to limit a BLAST search against nt, BLAST will not report errors on the missing entries.

formatdb, on the other hand, will check the GI list during an alias construction to verify the presence of the entries specified by the GI list. This function helps avoid the confusion furhter downstream, e.g. at the BLAST result analysis stage. For reference, we list the Entrez query approximation for the available BLAST databases below.

Table 4.1.3 Entrez proximation for preformatted databases 1
Database NameEntrez Query Proximation
Protein 2
nr all[filter] NOT environmental sample[filter] NOT gbdiv_pat[prop]
swissprot srcdb_swiss_prot[prop]
pdb srcdb_pdb[prop]
refseq_proteinsrcdb_refseq[prop]
env_nr 3environmental sample[filter]
pat gbdiv_pat[prop]
Nucleotide
nt all[filter] NOT (gbdiv_est[prop] OR gbdiv_gss[prop] OR gbdiv_sts[prop] OR gbdiv_pat[prop] OR gbdiv_htg[prop] OR (srcdb_refseq[prop] AND biomol_genomic[prop]) OR environmental sample[filter] OR wgs[prop])
refseq_rna srcdb_refseq[prop] AND biomol_rna[prop]
refseq_genomic 4N/A
est gbdiv_est[prop]
est_human gbdiv_est[prop] AND human[orgn]
est_mouse gbdiv_est[prop] AND mouse[orgn]
est_others gbdiv_est[prop] NOT (mouse[orgn] OR human[orgn])
human_genomic_transcript N/A
mouse_genomic_transcript N/A
htgs gbdiv_htg[prop]
gss gbdiv_gss[prop]
sts gbdiv_sts[prop]
wgs wgs[prop]
pat gbdiv_pat[prop]
pdb gbdiv_pdb[prop]
env_nt 3environmental sample[filter]
human_genomic NC_000001:NC_000024[accn] OR AC_000044:AC_000068[accn]
other_genomic 3N/A
NOTE:
1 The query is only a proximation provided for use with combination of other terms to get the gi for a subset of sequences, which can be used to limit BLAST search to that subset through the -l option. Due to the size of the databases and the time needed to update them, content of BLAST databases will LAG behind Entrez.
2 Protein BLAST databases are non-redundant, while Entrez approximation is NOT.
3 Some entries in the Entrez approximation are in protein nr or nucleotide nt.
4 Currently, there are not Entrez approximation for these two databases.

We can combine additional Entrez query with the approximation, using boolean operator AND or NOT, to retrieve a subset for that database. We can save the GIs of the retrieved records by first displaying them as "GI list" followed by using the "Send to" file button to save the GI. For more information, see Entrez Help.

To turn a text GI list file, mouse.n.gi, into a more efficient binary form, we can use the formatdb command line below.

formatdb -F mouse.n.gi -B mouse.n.gil
To generate a database alias using the resulted binary mouse.n.gil file, we can use this formatdb commandline.
formatdb -i parent_db -p F -F mouse.n.gil -L mouse.n.subset
Here the parent_db is the name of the preformatted parent database, and the alias generated by the command line is named mouse.n.subset. A search against this alias database will be against the subset of sequences specified by the GI list.

Database alias also allows one to give a database subset a more meaningful name, use a shorter command line in actual searching, keep fewer sets of actual databases, and reduces the maintenance needed in a group environment.

        4.1.4 Format multiple input files

We can use formatdb to format multiple input files into a single BLAST database. To do so, we need to quote the input files to be formatted and provide that to -i parameter. It is mandatory that to use the -n option to name the resulting database since formatdb will not be able to using the multiple file name input to name the resulting database.

formatdb -i "db1.fa db2.fa" -n all.db -t "combined db1 db2"
The -t paramter in the above command line is to create a descriptive title for the resulting database, which will appear in the header of the search result and help us track the BLAST result.

        4.1.5 Format custom database

The term "custom databases" here refers to sequences from users or other third party sources. Most of those databases should be formatted with "-o F" setting, unless their deflines follow NCBI convention.

To format a custom database with deflines in NCBI convention and "-o T" setting requires that each defline starts with a unique first string, since this is the field formatdb indexes to generate additional indexing files. The additional indexing will allow specific retrieval of sequences using their unique first string and fastacmd.

Sometimes, we may encounter problems when searching a custom BLAST databases formatted with "-o T" setting. To resolve the issue, we need to go back to the FASTA sequences and reformat them using the "-o F" setting. If FASTA sequences are not available, we can dump them out of the formatted database using fastacmd.

        4.1.6 Alias file structure

BLAST database aliases are text files with database configuration information. They can be created automatically by formatdb or manually. The alias file name follows database.##.*** convention, where the .## are optional volume numbers and the .*** are .pal or .nal file extension, representing protein and nucleotide aliases, respectively.

An alias file can tie multiple databases together to form a larger virtual database. It can also specify a subset of sequences within a large master database to form a smaller virtual database. Information on the number of sequences and their total length in the virtual database can be included in an alias file. BLAST will use them for the Expect value calculation. The alias below, named zebrafish.pal, specifies a virtual database for zebrafish entries found in the nr protein database.

#
# Alias file created Thu Jul  5 15:04:29 2001
#
TITLE My zebrafish database
#
DBLIST nr
#
GILIST zebrafish.gi
#
#OIDLIST
#
NSEQ 1836
LENGTH 640724
#

The alias content below is from est_others.nal, which was generated automatically by formatdb. It ties up the individual est_others.## volumes into one complete database.

#
# Alias file created Tue Jan 30 18:04:04 2007
#
TITLE GenBank non-mouse and non-human EST entries
#
DBLIST est_others.00 est_others.01 est_others.02 est_others.03 est_others.04 
#
#GILIST
#
#OIDLIST
#
Note:
# marks a commented line. All other lines should contain no line break.

    4.2 On fastacmd

The database tool fastacmd allows us to work with a formatted BLAST database for non-sequence alginment purposes. Those includes dumping of FASTA sequences, getting summary information, retrieving specific sequence or subsequence, and the extracting the taxonomic information for specific entries. Dealing with specific entries requires a database formatted with "-o T".

        4.2.1 Database information and database to FASTA sequence conversion

To get a brief summary of a BLAST database, we can use the -I parameter of fastacmd. This parameter overrides all others in the command line. The output given below is for an old version of the refseq_protein database.

C:\blast2210>fastacmd -d refseq_protein -I T

Database: NCBI Protein Reference Sequences
           902,672 sequences; 324,856,552 total letters

File name:

C:\blast2210\blast2210p\db\refseq_protein
   Date: May 12, 2005  8:14 PM    Version: 4    Longest sequence: 37,777 res

To convert a formatted BLAST database back to its FASTA form, we use the "-D 1" setting. For databases from NCBI, preformatted or formatted locally from sequences downloaded from Entrez, we can also selectively dump out the GIs using "-D 2". The first example command line below dumps out the FASTA sequences from refseq_protein and saves the output to a file called refp.fasta. The second command line dumps out the GIs and save the output to refp.gi.

C:\blast2210>fastacmd -d refseq_protein -D 1 -o refp.fasta

C:\blast2210>fastacmd -d refseq_protein -D 2 -o refp.gi

        4.2.2 Specific sequence and subsequence retrieval

Specific sequence retrieval requires that the target BLAST database be formatted with "-o T". To use this setting, the first strings in the FASTA deflines must be unique for indexing purposes. Preferrably, the deflines should conform to NCBI format ( Table 1.1). In addition, we need to have the seqID or first text string from the defline of the target sequence. For NCBI provided databases, the ids can be GI or accession numbers.

The following example command lines demonstrate the retrieval of a full sequence and a subsequence (with -L 100,160) for NP_112245 from the refseq_protein database.

C:\blast2210p>fastacmd -d refseq_protein -s NP_112245
>gi|14195630|ref|NP_112245.1| microtubule-associated protein 4 [Homo sapiens]
MADLSLADALTEPSPDIEGEIKRDFIATLEAEAFDDVVGETVGKTDYIPLLDVDEKTGNSES
KKKPCSETSQIEDTPSSK

C:\blast2210p>fastacmd -d refseq_protein -s NP_112245 -L 100,160
>gi|14195630:100-160 microtubule-associated protein 4 [Homo sapiens]
PTEFLEEKMAYQEYPNSQNWPEDTNFCFQPEQVVDPIQTDPFKMYHDDDLADLVFPSSATA

We can batch retrieve multiple sequences by using a list of comma-separated ids, like in "-s NP_000240,NP_024931". Alternatively, we can provide generate an id list and provide the list to fastacmd's -i parameter as in "-i input_file". Here the input_file is the name of a text file containing ids, one record per line.

For custom databases, the specific retrieval is a bit different from the preformatted databases. For example, entries from NCBI Trace databases do have NCBI styled deflines, but since they are not part of Entrez Nucleotide database, they do not have GI or accession. We can format them with "-o T", but specific retrieval will require the quoted first string.

In the example below, we need to use the bold portion in quoted form to retrieve that record. We need to quote the id due to the presence of pipe symbols.

>gnl|ti|127084115 name:avt02g01.x1 AC110665 mate:127084142 mate_name:avt02g01.y1
 
C:\blast2210p>fastacmd -d dog_trace.nt -s "gnl|ti|127084115"

>gnl|ti|127084115 name:avt02g01.x1 AC110665 mate:127084142 […]
ACCTGGGTGATCTGATCCCATCGTCCTGTGGTGGAATTCTTCCCATTCTGAGAGTGAATAATAATTCACT
CACTCTGAATAATTATTCACTCTCAGAATCCATCCTTCGAATTTCTGTTCAATTTTTCTGCTCCTCTTCA
TCAAAATTTTCTTCAGTGTTATCTAGAGTTGCTGCCTTTACTTTTTCTTTTCTTTTTTTTTTTTAAGATT
TTATTTATTTATTCATGAGAGACAGAGAGAGAGAGAGAGCCGCCNNCCCATAGGCAGAGCCTGAGGCCCC
GGAAGAAGCAGGCTCCATGCAGGGAGCCCGAGGAGGGAC

        4.2.3 Taxonomic information

In Version 4 of the blast database, we adopted ASN.1 formatted defline. The extra space available was used to encode taxonomic id and other Linkout group bits. Entries from NCBI preformatted BLAST databases will have the information embedded in their deflines. When taxdb.tar.gz archive is installed, fastacmd will be able to provide the taxonomic information for specific entries. Sample command lines taxonomic information retrieval are given below.

C:\blast2210p>fastacmd -d refseq_protein -s NP_000240 -TT
NCBI sequence id: gi|4557757|ref|NP_000240.1|
NCBI taxonomy id: 9606
Common name: human
Scientific name: Homo sapiens

C:\blast2210p>fastacmd2210p -d refseq_protein -D 2 -o accession 
C:\blast2210p>fastacmd2210p -d refseq_protein -i accession -T T -o tax_info

Currently, it is not possible to dump out this piece of information for the complete database in one step. This function will not work for custom databases with non-NCBI entries or with NCBI entries but formatted without the -T setting.

5. Feedback

For questions and comments on this document and BLAST in general, please send them to:

blast-help@ncbi.nlm.nih.gov

Questions and comments on other NCBI resources should be addressed to:

info@ncbi.nlm.nih.gov

Reference

[1] Pearson and Lipman. "Improved tools for biological sequence comparison", 1988. PNAS 85(8): 24444 - 2448

[2] Entrez Utilities Help Document: http://www.ncbi.nlm.nih.gov/entrez/eutils/

[3] wwwblast: Setup and Usage: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/wwwblast/