1. Introduction
When discussing biological sequences, we generally refer to text strings with each of its letter
represents a nucleotide or a amino acide residue in the actual biological sequence. There are many ways
to display a biological sequence, with FASTA being the most widely used one.
The FASTA format was first adopted by the authors of FASTA sequence alignment program [1]. In this format,
a ">" initialed definition line (defline in short) precedes the actual sequence. This defline contains a
brief description of the actual sequence. NCBI further expanded the defline by using the first string in the FASTA
defline as seqID and breaking this string into pipe (|) separated fields to encode additional information.
In addition to FASTA, NCBI also provides sequences in GenBank (GenPept for protein), ASN.1,
XML formats. However, only FASTA formatted sequences can be used with command line standalone BLAST or client BLAST (blastcl3)
as input query.
We cannot use sequences in above formats as BLAST databases. Rather, we will need to convert them into BLASTable format using
formatdb, which takes only sequences in FASTA or ASN.1 format as input. Once we identify a hit of interest, we often need
to get the entries out of a formatted BLAST database in human readable FASTA format. We can use the fastacmd program to
accomplish
this task.
In this document, we will go over the technical details of FASTA sequence deflines, the two database related programs,
formatdb
and fastacmd, their program parameters, and their practical usages.
We would like to point out that NCBI provides the common set of BLAST databases in preformatted form, generated
directly from our relational databases. We break large databases into smaller and easy to handle volumes. These databases
are readily blastable after inflation and extraction. Use these preformatted databases whenever you can. Users can quickly
regenerate the FASTA sequences from these preformatted databases if they are needed. See this page for more information on
available BLAST databases:
http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastdb.html
1.1 Biological sequences in FASTA format
FASTA formatted sequence consists of a single comment line called defline, which is marked by a ">" sign at
the beginning followed by the description of the sequence. The defline terminates with a new line character and
is followed by one or more lines of actual sequences, each terminated with a new line character. Deflines for FASTA
sequences from NCBI follow a distinctive structure, which has several pipe (|) separated fields. Details are given below.
| Table 1.1 Sequence ID (seqID) Fields in the FASTA Deflines of Sequences
from NCBI |
| Database Name | Identifier Syntax and Examples |
| GenBank 1 | >gi|digits|gb|accession|locus |
| EMBL Data Library 1 | >gi|digits|emb|accession|locus |
| DNA Database of Japan | >gi|digits|dbj|accession|locus |
| NBRF PIR 2 | >gi|digits|pir||entry |
| Protein Research Foundation 2 | >gi|digits|prf||name |
| SWISS-PROT 2 | >gi|digits|sp|accession|entry name |
| Protein Data Bank 2 | >gi|digits|pdb|entry|chain |
| Patents 2 | >gi|digits|pat|country|number |
| GenInfo Backbone Id | >gi|digits|bbs|number |
| NCBI Reference Sequence 1, 2 | >gi|digits|ref|accession| |
| General database identifier 3 | >gnl|database|identifier |
| Local Sequence identifier 3 | >lcl|identifier |
NOTE:
1 Nucleotide Defline Examples:
>gi|304804|gb|L17338.1|DROZENA Drosophila pseudoobscura zen gene
>gi|293667|gb|L13590.1|MUSIMSWAL Mus musculus DNA sequence
>gi|18917|emb|X52321.1|HVBAMYL Barley mRNA for beta-amylase
>gi|15042013|dbj|AB055093.1| Bacillus sp. KSM-KP43 gene for 16S rRNA
2 Protein Defline Examples:
>gi|68510037|ref|NP_766538.2| lipin 1 isoform a [Mus musculus]
>gi|24430466|gb|AAN61186.1| maturase [Perilla frutescens]
>gi|46090801|dbj|BAD13538.1| cytochrome b [Tanakia lanceolata]
>gi|223168|prf||0602187A protein,Leu,Ile,Val binding
>gi|50552|emb|CAA46200.1| protein kinase [Mus musculus]
>gi|51315829|sp|P84108|FLL1_ACEDI Flagellin-like protein
>gi|230326|pdb|1SGT| Trypsin (SGT) (E.C.3.4.21.4)
>gi|1082471|pir||S52920 disintegrin (EC 3.4.24.-) - human (fragment)
3 For records that are not included in the NCBI Entrez database.
1.2 Conversion to blastable format
Text files with sequences in FASTA or ASN.1 format cannot be used as BLAST databases
directly during a BLAST search. To make them recognizable by BLAST, we will need to
format them using formatdb. When formatting a database with the "-o F" setting, formatdb generates
3 files for the input sequence file, all arerequired by BLAST programs.
If the sequence file is from NCBI, or its deflines conform to NCBI convention, formatdb is
capable of parsing the seqIDs from the deflines to generate additional indexing files with the "-o T" setting.
For sequence files with custom deflines, the exact number of files generated with "-o T" setting will depend
on the actual format of the deflines.
For large databases, with size larger than one GB, formatdb automatically splits up the file and
generate multiple volumes, each no larger than the one GB (default setting: -v 1000). Each volume will
has its own specific set of database files. An alias file, generated automatically, ties up individual volumes
to form a large virtual database.
1.3 Sequence retrieval from formatted BLAST databases
We often need to retrieve the FASTA sequence for a specific entry within a BLAST database for visual inspection or other
analyses.
We can do so if the database was formatted with the "-o T" setting. What we need is fastacmd, another database related
tool
from the standalone BLAST package, in combination with the seqID for the entry. When using preformatted databases from NCBI,
we
can also obtain taxonomic information. For details see Section 4.2
2. Installation and configuration
There is not specific setting for formatdb or fastacmd if the BLAST package was installed properly. For more information
on the installation of BLAST pages, see pc_setup.html or unix_setup.html.
3. Program parameters
We will list the detailed program parameters for formatdb and fastacmd separately below.
3.1 Command line parameters for formatdb
Command line parameters for foramtdb are discussed here with each parameter listed in its own table.
| Table 3.1.1 |
| Parameter | -i |
| Function | Sepcifies the input file(s) to be formatted |
| Default | N/A |
| Input format | [File In] |
| Example | To format an input FASTA file my_seq.txt, use: -i my_seq.txt |
Note
This parameter is mandatory. It requires the full file name with extension. The input file should have sequences in FASTA or
ASN.1
format, except when converting a gi list to binary form. To format multiple input files, quote the input file names as in
-i "db1 db2".
The FASTA output from other programs can be pipe to this option using "-i stdin". Renaming of database is recommended
(mandatory in the first case).
See Table 3.1.9.
| Table 3.1.2 |
| Parameter | -p |
| Function | The input type is protein |
| Default | T |
| Input format | [T/F] |
| Example | To format nucleotide database, use: -p F |
Note
T: true, input is protein
F, false, input is nucleotide.
| Table 3.1.3 |
| Parameter | -o |
| Function | Parses deflines and indexes seqIDs |
| Default | F |
| Input format | [T/F] |
| Example | To enable seqID parsing and indexing, use: -o T |
Note
T: Parse SeqID and create indexes
F: Do not parse SeqID and do not create indexes.
For input FASTA sequence file with NCBI styled deflines, use "-o T". Otherwise, use "-o F".
| Table 3.1.4 |
| Parameter | -t |
| Function | Adds custom title to the database |
| Default | N/A |
| Input format | [String] |
| Example | To add the title "combined nt, est, and htgs", use: -t "combined nt, est, and
htgs" |
Note
This adds a more descriptive title to the database, which is displayed in the header section of the BLAST output.
| Table 3.1.5 |
| Parameter | -l |
| Function | specifies the logfile name |
| Default | formatdb.log |
| Input format | [File Out] |
| Example | None |
Note
The default setting is usually sufficient. We recommend users check this log file after each formatdb run to
make sure there is no obvious error.
| Table 3.1.6 |
| Parameter | -a |
| Function | The input file is in ASN.1 format |
| Default | F |
| Input format | [T/F] |
| Example | To format ASN.1 file, use: -a T |
Note
The deftault is to expect sequences in FASTA format. Currently, multiple sequences in ASN.1 format downloaded from Entrez
do NOT work properly with formatdb - only the first entry will be formatted.
| Table 3.1.7 |
| Parameter | -b |
| Function | The input ASN.1 file is in binary form |
| Default | F |
| Input format | [T/F] |
| Example | To format binary ASN.1 file, use -b T |
Note
T: binary ASN.1 file
F: text ASN.1 file expected
Use this with -a T option.
| Table 3.1.8 |
| Parameter | -e |
| Function | The input ASN.1 is a seq-entry file |
| Default | F |
| Input format | [T/F] |
| Example | To set this to true, use: -e T |
Note
To format sequence in ASN.1 form downloaded from Entrez, use: -e T
| Table 3.1.9 |
| Parameter | -n |
| Function | Renames the resulting database |
| Default | N/A |
| Input format | [String] |
| Example | To rename the formatted database to combined_nt, use: -n combined_nt |
Note
This parameter renames the formatted database to a name different from the input file, which is recommended when formatting
input sequences piped from stdin. Mandatory when formatting multiple input files. Do NOT combine -n with -L.
| Table 3.1.10 |
| Parameter | -v |
| Function | Sets the upper limit of database volume size, input in MILLIONS of
letters |
| Default | 0 |
| Input format | [Integer] |
| Example | To break the formatted database into 100 megabase volumes, use: -v 100 |
Note
Zero invokes default of 1000, which is one gigabase or 109 letters. If an input database is broken into multiple
volumes,
formatdb will automatically create an alias file with db_name.nal extension, which ties all the volume together. The
complete database can be called using "-d db_name" See Table 3.1.13 and section 3.1.
| Table 3.1.11 |
| Parameter | -s |
| Function | Creates sparse indexes - limited only to accessions |
| Default | F |
| Input format | [T/F] |
| Example | To activate this option, use: -s T |
Note
Activation of this parameter will reduce the size of the database indexing files.
| Table 3.1.12 |
| Parameter | -V |
| Function | Activates verbose mode and checks for non-unique IDs |
| Default | F |
| Input format | [T/F] |
| Example | To activate this warning, use: -V T |
Note
This prints warnings on screen if duplicate IDs are found.
| Table 3.1.13 |
| Parameter | -L |
| Function | Creates an alias file with this name |
| Default | N/A |
| Input format | [File Out] |
| Example | To create a nucleotide database alias named mouse_subset for nt from a gi list named
mouse.gil,
use: formatdb -i all -p F -F mouse.gil -L mouse_subset |
Note
It will use the GI file argument from -F to calculate the database size. Do NOT combine -L with -n.
| Table 3.1.14 |
| Parameter | -F |
| Function | Specifies an input GI file |
| Default | N/A |
| Input format | [File In] |
| Example | To input a GI file named mouse_gi, use: -F mouse_gi |
Note
It takes a text file with a list of GIs from Entrez or Eutils.
| Table 3.1.15 |
| Parameter | -B |
| Function | Generates binary GI file from the text GI file specified in -F |
| Default | N/A |
| Input format | [File Out] |
| Example | To generate binary GI file worm.gil from input text GI list worm, use: formatdb -F
worm -B worm.gil |
Note
This converts the -F input to a more efficient binary format. The resulting file can be used in database aliases or by -l
parameters
of BLAST programs during while searching against a preformatted database from NCBI.
| Table 3.1.16 |
| Parameter | -T |
| Function | Reads in taxonomic information and writes the group bit to the ASN.1 defline
|
| Default | Optional |
| Input format | [File in] |
| Example | To read in gi_taxid_prot.dmp, use: -T gi_taxid_prot.dmp |
Note
This parameter allows formatdb to read in a file with gi/taxid information and write the taxid information to the ASN.1
defline.
The inputs are gi_taxid*.dmp files from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/
3.2 Command Line Parameters for fastacmd
The program fastacmd works on a formatted BLAST database generated by formatdb. Its program parameters are discussed below in
their
individual tables.
| Table 3.2.1 |
| Parameter | -d |
| Function | Specifies the input database |
| Default | nr |
| Input format | [String] |
| Example | To use ecoli database, use: -d ecoli |
Note
fastacmd will NOT be able to retrieve specific entries from databases formatted with "-o F". To work with multiple databases
simultaneously, use -d "db1 db2".
Databases in quotes must be of the same type.
| Table 3.2.2 |
| Parameter | -p |
| Function | Specifies the database type |
| Default | G |
| Input format | [G/T/F] |
| Example | To work with a protein database, use: -p T |
Note
G: guess mode, looking for protein first, then nucleotide; T: protein; F: nucleotide.
| Table 3.2.3 |
| Parameter | -s |
| Function | Specifies the seqID or test strings |
| Default | N/A |
| Input format | [String] |
| Example | To retrieve sequences/information for gi 5556, use: -s 5556 |
Note
Use GI, accession, or text string (for custom database). Multiple entries need to be comma-delimited as in "-s
AF123456,U12345".
Specific sequence retrieval from custom database with deflines conforming to NCBI format will need the first text string.
See Section 4.2.2.
| Table 3.2.4 |
| Parameter | -i |
| Function | Specifies the input file containing GIs, accessions, or text strings for batch
retrieval |
| Default | N/A |
| Input format | [String] |
| Example | To batch retrieve sequences specified by gi_list, use: -i gi_list |
Note The input file must be a text file with one entry per line. The complete file name with extension should be used.
| Table 3.2.5 |
| Parameter | -a |
| Function | Retrieves duplicate accessions |
| Default | F |
| Input format | [T/F] |
| Example | None |
Note
The "-a T" setting retrieves all entries with deflines containing the input text string. For example, it is quite common for
pdb
entries of different chains to have the same record id, but with different chain number. The command line "fastacmd -d pdb -s
2hhd -a"
retrieves all the entries, while "fastacmd -d pdb -s 2hhd" retrieves only some of them.
| Table 3.2.6 |
| Parameter | -l |
| Function | Specifies the line length of the returned sequences |
| Default | 80 |
| Input format | [Integer] |
| Example | To set line length to 70, use: -l 70 |
Note
Changes to default is not recommended.
| Table 3.2.7 |
| Parameter | -t |
| Function | Requires defline to contain the target GI |
| Default | F |
| Input format | [T/F] |
| Example | To make sure "-s 511603" retrieves only the entry with this gi in the defline, add:
-t T |
Note It only affects compound entries from a non-redundant databases such as protein nr.
| Table 3.2.8 |
| Parameter | -o |
| Function | Specifies the output file |
| Default | stdout, print to screen |
| Input format | [String] |
| Example | To save the output to file called my_hit, use: -o my_hit |
Note Redirection using "|" or ">" also works.
| Table 3.2.9 |
| Parameter | -c |
| Function | Uses Ctrl-A's as non-redundant defline separator |
| Default | F |
| Input format | [T/F] |
| Example | To keep the Ctrl-A in the defline, use: -c T |
Note It is only for non-redundant databases, such protein nr.
| Table 3.2.10 |
| Parameter | -D |
| Function | Dumps the entire database |
| Default | 0 |
| Input format | Integer |
| Example | To dump the database in FASTA format, use: -D 1 |
Note It overwrites all other options except -I, and accepts the following values
0: No dump; 1: FASTA; 2: GI list; 3: Accession.version
| Table 3.2.11 |
| Parameter | -L |
| Function | Specifies the subsequence range |
| Default | 0,0 |
| Input format | [String] |
| Example | To get subsequence between 10 to 200, use: -L 10,200 |
Note
In default setting, 0 in 'start' refers to the beginning of the sequence and 0 in 'stop' refers to the end of the sequence.
Double quote the input if it contains white space(s). fastacmd will apply the input to all retrieved record in a batch
retrieval.
| Table 3.2.12 |
| Parameter | -S |
| Function | Specifies the strand of nucleotide sequence to retrieve |
| Default | 1 |
| Input format | [Integer] |
| Example | To retrieve the reverse complement, use: -S 2 |
Note fastacmd will apply the input to all retrieved entries during batch retrieval. Values and functions are:
1: The entry itself; 2: The reverse complement of the entry
| Table 3.2.13 |
| Parameter | -T |
| Function | Prints taxonomic information for requested sequence(s) |
| Default | F |
| Input format | [T/F] |
| Example | To print taxonomic information, use: -T T |
Note
This works only for preformatted databases provided by NCBI and requires the installation of the taxdb.tar.gz archive.
See Section 4.2.2 for more information.
| Table 3.2.14 |
| Parameter | -I |
| Function | Prints database information only |
| Default | F |
| Input format | [T/F] |
| Example | To get the database information, use -I T |
Note Prints title, database type, total length, and number of sequences for the target database specified by -d.
Overrides all other parameters.
| Table 3.2.15 |
| Parameter | -P |
| Function | Retrieves sequences with PIG ID |
| Default | N/A |
| Input format | [Integer] |
| Example | To retrieve PIG 234 from nr, use: -d nr -P 234 |
Note PIG stands for "Protein Identification Group". Each PIG contains one or more protein entries with the
exact same sequence. PIG number list or table is NOT available to the public at this time.
4. Practical usage
In this section, we will present some additional information on these two programs and discuss their practical usages.
4.1 Using formatdb
The function of this program is to convert sequence files to blastable format, index the entries, and encode additional
information to the ASN.1 defline to make the resulting databases more useful.
4.1.1 Number of formatdb output files: 3,
5, 7 and 9
The program formatdb processes the input sequence file and generates different number of files. The exact number of files
generated
will depend on whether the "-o T" option is used, whether the sequence deflines conform to NCBI format, whether they are from
NCBI Entrez database,
and whether they are protein or nucleotide.
| Table 4.1.1 Formatdb Output List |
| Nucleotide db file extension | Protein db file extension | Content | Format |
| .nhr | .phr | Deflines | binary |
| .nin | .pin | Indices | binary |
| .nsq | .psq | sequence data | binary |
| Additional files generated using the "-o T" * |
| .nnd | .pnd | GI data | binary |
| .nni | .pni | GI indices | binary |
| .nsd | .psd | non-GI data | binary |
| .nsi | .psi | non-GI indices | binary |
| - | .ppd | PIG data | binary |
| - | .ppi | PIG indices | binary |
NOTE:
* Number of files produced with "-o T" depends on the defline format.
We recommend that users use NCBI preformatted BLAST database whenever possible. These preformatted BLAST databases are
split into smaller volumes and easy to handle. They also contain added Linkouts and taxonomic information. Users can
fully exploit the linkout information under the wwwblast setup [2].
4.1.2 Format a custom database with
entries from NCBI Entrez
Sequences obtained from NCBI Entrez database or Eutilites [3] should be formatted with "-o T" option. For
batch
sequences, only FASTA format should be used at this time. For an annotated genome or genomic segment, it is advantageous to
use
the ASN.1 format as input to formatdb, since we can use it as a nucleotide or a protein input. When such a record is used as a
protein input to formatdb, we will format the annotated CDS, or the proteins products into a BLASTable protein database.
If the preformatted database, containing target entries of interest, is available from NCBI's BLAST db ftp directory, users
can use the database alias alternative by creating a database alias using a GI list without formatting a separate database.
We will discuss the detail in Section 4.1.3.
The "-T" option was added to formatdb 2.2.12, which allows the incorporation of taxonomic information into the ASN.1
defline using the gi_taxid_nucl.dmp or gi_taxid_prot.dmp from the taxonomy ftp site at:
ftp.ncbi.nlm.nih.gov/pub/taxonomy/
The following command line formats the bacterial_protein FASTA file and adds the taxid information to the ASN.1 deflines:
formatdb -i bacterial_protein -p T -o T -T gi_taxid_prot.dmp
4.1.3 Use a GI list to create an alias
file for a master database
All preformatted BLAST databases from NCBI can be used in conjunction with a GI list fed to the parameter -l to restrict a
given
BLAST search to a subset of entries delimited by that list. A more efficient and informative way to use GI list, however, is
to
generate a binary GI list based database alias using formatdb.
We can readily generate a GI list by searching in Entrez Nucleotide or Protein database. Correct usage of GI list
BLAST requires a good understanding of the sequence partition among the available BLAST databases. We also need to know
that BLAST programs, will not report error/warning if a sequence specified by a GI is missing from the target database.
For example, a GI list representing human mRNAs, obtained from Entrez Nucloetide using "human[orgn] AND biomol_mrna[prop]"
contains
GIs for ESTs as well. GIs representing these ESTs will have no corresponding entries. Using this GI list as input to -l to
limit a
BLAST search against nt, BLAST will not report errors on the missing entries.
formatdb, on the other hand, will check the GI list during an alias construction to verify the presence of the entries
specified by
the GI list. This function helps avoid the confusion furhter downstream, e.g. at the BLAST result analysis stage. For
reference, we
list the Entrez query approximation for the available BLAST databases below.
| Table 4.1.3 Entrez proximation for preformatted databases
1 |
| Database Name | Entrez Query Proximation |
| Protein 2 |
| nr | all[filter] NOT environmental sample[filter] NOT gbdiv_pat[prop] |
| swissprot | srcdb_swiss_prot[prop] |
| pdb | srcdb_pdb[prop] |
| refseq_protein | srcdb_refseq[prop] |
| env_nr 3 | environmental sample[filter] |
| pat | gbdiv_pat[prop] |
| Nucleotide |
| nt | all[filter] NOT (gbdiv_est[prop] OR gbdiv_gss[prop] OR gbdiv_sts[prop] OR gbdiv_pat[prop] OR
gbdiv_htg[prop] OR (srcdb_refseq[prop] AND biomol_genomic[prop]) OR environmental sample[filter] OR wgs[prop]) |
| refseq_rna | srcdb_refseq[prop] AND biomol_rna[prop] |
| refseq_genomic 4 | N/A |
| est | gbdiv_est[prop] |
| est_human | gbdiv_est[prop] AND human[orgn] |
| est_mouse | gbdiv_est[prop] AND mouse[orgn] |
| est_others | gbdiv_est[prop] NOT (mouse[orgn] OR human[orgn]) |
| human_genomic_transcript | N/A |
| mouse_genomic_transcript | N/A |
| htgs | gbdiv_htg[prop] |
| gss | gbdiv_gss[prop] |
| sts | gbdiv_sts[prop] |
| wgs | wgs[prop] |
| pat | gbdiv_pat[prop] |
| pdb | gbdiv_pdb[prop] |
| env_nt 3 | environmental sample[filter] |
| human_genomic | NC_000001:NC_000024[accn] OR AC_000044:AC_000068[accn] |
| other_genomic 3 | N/A |
NOTE:
1 The query is only a proximation provided for use with combination of other terms to get the gi
for a subset of sequences, which can be used to limit BLAST search to that subset through the -l option. Due
to the size of the databases and the time needed to update them, content of BLAST databases will LAG behind
Entrez.
2 Protein BLAST databases are non-redundant, while Entrez approximation is NOT.
3 Some entries in the Entrez approximation are in protein nr or nucleotide nt.
4 Currently, there are not Entrez approximation for these two databases.
We can combine additional Entrez query with the approximation, using boolean operator AND or NOT, to retrieve
a subset for that database. We can save the GIs of the retrieved records by first displaying them as "GI list"
followed by using the "Send to" file button to save the GI. For more information, see
Entrez Help.
To turn a text GI list file, mouse.n.gi, into a more efficient binary form, we can use the formatdb command line below.
formatdb -F mouse.n.gi -B mouse.n.gil
|
To generate a database alias using the resulted binary mouse.n.gil file, we can use this formatdb commandline.
formatdb -i parent_db -p F -F mouse.n.gil -L mouse.n.subset
|
Here the parent_db is the name of the preformatted parent database, and the alias generated by the command line is
named mouse.n.subset. A search against this alias database will be against the subset of sequences specified by the GI list.
Database alias also allows one to give a database subset a more meaningful name, use a shorter command line in actual
searching, keep fewer sets of actual databases, and reduces the maintenance needed in a group environment.
4.1.4 Format multiple input files
We can use formatdb to format multiple input files into a single BLAST database. To do so, we need to quote the input files
to be formatted and provide that to -i parameter. It is mandatory that to use the -n option to name the resulting database
since
formatdb will not be able to using the multiple file name input to name the resulting database.
formatdb -i "db1.fa db2.fa" -n all.db -t "combined db1 db2"
|
The -t paramter in the above command line is to create a descriptive title for the resulting database, which will
appear in the header of the search result and help us track the BLAST result.
4.1.5 Format custom database
The term "custom databases" here refers to sequences from users or other third party sources. Most of those databases should
be formatted with "-o F" setting, unless their deflines follow NCBI convention.
To format a custom database with deflines in NCBI convention and "-o T" setting requires that each defline
starts with a unique first string, since this is the field formatdb indexes to generate additional
indexing files. The additional indexing will allow specific retrieval of sequences using their unique first string
and fastacmd.
Sometimes, we may encounter problems when searching a custom BLAST databases formatted with "-o T" setting.
To resolve the issue, we need to go back to the FASTA sequences and reformat them using the "-o F" setting.
If FASTA sequences are not available, we can dump them out of the formatted database using fastacmd.
4.1.6 Alias file structure
BLAST database aliases are text files with database configuration information. They can be created automatically
by formatdb or manually. The alias file name follows database.##.*** convention, where the .## are optional volume numbers
and the .*** are .pal or .nal file extension, representing protein and nucleotide aliases, respectively.
An alias file can tie multiple databases together to form a larger virtual database. It can also specify a subset of sequences
within a large master database to form a smaller virtual database. Information on the number of sequences and their
total length in the virtual database can be included in an alias file. BLAST will use them for the Expect value calculation.
The alias below, named zebrafish.pal, specifies a virtual database for zebrafish entries found in the nr protein database.
#
# Alias file created Thu Jul 5 15:04:29 2001
#
TITLE My zebrafish database
#
DBLIST nr
#
GILIST zebrafish.gi
#
#OIDLIST
#
NSEQ 1836
LENGTH 640724
#
The alias content below is from est_others.nal, which was generated automatically by formatdb. It ties up the individual
est_others.## volumes into one complete database.
#
# Alias file created Tue Jan 30 18:04:04 2007
#
TITLE GenBank non-mouse and non-human EST entries
#
DBLIST est_others.00 est_others.01 est_others.02 est_others.03 est_others.04
#
#GILIST
#
#OIDLIST
#
Note:
# marks a commented line. All other lines should contain no line break.
4.2 On fastacmd
The database tool fastacmd allows us to work with a formatted BLAST database for non-sequence alginment purposes.
Those includes dumping of FASTA sequences, getting summary information, retrieving specific sequence or subsequence, and
the extracting the taxonomic information for specific entries. Dealing with specific entries requires a database
formatted with "-o T".
4.2.1 Database information and database
to FASTA sequence conversion
To get a brief summary of a BLAST database, we can use the -I parameter of fastacmd.
This parameter overrides all others in the command line. The output given below is for
an old version of the refseq_protein database.
C:\blast2210>fastacmd -d refseq_protein -I T
Database: NCBI Protein Reference Sequences
902,672 sequences; 324,856,552 total letters
File name:
C:\blast2210\blast2210p\db\refseq_protein
Date: May 12, 2005 8:14 PM Version: 4 Longest sequence: 37,777 res
|
To convert a formatted BLAST database back to its FASTA form, we use the "-D 1" setting.
For databases from NCBI, preformatted or formatted locally from sequences downloaded from Entrez, we can also selectively
dump out the GIs using "-D 2". The first example command line below dumps out the FASTA sequences from
refseq_protein and saves the output to a file called refp.fasta. The second command line dumps out the GIs and save the
output to refp.gi.
C:\blast2210>fastacmd -d refseq_protein -D 1 -o refp.fasta
C:\blast2210>fastacmd -d refseq_protein -D 2 -o refp.gi
|
4.2.2 Specific sequence and subsequence
retrieval
Specific sequence retrieval requires that the target BLAST database be formatted with "-o T". To use this setting, the first
strings
in the FASTA deflines must be unique for indexing purposes. Preferrably, the deflines should conform to NCBI format (
Table 1.1). In addition, we need to have the seqID or first text string from the defline of the target sequence. For NCBI
provided
databases, the ids can be GI or accession numbers.
The following example command lines demonstrate the retrieval of a full sequence
and a subsequence (with -L 100,160) for NP_112245 from the refseq_protein database.
C:\blast2210p>fastacmd -d refseq_protein -s NP_112245
>gi|14195630|ref|NP_112245.1| microtubule-associated protein 4 [Homo sapiens]
MADLSLADALTEPSPDIEGEIKRDFIATLEAEAFDDVVGETVGKTDYIPLLDVDEKTGNSES
KKKPCSETSQIEDTPSSK
C:\blast2210p>fastacmd -d refseq_protein -s NP_112245 -L 100,160
>gi|14195630:100-160 microtubule-associated protein 4 [Homo sapiens]
PTEFLEEKMAYQEYPNSQNWPEDTNFCFQPEQVVDPIQTDPFKMYHDDDLADLVFPSSATA
|
We can batch retrieve multiple sequences by using a list of comma-separated ids, like in "-s NP_000240,NP_024931".
Alternatively, we can provide generate an id list and provide the list to fastacmd's -i parameter as in "-i input_file".
Here the input_file is the name of a text file containing ids, one record per line.
For custom databases, the specific retrieval is a bit different from the preformatted databases. For example, entries
from NCBI Trace databases do have NCBI styled deflines, but since they are not part of Entrez Nucleotide database,
they do not have GI or accession. We can format them with "-o T", but specific retrieval will require the quoted first string.
In the example below, we need to use the bold portion in quoted form to retrieve that record. We need to quote the id due to
the
presence of pipe symbols.
>gnl|ti|127084115 name:avt02g01.x1 AC110665 mate:127084142 mate_name:avt02g01.y1
C:\blast2210p>fastacmd -d dog_trace.nt -s "gnl|ti|127084115"
>gnl|ti|127084115 name:avt02g01.x1 AC110665 mate:127084142 […]
ACCTGGGTGATCTGATCCCATCGTCCTGTGGTGGAATTCTTCCCATTCTGAGAGTGAATAATAATTCACT
CACTCTGAATAATTATTCACTCTCAGAATCCATCCTTCGAATTTCTGTTCAATTTTTCTGCTCCTCTTCA
TCAAAATTTTCTTCAGTGTTATCTAGAGTTGCTGCCTTTACTTTTTCTTTTCTTTTTTTTTTTTAAGATT
TTATTTATTTATTCATGAGAGACAGAGAGAGAGAGAGAGCCGCCNNCCCATAGGCAGAGCCTGAGGCCCC
GGAAGAAGCAGGCTCCATGCAGGGAGCCCGAGGAGGGAC
|
4.2.3 Taxonomic information
In Version 4 of the blast database, we adopted ASN.1 formatted defline. The extra space available was used
to encode taxonomic id and other Linkout group bits. Entries from NCBI preformatted BLAST databases will have the information
embedded
in their deflines. When taxdb.tar.gz archive is installed, fastacmd will be able to provide the taxonomic information for
specific entries.
Sample command lines taxonomic information retrieval are given below.
C:\blast2210p>fastacmd -d refseq_protein -s NP_000240 -TT
NCBI sequence id: gi|4557757|ref|NP_000240.1|
NCBI taxonomy id: 9606
Common name: human
Scientific name: Homo sapiens
C:\blast2210p>fastacmd2210p -d refseq_protein -D 2 -o accession
C:\blast2210p>fastacmd2210p -d refseq_protein -i accession -T T -o tax_info
|
Currently, it is not possible to dump out this piece of information for the complete database in one step. This function will
not work for
custom databases with non-NCBI entries or with NCBI entries but formatted without the -T setting.
5. Feedback
For questions and comments on this document and BLAST in general, please send them to:
blast-help@ncbi.nlm.nih.gov
Questions and comments on other NCBI resources should be addressed to:
info@ncbi.nlm.nih.gov
|