1. Introduction
This document describes the "BLAST" databases available on the NCBI FTP site under:
ftp.ncbi.nih.gov/blast/db/
NCBI BLAST home pages (www.ncbi.nih.gov/BLAST/) use a standard set of BLAST databases for Nucleotide,
Protein, and Translated BLAST searches. These preformatted databases are made available as compressed
archives at ftp.ncbi.nih.gov/blast/db/. The FASTA databases reside under the FASTA subdirectory.
The pre-formatted databases offer the following advantages:
- The pre-formatted databases breaks large database into smaller and more manageable volumes, which are easier to download;
- The smaller size will help avoid file size limitations users may encounter on certain platforms;
- Sequences in FASTA format can be generated from the pre-formatted databases by the fastacmd utility;
- A convenient script (update_blastdb.pl) is available to download the pre-formatted databases from the NCBI ftp site;
- Pre-formatting removes the need to run formatdb;
- Taxonomy ids are available for each database entry.
Pre-formatted databases must be downloaded in binary mode using ftp client, web browser, or the
update_blastdb.pl script
provided by NCBI. Documentation for the update_blastdb.pl script can be obtained by running the
script without any arguments (perl is required).
We strongly recommend that our users use the preformatted databases whenever possible. In addition
to the advantages listed above, a less-known logistic reason is that our BLAST databases are generated
directly from our backend relational databases in preformatted form, which can be loaded to our ftp
site directly after tar and gzip. To generate the FASTA files for ftp, we need to dump the FASTA
sequences from the preformatted databases using fastacmd before we can before we can gzip and load
them to our ftp server. For users who do have need for the FASTA sequences, they can easily get them
from the preformatted databases using the "-D 1" option in fastacmd.
More information on fastacmd and formatdb is available in
"Program Parameters for formatdb and fastacmd".
The compressed database files must be inflated with gzip or other compatible tools. The BLAST database
files can then be extracted out of the resulting tar file using tar program on Unix/Linux or WinZip and
StuffIt Expander on Windows and Macintosh platforms, respectively.
Large databases are formatted in multiple one-gigabytes volumes, which are named using the database.##.tar.gz
convention, with ## representing the volumne number. All relevant volumes are required to reconstitute the
database. An alias file is, with .nal or .pal extension, is included in the 00 volume to tie all volumes together.
The database can be called using the alias name without the extension. For example, to call nt database, simply
use "-d nt" in the commandline without the quotes.
Certain databases are subsets of a larger parental database. For those databases, alias and mask files, rather
than actual databases, are provided. The mask file requires the parent database, generated on the same day, to
function properly. For example, to use pre-formatted swissprot.tar.gz database, one will need to get all volume
for nr (nr.##.tar.gz).
Additional BLAST databases not provided in pre-formatted formats are available in the FASTA subdirectory. For
genomic BLAST databases, please check the genomes ftp directory at:
ftp.ncbi.nih.gov/genomes/
2. Contents of the /blast/db Directory
The pre-formatted BLAST databases are archived in this directory. The name of these databases and their
contents are listed below.
| Table 2. File Contents of the /blast/db/ Directory |
| File Name | Content Description |
| /FASTA | Subdirectory for sequences in FASTA format |
| blastdb.html | Readme for this subdirectory (this file) |
| env_nr.*tar.gz | Environmental protein sequences |
| env_nt.*tar.gz | Environmental nucleotide sequences |
| est.*tar.gz 1 | Alias file for preformatted est databases, requires all volumes of est_human, est_mouse, and est_others. |
| est_human.*tar.gz | Human subset of the est database |
| est_mouse.*tar.gz | Mouse subset of the est database |
| est_others.*tar.gz | Non-human and non-mouse subset of the est database |
| gss.*tar.gz | Volumes of the formatted gss database from the GSS division of GenBank, EMBL, and DDBJ |
| htgs.*tar.gz | Volumes of htgs database with entries from HTG division of GenBank, EMBL, and DDBJ |
| human_genomic.*tar.gz | Human RefSeq (NC_######) chromosome records with gap adjusted concatenated NT_ contigs |
| human_genomic_transcript.tar.gz | Combined database for human genome and refseq transcripts |
| mouse_genomic_transcript.tar.gz | Combined database for mouse genome and refseq transcripts |
| nr.*tar.gz | Non-redundant protein sequence database with entries from GenPept, Swissprot, PIR, PDF, PDB, and NCBI RefSeq. NOTE that nr does NOT contain sequences found in pataa and env_nr databases. |
| nt.*tar.gz | Nucleotide sequence database, with entries from all traditional divisions of GenBank, EMBL, and
DDBJ excluding bulk divisions (gss, sts, pat, est, and htg) as well as wgs entries. Minimally non-redundant. See Section 6 |
| other_genomic.*tar.gz | RefSeq chromosome records (NC_######) for organisms other than human |
| pataa.*tar.gz ² | Patent protein sequence database |
| patnt.*tar.gz ² | Patent nucleotide sequence database. |
| pdbaa.*tar.gz ³ | Protein sequences from pdb protein structures |
| pdbnt.*tar.gz | Nucleotide sequences from pdb nucleic acid structures, its parent database it nt. They are NOT the protein coding sequences for the corresponding pdbaa entries. |
| refseq_genomic.*tar.gz | NCBI genomic reference sequences |
| refseq_protein.*tar.gz | NCBI protein reference sequences |
| refseq_rna.*tar.gz | NCBI Transcript reference sequences |
| sts.*tar.gz | Sequences from the STS division of GenBank, EMBL, and DDBJ |
| swissprot.tar.gz ³ | swiss-prot sequence databases (last major update) |
| taxdb.tar.gz | Additional taxonomy information for the formatted database (contains common and scientific names) |
| wgs.*tar.gz | volumes for whole genome shotgun sequence assemblies for different organisms
|
NOTE:
- This alias requires all volumes of est_human, est_mouse, and est_others to function properly.
- Both patent databases are directly from USPTO or from EU/Japan Patent Agencies through collaboration with EMBL/DDBJ.
- These are aliases and mask files, which need all volumes of nr to function properly.
- * represents the volume numbers if present and all volumes are needed to reconstitute that database. Not all databases are in volumes.
3. Contents of the /blast/db/FASTA Directory
This directory contains FASTA formatted sequence files. The file names and database contents are listed
below. These files are archived in .gz format and must be processed through formatdb after inflation
before they can be used with different BLAST programs.
| Table 3. File Contents of the /blast/db/ Directory |
| File Name | Content Description |
| alu.a.gz 1 | Translation of alu.n repeats |
| alu.n.gz 1 | Alu repeat elements |
| drosoph.aa.gz 1 | CDS translations from drosophila.nt |
| drosoph.nt.gz 1 | Genomic sequences for drosophila |
| ecoli.aa.gz 1 | CDS translations from ecoli.nt |
| ecoli.nt.gz 1 | Escherichia coli K-12 genomic sequences |
| env_nr.gz * | Environmental protein sequences |
| env_nt.gz * | Environmental nucleotide sequences |
| est_human.gz * 2 | Human subset of the est database |
| est_mouse.gz * 2 | Mouse subset of the est database |
| est_others.gz * 2 | Non-human and non-mouse subset of the est database |
| gss.gz * | Entries from GSS division of GenBank, EMBL, and DDBJ |
| htg.gz * | Entries from HTG division of GenBank, EMBL, and DDBJ |
| human_genomic.gz * | Human RefSeq (NC_######) chromosome records with gap adjusted concatenated NT_ contigs |
| igSeqNt.gz | Nucleotide sequences for human and mouse immunoglobulin variable region |
| igSeqProt.gz | Protein sequences for human and mouse immunoglobulin variable region |
| mito.aa.gz 1 | CDS translations of complete mitochondrial genomes |
| mito.nt.gz 1 | Complete mitochondrial genomes |
| month.aa.gz 3 | Newly released/updated protein sequences |
| month.est_human.gz 3 | Newly released/updated human est sequences |
| month.est_mouse.gz 3 | Newly released/updated mouse est sequences |
| month.est_others.gz 3 | Newly released/updated est other than human/mouse |
| month.gss.gz 3 | Newly released/updated gss sequences |
| month.htgs.gz 3 | Newly released/updated htgs sequences |
| month.nt.gz 3 | Newly released/updated sequences for the nt database |
| nr.gz* | Non-redundant protein sequence database with entries from GenPept, Swissprot, PIR, PDF, PDB, and NCBI RefSeq. NOTE that nr does NOT contain sequences found in pataa and env_nr databases. |
| nt.gz* | Nucleotide sequences from GenBank, EMBL, and DDBJ (excluding gss, sts, pat, est, htg, and wgs) as well as refseq RNA entries. Partially non-redundant. |
| other_genomic.gz * | RefSeq chromosome records (NC_######) for organisms other than human. |
| pataa.gz * 4 | Patent protein sequence database |
| patnt.gz * 4 | Patent nucleotide sequence database |
| pdbaa.gz * | Protein sequences from pdb protein structures |
| pdbnt.gz * | Nucleotide sequences from pdb nucleic acid structures. They are NOT the protein coding sequences for the corresponding pdbaa entries. |
| sts.gz * | Sequence Tag Site entries |
| swissprot.gz * | Swiss-prot database (last major release) |
| vector.gz 1 | Vector sequence database |
| wgs.gz * | Whole genome shotgun genome assemblies |
| yeast.aa.gz 1 | Protein translations from yeast genome |
| yeast.nt.gz 1 | Yeast genome |
NOTE:
- Relatively old databases with no regular update.
- We do not provide the complete est database in FASTA format. One need to get est_human, est_mouse, and est_others to get the complete est fasta sequences.
- month.### are the sequences released or updated within the last 30 days for that database.
- Both patent databases are directly from USPTO or from EU/Japan Patent Agency through EMBL/DDBJ
- Preformatted counterpart are available for * marked databases.
4. Installation of Preformatted BLAST Databases
Preformatted databases do not require formatting with formatdb. Steps for installing preformatted the nt nucleotide
database and pdb protein database are given below.
Installation of nt nucleotide database:
- Download all volumes of nt.##.tar.gz, through browser, manual ftp session, or update_blastdb.pl;
- If your have a database specific directory, move the downloaded volumes to that directory;
- Inflate the compressed archive using gunzip, WinZip, or StuffIt;
- Extract the tar files using tar, WinZip, or StuffIt;
- This will several individual files sharing the same name (nt.##.) with different .nxx extensions for each of
the volume, plus a nt.nal alias file (from volume .00);
- Test search the nt installation using "-d nt" (without quotes). If database directory are not configured in
the .ncbirc (ncbi.ini for PC), prefix path to nt.
On PC with WinZip, one can right click the downloaded archive and select "WinZip", then "Extract to here ..." from
the popup menu. Click on "No" in the prompt to inflate the archive. Right click on the resulted file with .tar extension
and select "WinZip", then "Extract to here ..." to extract the actual database files.
On Linux or Unix machine, the following command lines can be used:
gunzip -d <db.##.tar.gz>
tar zxvpf <db.##.tar>
replace <db.##.tar.gz> and <db.##.tar> with actual archive name.
As mentioned in Section 2, all volumes of the same database are needed to reconstitute this nt database.
Installation of pdb protein database:
- Since preformatted pdbaa is a mask file of nr protein database, we will need to download all volumes of nr.##.tar.gz along with pdbaa.tar.gz;
- If a database specific directory is present, move the downloaded files to that directory;
- Inflate and extract the archives as described above for nt;
- For nr, this will result in 9 files sharing the same file name with different .pxx extensions for each of the volume, plus a nr.nal alias file (from volume .00); while only two files will be generated for pdbaa;
- Test search the pdbaa installation using "-d pdbaa" (without quotes). Provide path prefix if databases directory is not configured or database files are not under the working directory.
We can use fastacmd to see if database files can be read properly:
fastacmd -d nt -I
Under linx, if the database files are placed under "/home/johndoe/blast-x.y.z/blastdb" directory, which is not specified in .ncbirc,
we can specify the path within the command line:
fastacmd -d /home/johndoe/blast-x.y.z/blastdb/nt -I
On PC, if the database files are placed under "E:\users\johndoe\blast-x.y.z\blastdb" directory, which is not specified in ncbi.ini,
we can use the following command line to access the database:
fastacmd -d E:\users\johndoe\blast-x.y.z\blastdb\nt -I
One can do the same for the pdbaa database, which informs BLAST program which subsection of the parent nr database to use.
For more information on set up of blast, see "How to setup BLAST on Windows PC" or
"How to setup BLAST under Unix, Linux, MacOSX".
5. Database Updates
NCBI updates the BLAST databases as frequently as possible. Due to the increase in database size, this update
process takes significant amount of time. Due to this issue, daily update BLAST databases on our FTP site is,
in many cases, simply not possible.
Update of existing databases by merging new records from the month database is not supported due to the difficulties
in dealing with the removal of outdated records from the old database. We do not have an established incremental update
scheme at this time. Our recommendation is to download the databases regularly using the
blastdb_update.pl
script we provide to keep their content current.
6. Non-redundant Defline Syntax
The true non-redundant databases are protein database nr, swissprot, pdbaa, and pataa. In them, identical sequences are
merged into a single entry. To be merged, two sequences must have identical lengths and every residue at every position
must be the same. The FASTA deflines for the different entries that belong to one nr record are separated by control-A
characters invisible to most text editors. In the example below both entries gi|1469284 and gi|1477453 have the same sequence,
in every respect, so they were collapsed into a single record. The compound defline indicates the fact that it actually
represent two separate sequence records:
|
>gi|3023276|sp|Q57293|AFUC_ACTPL Ferric transport ATP-binding protein afuC ^A>gi|1469284|gb|AAB05030.1| afuC gene product ^A>gi|1477453|gb|AAB17216.1|
afuC [Actinobacillus pleuropneumoniae]
MNNDFLVLKNITKSFGKATVIDNLDLVIKRGTMVTLLGPSGCGKTTVLRLVAGLENPTSGQIFIDGEDVT
KSSIQNRDICIVFQSYALFPHMSIGDNVGYGLRMQGVSNEERKQRVKEALELVDLAGFADRFVDQISGGQ
QQRVALARALVLKPKVLILDEPLSNLDANLRRSMREKIRELQQRLGITSLYVTHDQTEAFAVSDEVIVMN
KGTIMQKARQKIFIYDRILYSLRNFMGESTICDGNLNQGTVSIGDYRFPLHNAADFSVADGACLVGVRPE
AIRLTATGETSQRCQIKSAVYMGNHWEIVANWNGKDVLINANPDQFDPDATKAFIHFTEQGIFLLNKE
|
This "control-A" character can be found using regular expression matching in Perl or other programming tools.
The following code parses FASTA defline of nr to count the number of records in each sequence:
if (/^>gi\|(\d+)\|/){
my $id=$1;
my $c=1;
while (/\001/g){
$c++;
}
print STDERR "gi $id contains $c sequences\n";
}
Recently, NCBI re-introduced non-redundancy to nucleotide nt database. This applies to RefSeq
and their corresponding GenBank records, i.e. GenBank records explicitly stated in the ASN.1 of RefSeq
entries will be combined with the RefSeq records to form the non-redundant entries. This affects only a
small fraction (less than 2%) of the records in nt.
The syntax of sequence header lines used by the NCBI BLAST server depends on the database from which each sequence
was obtained. The table below lists the identifiers for the databases from which the sequences were derived.
| Table 5. Defline Identifier Syntax for Different Databases |
| Source Database | Defline Identifier Syntax |
| GenBank | gb|accession|locus |
| EMBL Data Library | emb|accession|locus |
| DDBJ, DNA Database of Japan | dbj|accession|locus |
| NBRF PIR | pir||entry |
| Protein Research Foundation | prf||name |
| SWISS-PROT | sp|accession|entry name |
| Brookhaven Protein Data Bank | pdb|entry|chain |
| Patents | pat|country|number |
| GenInfo Backbone Id 1 | bbs|number |
| General database identifier 2 | gnl|database|identifier |
| NCBI Reference Sequence | ref|accession|locus |
| Local Sequence identifier 2 | lcl|identifier |
NOTE:
- Old entries manually created from journal-scan.
- Generally used for local custom databases.
GeneInfo identifiers ("gi") are assigned by NCBI for all sequence records found in NCBI Entrez sequence
databases. The gi identifier provides a uniform and stable naming convention whereby a specific sequence
is assigned its unique gi identifier. If a nucleotide or protein sequence changes, a new gi identifier is
assigned, even if the accession number of the record remains unchanged. Thus gi identifiers provide a
mechanism for identifying the exact sequence that was used or retrieved in a given search. This sequence
change can also be tracked through the version # field added to the Accession number. For example, in
NP_003864.3, the ".3" indicates that the sequence of this record has changed twice since it first appeared.
We recommend the usage of "-I T" option when doing local blast searches:
-I Show GI's in deflines [T/F]
default = F
For databases whose entries are not from official NCBI sequence databases, such as Trace and ##.seq.uniq files
from UniGene, the "gnl|" convention is used. For custom database, this convention should be followed and the id
for each sequence must be unique, if one would like to take the advantage of indexed database, which enables
specific sequence retrieval with the database utilities fastacmd from the standalone BLAST package.
7. Formatting FASTA Sequence Files
BLAST programs cannot recognize and use FASTA sequence file as input database. Those FASTA sequence files needs to be
formatted with formatdb, another database utilities from the standalone BLAST package, before they can be used in local
BLAST search. For those FASTA files from NCBI, users should use the following formatdb command line:
For nucleotide: formatdb -i input_db -p F -o T
For protein: formatdb -i input_db -p T -o T
The -A option introduced in 2.2.3 is now built into the formatdb program. It is removed from the list of configurable
options since 2.2.8. This function enables formatdb to properly handle large sequence files (longer than 16 million bases).
Databases prepared using formatdb, version 2.2.8 and later, will not be backward compatible with blast programs older
than version 2.2.3. Please refer to the formatdb.html for more information:
ftp.ncbi.nlm.nih.gov/blast/documents/formatdb.html
8. Technical Support
Questions and comments on this document and NCBI BLAST related questions should be sent to blast-help group at:
blast-help@ncbi.nlm.nih.gov
To help us minimize the time spent on gathering necessary information, please provide detailed information when
reporting a BLAST related problem to us. The information includes chip/OS combination, BLAST version, database
and query, formatdb command line and log file, as well as error messages and RID etc. This will help expedite
the resolution of the problem.
For information about other NCBI resources and services, please send email to NCBI User Service at:
info@ncbi.nlm.nih.gov
9. Appendix
The script "update_blastdb.pl" simplifies the BLAST databases update process. To use this script, users will need to
have perl installed. The following command will check and update the nr protein database automatically:
update_blastdb.pl nr
For users not familiar with the available preformatted databases on NCBI’s BLAST ftp site, we recommend
they execute the following command first to obtain the list:
update_blastdb.pl --showall
For reference purpose, the program options for update_blastdb.pl are given below.
NAME
update_blastdb.pl - Download pre-formatted BLAST databases from NCBI
SYNOPSIS
update_blastdb.pl [options] blastdb ...
OPTIONS
--showall
Show all available pre-formatted BLAST databases (default: false).
The output of this option lists the database names which should be used
when requesting downloads or updates using this script.
--passive
Use passive FTP, useful when behind a firewall (default: false).
--timeout
Timeout on connection to NCBI (default: 120 seconds).
--force
Force download even if there is a archive already on local directory
(default: false).
--verbose
Increment verbosity level (default: 1). Repeat this
option multiple times to increase the verbosity level (maximum 2).
--quiet
Produce no output (default: false). Overrides the --verbose option.
DESCRIPTION
This script will download the pre-formatted BLAST
databases requested in the command line from the NCBI ftp site.
EXIT CODES
This script returns 0 on success and a non-zero value on errors.
BUGS
Please report them to
COPYRIGHT
See PUBLIC DOMAIN NOTICE included at the top of this script.
|
|