FTP Downloadable BLAST Databases From NCBI
Tao Tao, PhD
User Service
NCBI, NLM, NIH
TOC
 
1. Introduction

This document describes the "BLAST" databases available on the NCBI FTP site under:
ftp.ncbi.nih.gov/blast/db/

NCBI BLAST home pages (www.ncbi.nih.gov/BLAST/) use a standard set of BLAST databases for Nucleotide, Protein, and Translated BLAST searches. These preformatted databases are made available as compressed archives at ftp.ncbi.nih.gov/blast/db/. The FASTA databases reside under the FASTA subdirectory.

    The pre-formatted databases offer the following advantages:
  • The pre-formatted databases breaks large database into smaller and more manageable volumes, which are easier to download;
  • The smaller size will help avoid file size limitations users may encounter on certain platforms;
  • Sequences in FASTA format can be generated from the pre-formatted databases by the fastacmd utility;
  • A convenient script (update_blastdb.pl) is available to download the pre-formatted databases from the NCBI ftp site;
  • Pre-formatting removes the need to run formatdb;
  • Taxonomy ids are available for each database entry.

Pre-formatted databases must be downloaded in binary mode using ftp client, web browser, or the update_blastdb.pl script provided by NCBI. Documentation for the update_blastdb.pl script can be obtained by running the script without any arguments (perl is required).

We strongly recommend that our users use the preformatted databases whenever possible. In addition to the advantages listed above, a less-known logistic reason is that our BLAST databases are generated directly from our backend relational databases in preformatted form, which can be loaded to our ftp site directly after tar and gzip. To generate the FASTA files for ftp, we need to dump the FASTA sequences from the preformatted databases using fastacmd before we can before we can gzip and load them to our ftp server. For users who do have need for the FASTA sequences, they can easily get them from the preformatted databases using the "-D 1" option in fastacmd.

More information on fastacmd and formatdb is available in "Program Parameters for formatdb and fastacmd".

The compressed database files must be inflated with gzip or other compatible tools. The BLAST database files can then be extracted out of the resulting tar file using tar program on Unix/Linux or WinZip and StuffIt Expander on Windows and Macintosh platforms, respectively.

Large databases are formatted in multiple one-gigabytes volumes, which are named using the database.##.tar.gz convention, with ## representing the volumne number. All relevant volumes are required to reconstitute the database. An alias file is, with .nal or .pal extension, is included in the 00 volume to tie all volumes together. The database can be called using the alias name without the extension. For example, to call nt database, simply use "-d nt" in the commandline without the quotes.

Certain databases are subsets of a larger parental database. For those databases, alias and mask files, rather than actual databases, are provided. The mask file requires the parent database, generated on the same day, to function properly. For example, to use pre-formatted swissprot.tar.gz database, one will need to get all volume for nr (nr.##.tar.gz).

Additional BLAST databases not provided in pre-formatted formats are available in the FASTA subdirectory. For genomic BLAST databases, please check the genomes ftp directory at:
ftp.ncbi.nih.gov/genomes/

2. Contents of the /blast/db Directory

The pre-formatted BLAST databases are archived in this directory. The name of these databases and their contents are listed below.

Table 2. File Contents of the /blast/db/ Directory
File NameContent Description
/FASTASubdirectory for sequences in FASTA format
blastdb.htmlReadme for this subdirectory (this file)
env_nr.*tar.gzEnvironmental protein sequences
env_nt.*tar.gzEnvironmental nucleotide sequences
est.*tar.gz 1Alias file for preformatted est databases, requires all volumes of est_human, est_mouse, and est_others.
est_human.*tar.gz Human subset of the est database
est_mouse.*tar.gz Mouse subset of the est database
est_others.*tar.gz Non-human and non-mouse subset of the est database
gss.*tar.gzVolumes of the formatted gss database from the GSS division of GenBank, EMBL, and DDBJ
htgs.*tar.gzVolumes of htgs database with entries from HTG division of GenBank, EMBL, and DDBJ
human_genomic.*tar.gzHuman RefSeq (NC_######) chromosome records with gap adjusted concatenated NT_ contigs
human_genomic_transcript.tar.gzCombined database for human genome and refseq transcripts
mouse_genomic_transcript.tar.gzCombined database for mouse genome and refseq transcripts
nr.*tar.gzNon-redundant protein sequence database with entries from GenPept, Swissprot, PIR, PDF, PDB, and NCBI RefSeq. NOTE that nr does NOT contain sequences found in pataa and env_nr databases.
nt.*tar.gzNucleotide sequence database, with entries from all traditional divisions of GenBank, EMBL, and DDBJ excluding bulk divisions (gss, sts, pat, est, and htg) as well as wgs entries. Minimally non-redundant. See Section 6
other_genomic.*tar.gzRefSeq chromosome records (NC_######) for organisms other than human
pataa.*tar.gz ²Patent protein sequence database
patnt.*tar.gz ²Patent nucleotide sequence database.
pdbaa.*tar.gz ³Protein sequences from pdb protein structures
pdbnt.*tar.gzNucleotide sequences from pdb nucleic acid structures, its parent database it nt. They are NOT the protein coding sequences for the corresponding pdbaa entries.
refseq_genomic.*tar.gzNCBI genomic reference sequences
refseq_protein.*tar.gzNCBI protein reference sequences
refseq_rna.*tar.gzNCBI Transcript reference sequences
sts.*tar.gzSequences from the STS division of GenBank, EMBL, and DDBJ
swissprot.tar.gz ³swiss-prot sequence databases (last major update)
taxdb.tar.gzAdditional taxonomy information for the formatted database (contains common and scientific names)
wgs.*tar.gzvolumes for whole genome shotgun sequence assemblies for different organisms
    NOTE:
  1. This alias requires all volumes of est_human, est_mouse, and est_others to function properly.
  2. Both patent databases are directly from USPTO or from EU/Japan Patent Agencies through collaboration with EMBL/DDBJ.
  3. These are aliases and mask files, which need all volumes of nr to function properly.
  4. * represents the volume numbers if present and all volumes are needed to reconstitute that database. Not all databases are in volumes.

3. Contents of the /blast/db/FASTA Directory

This directory contains FASTA formatted sequence files. The file names and database contents are listed below. These files are archived in .gz format and must be processed through formatdb after inflation before they can be used with different BLAST programs.

Table 3. File Contents of the /blast/db/ Directory
File NameContent Description
alu.a.gz 1Translation of alu.n repeats
alu.n.gz 1Alu repeat elements
drosoph.aa.gz 1CDS translations from drosophila.nt
drosoph.nt.gz 1Genomic sequences for drosophila
ecoli.aa.gz 1CDS translations from ecoli.nt
ecoli.nt.gz 1Escherichia coli K-12 genomic sequences
env_nr.gz *Environmental protein sequences
env_nt.gz *Environmental nucleotide sequences
est_human.gz * 2Human subset of the est database
est_mouse.gz * 2Mouse subset of the est database
est_others.gz * 2Non-human and non-mouse subset of the est database
gss.gz *Entries from GSS division of GenBank, EMBL, and DDBJ
htg.gz *Entries from HTG division of GenBank, EMBL, and DDBJ
human_genomic.gz *Human RefSeq (NC_######) chromosome records with gap adjusted concatenated NT_ contigs
igSeqNt.gzNucleotide sequences for human and mouse immunoglobulin variable region
igSeqProt.gzProtein sequences for human and mouse immunoglobulin variable region
mito.aa.gz 1CDS translations of complete mitochondrial genomes
mito.nt.gz 1Complete mitochondrial genomes
month.aa.gz 3Newly released/updated protein sequences
month.est_human.gz 3Newly released/updated human est sequences
month.est_mouse.gz 3Newly released/updated mouse est sequences
month.est_others.gz 3Newly released/updated est other than human/mouse
month.gss.gz 3Newly released/updated gss sequences
month.htgs.gz 3Newly released/updated htgs sequences
month.nt.gz 3Newly released/updated sequences for the nt database
nr.gz*Non-redundant protein sequence database with entries from GenPept, Swissprot, PIR, PDF, PDB, and NCBI RefSeq. NOTE that nr does NOT contain sequences found in pataa and env_nr databases.
nt.gz*Nucleotide sequences from GenBank, EMBL, and DDBJ (excluding gss, sts, pat, est, htg, and wgs) as well as refseq RNA entries. Partially non-redundant.
other_genomic.gz *RefSeq chromosome records (NC_######) for organisms other than human.
pataa.gz * 4Patent protein sequence database
patnt.gz * 4Patent nucleotide sequence database
pdbaa.gz *Protein sequences from pdb protein structures
pdbnt.gz *Nucleotide sequences from pdb nucleic acid structures. They are NOT the protein coding sequences for the corresponding pdbaa entries.
sts.gz *Sequence Tag Site entries
swissprot.gz *Swiss-prot database (last major release)
vector.gz 1Vector sequence database
wgs.gz *Whole genome shotgun genome assemblies
yeast.aa.gz 1Protein translations from yeast genome
yeast.nt.gz 1Yeast genome
    NOTE:
  1. Relatively old databases with no regular update.
  2. We do not provide the complete est database in FASTA format. One need to get est_human, est_mouse, and est_others to get the complete est fasta sequences.
  3. month.### are the sequences released or updated within the last 30 days for that database.
  4. Both patent databases are directly from USPTO or from EU/Japan Patent Agency through EMBL/DDBJ
  5. Preformatted counterpart are available for * marked databases.

4. Installation of Preformatted BLAST Databases

Preformatted databases do not require formatting with formatdb. Steps for installing preformatted the nt nucleotide database and pdb protein database are given below.

    Installation of nt nucleotide database:
  • Download all volumes of nt.##.tar.gz, through browser, manual ftp session, or update_blastdb.pl;
  • If your have a database specific directory, move the downloaded volumes to that directory;
  • Inflate the compressed archive using gunzip, WinZip, or StuffIt;
  • Extract the tar files using tar, WinZip, or StuffIt;
  • This will several individual files sharing the same name (nt.##.) with different .nxx extensions for each of the volume, plus a nt.nal alias file (from volume .00);
  • Test search the nt installation using "-d nt" (without quotes). If database directory are not configured in the .ncbirc (ncbi.ini for PC), prefix path to nt.
On PC with WinZip, one can right click the downloaded archive and select "WinZip", then "Extract to here ..." from the popup menu. Click on "No" in the prompt to inflate the archive. Right click on the resulted file with .tar extension and select "WinZip", then "Extract to here ..." to extract the actual database files.

On Linux or Unix machine, the following command lines can be used:

gunzip -d <db.##.tar.gz>
tar zxvpf <db.##.tar>
replace <db.##.tar.gz> and <db.##.tar> with actual archive name.

As mentioned in Section 2, all volumes of the same database are needed to reconstitute this nt database.

    Installation of pdb protein database:
  • Since preformatted pdbaa is a mask file of nr protein database, we will need to download all volumes of nr.##.tar.gz along with pdbaa.tar.gz;
  • If a database specific directory is present, move the downloaded files to that directory;
  • Inflate and extract the archives as described above for nt;
  • For nr, this will result in 9 files sharing the same file name with different .pxx extensions for each of the volume, plus a nr.nal alias file (from volume .00); while only two files will be generated for pdbaa;
  • Test search the pdbaa installation using "-d pdbaa" (without quotes). Provide path prefix if databases directory is not configured or database files are not under the working directory.

We can use fastacmd to see if database files can be read properly:

fastacmd -d nt -I

Under linx, if the database files are placed under "/home/johndoe/blast-x.y.z/blastdb" directory, which is not specified in .ncbirc, we can specify the path within the command line:

fastacmd -d /home/johndoe/blast-x.y.z/blastdb/nt -I

On PC, if the database files are placed under "E:\users\johndoe\blast-x.y.z\blastdb" directory, which is not specified in ncbi.ini, we can use the following command line to access the database:

fastacmd -d E:\users\johndoe\blast-x.y.z\blastdb\nt -I

One can do the same for the pdbaa database, which informs BLAST program which subsection of the parent nr database to use.

For more information on set up of blast, see "How to setup BLAST on Windows PC" or "How to setup BLAST under Unix, Linux, MacOSX".

5. Database Updates

NCBI updates the BLAST databases as frequently as possible. Due to the increase in database size, this update process takes significant amount of time. Due to this issue, daily update BLAST databases on our FTP site is, in many cases, simply not possible.

Update of existing databases by merging new records from the month database is not supported due to the difficulties in dealing with the removal of outdated records from the old database. We do not have an established incremental update scheme at this time. Our recommendation is to download the databases regularly using the blastdb_update.pl script we provide to keep their content current.

6. Non-redundant Defline Syntax

The true non-redundant databases are protein database nr, swissprot, pdbaa, and pataa. In them, identical sequences are merged into a single entry. To be merged, two sequences must have identical lengths and every residue at every position must be the same. The FASTA deflines for the different entries that belong to one nr record are separated by control-A characters invisible to most text editors. In the example below both entries gi|1469284 and gi|1477453 have the same sequence, in every respect, so they were collapsed into a single record. The compound defline indicates the fact that it actually represent two separate sequence records:

>gi|3023276|sp|Q57293|AFUC_ACTPL Ferric transport ATP-binding protein afuC ^A>gi|1469284|gb|AAB05030.1| afuC gene product ^A>gi|1477453|gb|AAB17216.1| afuC [Actinobacillus pleuropneumoniae] MNNDFLVLKNITKSFGKATVIDNLDLVIKRGTMVTLLGPSGCGKTTVLRLVAGLENPTSGQIFIDGEDVT KSSIQNRDICIVFQSYALFPHMSIGDNVGYGLRMQGVSNEERKQRVKEALELVDLAGFADRFVDQISGGQ QQRVALARALVLKPKVLILDEPLSNLDANLRRSMREKIRELQQRLGITSLYVTHDQTEAFAVSDEVIVMN KGTIMQKARQKIFIYDRILYSLRNFMGESTICDGNLNQGTVSIGDYRFPLHNAADFSVADGACLVGVRPE AIRLTATGETSQRCQIKSAVYMGNHWEIVANWNGKDVLINANPDQFDPDATKAFIHFTEQGIFLLNKE

This "control-A" character can be found using regular expression matching in Perl or other programming tools. The following code parses FASTA defline of nr to count the number of records in each sequence:

if (/^>gi\|(\d+)\|/){
  my $id=$1;
  my $c=1;
  while (/\001/g){
    $c++;
  }
  print STDERR "gi $id contains $c sequences\n";
}

Recently, NCBI re-introduced non-redundancy to nucleotide nt database. This applies to RefSeq and their corresponding GenBank records, i.e. GenBank records explicitly stated in the ASN.1 of RefSeq entries will be combined with the RefSeq records to form the non-redundant entries. This affects only a small fraction (less than 2%) of the records in nt.

The syntax of sequence header lines used by the NCBI BLAST server depends on the database from which each sequence was obtained. The table below lists the identifiers for the databases from which the sequences were derived.

Table 5. Defline Identifier Syntax for Different Databases
Source DatabaseDefline Identifier Syntax
GenBankgb|accession|locus
EMBL Data Libraryemb|accession|locus
DDBJ, DNA Database of Japandbj|accession|locus
NBRF PIRpir||entry
Protein Research Foundationprf||name
SWISS-PROTsp|accession|entry name
Brookhaven Protein Data Bankpdb|entry|chain
Patentspat|country|number
GenInfo Backbone Id 1bbs|number
General database identifier 2gnl|database|identifier
NCBI Reference Sequenceref|accession|locus
Local Sequence identifier 2lcl|identifier
    NOTE:
  1. Old entries manually created from journal-scan.
  2. Generally used for local custom databases.

GeneInfo identifiers ("gi") are assigned by NCBI for all sequence records found in NCBI Entrez sequence databases. The gi identifier provides a uniform and stable naming convention whereby a specific sequence is assigned its unique gi identifier. If a nucleotide or protein sequence changes, a new gi identifier is assigned, even if the accession number of the record remains unchanged. Thus gi identifiers provide a mechanism for identifying the exact sequence that was used or retrieved in a given search. This sequence change can also be tracked through the version # field added to the Accession number. For example, in NP_003864.3, the ".3" indicates that the sequence of this record has changed twice since it first appeared.

We recommend the usage of "-I T" option when doing local blast searches:

-I Show GI's in deflines [T/F]
      default = F

For databases whose entries are not from official NCBI sequence databases, such as Trace and ##.seq.uniq files from UniGene, the "gnl|" convention is used. For custom database, this convention should be followed and the id for each sequence must be unique, if one would like to take the advantage of indexed database, which enables specific sequence retrieval with the database utilities fastacmd from the standalone BLAST package.

7. Formatting FASTA Sequence Files

BLAST programs cannot recognize and use FASTA sequence file as input database. Those FASTA sequence files needs to be formatted with formatdb, another database utilities from the standalone BLAST package, before they can be used in local BLAST search. For those FASTA files from NCBI, users should use the following formatdb command line:

For nucleotide:   formatdb -i input_db -p F -o T
For protein:      formatdb -i input_db -p T -o T
The -A option introduced in 2.2.3 is now built into the formatdb program. It is removed from the list of configurable options since 2.2.8. This function enables formatdb to properly handle large sequence files (longer than 16 million bases). Databases prepared using formatdb, version 2.2.8 and later, will not be backward compatible with blast programs older than version 2.2.3. Please refer to the formatdb.html for more information:

ftp.ncbi.nlm.nih.gov/blast/documents/formatdb.html

8. Technical Support

Questions and comments on this document and NCBI BLAST related questions should be sent to blast-help group at:
blast-help@ncbi.nlm.nih.gov

To help us minimize the time spent on gathering necessary information, please provide detailed information when reporting a BLAST related problem to us. The information includes chip/OS combination, BLAST version, database and query, formatdb command line and log file, as well as error messages and RID etc. This will help expedite the resolution of the problem.

For information about other NCBI resources and services, please send email to NCBI User Service at: info@ncbi.nlm.nih.gov

9. Appendix

The script "update_blastdb.pl" simplifies the BLAST databases update process. To use this script, users will need to have perl installed. The following command will check and update the nr protein database automatically:

update_blastdb.pl nr

For users not familiar with the available preformatted databases on NCBI’s BLAST ftp site, we recommend they execute the following command first to obtain the list:

update_blastdb.pl --showall

For reference purpose, the program options for update_blastdb.pl are given below.

NAME
       update_blastdb.pl - Download pre-formatted BLAST databases from NCBI

SYNOPSIS
       update_blastdb.pl [options] blastdb ...

OPTIONS
       --showall
         Show all available pre-formatted BLAST databases (default: false). 
         The output of this option lists the database names which should be used 
         when requesting downloads or updates using this script.

       --passive
         Use passive FTP, useful when behind a firewall (default: false).

       --timeout
         Timeout on connection to NCBI (default: 120 seconds).

       --force
         Force download even if there is a archive already on local directory 
		 (default: false).

       --verbose
         Increment verbosity level (default: 1). Repeat this
         option multiple times to increase the verbosity level (maximum 2).

       --quiet
         Produce no output (default: false). Overrides the --verbose option.

DESCRIPTION
       This script will download the pre-formatted BLAST
       databases requested in the command line from the NCBI ftp site.

EXIT CODES
       This script returns 0 on success and a non-zero value on errors.

BUGS
       Please report them to 

COPYRIGHT
       See PUBLIC DOMAIN NOTICE included at the top of this script.