Latest news: New BLAST design to be released on April 16, 2007
MEGABLAST is the tool of choice to identify a sequence. 

The best way to identify an unknown sequence is to see if that sequence already exists in a public database. If the database sequence is a well-characterized sequence, then you may have access to a wealth of biological information. All of the nucleotide-nucleotide BLAST programs can be used to accomplish this goal. However, MEGABLAST is specifically designed to efficiently find long alignments between very similar sequences and thus is the best tool to use to find the identical match to your query sequence. In addition to the expect value significance cut-off, MEGABLAST also provides an adjustable percent identity cut-off that overrides the significance threshold.

NOTE: Web MEGABLAST can also accept batch queries. Click here for details.

Standard nucleotide BLAST is better at finding sequences similar, but not identical, to your query. 

The BLAST nucleotide algorithm finds similar sequences by generating an indexed table or dictionary of short subsequences called words for both the query and the database. The program can then rapidly find initial exact matches to the query words by simply looking up a particular word in the database dictionary. These initial matches serve as starting points for longer alignments that are generated in several steps, ending with a final gapped alignment.

One of the important parameters governing the sensitivity of BLAST searches is the length of the initial words (word size). The most important reason that blastn is more sensitive than MEGABLAST is that it uses a shorter default word size. Because of this, blastn is better than MEGABLAST at finding alignments to related nucleotide sequences from other organisms since the initial exact match can be shorter. The word size is adjustable in blastn and can be reduced from the default value of 11 to a minimum of 7 to increase sensitivity. This word size can also be increased to increase the search speed and limit the number of database hits. Or one can use MEGABLAST with a relaxed percent identity cutoff (default set at 99%).

Nucleotide-nucleotide searches are not the recommended way to find homologous protein coding regions in other organisms. It is better to perform searches at the protein level, either with translations of the nucleotide sequences or by direct protein-protein BLAST. This is because of the degeneracy of the genetic code, the greater information available in amino acid sequence, and the more sophisticated algorithm in protein-protein BLAST.

"Search for short and near exact matches" under Nucleotide BLAST is useful for primer or short nucleotide motif searches.  

Short sequences (less than 20 bases) will often not find any significant matches to the database entries under the standard nucleotide-nucleotide BLAST settings. The usual reasons for this are that the significance threshold governed by the expect value parameter is set too stringently and the default word size parameter is set too high.

You can adjust both the word size and the expect value on the standard BLAST pages to work with short sequences. However, we do provide a BLAST page with these values preset to give optimum results with short sequences. This page ("Search for short and nearly exact matches") is linked under the nucleotide BLAST section of the main BLAST page. The adjustments are described in the table below.

Program Word Size Filter Setting Expect Value
Standard Nucleotide BLAST 11 On (DUST) 10
Search for short/near exact matches 7 Off 1000

A common use of this page is to check the specificity of primers used in the polymerase chain reaction (PCR) or hybridization. A useful way to check a pair of PCR primers is to concatenate them and search them as one sequence. The forward primer and the reverse primer can simply be pasted together with a string of ten or more N's between the two sequences. Since BLAST looks for local alignments and searches both strands, there is no need to reverse complement one of the primers before doing the concatenation or the search.

NOTE: The query sequence should contain no ambiguous bases. Consensus motifs with degenerate bases will not work for this type of search.

Standard protein BLAST is designed for protein searches 

Standard protein-protein BLAST (blastp) is used for both identifying a query amino acid sequence and for finding similar sequences in protein databases. Like other BLAST programs, blastp is designed to find local regions of similarity. However, when sequence similarity spans the whole sequence, blastp will report a global alignment, which is the preferred result for protein identification purposes.

Unlike nucleotide BLAST, there is no comparable MEGABLAST for protein searches.

PSI-BLAST is designed for more sensitive protein protein similarity searches. 

Position-Specific Iterated (PSI)-BLAST is the most sensitive BLAST program, making it useful for finding very distantly related proteins. Use PSI-BLAST when your standard protein-protein BLAST search either failed to find significant hits, or returned hits with descriptions such as "hypothetical protein" or "similar to..."

The first round of PSI-BLAST is a standard protein-protein BLAST search. The program builds a position-specific scoring matrix (PSSM or profile) from an alignment of the sequences returned with Expect values better (lower) than the inclusion threshold (0.005 by default). In the second iteration the PSSM becomes the query in the search. Any new database hits below the inclusion threshold are included in a new PSSM. The PSI-BLAST search is said to have converged when no more new database sequences are added in subsequent iterations.

You can add database hits that fall outside the inclusion threshold to your PSSM for the next round by checking the box next to the hit.

You can also save a PSSM created during a PSI-BLAST search of one database and use it to search a different database. To do this, change "Alignment" to "PSSM" in a pulldown menu in the Format section of a "formatting BLAST" page (at any iteration after the first). Then format the search, copy the resulting PSSM and paste it into the Options section of a new PSI-BLAST search page.

PHI-BLAST can do a restricted protein pattern search. 

Pattern-Hit Initiated (PHI) BLAST is designed to search for proteins that contain a pattern specified by the user, AND are similar to the query sequence in the vicinity of the pattern. This dual requirement is intended to reduce the number of database hits that contain the pattern, but are likely to have no true homology to the query.

To run PHI-BLAST, enter your query (which contains one or more instances of the pattern) into the "Search" box, and enter your pattern into the "PHI pattern" box in the "Options" section. Patterns must follow the syntax conventions of PROSITE. The documentation on Pattern Syntax is at: http://www.ncbi.nlm.nih.gov/blast/html/PHIsyntax.html.

The protein version of "Search for short nearly exact matches" is optimized to find matches to a short peptide sequence.  

A short peptide (10-15mer or less) often will not find any significant matches to the database under the standard protein-protein BLAST settings. The usual reasons for this are that the significance threshold governed by the expect value parameter is set too stringently and the default word size parameter is set too high.

To use a short peptide sequence as a query, you could adjust both the word size and the expect value on the standard BLAST pages to make it work with short sequences. However, we provide a separate BLAST page with these values preset to optimize blastp searches with short query sequences. This page, "Search for short nearly exact matches", is available via a link under the Protein BLAST section of the BLAST home page. In addition to changing the Expect value cutoff and word size, the more stringent PAM30 scoring matrix replaces the BLOSUM62 matrix. This page also turns off the composition-based statistics feature in standard blastp, which takes the amino acid composition of the query sequence into account when calculating the score and significance of the alignments. NOTE: Composition based statistics can have a large effect on searches using queries with a biased amino acid composition. By definition, short peptides will have a biased compositions and should not be used with composition based statistics.

Due to the requirement that the query needs to be at least twice the word size, a query shorter than 5 residues is not recommended even though it can be as short as 4 residues when the word size is set to 2. In addition, since ambiguous residues break the query sequence, there should be no ambiguities in the query to ensure that the entire sequence can be used as seeds for initial search. You can also modify the settings on the "Protein query - Translated db [tblastn]" pages to find nucleotide matches for a short peptide. A summary of the settings for short peptide searches is given below:

Program Word Size Filter E Value Composition based Statistics Score Matrix
Standard protein BLAST 3 On (SEG) 10 On BLOSUM62
Search for short/nearly exact matches 2 Off 20000 Off PAM30


The "Nucleotide query - Protein db [blastx]" is useful for finding similar proteins to those encoded by a nucleotide query.  

Translated BLAST services are useful when trying to find homologous proteins to a nucleotide coding region. Blastx compares the translation of the nucleotide query sequence to a protein database. Because blastx translates the query sequence in all six reading frames and provides combined significance statistics for hits to different frames, it is particularly useful when the reading frame of the query sequence is unknown or it contains errors that may lead to frame shifts or other coding errors. Thus blastx search is often the first analysis performed with a read from a newly derived sequence and is used extensively in analyzing EST sequences.

The "Protein query - Translated db [tblastn]" search is useful for finding protein homologs in unnannotated nucleotide data.  

A tblastn search allows you to compare a protein sequence to the six-frame translations of a nucleotide database. It can be a very productive way of finding homologous protein coding regions in unannotated nucleotide sequences such as expressed sequence tags (ESTs) and draft genome records (HTG), located in BLAST databases est and htgs, respectively.

ESTs are short, single-read cDNA sequences. These comprise the largest pool of sequence data for many organisms and contain portions of transcripts from many uncharacterized genes. Since ESTs have no annotated coding sequences, there are no corresponding protein translations in the BLAST protein databases. Hence a tblastn search is the only way to search for these potential coding regions at the protein level. The HTG sequences, draft sequences from various genome projects or large genomic clones, are another large source of unannotated coding regions.

Like all translating searches, the tblastn search is especially suited to working with error prone data like ESTs and draft genomic sequences from HTG because it combines BLAST statistics for hits to multiple reading frames and thus is robust to frame shifts introduced by sequencing error.

The "Nucleotide query - Translated db [tblastx]" is useful for identifying novel genes in error prone query sequence.  

tblastx takes a nucleotide query sequence, translates it in all six frames, and compares those translations to the database sequences dynamically translated in all six frames. This effectively performs a more sensitive blastp search without doing the manual translation.

tblastx gets around the the potential frame-shift and ambiguities that may prevent certain open reading frames from being detected. This is very useful in identifying potential proteins encoded by single pass read ESTs. In addition, it would be a good tool for identifying novel genes.

NOTE: This type of search is computationally intensive and searches with large genomic queries are not recommended. The best way to do this is to install standalone blast and perform the search locally. For more information on standalone blast, please read the document forstandalone BLAST and formatdb.

The Conserved Domain Database (CDD) search service uses RPS-BLAST to identify conserved protein domains.  

Reverse Position Specific BLAST (RPS-BLAST) is a more sensitive way of identifying conserved domains in proteins than standard BLAST searching. It compares a protein sequence against a database of position specific scoring matrices (PSSMs). The PSSMs used in CDD search capture the substitution frequencies at each position in the multiple sequence alignments of recognized conserved domains. These conserved domain alignments are from three protein domain databases: SMART, PFAM, and LOAD. For additional information, go to CDD help.

The Conserved Domain Architecture Retrieval Tool (CDART) explores the domain architectures of proteins.  

CDART allows you to examine the domain structure of all proteins in the default BLAST protein database. The CDART tool first searches a query sequence for the presence of conserved domains using RPS-BLAST. It then allows you to retrieve proteins that share one or more protein domains in common with your query. Because CDART relies on RPS-BLAST, these searches are more sensitive than ordinary BLAST searches.

NOTE: If the query does not contain any conserved domains, CDART will not report any result.

"BLAST 2 Sequences" is designed for direct comparison of two sequences.  

This program takes two input sequences and compares them directly. Unlike the other BLAST programs, there is no need to format the database sequence in any special way. Please note that "BLAST 2 Sequences" regards the second sequence as the database. If the database sequence or second query is present in NCBI databases, using GI/Accession instead of the FASTA sequence would allow the program to incorporate the translation and other sequence features, found in that record, into the final result to make it more informative.

Since translated BLAST programs are incorporated in this program, the second sequence can be of different type as long as an appropriate BLAST program is selected. Appropriate Query/Program combinations are given in the table below.

First sequence Second Sequence Program to Use
Nucleotide Nucleotide blastn or tblastx
Nucleotide Protein blastx
Protein Nucleotide tblastn
Protein Protein blastp


Human Genome BLAST page is for comparing a query against the NCBI's assembly of human genome, its derivative and/or other related databases.  

Like other BLAST search pages in this section, this page provides a centralized page to access specialized databases. In this case, the databases are the current NCBI human genome build and those derived from or related to it.

All flavors of BLAST, except tblastx, are available with MEGABLAST set as default. Default filters are DUST and Human Repeat. The BLAST output links directly to the Human Genome MapViewer, where database hits can be analyzed in a genomic context, such as their relationship to other map elements like ESTs, SNPs, and other predicted genes. The complete list of databases available for searching are given below.

Human Genome Blast DataBases
Database Content
genome
(default)
human genomic contig sequences with NT_#### accessions
mrna human RefSeq mrna with NM_#### or XM_#### accessions
protein human RefSeq proteins with NP_#### or XP_#### accessions
gscan_mrna predicted mRNA sequences generated by running GenomeScan program on human genomic contigs
gscan_protein CDS translations from gscan mrna set
BAC end sequences BAC ends from GSS (?)
HTGS Human entries from GenBank htg division
ESTs Human subset from GenBank est division
EST Traces Human ests from Trace Archive
Other Traces Other human entries found in Trace Archive


Use Mouse Genome BLAST page to search preliminary assemblies as well as other mouse sequence databases. 

The organization of this page is similar to that of Human Genome BLAST page. Note that the double translated BLAST program, tblastx, is not available on this page due to its high computational intensity. MEGABLAST is the default algorithm and both low complexity filtering (DUST) and rodent repeat filtering are on by default.

The default database "curated NT contigs" is analogous to the human genome database "genome". However, much less of the mouse genome has been assembled into contigs. The databases available for searching are given given in this page.

The Microbial Genome BLAST page provides centralized access to complete and unfinished bacterial/archeal genomes. 

This page provides access to many complete and some unfinished bacterial/archeal genomes. For a complete list of genomes in this page, please follow this link.

The primary dataset is the DNA (the genomes), with Protein as the derivative dataset. Due to the lack of annotation, the protein dataset may not be available (selectable but with empty database) for unfinished genomes. One can choose to search against all the genomes or a selected subsets of them, and all flavors of BLAST programs are available.

NOTE: BLAST hits to an unfinished genome do not contain links to GenBank entries since they are not deposited to GenBank.

Other eukaryotes BLAST page provides access to genomic sequences to other eukaryotic organisms.  

In addition to human, mouse, and microbial genomes mentioned above, genomic sequences for many other organisms are also available. The prominent and high impact genomes are listed separately and others not list separately are grouped under this link. The exact sequences available varies depending on the stage of the sequencing projects.

For list of the organisms represented in the blast database, please check this page.

Use the Rat Genome BLAST page to search preliminary assemblies as well as other rat sequence databases.  

This page provides access to blast databases specific for rat. Comparing with human and mouse, only limited genomic sequences are available and there are no assembled contigs. The contents are explained below.

Content of Rat Genome Blast Databases
Database Content
HTGS Rat phase 0, phase 1, phase 2 or phase 3 sequence. These are the original BAC sequences as submitted by the sequencing centers.
Traces All of the raw rat WGS and BAC traces
BAC ends The end sequences of BACs from CHORI-230. Sequenced at TIGR.
Reference mRNAs Collection of reference mRNAs generated by the NCBI RefSeq project.
Reference Proteins Collection of reference proteins generated by the NCBI RefSeq project.
ESTs Single pass sequence reads from numerous rat cDNA libraries


Use the Fugu genome BLAST page to search against the draft Fugu rubripes (Puffer fish) genome.   

This page provides access to the draft genome and the protein translation of Fugu rubripes (Japanese Puffer fish). This genome assembly is provided by the DOE's Joint Genome Institute. For details on the databases and its release policy, please go to JGI's Fugu site. Similar BLAST searches against this genome assembly can also be done there.

Use the Zebrafish Genome BLAST page to search against Zebrafish specific sequences.   

Currently there are not finished genomic contigs for this organism and the content of available databases is explained below.

Content of Zebrafish Genome Blast Databases
Database Content
mRNAs Zebrafish mRNAs in GenBank.
ESTs Single pass sequence reads from numerous Zebrafish cDNA libraries.
HTGS Zebrafish phase 0, phase 1, phase 2 or phase 3 sequence. These are the original BAC sequences as submitted by the sequencing centers.
Traces All of the raw Zebrafish WGS and BAC and EST Traces.
Reference mRNAs Collection of reference mRNAs generated by the NCBI RefSeq project.
Reference Proteins Collection of reference proteins generated by the NCBI RefSeq project.

Use the Arabidopsis thaliana genome BLAST page to search against the Arabidopsis genome.  

This page provides access to the sequenced chromosome clones of Arabidopsis thaliana, mRNA sequences predicted from them, and the translations of those mRNA. Links to the genome mapviewer are also provided for the identified hits. Direct searches with text terms can be done in that Arabidopsis thalianagenome mapviewer page.

Use Oryza sativa genome BLAST page to search against the rice genome.  

This page provides access to the super contig assemblies of rice. The data available is from a publicly funded Chinese rice geneome project and the sequence is from the Oryza sativa L. ssp. indica strain. For more details, please refer to the Rice Genome MapViewer page.

Use the Anopheles gambiae genome BLAST page to search against the mosquito genome.  

This page provides access to the genome scaffold of Anopheles gambiae. The data available are from a NIAID publicly funded project. The sequencing and assembly were done by Celera. For more details, please refer to the Anopheles gambiae Genome MapViewer page.

The VecScreen page is for identifying vector sequence contamination in a query sequence. 

VecScreen is a rapid screening tool that checks the query sequence against a non-redundant vector database, UniVec, which contains one copy of every unique sequence segment from a large number of vectors. In addition, UniVec contains sequences for adapters, linkers and primers that are commonly used in the cloning of cDNA or genomic DNA. Detailed information on UniVec is at: http://www.ncbi.nlm.nih.gov/tools/vecscreen/univec/.

This page is generally used to screen for vector contamination in the sequence before the sequence is being submitted to public sequence database.

Use the Trace Archive BLAST page to search raw, unassembled and unannotated primary sequence trace files. 

Trace data files are a rich source of information, especially for organisms lacking a significant amount of assembled genomic sequence. The sequences come from a variety of projects and sequencing strategies, including Whole Genome Shotgun (WGS), BAC end sequencing, and EST sequencing. The trace data are single pass sequencing reads not trimmed for quality or vector sequences. Their average lengths are between 500 to 700 bp.

A search from the Trace Archive BLAST page uses MEGABLAST exclusively and offers the same user-selected options as on the MEGABLAST web page. Information on the Trace data is available from this page.

Web MEGABLAST can accept batch queries.  

MEGABLAST is the only BLAST web service that can accept multiple queries. There are two ways to enter batch queries in MEGABLAST. If the query sequences are not present in the NCBI Entrez system, those sequences need to be pasted in the search box in FASTA format, one after another with no blank lines in between sequences. The FASTA definition line (or title) of each sequence should be on a single line all by itself. Alternatively, if those sequences are already in a text file in proper format, the file can be uploaded using the "Browse" button. An example query file with multiple sequences is given below.

>Sequence_1
AGACAGATCACTTCAGTCGCCACAATTAGCCATGGATAAGATACACCATTGCCATC
>Sequence_2
AGACAACTTCAGTCGCCGATCACTCGCCACAATTTCAGTCGCCATAAGGCAATTAT

If the query sequences are already present in Entrez, their GI or Accession numbers can be pasted in the search box, one identifier per line.

U12345
F12564
BH023812

A text file containing those numbers in this format can be uploaded through the "Browse" button rather than copy/paste.


Degenerate bases and ambiguity codes are treated as mismatches by BLAST.  

Uncertainties in a nucleotide sequence can be represented by a standard set of single-letter ambiguity codes given in the table below.

Code Meaning (Base) Code Meaning (Base)
A adenosine (A) M amino (A or C)
C cytidine (C) S strong (G or C)
G guanine (G) W weak (A or T)
T thymidine (T) B not A (G or T or C)
U uridine (U) D not C G or A or T)
R purine (G or A) H not G (A or C or T)
Y pyrimidine (T or C) V not T (G or C or A)
K keto (G or T) N any base (A or G or C or T)
- gap(s) (none)  

These are often used to represent degenerate bases in the third position of codons in degenerate oligonucleotide primers, or in a less conserved region of a sequence motif. Although this alphabet is accepted by BLAST, the BLAST program treats such ambiguities as mismatches in alignment. In short queries, such as primer sequences, these ambiguous bases may prevent BLAST from finding any matches in the database that are as large as the word size. Another side effect of too many ambiguities is that blastn may interpret your query sequence as protein and give an error message. NOTE: dashes (-) in the query are not accepted. Web blast programs will strip them before submitting the search. If gaps are desired, use N's instead of dashes.

For those programs that use amino acid query sequences (BLASTP and TBLASTN), the IUPAC based amino acid codes are given in the table below.

Code Residue Code Residue
A alanine P proline
B aspartate or asparagine Q glutamine
C cysteine R arginine
D aspartate S serine
E glutamate T threonine
F phenylalanine U selenocysteine
G glycine V valine
H histidine W tryptophan
I isoleucine Y tyrosine
K lysine Z glutamate or glutamine
L leucine X any residue
M methionine * translation stop
N asparagine - gap of indeterminate length

Blastp treats the red colored (non-standard) codes as mismatches in alignment. Web blast programs regard dashes (-) as illegal characters and will remove them before starting the search. For U's present in the query, Web BLAST will replace them with X, before submit the query. NOTE: If the presence of gaps is desired, use a string of X's instead of dashes.

Peptide sequence database content
 

The content of the peptide sequence databases available for BLAST searches is described below.

Database Content
nr All non-redundant GenBank CDS translations +PDB+SwissProt+PIR+PRF.
swissprot Last major release of the SWISS-PROT protein sequence database (no incremetnal updates).
pat Proteins from the Patent division of GenBank.
Yeast Saccharomyces cerevisiae genomic CDS translations
ecoli Escherichia coli genomic CDS translations
pdb Sequences derived from the 3-dimensional structures from the Brookhaven Protein Data Bank
Drosophila genome Drosophila genome proteins provided by Celera and Berkeley Drosophila Genome Project (BDGP).
month All new or revised GenBank CDS translation+PDB+SwissProt+PIR+PRF released in the last 30 days


Nucleotide sequence database content 

The content of the nucleotide sequence databases available for BLAST searches is described below.

Nucleotide Sequence Databases 
Database Content
nr  All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences). No longer "non-redundant".
est Database of GenBank+EMBL+DDBJ sequences from EST division.
est_human Human subset of GenBank+EMBL+DDBJ sequences from EST division.
est_mouse Mouse subset of GenBank+EMBL+DDBJ sequences from EST division.
est_others Non-Mouse, non-Human sequences of GenBank+EMBL+DDBJ sequences from EST Division.
gss Genome Survey Sequence, includes single-pass genomic data, exon-trapped sequences, and Alu PCR sequences.
htgs Unfinished High Throughput Genomic Sequences: phases 0, 1 and 2. Finished, phase 3 HTG sequences are in nr.
pat Nucleotides from the Patent division of GenBank
yeast Saccharomyces cerevisiae genomic nucleotide sequences
mito Database of mitochondrial sequences
vector Vector subset of GenBank(R), NCBI, in ftp://ftp.ncbi.nlm.nih.gov/blast/db/
ecoli Escherichia coli genomic nucleotide sequences
pdb Sequences derived from the 3-dimensional structures from the Brookhaven Protein Data Bank.
Drosophila genome  Drosophila genome provided by Celera and Berkeley Drosophila Genome Project (BDGP)
month All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days.
alu Select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences. It is available by FTP from ftp://ftp.ncbi.nlm.nih.gov/blast/db/alu.n.Z. See "Alu alert" by Claverie and Makalowski, Nature 371: 752 (1994).
dbsts Database of GenBank+EMBL+DDBJ sequences from STS division. .
chromosome Searches Complete Genomes, Complete Chromosome, or contigs form the NCBI Reference Sequence project.
wgs_anopheles Anopheles gambiae (mosquito) whole genome shotgun sequences

Disclaimer
Privacy statement
Accessibility
This page is valid XHTML 1.0.