|
A region of the query sequences can be used to be used for BLAST searching.
You can enter the range in nucleotides or protein residues in the "Form"
and "To" boxes provided under "Set Subsequence". For example to limit matches
to the region from nucleotide 24 to nucleotide 200 of a query sequence,
you would enter From= 24 To= 200. If one of the limits you enter is out of range,
the intersection of the [From,To] and [1,length] intervals will be searched,
where length is the length of the whole query sequence.
| |
Databases
available for BLAST search |
Learn
more
The BLAST pages offer several different databases for searching.
some of these, like SwissProt, PDB and Kabat are complied outside of NCBI.
Other like ecoli, dbEST and month, are subsets of the NCBI databases. Other
"virtual Databases" can be created using the Limit
by Entrez Query option.
Peptide Sequence Databases
nr
All non-redundant GenBank CDS translations+RefSeq Proteins+PDB+SwissProt+PIR+PRF
swissprot
Last major release of the SWISS-PROT protein sequence database (no
updates)
pat
Proteins from the Patent division of GenPept.
Yeast
yeast (Saccharomyces cerevisiae) genomic CDS translations
ecoli
Escherichia coli genomic CDS translations
pdb
Sequences derived from the 3-dimensional structure from Brookhaven
Protein Data Bank
Drosophila genome
Drosophila genome proteins provided by Celera and Berkeley Drosophila
Genome Project (BDGP).
month
All new or revised GenBank CDS translation+PDB+SwissProt+PIR+PRF released
in the last 30 days.
Nucleotide Sequence
Databases
nr
All GenBank+RefSeq Nucleotides+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase
0, 1 or 2 HTGS sequences). No longer "non-redundant".
est
Database of GenBank+EMBL+DDBJ sequences from EST
Divisions
est_human
Human subset of GenBank+EMBL+DDBJ sequences from EST
Divisions
est_mouse
Mouse subset of GenBank+EMBL+DDBJ sequences from EST
Divisions
est_others
Non-Mouse, non-Human sequences of GenBank+EMBL+DDBJ sequences from EST
Divisions
gss
Genome
Survey Sequence, includes single-pass genomic data, exon-trapped sequences,
and Alu PCR sequences.
htgs
Unfinished High Throughput
Genomic Sequences: phases 0, 1 and 2 (finished, phase 3 HTG sequences
are in nr)
pat
Nucleotides from the Patent division of GenBank.
yeast
Yeast (Saccharomyces cerevisiae) genomic nucleotide sequences
mito
Database of mitochondrial sequences
vector
Vector subset of GenBank(R), NCBI, in ftp://ftp.ncbi.nih.gov/blast/db/
E. coli
Escherichia coli genomic nucleotide sequences
pdb
Sequences derived from the 3-dimensional structure from Brookhaven
Protein Data Bank
Drosophila genome
Drosophila genome provided by Celera and Berkeley Drosophila
Genome Project (BDGP).
month
All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the
last 30 days.
alu
Select Alu repeats from REPBASE, suitable for masking Alu repeats from
query sequences. It is available by anonymous FTP from ftp.ncbi.nih.gov
(under the /pub/jmc/alu
directory). See "Alu alert" by Claverie and Makalowski, Nature vol. 371,
page 752 (1994).
dbsts
Database of GenBank+EMBL+DDBJ sequences from STS Divisions .
chromosome
Searches Complete Genomes, Complete Chromosome, or contigs form the NCBI Reference Sequence project..
Human Genome Blast DataBases
genome
human genomic contig sequences with NT_#### accessions.
mrna
human RefSeq mrna with NM_#### or XM_#### accessions
protein
human RefSeq proteins with NP_#### or XP_#### accessions
gscan mrna
predicted mRNA sequences generated by running GenomeScan program on human genomic contigs
gscan protein
CDS translations from gscan mrna set.
CDD Search
Compares protein sequences to the Conserved Domain Database. The
CDD is a database containing a collection of functional and/or structural
domains derived from two popular collections, Smart
and Pfam, plus contributions from colleagues
at NCBI. For more information please see the CDD
homepage.
| |
BLAST
Search main parameters |
Limit by Entrez Query
BLAST searches can be limited to the results of an Entrez query against
the database chosen. This can be used to limit searches to subsets of the
BLAST databases. Any terms can be entered that would normally be
allowed in an Entrez search session. For example:
protease NOT hiv1[Organism]
This will limit a BLAST search to all proteases, except those in HIV 1.
This can also be used to limit searches to a particular molecule type:
biomol_mrna[PROP] AND brain
To limit to a specific organism you can either select using the pulldown
menu, form a list of the most common organism in the databases. Or
enter the name of the organism in the Entrez Query field with the [Organism]
qualifier. For example:
Mus musculus[Organism]
Or For help in constructing Entrez queries please see the "Writing
Advanced Search Statements" section of the Entrez Help document.
Filter (Low-complexity)
Mask off segments of the query sequence that have low compositional
complexity, as determined by the SEG
program of Wootton & Federhen (Computers and Chemistry, 1993) or, for
BLASTN, by the DUST
program of Tatusov and Lipman (in preparation). Filtering can eliminate
statistically significant but biologically uninteresting reports from the
blast output (e.g., hits against common acidic-, basic- or proline-rich
regions), leaving the more biologically interesting regions of the query
sequence available for specific matching against database sequences.
Filtering is only applied to the query sequence (or its translation products),
not to database sequences. Default filtering is DUST for BLASTN, SEG for
other programs.
It is not unusual for nothing at all to be masked by SEG, when applied
to sequences in SWISS-PROT, so filtering should not be expected to
always yield an effect. Furthermore, in some cases, sequences are masked
in their entirety, indicating that the statistical significance of any
matches reported against the unfiltered query sequence should be suspect.
Filter (Human repeats)
This option masks Human repeats (LINE's and SINE's) and is especially
useful for human sequences that may contain these repeats. Filtering for
repeats can increase the speed of a search especially with very long sequences
(>100 kb) and against databases which contain large number of repeats (htgs).
For more information please see "Why
does my search timeout on the BLAST servers?" in the BLAST Frequently
Asked Questions. Human Repeat Filtering is still experimental and under
development, so it may change in the near future.
Filter (Mask for lookup table only)
This option masks only for purposes of constructing the lookup table used
by BLAST. BLAST searches consist of two phases, finding hits based upon
a lookup table and then extending them. The option to "Mask for lookup
table only" masks only for the lookup table so that no hits are found based
upon low-complexity sequence. The BLAST extensions are performed without
masking and so they can be extended through low-complexity sequence. This
option is still experimental and may change in the near future.
Mask Lower Case
With this option selected you can cut and paste a FASTA sequence in upper case
characters and denote areas you would like filtered with lower case. This allows
you to customize what is filtered from the sequence during the comparison to the
BLAST databases
Expect
The statistical significance threshold for reporting matches against database
sequences; the default value is 10, meaning that 10 matches are expected
to be found merely by chance, according to the stochastic model of Karlin
and Altschul (1990). If the statistical significance ascribed to a match
is greater than the EXPECT threshold, the match will not be reported. Lower
EXPECT thresholds are more stringent, leading to fewer chance matches being
reported. Increasing the threshold shows less stringent matches. Fractional
values are acceptable.
Learn
more
Inclusion Threshold
The statistical significance threshold for including a sequence in the
model used by PSI-BLAST to create the PSSM on the next iteration.
Query Genetic Code
Genetic code to be used in blastx translation of the query. (See List
of Genetic Codes)
Number of hits
It is possible to speed up search by specifying maximum number of hits
to be computed.
AutoFormat
If AutoFormat is disabled (unchecked) the "Status = Ready" and a change
of background color to blue indicates the search is complete. However,
it will not perform actual formatting. Formatting can be performed by
pressing the 'Format' button on a previous page.
When the AutoFormat option is enabled (checked) clicking the Format
button will show the status and time stamps and then automatically
format BLAST results when they are ready.
Send Results by E-mail
By entering a e-mail address in the "Send Results by E-mail" field the BLAST
server will send a copy of your BLAST results to the address provided. The
default format of these results is in HTML however, you can have plain text results
send in an e-mail by setting the "Format" pull-down menu from "HTML" to "Plain text" on the main
BLAST search page. The BLAST graphic is not available through the e-mail service.
Graphical Overview
An overview of the database sequences aligned to the query sequence is
shown. The score of each alignment is indicated by one of five different
colors, which divides the range of scores into five groups. Multiple alignments
on the same database sequence are connected by a striped line. Mousing
over a hit sequence causes the definition and score to be shown in the
window at the top, clicking on a hit sequence takes the user to the associated
alignments.
NCBI-gi
Causes NCBI gi identifiers to be shown in the output, in addition to the
accession and/or locus name.
Descriptions
Restricts the number of short descriptions of matching sequences reported
to the number specified; default limit is 100 descriptions. See also EXPECT.
Alignments
Restricts database sequences to the number specified for which high-scoring
segment pairs (HSPs) are reported; the default limit is 100. If more database
sequences than this happen to satisfy the statistical significance threshold
for reporting (see EXPECT below), only the matches ascribed the greatest
statistical significance are reported.
Database LinkOuts
Enabling this option provides cross reference links from the BLAST results to other NCBI specialized databases. If a database sequence matches your query and it also found in LocusLink or UniGene (more databases to be included in the future) there will be links ( ) from the BLAST
search results to these resources.
Alignments Views
pairwise
Standard BLAST alignment in pairs of query sequence and database match.
Query-anchored with identities
The databases alignments are anchored (shown in relation to) to the query
sequence. Identities are displayed as dashes, with mismatches displayed
as single letter nucleotide abbreviations.
Query-anchored without identities
Identities are shown as single letter nucleotide abbreviations.
Flat Query-anchored with identities
The 'flat' display shows inserts as deletions on the query.
Identities are displayed as dashes, with mismatches displayed as single
letter nucleotide abbreviations.
Flat Query-anchored without identities
The 'flat' display shows inserts as deletions on the query. Identities
are shown as single letter nucleotide abbreviations.
Get ASN.1 for SeqAnnot
SeqAnnot format for importation into NCBI Toolkit programs.
Get ASN.1 for the BLAST Object
Object format for NCBI toolkit programs.
The translations include:
blastx
compares a nucleotide query sequence translated in all reading frames against
a protein sequence database
tblastn
compares a protein query sequence against a nucleotide sequence database
dynamically translated in all reading frames.
tblastx
compares the six-frame translations of a nucleotide query sequence against
the six-frame translations of a nucleotide sequence database. Please note
that tblastx program cannot be used with the nr database on the BLAST Web
page.
Learn
more
Matrix
A key element in evaluating the quality of a pairwise sequence alignment
is the "substitution matrix", which assigns a score for aligning any possible
pair of residues. The matrix used in a BLAST search can be changed
depending on the type of sequences you are searching with (see the BLAST
Frequently Asked Questions).
More information on BLAST
substitution matrices
Gap Cost and Lambda Ratio
The pull down menu shows the Gap Costs (Penalty to open Gap and penalty
to extend Gap) and the Lambda ratio settings for the matrix chosen. There
can only be a limited number of options for these parameters. Increasing
the Gap Costs and Lambda ratio will result in alignments which decrease
the number of Gaps introduced.
Learn More
PSSM
PSI-BLAST can save the Position Specific Score Matrix to be used in other
protein searches. The PSSM can be stored in a text file and cut and pasted
into the PSSM field.
To save a PSSM file:
-
Run a protein BLAST search.
-
Check the PSI-BLAST box on formatting page.
-
Click the "Format" Button.
-
On the PSI-BLAST results page, click the "Run PSI-BLAST Iteration 2" button.
-
Now, on the Format page, "PSSM" from the "Show" pull down menu.
-
Click "Format".
-
This will display text output with the ASCII-encoded PSSM. The "Save
as..." option of the browser can be used to save this to a plain text file
on your hard drive.
From the protein BLAST page, chose any database, and paste the contents
of the PSSM text file into the "PSSM" field. If the database is the same
as when the PSSM was stored, you'll reproduce the iteration on which you've
saved the PSSM; A different database will yield a different hit list.
Composition-based statistics
BLAST and PSI-BLAST now permit calculated E-values to take into account
the amino acid composition of the individual database sequences involved
in reported alignments. This improves E-value accuracy, thereby reducing
the number of false positive results.
The improved statistics are achieved with a scaling procedure [1,2]
which in effect employs a slightly different scoring system for each database
sequence. As a result, raw BLAST alignment scores in general will not correspond
precisely to those implied by any standard substitution matrix. Furthermore,
identical alignments can receive different scores, based upon the compositions
of the sequences they involve. The improved statistics are now used by
default for all rounds of searching on the PSI-BLAST page, but not on the
BLAST page. Therefore, if one uses default settings, the results of the
first round of searching will be different on the BLAST and PSI-BLAST pages.
In addition adjustments have been made to two PSI-BLAST parameters: the
pseudocount constant default has been changed from 10 to 7, and the E-value
threshold for including matches in the PSI-BLAST model has been changed
from 0.001 to 0.002.
[1]
Altschul, S.F. et al. (1997) Nucl. Acids Res. 25:3389-3402.
[2]
Schäffer, A.A. et al. (1999) Bioinformatics 15:1000-1011.
| |
NCBI
BLAST Advanced Options |
Program Advanced Options
-G Cost to open gap [Integer]
default = 5 for nucleotides 11 proteins
-E Cost to extend gap [Integer]
default = 2 nucleotides 1 proteins
-q Penalty for nucleotide mismatch [Integer]
default = -3
-r reward for nucleotide match [Integer]
default = 1
-e expect value [Real]
default = 10
-W wordsize [Integer]
default = 11 nucleotides 3 proteins
-y Dropoff (X) for blast extensions in bits (default if zero)
default = 20 for blastn 7 for other programs
-X X dropoff value for gapped alignment (in bits)
default = 15 for al programs except for blastn for which it does not apply
-Z final X dropoff value for gapped alignment (in bits)
50 for blastn 25 for other programs
Limited values for gap existence and extension are supported for these
three programs. Some supported and suggested values are:
Existence Extension
10
1
10
2
11
1
8
2
9
2
Learn
more
PHI-BLAST (Pattern-Hit Initiated BLAST) is a search program that combines
matching of regular expressions
with local alignments surrounding the match. Given a protein sequence
S and a regular expression pattern P
occurring in S, PHI-BLAST helps answer the question: What other protein
sequences both contain an occurrence of P
and are homologous to S in the vicinity of the pattern occurrences?
PHI-BLAST may be preferable to just searching for pattern occurrences because
it filters out those cases where the pattern occurrence is probably random
and not indicative of homology. Please see the Rules
for Pattern Syntax.
Learn
more
The Position-Specific Iterated BLAST, or PSI-BLAST program performs
an iterative search in which sequences found in one round of searching
are used to build a score model for the next round of searching. In PSI-BLAST
the algorithm is not tied to a specific score matrix. Traditionally, it
has been implemented using an AxA substitution matrix where A is the alphabet
size. PSI-BLAST instead uses a QxA matrix, where Q is the length of the
query sequence; at each position the cost of a letter depends on the position
w.r.t. the query and the letter in the subject sequence.
Disclaimer
Privacy
statement
Revised January 21, 2000 |