NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

BLAST® Command Line Applications User Manual [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2008-.

Cover of BLAST® Command Line Applications User Manual

BLAST® Command Line Applications User Manual [Internet].

Show details

Appendices

Created: ; Last Update: May 18, 2016.

Conversion from C toolkit applications

The functionality offered by the BLAST+ applications has been organized by program type. The following graph depicts a correspondence between the NCBI C Toolkit BLAST command line applications and the BLAST+ applications:

Image appendices-Image001.jpg

The easiest way to get started using the BLAST+ command line applications is by means of the legacy_blast.pl PERL script which is bundled along with the BLAST+ applications. To utilize this script, simply prefix it to the invocation of the C toolkit BLAST command line application and append the --path option pointing to the installation directory of the BLAST+ applications. For example, instead of using

    blastall -i query -d nr -o blast.out 

use

    legacy_blast.pl blastall -i query -d nr -o blast.out 
--path /opt/blast/bin

The purpose of the legacy_blast.pl PERL script is to help users make the transition from the C Toolkit BLAST command line applications to the BLAST+ applications. This script produces its own documentation by invoking it without any arguments.

The legacy_blast.pl script supports two modes of operation, one in which the C Toolkit BLAST command line invocation is converted and executed on behalf of the user and another which solely displays the BLAST+ application equivalent to what was provided, without executing the command.

The first mode of operation is achieved by specifying the C Toolkit BLAST command line application invocation and optionally providing the --path argument after the command line to convert if the installation path for the BLAST+ applications differs from the default (available by invoking the script without arguments). See example in the first section of the Quick start.

The second mode of operation is achieved by specifying the C Toolkit BLAST command line application invocation and appending the --print_only command line option as follows:

$ ./legacy_blast.pl megablast -i query.fsa -d nt -o mb.out --print_only
/opt/ncbi/blast/bin/blastn -query query.fsa -db "nt" -out mb.out
$

Exit codes

All BLAST+ applications have consistent exit codes to signify the exit status of the application. The possible exit codes along with their meaning are detailed in the table below:

Exit CodeMeaning
0Success
1Error in query sequence(s) or BLAST options
2Error in BLAST database
3Error in BLAST engine
4Out of memory
5Network error connecting to NCBI to fetch sequence data
6Error creating output files
255Unknown error

In the case of BLAST+ database applications, the possible exit codes are 0 (indicating success) and 1 (indicating failure).

Options for the command-line applications.

This appendix consists of several tables that list option names, types, default values, and a short description of the option. These tables were first published as an appendix to an article in BMC Bioinformatics (BLAST+: architecture and applications). They have been updated for this manual.

Table C1:

Options common to all BLAST+ search applications. An option of type “flag” takes no argument, but if present is true. Some options are valid only for a local search (“remote” option not used), others are valid only for a remote search (“remote” option used).

optiontypedefault valuedescription and notes
dbstringnoneBLAST database name.
querystringstdinQuery file name.
query_locstringnoneLocation on the query sequence (Format: start-stop)
outstringstdoutOutput file name
evaluereal10.0Expect value (E) for saving hits
subjectstringnoneFile with subject sequence(s) to search.
subject_locstringnoneLocation on the subject sequence (Format: start-stop).
show_gisflagN/AShow NCBI GIs in report.
num_descriptionsinteger500Show one-line descriptions for this number of database sequences.
num_alignmentsinteger250Show alignments for this number of database sequences.
max_target_seqsInteger500Number of aligned sequences to keep. Use with report formats that do not have separate definition line and alignment sections such as tabular (all outfmt > 4). Not compatible with num_descriptions or num_alignments.
max_hspsintegernoneMaximum number of HSPs (alignments) to keep for any single query-subject pair. The HSPs shown will be the best as judged by expect value. This number should be an integer that is one or greater. If this option is not set, BLAST shows all HSPs meeting the expect value criteria. Setting it to one will show only the best HSP for every query-subject pair
htmlflagN/AProduce HTML output
giliststringnoneRestrict search of database to GI’s listed in this file. Local searches only.
negative_giliststringnoneRestrict search of database to everything except the GI’s listed in this file. Local searches only.
entrez_querystringnoneRestrict search with the given Entrez query. Remote searches only.
culling_limitintegernoneDelete a hit that is enveloped by at least this many higher-scoring hits.
best_hit_overhangrealnoneBest Hit algorithm overhang value (recommended value: 0.1)
best_hit_score_edgerealnoneBest Hit algorithm score edge value (recommended value: 0.1)
dbsizeintegernoneEffective size of the database
searchspintegernoneEffective length of the search space
import_search_strategystringnoneSearch strategy file to read.
export_search_strategystringnoneRecord search strategy to this file.
parse_deflinesflagN/AParse query and subject bar delimited sequence identifiers (e.g., gi|129295).
num_threadsinteger1Number of threads (CPUs) to use in blast search.
remoteflagN/AExecute search on NCBI servers?
outfmtstring0alignment view options:
0 = pairwise,
1 = query-anchored showing identities,
2 = query-anchored no identities,
3 = flat query-anchored, show identities,
4 = flat query-anchored, no identities,
5 = XML Blast output,
6 = tabular,
7 = tabular with comment lines,
8 = Text ASN.1,
9 = Binary ASN.1
10 = Comma-separated values
11 = BLAST archive format (ASN.1)
Options 6, 7, and 10 can be additionally configured to produce a custom format specified by space delimited format specifiers.
The supported format specifiers are:
qseqid means Query Seq-id
qgi means Query GI
qacc means Query accesion
sseqid means Subject Seq-id
sallseqid means All subject Seq-id(s), separated by a ';'
sgi means Subject GI
sallgi means All subject GIs
sacc means Subject accession
sallacc means All subject accessions
qstart means Start of alignment in query
qend means End of alignment in query
sstart means Start of alignment in subject
send means End of alignment in subject
qseq means Aligned part of query sequence
sseq means Aligned part of subject sequence
evalue means Expect value
bitscore means Bit score
score means Raw score
length means Alignment length
pident means Percentage of identical matches
nident means Number of identical matches
mismatch means Number of mismatches
positive means Number of positive-scoring matches
gapopen means Number of gap openings
gaps means Total number of gap
ppos means Percentage of positive-scoring matches
frames means Query and subject frames separated by a '/'
qframe means Query frame
sframe means Subject frame
btop means Blast traceback operations (BTOP)
staxids means unique Subject Taxonomy ID(s), separated by a ';'(in numerical order)
sscinames means unique Subject Scientific Name(s), separated by a ';'
scomnames means unique Subject Common Name(s), separated by a ';'
sblastnames means unique Subject Blast Name(s), separated by a ';' (in alphabetical order)
sskingdoms means unique Subject Super Kingdom(s), separated by a ';' (in alphabetical order)
stitle means Subject Title
salltitles means All Subject Title(s), separated by a '<>'
sstrand means Subject Strand
qcovs means Query Coverage Per Subject (for all HSPs)
qcovhsp means Query Coverage Per HSP
qcovus is a measure of Query Coverage that counts a position in a subject sequence for this measure only once. The second time the position is aligned to the query is not counted towards this measure.
When not provided, the default value is:
'qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore', which is equivalent to the keyword 'std'

Table C2:

blastn application options. The blastn application searches a nucleotide query against nucleotide subject sequences or a nucleotide database. An option of type “flag” takes no arguments, but if present the argument is true. Four different tasks are supported: 1.) “megablast”, for very similar sequences (e.g, sequencing errors), 2.) “dc-megablast”, typically used for inter-species comparisons, 3.) “blastn”, the traditional program used for inter-species comparisons, 4.) “blastn-short”, optimized for sequences less than 30 nucleotides.

optiontask(s)typedefault valuedescription and notes
word_sizemegablastinteger28Length of initial exact match.
word_sizedc-megablastinteger11Number of matching nucleotides in initial match. dc-megablast allows non-consecutive letters to match.
word_sizeblastninteger11Length of initial exact match.
word_sizeblastn-shortinteger7Length of initial exact match.
gapopenmegablastinteger0Cost to open a gap. See appendix “BLASTN reward/penalty values”.
gapextendmegablastintegernoneCost to extend a gap. This default is a function of reward/penalty value. See appendix “BLASTN reward/penalty values”.
gapopenblastn, blastn-short, dc-megablastinteger5Cost to open a gap. See appendix “BLASTN reward/penalty values”.
gapextendblastn, blastn-short, dc-megablastinteger2Cost to extend a gap. See appendix “BLASTN reward/penalty values”.
rewardmegablastinteger1Reward for a nucleotide match.
penaltymegablastinteger-2Penalty for a nucleotide mismatch.
rewardblastn, dc-megablastinteger2Reward for a nucleotide match.
penaltyblastn, dc-megablastinteger-3Penalty for a nucleotide mismatch.
rewardblastn-shortinteger1Reward for a nucleotide match.
penaltyblastn-shortinteger-3Penalty for a nucleotide mismatch.
strandallstringbothQuery strand(s) to search against database/subject. Choice of both, minus, or plus.
dustallstring20 64 1Filter query sequence with dust.
filtering_dballstringnoneMask query using the sequences in this database.
window_masker_taxidallintegernoneEnable WindowMasker filtering using a Taxonomic ID.
window_masker_dballstringnoneEnable WindowMasker filtering using this file.
soft_maskingallbooleantrueApply filtering locations as soft masks (i.e., only for finding initial matches).
lcase_maskingallflagN/AUse lower case filtering in query and subject sequence(s).
db_soft_maskallintegernoneFiltering algorithm ID to apply to the BLAST database as soft mask (i.e., only for finding initial matches).
db_hard_maskallintegernoneFiltering algorithm ID to apply to the BLAST database as hard mask (i.e., sequence is masked for all phases of search).
perc_identityallinteger0Percent identity cutoff.
template_typedc-megablaststringcodingDiscontiguous MegaBLAST template type. Allowed values are coding, optimal and coding_and_optimal.
template_lengthdc-megablastinteger18Discontiguous MegaBLAST template length.
use_indexmegablastbooleanfalseUse MegaBLAST database index. Indices may be created with the makembindex application.
index_namemegablaststringnoneMegaBLAST database index name.
xdrop_ungapallreal20Heuristic value (in bits) for ungapped extensions.
xdrop_gapallreal30Heuristic value (in bits) for preliminary gapped extensions.
xdrop_gap_finalallreal100Heuristic value (in bits) for final gapped alignment.
no_greedymegablastflagN/AUse non-greedy dynamic programming extension.
min_raw_gapped_scoreallintegernoneMinimum raw gapped score to keep an alignment in the preliminary gapped and trace-back stages. Normally set based upon expect value.
ungappedallflagN/APerform ungapped alignment.
window_sizedc-megablastinteger40Multiple hits window size, use 0 to specify 1-hit algorithm

Table C3:

blastp application options. The blastp application searches a protein sequence against protein subject sequences or a protein database. An option of type “flag” takes no arguments, but if present the argument is true. Three different tasks are supported: 1.) “blastp”, for standard protein-protein comparisons, 2.) “blastp-short”, optimized for query sequences shorter than 30 residues, and 3.) “blastp-fast”, a faster version that uses a larger word-size per https://www.ncbi.nlm.nih.gov/pubmed/17921491. This table reflects the 2.2.27 BLAST+ release.

optiontasktypedefault valuedescription and notes
word_sizeblastpinteger3Word size of initial match. Valid word sizes are 2-7.
word_sizeblastp-shortinteger2Word size of initial match.
word_sizeblastp-fastinteger6Word size of initial match
gapopenblastp and blastp-fastinteger11Cost to open a gap.
gapextendblastp and blastp-fastinteger1Cost to extend a gap.
gapopenblastp-shortinteger9Cost to open a gap.
gapextendblastp-shortinteger1Cost to extend a gap.
matrixblastp and blastp-faststringBLOSUM62Scoring matrix name.
matrixblastp-shortstringPAM30Scoring matrix name.
thresholdblastpinteger11Minimum score to add a word to the BLAST lookup table.
thresholdblastp-shortinteger16Minimum score to add a word to the BLAST lookup table.
thresholdblastp-fastInteger21Minimum score to add a word to the BLAST lookup table.
comp_based_statsblastp and blastp-faststring2Use composition-based statistics:
D or d: default (equivalent to 2)
0 or F or f: no composition-based statistics
1: Composition-based statistics as in NAR 29:2994-3005, 2001
2 or T or t : Composition-based score adjustment as in Bioinformatics
21:902-911, 2005, conditioned on sequence properties
3: Composition-based score adjustment as in Bioinformatics 21:902-911, 2005, unconditionally
comp_based_statsblastp-shortstring0Use composition-based statistics :
D or d: default (equivalent to 2)
0 or F or f: no composition-based statistics
1: Composition-based statistics as in NAR 29:2994-3005, 2001
2 or T or t : Composition-based score adjustment as in Bioinformatics
21:902-911, 2005, conditioned on sequence properties
3: Composition-based score adjustment as in Bioinformatics 21:902-911, 2005, unconditionally
segallstringnoFilter query sequence with SEG (Format: 'yes', 'window locut hicut', or 'no' to disable).
soft_maskingallbooleanfalseApply filtering locations as soft masks (i.e., only for finding initial matches).
lcase_maskingallflagN/AUse lower case filtering in query and subject sequence(s).
db_soft_maskallintegernoneFiltering algorithm ID to apply to the BLAST database as soft mask (i.e., only for finding initial matches).
db_hard_maskallintegernoneFiltering algorithm ID to apply to the BLAST database as hard mask (i.e., sequence is masked for all phases of search).
xdrop_gap_finalallreal25Heuristic value (in bits) for final gapped alignment/
window_sizeblastp and blastp-fastinteger40Multiple hits window size, use 0 to specify 1-hit algorithm.
window_sizeblastp-shortinteger15Multiple hits window size, use 0 to specify 1-hit algorithm.
use_sw_tbackallflagN/ACompute locally optimal Smith-Waterman alignments?

Table C4:

blastx application options. The blastx application translates a nucleotide query and searches it against protein subject sequences or a protein database. Two different tasks are supported: 1.) “blastx” for standard translated nucleotide-protein comparison and 2.) “blastx-fast”, a faster version that uses a larger word-size based on https://www.ncbi.nlm.nih.gov/pubmed/17921491.

optiontasktypedefault valuedescription and notes
word_sizeblastxinteger3Word size for initial match. Valid word sizes are 2-7.
word_sizeblastx-fastinteger6Word size for initial match.
gapopenallinteger11Cost to open a gap.
gapextendallinteger1Cost to extend a gap.
matrixallstringBLOSUM62Scoring matrix name.
thresholdblastxinteger12Minimum score to add a word to the BLAST lookup table.
thresholdblastx-fastInteger21Minimum score to add a word to the BLAST lookup table.
segallstring12 2.2 2.5Filter query sequence with SEG (Format: 'yes', 'window locut hicut', or 'no' to disable).
soft_maskingallbooleanfalseApply filtering locations as soft masks (i.e., only for finding initial matches).
lcase_maskingallflagN/AUse lower case filtering in query and subject sequence(s).
db_soft_maskallintegernoneFiltering algorithm ID to apply to the BLAST database as soft mask (i.e., only for finding initial matches).
db_hard_maskallintegernoneFiltering algorithm ID to apply to the BLAST database as hard mask (i.e., sequence is masked for all phases of search).
xdrop_gap_finalallreal25Heuristic value (in bits) for final gapped alignment.
window_sizeallinteger40Multiple hits window size, use 0 to specify 1-hit algorithm.
strandallstringbothQuery strand(s) to search against database/subject. Choice of both, minus, or plus.
query_genetic_codeallinteger1Genetic code to translate query, see ftp://ftp.ncbi.nih.gov/entrez/misc/data/gc.prt
max_intron_lengthallinteger0Length of the largest intron allowed in a translated nucleotide sequence when linking multiple distinct alignments (a negative value disables linking).
comp_based_statsallinteger2Use composition-based statistics for blastx:
D or d: default (equivalent to 2)
0 or F or f: no composition-based statistics
1: Composition-based statistics as in NAR 29:2994-3005, 2001
2 or T or t : Composition-based score adjustment as in Bioinformatics
21:902-911, 2005, conditioned on sequence properties
3: Composition-based score adjustment as in Bioinformatics 21:902-911, 2005, unconditionally
Default = `2'

Table C5:

tblastn application options. The tblastn application searches a protein query against nucleotide subject sequences or a nucleotide database translated at search time. Two different tasks are supported: 1.) “tblastn” for a standard protein-translated nucleotide comparison and 2.) “tblastn-fast” for a faster version with a larger word-size based on https://www.ncbi.nlm.nih.gov/pubmed/17921491.

optiontasktypedefault valuedescription and notes
word_sizetblastninteger3Word size for initial match. Valid word sizes are 2-7.
word_sizetblastn-fastinteger6Word size for initial match.
gapopenallinteger11Cost to open a gap.
gapextendallinteger1Cost to extend a gap.
matrixallstringBLOSUM62Scoring matrix name.
thresholdtblastninteger13Minimum score to add a word to the BLAST lookup table.
thresholdtblastn-fastInteger21Minimum score to add a word to the BLAST lookup table.
segallstring12 2.2 2.5Filter query sequence with SEG (Format: 'yes', 'window locut hicut', or 'no' to disable).
soft_maskingallbooleanfalseApply filtering locations as soft masks (i.e., only for finding initial matches).
lcase_maskingallflagN/AUse lower case filtering in query and subject sequence(s).
db_soft_maskallintegernoneFiltering algorithm ID to apply to the BLAST database as soft mask (i.e., only for finding initial matches).
db_hard_maskallintegernoneFiltering algorithm ID to apply to the BLAST database as hard mask (i.e., sequence is masked for all phases of search).
xdrop_gap_finalallreal25Heuristic value (in bits) for final gapped alignment.
window_sizeallinteger40Multiple hits window size, use 0 to specify 1-hit algorithm.
db_gen_codeallinteger1Genetic code to translate subject sequences, see ftp://ftp.ncbi.nih.gov/entrez/misc/data/gc.prt
max_intron_lengthallinteger0Length of the largest intron allowed in a translated nucleotide sequence when linking multiple distinct alignments (a negative value disables linking).
comp_based_statsallstring2Use composition-based statistics for tblastn:
D or d: default (equivalent to 2)
0 or F or f: no composition-based statistics
1: Composition-based statistics as in NAR 29:2994-3005, 2001
2 or T or t : Composition-based score adjustment as in Bioinformatics
21:902-911, 2005, conditioned on sequence properties
3: Composition-based score adjustment as in Bioinformatics 21:902-911, 2005, unconditionally
Default = `2'

Table C6:

tblastx application options. The tblastx application searches a translated nucleotide query against translated nucleotide subject sequences or a translated nucleotide database. An option of type “flag” takes no arguments, but if present the argument is true. This table reflects the 2.2.27 BLAST+ release. Only ungapped searches are supported for tblastx.

optiontypedefault valuedescription and notes
word_sizeinteger3Word size for initial match.
matrixstringBLOSUM62Scoring matrix name.
thresholdinteger13Minimum word score to add the word to the BLAST lookup table.
segstring12 2.2 2.5Filter query sequence with SEG (Format: 'yes', 'window locut hicut', or 'no' to disable).
soft_maskingbooleanfalseApply filtering locations as soft masks (i.e., only for finding initial matches).
lcase_maskingflagN/AUse lower case filtering in query and subject sequence(s).
db_soft_maskintegernoneFiltering algorithm ID to apply to the BLAST database as soft mask (i.e., only for finding initial matches).
db_hard_maskintegernoneFiltering algorithm ID to apply to the BLAST database as hard mask (i.e., sequence is masked for all phases of search).
strandstringbothQuery strand(s) to search against database subject sequences. Choice of both, minus, or plus.
query_genetic_codeinteger1Genetic code to translate query, see ftp://ftp.ncbi.nih.gov/entrez/misc/data/gc.prt
db_gen_codeinteger1Genetic code to translate subject sequences, see ftp://ftp.ncbi.nih.gov/entrez/misc/data/gc.prt
max_intron_lengthinteger0Length of the largest intron allowed in a translated nucleotide sequence when linking multiple distinct alignments (a negative value disables linking)

Table C7:

rpsblast application options. The rpsblast application searches a protein query against the conserved domain database (CDD), which is a set of protein profiles. Many of the common options such as matrix or word threshold are set when the CDD is built and cannot be changed by the rpsblast application. A search ready CDD can be downloaded from ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/

OptionTypeDefault valueDescription and notes
window_sizeinteger40Multiple hits window size, use 0 to specify 1-hit algorithm.
xdrop_ungapreal15Heuristic value (in bits) for ungapped extensions
xdrop_gapreal25Heuristic value (in bits) for preliminary gapped extensions.
xdrop_gap_finalreal40Heuristic value (in bits) for final gapped alignment.
segstring12 2.2 2.5Filter query sequence with SEG (Format: 'yes', 'window locut hicut', or 'no' to disable).
soft_maskingbooleanfalseApply filtering locations as soft masks (i.e., only for finding initial matches).

Table C8:

Makeblastdb application options. This application builds a BLAST database. An option of type “flag” takes no arguments, but if present the argument is true.

optiontypedefault valueDescription and notes
instringstdinInput file/database name
input_typestringfastaInput file type, it may be any of the following:
fasta: for FASTA file(s)
blastdb: for BLAST database(s)
asn1_txt: for Seq-entries in text ASN.1 format
asn1_bin: for Seq-entries in binary ASN.1 format
dbtypestringprotMolecule type of input, values can be nucl or prot.
titlestringnoneTitle for BLAST database. If not set, the input file name will be used.
parse_seqidsflagN/AParse bar delimited sequence identifiers (e.g., gi|129295) in FASTA input.
hash_indexflagN/ACreate index of sequence hash values.
mask_datastringnoneComma-separated list of input files containing masking data as produced by NCBI masking applications (e.g. dustmasker, segmasker, windowmasker).
outstringinput file nameName of BLAST database to be created. Input file name is used if none provided. This field is required if input consists of multiple files.
max_file_sizestring1GBMaximum file size to use for BLAST database.
taxidintegernoneTaxonomy ID to assign to all sequences.
taxid_mapstringnoneFile with two columns mapping sequence ID to the taxonomy ID. The first column is the sequence ID represented as one of:
1.

fasta with accessions (e.g., emb|X17276.1|)

2.

fasta with GI (e.g., gi|4)

3.

GI as a bare number (e.g., 4)

4.

A local ID. The local ID must be prefixed with "lcl" (e.g., lcl|4).

The second column should be the NCBI taxonomy ID (e.g., 9606 for human).
logfilestringnoneProgram log file (default is stderr).

Table C9:

Makeprofiledb application options. This application builds an RPS-BLAST database. An option of type “flag” takes no arguments, but if present the argument is true. COBALT (a multiple sequence alignment program) and DELTA-BLAST both use RPS-BLAST searches as part of their processing, but use specialized versions of the database. This application can build databases for COBALT, DELTA-BLAST, and a standard RPS-BLAST search. The “dbtype” option (see entry in table) determines which flavor of the database is built.

optiontypedefault valueDescription and notes
instringstdinInput file that contains a list of scoremat files (delimited by space, tab, or newline)
binaryflagN/AThe scoremat files are binary ASN.1
titlestringnoneTitle for RPS-BLAST database. If not set, the input file name will be used.
thresholdreal9.82Threshold for RPSBLAST lookup table.
outstringinput file nameName of BLAST database to be created. Input file name is used if none provided.
max_file_sizestring1GBMaximum file size to use for BLAST database.
dbtypestringrpsSpecifies use for RPSBLAST db. One of rps, cobalt, or delta.
indexflagN/ACreates index files.
gapopenintegernoneCost to open a gap. Used only if scoremat files do not contain PSSM scores, otherwise ignored.
gapextendintegernoneCost to extend a gap by one residue. Used only if scoremat files do not contain PSSM scores, otherwise ignored.
scalereal100PSSM scale factor.
matrixstringBLOSUM62Matrix to use in constructing PSSM. One of BLOSUM45, BLOSUM50, BLOSUM62, BLOSUM80, BLOSUM90, PAM250, PAM30 or PAM70. Used only if scoremat files do not contain PSSM scores, otherwise ignored.
obsr_thresholdreal6Exclude domains with maximum number of independent observations below this value (for use in DELTA-BLAST searches).
exclude_invalidrealtrueExclude domains that do not pass validation test (for use in DELTA-BLAST searches).
logfilestringnoneProgram log file (default is stderr).

Table C10:

Blastdbcmd application options. This application reads a BLAST database and produces reports.

optiontypedefault valuedescription and notes
dbstringnrBLAST database name.
dbtypestringguessMolecule type stored in BLAST database, one of nucl, prot, or guess.
entrystringnoneComma-delimited search string(s) of sequence identifiers: e.g.: 555, AC147927, 'gnl|dbname|tag', or 'all' to select all sequences in the database
entry_batchstringnoneInput file for batch processing. The format requires one entry per line; each line should begin with the sequence ID followed by any of the following optional specifiers (in any order): range (format: ‘from-to’, inclusive in 1-offsets), strand (‘plus’ or ‘minus’), or masking algorithm ID (integer value representing the available masking algorithm). Omitting the ending range (e.g.: ‘10-‘) is supported, but there should not be any spaces around the ‘-‘.
pigintegernonePIG (protein identity group) to retrieve.
infoflagN/APrint BLAST database information.
rangestringnoneRange of sequence to extract (Format: start-stop).
strandstringplusStrand of nucleotide sequence to extract. Choice of plus or minus.
mask_sequence_withstringnoneProduce lower-case masked FASTA using the algorithm IDs specified.
outstringstdoutOutput file name.
outfmtstring%fOutput format, where the available format specifiers are:
%f means sequence in FASTA format
%s means sequence data (without defline)
%a means accession
%g means gi
%o means ordinal id (OID)
%t means sequence title
%l means sequence length
%T means taxid
%L means common taxonomic name
%S means scientific name
%P means PIG
%mX means sequence masking data, where X is an optional comma-separated list of integers to specify the algorithm ID(s) to display (or all masks if absent or invalid specification). Masking data will be displayed as a series of 'N-M' values separated by ';' or the word 'none' if none are available. For every format except '%f', each line of output will correspond to a sequence.
target_onlyflagN/ADefinition line should contain target GI only.
get_dupsflagN/ARetrieve duplicate accessions.
line_lengthinteger80Line length for output.
ctrl_aflagN/AUse Ctrl-A as the non-redundant definition line separator.

Table C11:

Makembindex application options. The indexed databases created by makembindex are used by production MegaBLAST software and by a new srsearch utility designed to quickly search for nearly exact matches (up to one mismatch) of short queries against a genomic database. When a FASTA formatted file is used as the input, then masking by lower case letters is incorporated in the index. Makembindex can currently build two types of indices, called “old style” and “new style” indexing. The NCBI offers full support for the new style and has deprecated the old style. A MegaBLAST search with a new style index requires that both the index and the corresponding BLAST database be present. The index structure is described in PMID:18567917. Please cite this paper in any publication that uses makembindex.

optiontypedefault valueDescription and notes
inputstringstdinInput file name or BLAST database name, depending on the value of the iformat parameter. For FASTA formatted input, this parameter is optional and defaults to the program's standard input stream.
outputstringnoneThe resulting index name. The index itself can consist of multiple files, called volumes, called <index_name>.00.idx, <index_name>.01.idx,...
This option should not be used with new style indices.
iformatstringfastaThe input format selector. Possible values are 'fasta' and 'blastdb'.
old_style_indexbooleanfalseThe old_style_index is no longer supported. If set to 'false' the new style index is created. New style indices require a BLAST database as input (use -iformat blastdb), which can be downloaded from the NCBI FTP site or created with makeblastdb. The option -output is ignored for a new style index. New style indices are always created at the same location as the corresponding BLAST database.
db_maskintegerNoneExclude masked regions of BLAST db from the index. Use makeblastdb to discover the algorithm ID to be used as input for this argument.
legacybooleantrueThis is a compatibility feature to support current production MegaBLAST. If true, then -stride, -nmer, and -ws_hint are ignored. The legacy format must be used for BLAST.
nmerinteger12N-mer size to use. Ignored if –legacy is specified
ws_hintinteger28This is an optimization hint for makembindex that indicates an expected minimum match size in searches that use the index. If n is the value of -nmer parameter and s is the value of –stride parameter, then the value of -ws_hint must be at least n + s - 1.
strideinteger5makembindex will index every stride-th N-mer of the database.
volsizeinteger1536Target index volume size in megabytes.

BLASTN reward/penalty values

BLASTN uses a simple approach to score alignments, with identically matching bases assigned a reward and mismatching bases assigned a penalty. It is important to choose reward/penalty values appropriate to the sequences being aligned with the (absolute) reward/penalty ratio increasing for more divergent sequences. A ratio of 0.33 (1/-3) is appropriate for sequences that are about 99% conserved; a ratio of 0.5 (1/-2) is best for sequences that are 95% conserved; a ratio of about one (1/-1) is best for sequences that are 75% conserved [2].

For each reward/penalty pair, a number of different gap costs are supported. A gap cost includes a value to open the gap and a value to extend the gap by a base. Following the convention of the command-line applications, these costs are listed as positive numbers here. MegaBLAST uses a specialized algorithm to calculate the default gap costs for a reward/penalty pair that is described in PMID:10890397. Briefly, the default megaBLAST cost to open a gap is zero and the cost to extend a gap two letters is given by the absolute value of two mismatches minus one match. For example, given a reward of 1 and penalty of -5, the cost to extend a gap by one letter is 5.5. The default gap costs for other tasks supported by the blastn application is 5 to open a gap and 2 to extend one base.

Table D1 presents the supported reward/penalty values and gap costs.

Table D1:

Supported reward/penalty values and gap costs for the blastn application. The left-most column presents the supported reward/penalty values. The middle column presents pairs of numbers for the cost to open and extend a gap for each reward/penalty value. Blastn also supports gap costs more stringent than those listed (e.g., for reward/penalty of 1/-3 gap costs of 5/2 or 500/2 are supported). The reward/penalty values are ordered from most to least stringent, with the more stringent values better suited for alignments with high sequence identity. The default megaBLAST gap costs are shown in the right-most column. Accurate statistics for these default megaBLAST gap costs can only be calculated for the most stringent reward/penalty values, but the values listed in the middle column can always be used.

reward/penaltygap costs (open/extend)default MegaBLAST gap costs (open/extend)
1/-53/30/5.5
1/-41/2, 0/2, 2/1, 1/10/4.5
2/-72/4, 0/4, 4/2, 2/20/8
1/-32/2, 1/2, 0/2, 2/1, 1/10/3.5
2/-52/4, 0/4, 4/2, 2/20/6
1/-22/2, 1/2, 0/2, 3/1, 2/1, 1/10/2.5
2/-34/4, 2/4, 0/4, 3/3, 6/2, 5/2, 4/2, 2/20/4
3/-46/3, 5/3, 4/3, 6/2, 5/2, 4/2N/A
4/-56/5, 5/5, 4/5, 3/5N/A
1/-13/2, 2/2, 1/2, 0/2, 4/1, 3/1, 2/1N/A
3/-25/5N/A
5/-410/6, 8/6N/A

BLAST Substitution Matrices

BLAST uses a substitution matrix for any program that aligns residues. The program may align residues because both the query and database consist of proteins (e.g. BLASTP) or the program may align DNA translated to protein with protein (e.g. BLASTX). A key element in evaluating the quality of a pairwise sequence alignment is the "substitution matrix", which assigns a score for aligning any possible pair of residues. The theory of amino acid substitution matrices is described in [1], and applied to DNA sequence comparison in [2]. In general, different substitution matrices are tailored to detecting similarities among sequences that are diverged by differing degrees [1-3]. A single matrix may nevertheless be reasonably efficient over a relatively broad range of evolutionary change [1-3]. Experimentation has shown that the BLOSUM-62 matrix [4] is among the best for detecting most weak protein similarities. For particularly long and weak alignments, the BLOSUM-45 matrix may prove superior. A detailed statistical theory for gapped alignments has not been developed, and the best gap costs to use with a given substitution matrix are determined empirically. Short alignments need to be relatively strong (i.e. have a higher percentage of matching residues) to rise above background noise. Such short but strong alignments are more easily detected using a matrix with a higher "relative entropy" [1] than that of BLOSUM-62. In particular, short query sequences can only produce short alignments, and therefore database searches with short queries should use an appropriately tailored matrix. The BLOSUM series does not include any matrices with relative entropies suitable for the shortest queries, so the older PAM matrices [5,6] may be used instead. For proteins, a provisional table of recommended substitution matrices and gap costs for various query lengths is:

Query LengthSubstitution MatrixGap Costs
<35PAM-30(9, 1)
35-50PAM-70(10, 1)
50-85BLOSUM-80(10, 1)
>85BLOSUM-62(11, 1)

Gap Costs

The raw score of an alignment is the sum of the scores for aligning pairs of residues and the scores for gaps. Gapped BLAST and PSI-BLAST use "affine gap costs" which charge the score -a for the existence of a gap, and the score -b for each residue in the gap. Thus a gap of k residues receives a total score of -(a+bk); specifically, a gap of length 1 receives the score -(a+b).

Lambda Ratio

To convert a raw score S into a normalized score S' expressed in bits, one uses the formula S' = (lambda*S - ln K)/(ln 2), where lambda and K are parameters dependent upon the scoring system (substitution matrix and gap costs) employed [7-9]. For determining S', the more important of these parameters is lambda. The "lambda ratio" quoted here is the ratio of the lambda for the given scoring system to that for one using the same substitution scores, but with infinite gap costs [8]. This ratio indicates what proportion of information in an ungapped alignment must be sacrificed in the hope of improving its score through extension using gaps. We have found empirically that the most effective gap costs tend to be those with lambda ratios in the range 0.8 to 0.9.

References

1.
Altschul S.F. Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol. 1991;219:555–565. [PubMed: 2051488]
2.
States D.J., Gish W., Altschul S.F. Improved sensitivity of nucleic acid database searches using application-specific scoring matrices. Methods. 1991;3:66–70.
3.
Altschul S.F. A protein alignment scoring system sensitive at all evolutionary distances. J. Mol. Evol. 1993;36:290–300. [PubMed: 8483166]
4.
Henikoff S., Henikoff J.G. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA. 1992;89:10915–10919. [PMC free article: PMC50453] [PubMed: 1438297]
5.
Dayhoff, M.O., Schwartz, R.M. & Orcutt, B.C. (1978) "A model of evolutionary change in proteins." In "Atlas of Protein Sequence and Structure, vol. 5, suppl. 3," M.O. Dayhoff (ed.), pp. 345-352, Natl. Biomed. Res. Found., Washington, DC.
6.
Schwartz, R.M. & Dayhoff, M.O. (1978) "Matrices for detecting distant relationships." In "Atlas of Protein Sequence and Structure, vol. 5, suppl. 3," M.O. Dayhoff (ed.), pp. 353-358, Natl. Biomed. Res. Found., Washington, DC.
7.
Karlin S., Altschul S.F. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA. 1990;87:2264–2268. [PMC free article: PMC53667] [PubMed: 2315319]
8.
Altschul S.F., Gish W. Local alignment statistics. Meth. Enzymol. 1996;266:460–480. [PubMed: 8743700]
9.
Altschul S.F., Madden T.L., Schäffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. [PMC free article: PMC146917] [PubMed: 9254694]
Copyright Notice

BLAST is a Registered Trademark of the National Library of Medicine

Bookshelf ID: NBK279684

Views

Other titles in this collection

Related information

  • PMC
    PubMed Central citations
  • PubMed
    Links to PubMed

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...