Direct Comparison of Two FASTA Sequences Using bl2seq
Tao Tao, PhD User Services, NCBI, NLM, NIH
TOC
1. Introduction
The program bl2seq is designed to compare two input FASTA sequences directly bypassing the formatdb
step. This allows users to quickly assess the similarities between two input sequences. The program
is capable of performing six types of blast searches: blastn, blastp, blastx, tblastn, tblastx, and
megabalst in place of blastn.
The input sequences can be provided as the FASTA sequence files name or in the accession.version format under
'-A T' setting, if the client server function can connect to NCBI.
2. Installation and setup
bl2seq does not require program specific setup if the client server function is not used. For users behind firewalls,
using the client server function will require addtional setup. Specifically, the following lines need to be added to
.ncbirc or ncbi.ini.
[CONN]
FIREWALL=TRUE
[NET_SERV]
SRV_CONN_MODE=SERVICE
This also requires that the corresponding ports be open for connection to certain NCBI IP addresses.
Refer to Searching Against NCBI Databases With blastcl3 for more information
on firewall and .ncbirc (ncbi.ini) configuration.
3. Common usage examples
bl2seq is a convenient tool for directly comparison of two local sequences in all possible combination
of sequence types. Here we will discuss a few common comparisons using this tool.
One search bl2seq can perform is the direct comparison of two nucleotide sequences to find out the differences
present between teh two sequences. The two sequences can be a mutated clone and its parent, a field isolate
and the reference strain, or a the mRNA and its genomic counterpart.
bl2seq -i mutant.seq -j parent.seq -p blastn -F F -o mut_parent.out
|
We can also use bl2seq directly compare PCR primers against their target sequence to identify
the annealing position and amplicon size. To search two primers together, we need to convert them
to the following format. The reverse primer does not have to be converted to its reverse complement
since bl2seq does this automatically.
>Human MLH1 primer pair
TGCACTGTGGGATGTGTTCT
NNNNNNNNNNNNNNNNNNNN
AATCAATCCACTGTGTATAAAGGAA
|
If the above primer pair is saved in MLH_primer.seq, and the target sequence is in MLH.seq, we can
directly compare the two using the following command line. The result is saved in amplicon. Since
the refseq mRNA for human MLH1 is NM_000249, we can use the second command line if the client server
connection is set up correctly.
bl2seq -i MLH_primer.seq -j MLH.seq -p blastn -F F -o amplicon.out
bl2seq -i MLH_primer.seq -j NM_000249.2 -p blastn -F F -o amplicon.out -A T
|
If both primers map to the target well, then the amplicon can be calculated using the largest and
smallest coordinates as given in the alignment output as shown below.
| Reverse Primer Match | Forward Primer Match |
Score = 50.1 bits (25), Expect = 1e-010
Identities = 25/25 (100%)
Strand = Plus / Minus
Query: 51 aatcaatccactgtgtataaaggaa 75
|||||||||||||||||||||||||
Sbjct: 2496 aatcaatccactgtgtataaaggaa 2472
|
Score = 40.1 bits (20), Expect = 1e-007
Identities = 20/20 (100%)
Strand = Plus / Plus
Query: 1 tgcactgtgggatgtgttct 20
||||||||||||||||||||
Sbjct: 2345 tgcactgtgggatgtgttct 2364 |
Amplicon size is the difference of the two underlined coordinates plus 1, i.e. (2496 – 2345) + 1 = 152 bp
We can also use bl2seq to identify the region in a BAC clone to which a given mRNA sequence matches.
bl2seq -p blastn -i my_mRNA -j my_BAC -F "m L" -m T -o mRNA_BAC.out
bl2seq -p blastn -i NM_000249.2 -j AC006583.31 -F "m L" -m T -o hs_MLH1.out
|
In the first command line above, we use megablast and low complexity filter to reduce spurious hits. For clear
result, we also add "m" to the filter string to allow extension through the filter masked region. In the second
command line, we use the client server function to map retrieve the human MLH1 mRNA and a BAC clone containing this
gene before matching them. For precise mapping of the exon/intron junctions, however, we should use splign or spidey,
two specialized tools designed for this purpose. More information is available at:
spideydoc.html, and
splign document.
Sometimes two nucleotide sequences may not have similarity detectable by bl2seq's blastn function. However, if both of
them encodes for protein products, we can use the tblastx program function to detect the potential similarity shared
by these products.
For example, the following two sequences have no detectable nucleotide similarity
when compared using bl2seq's blastn function.
>Hypera postica cysteine proteinase, partail sequence
GGGAGCCAAATTCCAGGCCTTCAAGTTGGAGCATGGCAAGACCTACTTAAACCAAGCTGAGGAAAGCAAG
CGCTTTAACATCTTTACTGACAACGTACGCGCTATTGAAGCACACAATGCCCTCTACGAGCAAGGAAAAG
TATCCTACAAAAAAGGTATCAATAAATTCACTGACATGTCTCAAGAAGAGTTCAAGACAATGCTCACTCT
CAGCGCATCTAGAAAACCAACTTTGGAAACTACTTCATACGTAAAAACCGGTGTTGAAATCCCATCATCT
GTTGACTGGAGAAAAGAAGGTCGAGTAACTGGAGTCAAAGATCAAGGCGATTGTGGATCATGCTGGGCAT
TCTCTATCACTGGATCAACCGAAGGCGCCTACGCCCGTAAATCTGGGAAACTTGTTTCTCTTTCTGAACA
ACAATTGATAGACTGCTGCACTGATACAAGTGCAGGATGTGATGGTGGATCACTAGACGACAATTTTAAA
TACGTCATGAAGGATGGTCTTCAGTCTGAAGAAAGCTACACCTACAAGGGTGAGGATGGAGCATGCAAAT
ACAACGTTGCAAGTGTTGTAACTAAAGTCAGCAAATACACTTCCATTCCAGCAGAAGACGAAGATGCTCT
TCTTGAGGCTGTAGCTACTGTAGGACCAGTATCTGTTGGCATGGATGCTAGCTACCT
>Boophilus microplus cathepsin L-like proteinase, partial sequences
ACTGTTGCTGCAAGCTCTCAAGAAATCCTACGCACCCAATGGGAGGCATTTAAAACTACCCACAAAAAAT
CCTACCAGTCACACATGGAGGAGCTCCTGAGGTTCAAGATTTTCACGGAGAACAGCCTAATCATTGCCAA
GCACAACGCTAAGTACGCCAAGGGTCTCGTTTCTTACAAGCTCGGAATGAACCAGTTCGGCGATCTGCTG
GCACACGAATTTGCCAGGATCTTCAACGGTCACCACGGAACCCGCAAAACCGGTGGATCGACCTTCCTGC
CACCAGCAAACGTCAATGACAGCAGCCTGCCAAAAGTTGTCGACTGGCGCAAAAAAGGAGCTGTCACACC
TGTCAAGGACCAGGGACAGTGCGGGTCTTGCTGGGCCTTCAGTGCAACTGGATCTCTGGAGGGACAGCAT
TTTCTGAAGAACGGGGAGCTCGTTTCACTCAGTGAACAAAACTTGGTCGACTGTTCTCAGTCCTTCGGCA
ACAATGGTTGTGAAGGTGGTCTCATGGAGGACGCCTTTAAGTACATCAAGGCAAACGATGGTATCGACAC
GGAAAAAAGCTACCCATATGAGGCTGTGGATGGCGAGTGTCGTTTCAAGAAGGAAGATGTTGGAGCAACC
GACACCGGCTATGTGGAAATCAAGGCGGGTTCTGAGGTTGACCTGAAGAAGGCCGTCGCTACGGTCGGCC
CCATCTCTGTGGCTATTGACGCTAGTCACTCATCATTCCAGCTGTATTCCGAAGGAGTGTACGATGAGCC
CGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCGG
|
However, the similarity at the protein product level is readily detectable, if we use the
translated search function tblastx. The command line to use would be the following if we save
them as seq_1.txt and seq_2.txt, respectively.
bl2seq -i seq_1.txt -k seq_2.txt -p tblastx -F F
|
A common misuse of bl2seq is to try to identify patterns or sequece motifs that might be
present in the target sequence. This type of attempts often fails since bl2seq does not handle
ambiguous code. A more suitable and yet little known tool from the standalone BLAST package is
seedtop. Please refer to Pattern Search With seedtop for more information.
4. Feedback
For questions and comments on this document and BLAST in general, please send them to:
blast-help@ncbi.nlm.nih.gov
Questions and comments on other NCBI resources should be addressed to:
info@ncbi.nlm.nih.gov
5. Appendix: program parameters of blseq
We control behavior of a given BLAST program through command line parameter and value pairs. The convention is
-A value, where the dash plus single letter marks the parameter
with value following it. In this section, we will explain the available bl2seq parameters individually, one per table.
| Table 5.1 |
| Option | -i |
| Function | First Input query sequence |
| Default | - |
| Example | To search with my_primer as the first query, use: -i my_primer |
Note
This parameter must be set. Since bl2seq treats the second input query as database,
it is recommended that the short query be provided to this option. When -A T is used,
-i takes accession.version, such as in "-i NM_000249.1".
| Table 5.2 |
| Option | -j |
| Function | Second Input query sequence |
| Default | - |
| Example | To search with my_target as the second query, use: -i my_target |
Note
This parameter must be set. Large query should be provided to this parameter since bl2seq treats it
as the database.
| Table 5.3 |
| Option | -p |
| Function | Program function name, must be set |
| Default | - |
| Example | To search with tblastn, use: -p tblastn |
Note Program and Query type combination
Program blastn blastp blastx tblastn tblastx
Query (-i) NT AA NT AA NT
2nd Query (-j) NT AA AA NT NT
| Table 5.4 |
| Option | -g |
| Function | Perform gapped alignment |
| Default | T |
| Example | To disable gapping, use: -g F |
Note Default is to do gapped alignment.
| Table 5.5 |
| Option | -o |
| Function | To save alignment to specified output file |
| Default | stdout |
| Example | To save search result to my_output, use: -o my_output |
Note
Default is to print the output to the screen, which can be redirected to file or piped to other downstream process.
| Table 5.6 |
| Option | -d |
| Function | Theoretical database size |
| Default | 0 |
| Example | To use a theoretical database size of 2000000, use: -d 2000000 |
Note Default is to use actual size of the second query. We can use this parameter to provide the actual size
of a real database such as protein nr to get a more realistic Expect value for the returned protein alignment.
| Table 5.7 |
| Option | -a |
| Function | SeqAnnot output file |
| Default | Optional |
| Example | To save the this to my_seqalign, use: -a my_seqalign |
Note The output is an ASN.1 file.
| Table 5.8 |
| Option | -G |
| Function | Cost ot open a gap |
| Default | Varies |
| Example | To increase the gap open penalty to -5, use: -G -5 |
Note Defaults for various -p settings
Program blastn blastp blastx tblastn tblastx megablast
Value 5 11 11 11 11 0
| Table 5.9 |
| Option | -E |
| Function | Cost to extend a gap |
| Default | -1 |
| Example | To increase the gap extension penalty to -2, use: -E 2 |
Note Defaults for various -p settings
Program blastn blastp blastx tblastn tblastx megablast
Value 2 1 1 1 1 0
| Table 5.10 |
| Option | -X |
| Function | X dropoff value for gapped alignment (in bits) |
| Default | 0 |
| Example | To increase this X dropoff to 50, use: -X 50 |
Note Zero invokes the following default
Program blastn megablast tblastx all others
Value 30 20 0 15
| Table 5.11 |
| Option | -W |
| Function | Word size |
| Default | 0 |
| Example | To decrease the nucleotide search word size to 8, use: -W 8 |
Note Zero invokes the following defaults
Program blastn megablast all others
Value 11 28 3
| Table 5.12 |
| Option | -M |
| Function | Score matrix to use |
| Default | BLOSUM62 |
| Example | To use PAM30 for short peptide search, use: -M PAM30 |
Note bl2seq only supports BLOSUM45, BLOSUM62, BLOSUM80, PAM30, and PAM70.
| Table 5.13 |
| Option | -q |
| Function | Penalty for a nucleotide mismatch |
| Default | -3 |
| Example | To decrease this to -2, use: -q -2 |
Note For blastn only.
| Table 5.14 |
| Option | -r |
| Function | Reward for a nucleotide match |
| Default | 1 |
| Example | To increase this to 2, use: -r 2 |
Note For blastn only.
| Table 5.15 |
| Option | -F |
| Function | Filter query sequence |
| Default | T |
| Example | To turn off filter, use: -F F |
Note Accepted strings: T, F, D, L, R, V, S, C, and m. See
Section 6.4 of the
"BLAST URLAPI" document for details.
| Table 5.16 |
| Option | -e |
| Function | Expect value |
| Default | 10.0 |
| Example | To increase this for short primer search, use: -e 1000 |
Note This controls the search stringency. To increase the stringency, reduce the value.
To reduce the stringency, do the reverse.
| Table 5.17 |
| Option | -S |
| Function | Query strands to use in the search (1st query) |
| Default | 3 |
| Example | To search with only the input strand, use: -S 1 |
Note For nucleotide search with blastn only: 3 is both, 1 is top, 2 is bottom.
| Table 5.18 |
| Option | -T |
| Function | Produce HTML output |
| Default | F |
| Example | To produce HTML output viewable through browser, use: -T T |
Note Hit to NCBI sequence will be hot linked to its Entrez record if -A T and accession.version are used.
| Table 5.19 |
| Option | -m |
| Function | Use megablast for search |
| Default | F |
| Example | To trigger megablast algorithm, use: -m T |
Note For blastn only. Search will be faster, but less sensitive.
| Table 5.20 |
| Option | -Y |
| Function | Effective length of the search space |
| Default | 0 |
| Example | To use an effective search space of 1000000, use: -Y 1000000 |
Note The default is to use the actual search space defined by the input queries.
| Table 5.21 |
| Option | -t |
| Function | Length of the largest intron allowed in tblastn for linking HSPs |
| Default | 0 |
| Example | To link HSPs 2000 bps apart, use: -t 2000 |
Note Zero disables HSP linking.
| Table 5.22 |
| Option | -I (upper case i) |
| Function | Location on first query |
| Default | Optional |
| Example | To search subsequence 100-200 of the first query, use: -I 100,200 |
Note N/A.
| Table 5.23 |
| Option | -J |
| Function | Location on second query |
| Default | Optional |
| Example | To search subsequence 1000-2000 of the second query, use: -J 1000,2000 |
Note This is the only program that allows users to specify a subsequence in the "database" entry.
| Table 5.23 |
| Option | -D |
| Function | Output format |
| Default | 0 |
| Example | To see the tabular output, use: -D 1 |
Note An input of 0 generates pairwise display and an input of 1 generates tabular display.
Currently bl2seq does not support XML.
| Table 5.24 |
| Option | -U |
| Function | Use lower case filtering for thequery sequences |
| Default | F |
| Example | To enable lowercase filtering, use: -U T |
Note This is for the first query. Make sure there is uppercased region available in that sequence.
| Table 5.25 |
| Option | -A |
| Function | Input sequences in the form of accession.version |
| Default | F |
| Example | To enable this parsing and retrieving, use: -A T |
Note The input to -i and/or -j should be accession.version, such as in -i
NM_000249.2.
| Table 5.26 |
| Option | -V |
| Function | Force use of the legacy BLAST engine |
| Default | F |
| Example | To enable this, use: -V T |
Note Not recommended.
|