Direct Comparison of Two FASTA Sequences Using bl2seq

Tao Tao, PhD
User Services, NCBI, NLM, NIH

TOC

1. Introduction

The program bl2seq is designed to compare two input FASTA sequences directly bypassing the formatdb step. This allows users to quickly assess the similarities between two input sequences. The program is capable of performing six types of blast searches: blastn, blastp, blastx, tblastn, tblastx, and megabalst in place of blastn.

The input sequences can be provided as the FASTA sequence files name or in the accession.version format under '-A T' setting, if the client server function can connect to NCBI.

2. Installation and setup

bl2seq does not require program specific setup if the client server function is not used. For users behind firewalls, using the client server function will require addtional setup. Specifically, the following lines need to be added to .ncbirc or ncbi.ini.

[CONN]
FIREWALL=TRUE

[NET_SERV]
SRV_CONN_MODE=SERVICE

This also requires that the corresponding ports be open for connection to certain NCBI IP addresses. Refer to Searching Against NCBI Databases With blastcl3 for more information on firewall and .ncbirc (ncbi.ini) configuration.

3. Common usage examples

bl2seq is a convenient tool for directly comparison of two local sequences in all possible combination of sequence types. Here we will discuss a few common comparisons using this tool.

One search bl2seq can perform is the direct comparison of two nucleotide sequences to find out the differences present between teh two sequences. The two sequences can be a mutated clone and its parent, a field isolate and the reference strain, or a the mRNA and its genomic counterpart.

bl2seq -i mutant.seq -j parent.seq -p blastn -F F -o mut_parent.out
We can also use bl2seq directly compare PCR primers against their target sequence to identify the annealing position and amplicon size. To search two primers together, we need to convert them to the following format. The reverse primer does not have to be converted to its reverse complement since bl2seq does this automatically.

>Human MLH1 primer pair 
TGCACTGTGGGATGTGTTCT
NNNNNNNNNNNNNNNNNNNN
AATCAATCCACTGTGTATAAAGGAA

If the above primer pair is saved in MLH_primer.seq, and the target sequence is in MLH.seq, we can directly compare the two using the following command line. The result is saved in amplicon. Since the refseq mRNA for human MLH1 is NM_000249, we can use the second command line if the client server connection is set up correctly.

bl2seq -i MLH_primer.seq -j MLH.seq -p blastn -F F -o amplicon.out

bl2seq -i MLH_primer.seq -j NM_000249.2 -p blastn -F F -o amplicon.out -A T

If both primers map to the target well, then the amplicon can be calculated using the largest and smallest coordinates as given in the alignment output as shown below.

Reverse Primer MatchForward Primer Match
Score = 50.1 bits (25), Expect = 1e-010
 Identities = 25/25 (100%)
 Strand = Plus / Minus
 
Query: 51   aatcaatccactgtgtataaaggaa 75
            |||||||||||||||||||||||||
Sbjct: 2496 aatcaatccactgtgtataaaggaa 2472
Score = 40.1 bits (20), Expect = 1e-007
 Identities = 20/20 (100%)
 Strand = Plus / Plus
 
Query: 1    tgcactgtgggatgtgttct 20
            ||||||||||||||||||||
Sbjct: 2345 tgcactgtgggatgtgttct 2364

Amplicon size is the difference of the two underlined coordinates plus 1, i.e. (2496 – 2345) + 1 = 152 bp

We can also use bl2seq to identify the region in a BAC clone to which a given mRNA sequence matches.

bl2seq -p blastn -i my_mRNA -j my_BAC -F "m L" -m T -o mRNA_BAC.out

bl2seq -p blastn -i NM_000249.2 -j AC006583.31 -F "m L" -m T -o hs_MLH1.out

In the first command line above, we use megablast and low complexity filter to reduce spurious hits. For clear result, we also add "m" to the filter string to allow extension through the filter masked region. In the second command line, we use the client server function to map retrieve the human MLH1 mRNA and a BAC clone containing this gene before matching them. For precise mapping of the exon/intron junctions, however, we should use splign or spidey, two specialized tools designed for this purpose. More information is available at: spideydoc.html, and
splign document.

Sometimes two nucleotide sequences may not have similarity detectable by bl2seq's blastn function. However, if both of them encodes for protein products, we can use the tblastx program function to detect the potential similarity shared by these products.

For example, the following two sequences have no detectable nucleotide similarity when compared using bl2seq's blastn function.

>Hypera postica cysteine proteinase, partail sequence
GGGAGCCAAATTCCAGGCCTTCAAGTTGGAGCATGGCAAGACCTACTTAAACCAAGCTGAGGAAAGCAAG
CGCTTTAACATCTTTACTGACAACGTACGCGCTATTGAAGCACACAATGCCCTCTACGAGCAAGGAAAAG
TATCCTACAAAAAAGGTATCAATAAATTCACTGACATGTCTCAAGAAGAGTTCAAGACAATGCTCACTCT
CAGCGCATCTAGAAAACCAACTTTGGAAACTACTTCATACGTAAAAACCGGTGTTGAAATCCCATCATCT
GTTGACTGGAGAAAAGAAGGTCGAGTAACTGGAGTCAAAGATCAAGGCGATTGTGGATCATGCTGGGCAT
TCTCTATCACTGGATCAACCGAAGGCGCCTACGCCCGTAAATCTGGGAAACTTGTTTCTCTTTCTGAACA
ACAATTGATAGACTGCTGCACTGATACAAGTGCAGGATGTGATGGTGGATCACTAGACGACAATTTTAAA
TACGTCATGAAGGATGGTCTTCAGTCTGAAGAAAGCTACACCTACAAGGGTGAGGATGGAGCATGCAAAT
ACAACGTTGCAAGTGTTGTAACTAAAGTCAGCAAATACACTTCCATTCCAGCAGAAGACGAAGATGCTCT
TCTTGAGGCTGTAGCTACTGTAGGACCAGTATCTGTTGGCATGGATGCTAGCTACCT
>Boophilus microplus cathepsin L-like proteinase, partial sequences
ACTGTTGCTGCAAGCTCTCAAGAAATCCTACGCACCCAATGGGAGGCATTTAAAACTACCCACAAAAAAT
CCTACCAGTCACACATGGAGGAGCTCCTGAGGTTCAAGATTTTCACGGAGAACAGCCTAATCATTGCCAA
GCACAACGCTAAGTACGCCAAGGGTCTCGTTTCTTACAAGCTCGGAATGAACCAGTTCGGCGATCTGCTG
GCACACGAATTTGCCAGGATCTTCAACGGTCACCACGGAACCCGCAAAACCGGTGGATCGACCTTCCTGC
CACCAGCAAACGTCAATGACAGCAGCCTGCCAAAAGTTGTCGACTGGCGCAAAAAAGGAGCTGTCACACC
TGTCAAGGACCAGGGACAGTGCGGGTCTTGCTGGGCCTTCAGTGCAACTGGATCTCTGGAGGGACAGCAT
TTTCTGAAGAACGGGGAGCTCGTTTCACTCAGTGAACAAAACTTGGTCGACTGTTCTCAGTCCTTCGGCA
ACAATGGTTGTGAAGGTGGTCTCATGGAGGACGCCTTTAAGTACATCAAGGCAAACGATGGTATCGACAC
GGAAAAAAGCTACCCATATGAGGCTGTGGATGGCGAGTGTCGTTTCAAGAAGGAAGATGTTGGAGCAACC
GACACCGGCTATGTGGAAATCAAGGCGGGTTCTGAGGTTGACCTGAAGAAGGCCGTCGCTACGGTCGGCC
CCATCTCTGTGGCTATTGACGCTAGTCACTCATCATTCCAGCTGTATTCCGAAGGAGTGTACGATGAGCC
CGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCGG

However, the similarity at the protein product level is readily detectable, if we use the translated search function tblastx. The command line to use would be the following if we save them as seq_1.txt and seq_2.txt, respectively.

bl2seq -i seq_1.txt -k seq_2.txt -p tblastx -F F 

A common misuse of bl2seq is to try to identify patterns or sequece motifs that might be present in the target sequence. This type of attempts often fails since bl2seq does not handle ambiguous code. A more suitable and yet little known tool from the standalone BLAST package is seedtop. Please refer to Pattern Search With seedtop for more information.

4. Feedback

For questions and comments on this document and BLAST in general, please send them to:

blast-help@ncbi.nlm.nih.gov

Questions and comments on other NCBI resources should be addressed to:

info@ncbi.nlm.nih.gov

5. Appendix: program parameters of blseq

We control behavior of a given BLAST program through command line parameter and value pairs. The convention is -A value, where the dash plus single letter marks the parameter with value following it. In this section, we will explain the available bl2seq parameters individually, one per table.

Table 5.1
Option-i
FunctionFirst Input query sequence
Default-
ExampleTo search with my_primer as the first query, use: -i my_primer
Note
This parameter must be set. Since bl2seq treats the second input query as database, it is recommended that the short query be provided to this option. When -A T is used, -i takes accession.version, such as in "-i NM_000249.1".

Table 5.2
Option-j
FunctionSecond Input query sequence
Default-
ExampleTo search with my_target as the second query, use: -i my_target
Note
This parameter must be set. Large query should be provided to this parameter since bl2seq treats it as the database.

Table 5.3
Option-p
FunctionProgram function name, must be set
Default-
ExampleTo search with tblastn, use: -p tblastn
Note
Program and Query type combination
Program         blastn    blastp    blastx   tblastn   tblastx
Query (-i)        NT        AA        NT        AA        NT
2nd Query (-j)    NT        AA        AA        NT        NT

Table 5.4
Option-g
FunctionPerform gapped alignment
DefaultT
ExampleTo disable gapping, use: -g F
Note
Default is to do gapped alignment.

Table 5.5
Option-o
FunctionTo save alignment to specified output file
Defaultstdout
ExampleTo save search result to my_output, use: -o my_output
Note
Default is to print the output to the screen, which can be redirected to file or piped to other downstream process.

Table 5.6
Option-d
FunctionTheoretical database size
Default0
ExampleTo use a theoretical database size of 2000000, use: -d 2000000
Note
Default is to use actual size of the second query. We can use this parameter to provide the actual size of a real database such as protein nr to get a more realistic Expect value for the returned protein alignment.

Table 5.7
Option-a
FunctionSeqAnnot output file
DefaultOptional
ExampleTo save the this to my_seqalign, use: -a my_seqalign
Note
The output is an ASN.1 file.

Table 5.8
Option-G
FunctionCost ot open a gap
DefaultVaries
ExampleTo increase the gap open penalty to -5, use: -G -5
Note
Defaults for various -p settings
Program    blastn    blastp    blastx    tblastn    tblastx    megablast
Value        5         11        11         11         11          0

Table 5.9
Option-E
FunctionCost to extend a gap
Default-1
ExampleTo increase the gap extension penalty to -2, use: -E 2
Note
Defaults for various -p settings
Program    blastn    blastp    blastx    tblastn    tblastx    megablast
Value        2         1         1         1           1          0

Table 5.10
Option-X
FunctionX dropoff value for gapped alignment (in bits)
Default0
ExampleTo increase this X dropoff to 50, use: -X 50
Note
Zero invokes the following default
Program    blastn  megablast  tblastx     all others
Value        30      20          0           15

Table 5.11
Option-W
FunctionWord size
Default0
ExampleTo decrease the nucleotide search word size to 8, use: -W 8
Note
Zero invokes the following defaults
Program    blastn   megablast   all others
Value        11        28           3

Table 5.12
Option-M
FunctionScore matrix to use
DefaultBLOSUM62
ExampleTo use PAM30 for short peptide search, use: -M PAM30
Note
bl2seq only supports BLOSUM45, BLOSUM62, BLOSUM80, PAM30, and PAM70.

Table 5.13
Option-q
FunctionPenalty for a nucleotide mismatch
Default-3
ExampleTo decrease this to -2, use: -q -2
Note
For blastn only.

Table 5.14
Option-r
FunctionReward for a nucleotide match
Default1
ExampleTo increase this to 2, use: -r 2
Note
For blastn only.

Table 5.15
Option-F
FunctionFilter query sequence
DefaultT
ExampleTo turn off filter, use: -F F
Note
Accepted strings: T, F, D, L, R, V, S, C, and m. See Section 6.4 of the "BLAST URLAPI" document for details.

Table 5.16
Option-e
FunctionExpect value
Default10.0
ExampleTo increase this for short primer search, use: -e 1000
Note
This controls the search stringency. To increase the stringency, reduce the value. To reduce the stringency, do the reverse.

Table 5.17
Option-S
FunctionQuery strands to use in the search (1st query)
Default3
ExampleTo search with only the input strand, use: -S 1
Note
For nucleotide search with blastn only: 3 is both, 1 is top, 2 is bottom.

Table 5.18
Option-T
FunctionProduce HTML output
DefaultF
ExampleTo produce HTML output viewable through browser, use: -T T
Note
Hit to NCBI sequence will be hot linked to its Entrez record if -A T and accession.version are used.

Table 5.19
Option-m
FunctionUse megablast for search
DefaultF
ExampleTo trigger megablast algorithm, use: -m T
Note
For blastn only. Search will be faster, but less sensitive.

Table 5.20
Option-Y
FunctionEffective length of the search space
Default0
ExampleTo use an effective search space of 1000000, use: -Y 1000000
Note
The default is to use the actual search space defined by the input queries.

Table 5.21
Option-t
FunctionLength of the largest intron allowed in tblastn for linking HSPs
Default0
ExampleTo link HSPs 2000 bps apart, use: -t 2000
Note
Zero disables HSP linking.

Table 5.22
Option-I (upper case i)
FunctionLocation on first query
DefaultOptional
ExampleTo search subsequence 100-200 of the first query, use: -I 100,200
Note
N/A.

Table 5.23
Option-J
FunctionLocation on second query
DefaultOptional
ExampleTo search subsequence 1000-2000 of the second query, use: -J 1000,2000
Note
This is the only program that allows users to specify a subsequence in the "database" entry.

Table 5.23
Option-D
FunctionOutput format
Default0
ExampleTo see the tabular output, use: -D 1
Note
An input of 0 generates pairwise display and an input of 1 generates tabular display. Currently bl2seq does not support XML.

Table 5.24
Option-U
FunctionUse lower case filtering for thequery sequences
DefaultF
ExampleTo enable lowercase filtering, use: -U T
Note
This is for the first query. Make sure there is uppercased region available in that sequence.

Table 5.25
Option-A
FunctionInput sequences in the form of accession.version
DefaultF
ExampleTo enable this parsing and retrieving, use: -A T
Note
The input to -i and/or -j should be accession.version, such as in -i NM_000249.2.

Table 5.26
Option-V
FunctionForce use of the legacy BLAST engine
DefaultF
ExampleTo enable this, use: -V T
Note
Not recommended.