Search for Sequence Patterns Using seedtop
Tao Tao, Ph.D. User Services NCBI, NLM, NIH
Table of Content
1. Introduction
seedtop is a little-known program found in the NCBI standalone blast package, whose main function
is to search for patterns in an input sequence or database. It has four modes of usage, which are
referred to as "subprograms": two for pattern searches from a input query or database only and
two for pattern initiated sequence alignment. The following table lists these subprograms, their
functions, and required inputs.
| Table 1.1 Subprogram Fucntion of seedtop |
| Program Call ¹ | Functions | Required Inputs |
| -p patmatch | Search for patterns in an input sequence |
Pattern (-k) and sequence (-i) |
| -p pattern | Search for patterns in an input database |
Same as above |
| -p patseed | Search for patterns in the query and align the query against a database |
Pattern (-k), input sequence (-i), and target database (-d) |
| -p seed | Search for specific pattern in the query and align the query against a database |
Same as above ² |
NOTE:
¹ The program strings listed are for nucleotide searches. For protein searches, add lowercase p to the program name.
² The pattern file needs to have an extra HI initialed line to specify the position in the input sequence at which the
pattern occurrence of interest starts.
2. Setup
Installation of the standalone blast archive is fairly easy. Once the archive is placed in a desired
directory and extracted, the whole package will be installed in a newly created subdirectory called
blast-#.#.#, where #.#.# is the version number. All the programs, including seedtop, will be in the
blast-#.#.#/bin/ subdirectory (blast-#.#.#\bin\ for PC).
Appropriate setup requires the creation of .ncbirc configuration file, which blast programs (including
seedtop) read upon startup to locate the appropriate files needed. In this .ncbirc, we can specify the
location of the DATA directory and the BLASTDB directory using the following lines:
[NCBI]
DATA=/path/data
[BLAST]
BLASTDB=/path/db
|
The [NCBI] section is used by most of the NCBI programs to locate the data directory and retrieve specific
files needed (MATRIX file for example). The [BLAST] section specifies the path to the directory where
databases are stored.
We need to create it after installation since the db directory does not come with the NCBI setup. For simplicity,
we suggest that it be created under the blast-#.#.#, at the same level as data directory. If we
place the directory elsewhere, we need to change the path specificayion in .ncbirc correspondingly.
To be able to call various programs in the blast-#.#.#\bin\ subdirectory from any directory, we need to modify the PATH
environment variable to include the blast-#.#.#\bin in the search path. For advanced users, they can also try specify the
DATA and BLASTDB using environment variables.
For more details, see:
http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/pc_setup.html
http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/unix_setup.html
3. Execution and Practical Usage
Since there is no GUI for this family of programs, we invoke seedtop program though commands issued in a terminal window
and control the search through parameter/value pairs within the command line. The following representative command lines
can be used to testing purposes.
On PC, we can cd to the blast-#.#.#/bin directory, with #.#.# represent the version number, and type this:
On Linux or Unix, including MacOSX, platforms, we can cd to the bin directory and type this:
Both should display the list of parameters on the screen.
The most useful functionalities of seedtop are patmatchp and patternp. Since patseedp does not generate
the actual alignment and its function is already incorporated in blastpgp, we will not cover it here.
The functionality for searching with nucleotide entries are similar to
protein searches, we will only provide a couple simple examples.
3.1 Pattern specification
The pattern input file is unique for seedtop. Each pattern contains one ID initialed lines for
pattern identification, and one or more PA initialed lines for the actual pattern specified using
ProSite syntax.
Pattern lines should be less than 100 letters long. Longer patterns can be specified by multiple PA lines
as given in the example. Here is a pattern input file with a single pattern containing two PA lines. For testing purposes,
we can use it with refseq protein records such as YP_471346.1, YP_575330.1, or YP_564843.1.
ID ATP synthase delta (OSCP) subunit signature.
PA [LIVM]-x-[LIVMFYT]-x(3)-[LIVMT]-[DENQK]-x-{G}-[LIVM]-x-[GSA]-G>
PA [LIVMFYGA]-{S}-[LIVM]-[KRHENQ]-x-[GSEN].
|
A pattern input file can contain multiple patterns as long as they are separated
by a line with a single forward slash (/). Below is a pattern input file with multiple
patterns specified, each with a single PA line:
ID Cyclic nucleotide-binding domain signature 2.
PA [LIVMF]-G-E-x-[GAS]-[LIVM]-x(5,11)-R-[STAQ]-A-x-[LIVMA]-x-[STACV].
/
ID Cyclic nucleotide-binding domain signature 1
PA [LIVM]-[VIC]-x-{H}-G-[DENQTA]-x-[GAC]-{L}-x-[LIVMFY](4)-x(2)-G
|
Other general pattern rules are:
| Symbol | Meaning |
| [ ] | marks a single position, match to anyone in the bracket is acceptable |
| (x,y) | marks a range for the residue(s) before it, matching within the range is acceptable |
| (x,) | represents range with no upper limit for the residue(s) before it |
| (x) | represents exact number of matches for the residue(s) before it |
| {} | marks a single residue, residues in the braces should be excluded |
| - | separates the individual positions in the pattern |
| . | used at the end marks the end of a pattern |
| > | symbol at the end marks an incomplete pattern (optional) |
3.2 patmatchp
This function matches patterns found in an input pattern file and identifies the pattern occurrences in an input protein
sequence.
The sample command line below takes an input pattern named pattern.txt, searches against the query sequence in query.aa, and
displays
the output on the screen.
seedtop -k pattern.txt -i query.aa -p patmatchp
Name Cyclic nucleotide-binding domain signature 2.
Pattern [LIVMF]-G-E-x-[GAS]-[LIVM]-x(5,11)-R-[STAQ]-A-x-[LIVMA]-x-[STACV].
At position 521 of query sequence
Name Cyclic nucleotide-binding domain signature 1
Pattern [LIVM]-[VIC]-x-{H}-G-[DENQTA]-x-[GAC]-{L}-x-[LIVMFY](4)-x(2)-G
At position 483 of query sequence
|
The result only lists the pattern starting positions in the query sequence. Seedtop
only searches the first sequence in the input file.
3.3 patternp
This function matches patterns from the input pattern file against a formatted BLAST database
and reports back the database entries containing one or more of the input patterns as
well as the pattern locations. The sample command line below takes an input pattern named pattern.txt,
searches against the input refseq_protein database, and saves the identified entries with pattern matches to db.out.
Partial output is given below the command line.
seedtop -k pattern.txt -d refseq_protein -p patternp -o db.out
seqno=892602 gi|33859524|ref|NP_034048.1|
ID Cyclic nucleotide-binding domain signature 1
PA [LIVM]-[VIC]-x-{H}-G-[DENQTA]-x-[GAC]-{L}-x-[LIVMFY](4)-x(2)-G
HI (449 450) (452 454) (456 457) (459 462) (465 465)
seqno=892873 gi|51470807|ref|XP_290552.4|
ID Cyclic nucleotide-binding domain signature 1
PA [LIVM]-[VIC]-x-{H}-G-[DENQTA]-x-[GAC]-{L}-x-[LIVMFY](4)-x(2)-G
HI (374 375) (377 379) (381 382) (384 387) (390 390)
|
Each hit is described in the following lines:
- seqno: seqid for the database sequence with pattern matches
- ID: Pattern ID, reiterated pattern input
- PA: Pattern, reiterated pattern input
- HI: Hit position on the db sequence, regions broken up by X
3.4 patseedp
This function takes three inputs, an input pattern, a query protein sequence
containing the pattern, and a protein sequence database. It identifies the pattern in the
query and aligns the query against the database entries that contains the same pattern.
It reports the pattern position in the query, the total number of pattern occurrences
in the database, and the actual database entries with pattern and alignment to the
input query. Specifically, it reports the seqid of the database entry, the E-value of the alignment
between query and this database entry, the alignment score, and the pattern position.
seedtop -k pat.txt -d refseq_protein -p patseedp -o pat.out -i query_aa.txt
1 occurrence(s) of pattern in query
Name Cyclic nucleotide-binding domain signature 2.
Pattern [LIVMF]-G-E-x-[GAS]-[LIVM]-x(5,11)-R-[STAQ]-A-x-[LIVMA]-x-[STACV].
At position 488 of query sequence
effective database length=3.1e+008
pattern probability=3.4e-008
lengthXprobability=1.0e+001
Number of occurrences of pattern in the database is 265
892602 gi|33859524|ref|NP_034048.1|
0 Total Score 3279 Outside Pattern Score 3162 Match start in db seq 488
Extent in query seq 1 631 Extent in db seq 1 631
1 occurrence(s) of pattern in query
Name Cyclic nucleotide-binding domain signature 1
Pattern [LIVM]-[VIC]-x-{H}-G-[DENQTA]-x-[GAC]-{L}-x-[LIVMFY](4)-x(2)-G
At position 450 of query sequence
effective database length=3.1e+008
pattern probability=7.0e-008
lengthXprobability=2.2e+001
Number of occurrences of pattern in the database is 247
892602 gi|33859524|ref|NP_034048.1|
0 Total Score 3279 Outside Pattern Score 3188 Match start in db seq 450
Extent in query seq 1 631 Extent in db seq 1 631
|
The input pat.txt file contains two patterns, so the result contains two
sections, one for each pattern. Only one database match is shown.
3.5. seedp
This function is similar to patseedp. The only difference is that the pattern
file should specify the pattern position in the input query sequence. The output from
the patternp can be used for this purpose. This specifies which pattern is to be used
during the search.
An actual command line and pattern file is listed below. The output is omitted since
it is essential the same as that from patseedp given above.
seedtop -p seedp -k pat2.txt -d refseq_protein -i q_aa.txt -o seed.out
ID Cyclic nucleotide-binding domain signature 1
PA [LIVM]-[VIC]-x-{H}-G-[DENQTA]-x-[GAC]-{L}-x-[LIVMFY](4)-x(2)-G
HI (450 451) (453 455) (457 458) (460 463) (466 466)
|
The seedp functionality has been incorporated into standalone blastpgp and the PSI/PHI-BLAST
web service. The term PHI stands for Pattern-Hit-Initiated. The differences are that blastpgp
does not report the total number of pattern occurrences in the database and blastpgp generates
actual sequence alignments. The implementation in blastpgp provides more functionalities
in that the results of the (first) round of PHI-BLAST search can be
used seamlessly as the start materials of a PSI-BLAST iterated search.
3.6. For nucleotide searches
Search for nucleotide patterns using seedtop is very similar to what described
above for proteins queries. The function names for nucleotide have no terminating p.
Note that PHI/PSI-BLAST and blastpgp are not available for nucleotide.
4. Technical Support
For additional questions and comments, please write to:
blast-help@ncbi.nlm.nih.gov
Questions and inquries on other NCBI resources should be sent to:
info@ncbi.nlm.nih.gov
5. Appendix
Here we list all the program parameters for seedtop and their accepted input values.
Individual parameters are listed in their own tables.
| Table 5.1 |
| Parameter | -d |
| Function | Specifies the target database to search |
| Default | nr |
| Input format | Takes database formatted by formatdb, use name without extension |
| Example | To search against est_human, use: -d est_human |
Note
Searching for patterns in a single input sequence does not require this parameter.
| Table 5.2 |
| Parameter | -i |
| Function | Specifies the input query file |
| Default | stdin |
| Input format | [File In] |
| Example | To take my_pept.txt as input query, use: -i my_pept.txt |
Note
Use complete file name with extension. To using stdin as input, either redirect or pipe the input:
seedtop -k pat -p patmatchp < input_file
more input_file | seedtop -k pat -p patmatchp
|
| Table 5.3 |
| Parameter | -k |
| Function | Specifies the input pattern (Hit File) |
| Default | hit_file |
| Input format | [File in] |
| Example | If the pattern file is named my_pat.txt, use: -k my_pat.txt |
Note
Use complete file name with extension. See Section 3.1 above for details.
| Table 5.4 |
| Parameter | -o |
| Function | Specifies the output file name |
| Default | stdout |
| Input format | file name with or without extension |
| Example | To save result in my_output, use: -o my_output |
Note Redirection or piping also works.
| Table 5.5 |
| Parameter | -G |
| Function | Specifies the cost to open a gap |
| Default | 11 |
| Input format | [Integer] |
| Example | To change this to 12, use: -G 12 |
Note
The choice of -M option determines the available input value for this option as well as that for -E option.
Only a selected set is supported. Detailed list is in the blastall document.
| Table 5.6 |
| Parameter | -E |
| Function | Specifies the cost to extend a gap |
| Default | 1 |
| Input format | [Integer] |
| Example | To change this to 2, use: -E 2 |
Note See Table 5.5 for more information.
| Table 5.7 |
| Parameter | -D |
| Function | Specifies the cost to decline alignment |
| Default | 99999 |
| Input format | [Integer] |
| Example | N/A |
Note Functions similar to the -L option in blastpgp. If enabled, it would implement Dr. Altschul's 3-parameter gap model
for scoring.
| Table 5.8 |
| Parameter | -X |
| Function | Specifies X dropoff value for gapped alignment (in bits) |
| Default | 15 |
| Input format | [Integer] |
| Example | To increase this dropoff value to 20, use: -X 20 |
Note Increasing this value may enable one to see a longer alignment.
| Table 5.9 |
| Parameter | -S |
| Function | Specifies cutoff cost |
| Default | 30 |
| Input format | [Integer] |
| Example | N/A |
Note Currently it is overridden in pseed3.c. It could allow the user to control the score threshold applied
to the part of the alignment that does not include the pattern in deciding which alignment(s) to report.
| Table 5.10 |
| Parameter | -C |
| Function | Score only or not |
| Default | 1 |
| Input format | [Integer] |
| Example | N/A |
Note This is relevant only to searches with -p seed(p) or -p patseed(p). NOT implemented yet.
| Table 5.11 |
| Parameter | -I |
| Function | Shows GI numbers in deflines |
| Default | F |
| Input format | [T/F] |
| Example | To display GI in the deflines, use: -I T |
Note Relevant only to searches with -p seed(p) or -p patseed(p).
| Table 5.12 |
| Parameter | -e |
| Function | Specifies the expectation value (E) cutoff |
| Default | 10.0 |
| Input format | [Real] |
| Example | To set this to 0.001, use: -e 0.001 |
Note Relevant only to searches with -p seed(p) or -p patseed(p). Scientific notation acceptable: -e 1e-3.
| Table 5.13 |
| Parameter | -J |
| Function | Believes the query defline |
| Default | F |
| Input format | [T/F] |
| Example | To set this to true, use: -J T |
Note To save SeqAlign object requires -J T setting.
| Table 5.14 |
| Parameter | -O |
| Function | Specifies the output file for SeqAlign object |
| Default | Optional |
| Input format | [File Out] |
| Example | N/A |
Note Relevant only to searches with -p seed(p) or -p patseed(p). NOT implement yet.
| Table 5.15 |
| Parameter | -M |
| Function | Specifies which matrix file to use |
| Default | BLOSUM62 |
| Input format | [String] |
| Example | To set matrix to PAM30, use: -M PAM30 |
Note Relevant to seedp/patseedp searches, only a limited set is supported.
| Table 5.16 |
| Parameter | -p |
| Function | Specifies which subprogram to run |
| Default | patmatchp |
| Input format | [String] |
| Example | To find protein patterns in a database, use: -p patternp |
Note
Choices for nucleotide searches: patmatch, pattern, seed, and patseed
Choices for protein searches: patmatchp, patternp, seedp, and patseedp
| Table 5.17 |
| Parameter | -r |
| Function | Specifies the reward for a match |
| Default | 10 |
| Input format | [Integer] |
| Example | To increase the reward to 20, use: -r 20 |
Note Relevant only to nucleotide searches with -p seed or -p patseed.
| Table 5.18 |
| Parameter | -q |
| Function | Specifies the cost for a mismatch |
| Default | -10 |
| Input format | [Integer] |
| Example | To increase the penalty to -15, use: -q -15 |
Note For nucleotide search with seed/patseed only.
| Table 5.19 |
| Parameter | -F |
| Function | Whether to filter query sequence with SEG |
| Default | F |
| Input format | [T/F] |
| Example | To activate filter, use: -F T |
Note Relevant to seedp/patseedp searches only.
| Table 5.20 |
| Parameter | -f |
| Function | Force searching for patterns even if they are too likely |
| Default | F |
| Input format | [T/F] |
| Example | To activate this, use: -f T |
Note Activation of this parameter will force seedtop to search with patterns with high frequency of occurrences.
|