Format

Send to

Choose Destination
Bioinformatics. 2007 Nov 1;23(21):2949-51. Epub 2007 Oct 6.

Improved BLAST searches using longer words for protein seeding.

Author information

1
Department of Health and Human Services, National Center for Biotechnology Information, National Institutes of Health, Bethesda, MD, USA.

Abstract

MOTIVATION:

The blastp and tblastn modules of BLAST are widely used methods for searching protein queries against protein and nucleotide databases, respectively. One heuristic used in BLAST is to consider only database sequences that contain a high-scoring match of length at most 5 to the query. We implemented the capability to use words of length 6 or 7. We demonstrate an improved trade-off between running time and retrieval accuracy, controlled by the score threshold used for short word matches. For example, the running time can be reduced by 20-30% while achieving ROC (receiver operator characteristic) scores similar to those obtained with current default parameters.

AVAILABILITY:

The option to use long words is in the NCBI C and C++ toolkit code for BLAST, starting with version 2.2.16 of blastall. A Linux executable used to produce the results herein is available at: ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/protein_longwords

PMID:
17921491
DOI:
10.1093/bioinformatics/btm479
[Indexed for MEDLINE]

Supplemental Content

Full text links

Icon for Silverchair Information Systems
Loading ...
Support Center