How to BLAST Using Very Large Nucleotide Queries
As an era of prolific genome sequencing begins, the need arises to run BLAST searches using very large query sequences that often reach hundreds of kilobases in length. Because of its speed and flexibility, the BLAST algorithm can rise to the occasion and perform these searches quickly.
One of the contributors to the speed and flexibility of BLAST is the fact that BLAST matches multi-character words between the query and database sequences rather than single characters. By default, these matches need not be perfect, although a scoring threshold for reporting matches can be adjusted. This search strategy offers a tradeoff between speed and sensitivity; smaller word-sizes result in greater sensitivity at the expense of speed while larger word-sizes optimize BLAST for speed.

Click on figure to view enlarged version
Figure 1: Portion of the Advanced BLAST page demonstrating the use of the Advanced Options box and filtering options.
The default word-size for BLASTn searches is 11, which allows untranslated nucleotide searches to proceed at a rapid pace with a degree of sensitivity appropriate for queries in the range of 1 to 50 kb. A word-size of 11, however, is too small to facilitate rapid searches with queries of 100 kb or more. Fortunately BLAST is flexible enough to allow the word-size to be adjusted upward indefinitely to accommodate ever larger query lengths. For a query of 100 kb, a word-size of 30 to 50 will allow a Web-BLAST search to run to completion and return informative database matches.
Filtering parameters may also be adjusted to facilitate BLAST searches with large queries. Because repetitive sequence within the query leads to a repetitive and relatively uninformative series of database hits, BLAST masks simple, repetitive sequence by default. Repetitious hits waste computational time, and the larger the query sequence, the larger the potential problemso repeat filtering should always be used with large query sequences. In the case of human sequences, the human repeat filtering option of Advanced BLAST should be used to mask more complex varieties of repeat found in human sequences. Both simple repeat and human repeat filtering are activated using check boxes on the Advanced BLAST Web page. To run a BLASTn search using a word-size of 50 and filtering the query for human repeats, type -W 50 into the Advanced BLAST Advanced Options box and check the appropriate filtering options as shown in Figure 1.
As an example, a BLASTn search of the default nr nucleotide database with a 400 kb contig from human chromosome 22 will time-out using Advanced BLAST if the default parameters are used. However, if the word-size is changed to 200, and human repeat filtering activated, the search is completed within 15 minutes!
Searches with very large sequences may also be performed using BLAST2Sequences if the word-size is increased sufficiently. Using the default word-size of 11, alignments with query sequences on the order of 150 kb in length will not generally be completed. However, by increasing the word-size to 50, two sequences as large as 250 kb apiece may be aligned. The word-size may be changed on the BLAST2Sequence page using the input box provided.
The BLAST Lab feature is intended to provide detailed technical information on some of the more specialized uses of the BLAST family of programs. Topics are selected from the range of questions received by the BLAST Help Group. |