NCBI Logo NCBI News Masthead

In this issue

COG Database

Plant Genomes

LinkOut

Investigator Profile:
Stephen Altschul

GenBank News

Expanded Bookshelf

BLAST Enhancements


Recent Publications

Masthead

Investigator Profile Banner


Powerful Tools for Identifying Sequence Similarities

Portrait of Stephen F. AltschulThe recent pace of whole genome sequencing projects has resulted in an increase—both in volume and in complexity—of available molecular sequence data. Patterns shared by multiple protein and nucleic acid sequences provide valuable insights into genomic organization, molecular structure, and biological function, as well as to unexpected links among diverse biological systems. These previously unknown connections not only speed research progress, but often open new areas for scientific inquiry.

When studying a new gene or protein sequence, researchers often conduct database searches in order to identify similar genes or proteins. This is because the fastest method for identifying the function of a gene or protein is to find a related gene or protein—or an entire family—whose function is already known. The recognition of subtle residue patterns among genes or proteins sometimes relies upon aligning many sequences, a procedure that continues to present a complex and multifaceted problem for research.

In 1990, NCBI researchers Altschul, Gish, and Lipman, in collaboration with colleagues Miller and Myers from Penn State and the University of Arizona, developed and released BLAST—the Basic Local Alignment Search Tool1. The BLAST programs implement a set of sequence comparison algorithms that search a database for optimal local alignments to a query sequence. A local alignment represents a possible homology, or similarity by descent, between segments from two nucleic acid or protein sequences. The BLAST programs were substantially faster than existing database similarity search programs, and of comparable sensitivity to distant relationships. Of equal importance, using a statistical theory reported the same year by Karlin of Stanford University, and Altschul2, the BLAST programs first provided researchers with rigorous guidance for determining which alignments were statistically significant, and therefore worthy of further examination. The ideas underlying BLAST are simple and robust, and can be applied in a variety of contexts, including DNA and protein database searches, gene identification searches, and most recently, sequence motif or profile searches.

Comparison, whether of morphology or protein sequences, lies at the heart of biology.


BLAST works by breaking the query sequence into short fragments, or “words”, and initially seeking very close matches between these words and words from database sequences. Any aligned word pair scoring above a specified threshold is called a “hit”. Each hit is then extended in both directions in an attempt to generate a local alignment representing statistically significant sequence similarity. The quality of each alignment is represented by a score, defined most simply as the sum of scores for aligning pairs of nucleotides (for DNA) or pairs of amino acids (for proteins).

Because nucleotides or amino acids may be inserted or deleted within a particular sequence during the course of evolution, alignment programs generally allow for the existence of gaps, or spaces introduced into an alignment to compensate for insertions and deletions in one sequence relative to another. Gaps contribute negatively to the overall score of an alignment. The original BLAST programs did not explicitly include gaps within alignments, but rather treated them implicitly by calculating combined statistical assessments of multiple ungapped alignments produced by a single pair of sequences3.

In 1997, a team of NCBI researchers, including Altschul, Madden, Schäffer, Zhang, and Lipman, in collaboration with Zhang and Miller from Penn State, released a set of “gapped BLAST” programs. These new programs not only generated gapped alignments, but also ran several fold faster than the original BLAST programs4. This improve-ment was achieved by incorporating two algorithmic refinements. The first refinement required two hits within a set distance of one another, rather than one, before triggering a search for an ungapped local alignment, or high-scoring segment pair (HSP). The second refinement invoked a gapped extension step whenever an HSP of sufficiently high score was found. Previously, missing a single HSP implicitly involved in a significant alignment match could jeopardize the discovery of the result. Now, by introducing an algorithm for generating gapped alignments, it becomes necessary to find only one HSP, rather than all ungapped alignments subsumed in a significant result. Therefore, careful choice of algorithmic parameters led to increased program sensitivity to distant sequence relationships as well as to increased speed.

The introduction of BLAST and then gapped BLAST rendered it substantially easier for scientists to scan large sequence databases rapidly for relatively weak sequence similarities, and to statistically evaluate the resulting matches. Today, these BLAST programs are widely used tools for searching both protein and nucleic acid databases for sequence similarities, and may compare protein or DNA queries with protein or DNA databases in any combination. However, some of the most interesting similarities are quite subtle and do not rise to statistical significance during a standard BLAST search. Protein database searches using strategies that employ the construction of position-specific score matrices are often better able to detect weak relationships between sequences than are searches using a simple sequence as the query. Yet, employing these methods has not always been simple, frequently involving the use of multiple computer programs as well as a fair amount of scientific expertise.

The original BLAST paper was the most highly cited paper published in the 1990s and is being supplanted only by the 1997 paper describing the original version of PSI-BLAST.


To overcome this obstacle, the team that developed gapped BLAST incorporated the use of position-specific score matrices into the BLAST protein database search program, extending its capacity to detect weak yet significant sequence similarities. The resulting Position-Specific Iterated BLAST (PSI-BLAST) program features a method for automatically constructing a position-specific score matrix, or profile, from the multiple alignment implicit in the highest scoring matches from an initial BLAST search4. Scores are defined for aligning the various amino acids to each profile position. Highly conserved positions yield large positive or negative scores while weakly conserved positions yield scores near zero. The profile is then used to perform a subsequent BLAST search, and the procedure may be iterated, or repeated, further refining the profile. PSI-BLAST runs at approximately the same speed per iteration as gapped BLAST, but in most cases, is far more sensitive to weak yet biologically significant sequence similarities.

A limitation to using PSI-BLAST for large-scale protein analysis has been that on a small percentage of queries, false positives—segments having no direct relationship to the query—enter the list of matches during one iteration and corrupt the profile for subsequent iterations. To mitigate this problem, a team of NCBI investigators headed by Altschul recently improved PSI-BLAST accuracy by incorporating the use of composition-based statistics5. Here, the evaluation of an alignment’s significance is tuned to a specific profile and the amino acid composition of the sequence to which it is locally aligned. Composition-based statistics have largely suppressed the problem of profile corruption.

Altschul and his team have also investigated at least a dozen other potential modifications to the methods used in PSI-BLAST, with the goal of improving overall accuracy in finding true positive matches. Their evaluation resulted in the implementation of a number of refinements to the PSI-BLAST program. Refinements include: the use of more accurately estimated statistical parameters; the filtering of database sequences, as opposed to query sequences, in order to prevent segments with highly restricted or biased amino acid composition from participating in the construction of profiles; and improved treatment of gaps within alignments when estimating position-specific amino acid frequencies5. Altschul and his collaborators have many more ideas they would like to implement and evaluate, always striving to provide the biomedical research community with readily accessible and powerful tools for conducting state-of-the-art molecular biology research. —CB

1
Altschul, SF, W Gish, W Miller, EW Myers, and DJ Lipman. Basic local alignment search tool. J Mol Biol 215(3):403-10, 1990.

2
3
Karlin, S and SF Altschul. Applications and statistics for multiple high-scoring segments in molecular sequences. Proc Natl Acad Sci USA 90(12):5873-7, 1993.

4
Altschul, SF, TL Madden, AA Schäffer, J Zhang, Z Zhang, W Miller, and DJ Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search. Nucleic Acids Res 25(17):3389-3402, 1997.

5
Schäffer, AA, L Aravind, TL Madden, S Shavirin, L Spouge, YI Wolf, EV Koonin, and SF Altschul. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 29(14):2994-3005, 2001.

6
Sternberg, MJE, PA Bates, LA Kelley, and RM MacCallum. Progress in protein structure prediction: assessment of CASP3. Curr Opin Struct Biol 9(3):368-73, 1999.


The BLAST programs have been widely adopted as standard research tools by the international biomedical community. The advances described above not only improve the accuracy of BLAST searches, but provide scientists worldwide with more powerful methods for characterizing proteins by inferring function from sequence similarity. Using the various versions of BLAST, researchers have assigned many proteins to previously described families, and sometimes have uncovered completely novel families. PSI-BLAST has found relationships that had previously been detectable only with the aid of information about protein three-dimensional structure6. This research continues to reveal interesting, and, at times, unexpected truths about evolution.



Continue Link


NCBI News | Winter 2001

NCBI News