|
|
 |


Powerful Tools for Identifying Sequence Similarities
The
recent pace of whole genome sequencing projects has resulted in an increaseboth
in volume and in complexityof available molecular sequence data.
Patterns shared by multiple protein and nucleic acid sequences provide
valuable insights into genomic organization, molecular structure, and
biological function, as well as to unexpected links among diverse biological
systems. These previously unknown connections not only speed research
progress, but often open new areas for scientific inquiry.
When studying a new gene or protein sequence, researchers often conduct
database searches in order to identify similar genes or proteins. This
is because the fastest method for identifying the function of a gene or
protein is to find a related gene or proteinor an entire familywhose
function is already known. The recognition of subtle residue patterns
among genes or proteins sometimes relies upon aligning many sequences,
a procedure that continues to present a complex and multifaceted problem
for research.
In 1990, NCBI researchers Altschul, Gish, and Lipman, in
collaboration with colleagues Miller and Myers from Penn
State and the University of Arizona, developed and released BLASTthe
Basic Local Alignment Search Tool1.
The BLAST programs implement a set of sequence comparison algorithms that
search a database for optimal local alignments to a query sequence. A
local alignment represents a possible homology, or similarity by
descent, between segments from two nucleic acid or protein sequences.
The BLAST programs were substantially faster than existing database similarity
search programs, and of comparable sensitivity to distant relationships.
Of equal importance, using a statistical theory reported the same year
by Karlin of Stanford University, and Altschul2,
the BLAST programs first provided researchers with rigorous guidance for
determining which alignments were statistically significant, and therefore
worthy of further examination. The ideas underlying BLAST are simple and
robust, and can be applied in a variety of contexts, including DNA and
protein database searches, gene identification searches, and most recently,
sequence motif or profile searches.
| Comparison,
whether of morphology or protein sequences, lies at the heart of biology. |
BLAST works by breaking the query sequence into short fragments, or words,
and initially seeking very close matches between these words and words
from database sequences. Any aligned word pair scoring above a specified
threshold is called a hit. Each hit is then extended
in both directions in an attempt to generate a local alignment representing
statistically significant sequence similarity. The quality of each alignment
is represented by a score, defined most simply as the sum of scores for
aligning pairs of nucleotides (for DNA) or pairs of amino acids (for proteins).
Because nucleotides or amino acids may be inserted or deleted within a
particular sequence during the course of evolution, alignment programs
generally allow for the existence of gaps, or spaces introduced into an
alignment to compensate for insertions and deletions in one sequence relative
to another. Gaps contribute negatively to the overall score of an alignment.
The original BLAST programs did not explicitly include gaps within alignments,
but rather treated them implicitly by calculating combined statistical
assessments of multiple ungapped alignments produced by a single pair
of sequences3.
In 1997, a team of NCBI researchers, including Altschul, Madden,
Schäffer, Zhang, and Lipman, in collaboration
with Zhang and Miller from Penn State, released a set of
gapped BLAST programs. These new programs not only generated
gapped alignments, but also ran several fold faster than the original
BLAST programs4. This improve-ment was
achieved by incorporating two algorithmic refinements. The first refinement
required two hits within a set distance of one another, rather than one,
before triggering a search for an ungapped local alignment, or high-scoring
segment pair (HSP). The second refinement invoked a gapped extension step
whenever an HSP of sufficiently high score was found. Previously, missing
a single HSP implicitly involved in a significant alignment match could
jeopardize the discovery of the result. Now, by introducing an algorithm
for generating gapped alignments, it becomes necessary to find only one
HSP, rather than all ungapped alignments subsumed in a significant result.
Therefore, careful choice of algorithmic parameters led to increased program
sensitivity to distant sequence relationships as well as to increased
speed.
The introduction of BLAST and then gapped BLAST rendered it substantially
easier for scientists to scan large sequence databases rapidly for relatively
weak sequence similarities, and to statistically evaluate the resulting
matches. Today, these BLAST programs are widely used tools for searching
both protein and nucleic acid databases for sequence similarities, and
may compare protein or DNA queries with protein or DNA databases in any
combination. However, some of the most interesting similarities are quite
subtle and do not rise to statistical significance during a standard BLAST
search. Protein database searches using strategies that employ the construction
of position-specific score matrices are often better able to detect weak
relationships between sequences than are searches using a simple sequence
as the query. Yet, employing these methods has not always been simple,
frequently involving the use of multiple computer programs as well as
a fair amount of scientific expertise.
| The original BLAST paper was the
most highly cited paper published in the 1990s and is being supplanted
only by the 1997 paper describing the original version of PSI-BLAST. |
To overcome
this obstacle, the team that developed gapped BLAST incorporated the use
of position-specific score matrices into the BLAST protein database search
program, extending its capacity to detect weak yet significant sequence
similarities. The resulting Position-Specific Iterated BLAST (PSI-BLAST)
program features a method for automatically constructing a position-specific
score matrix, or profile, from the multiple alignment implicit in the
highest scoring matches from an initial BLAST search4.
Scores are defined for aligning the various amino acids to each profile
position. Highly conserved positions yield large positive or negative
scores while weakly conserved positions yield scores near zero. The profile
is then used to perform a subsequent BLAST search, and the procedure may
be iterated, or repeated, further refining the profile. PSI-BLAST runs
at approximately the same speed per iteration as gapped BLAST, but in
most cases, is far more sensitive to weak yet biologically significant
sequence similarities.
A limitation to using PSI-BLAST for large-scale protein analysis has been
that on a small percentage of queries, false positivessegments having
no direct relationship to the queryenter the list of matches during
one iteration and corrupt the profile for subsequent iterations. To mitigate
this problem, a team of NCBI investigators headed by Altschul recently
improved PSI-BLAST accuracy by incorporating the use of composition-based
statistics5. Here, the evaluation of
an alignments significance is tuned to a specific profile and the
amino acid composition of the sequence to which it is locally aligned.
Composition-based statistics have largely suppressed the problem of profile
corruption.
Altschul and his team have also investigated at least a dozen other potential
modifications to the methods used in PSI-BLAST, with the goal of improving
overall accuracy in finding true positive matches. Their evaluation resulted
in the implementation of a number of refinements to the PSI-BLAST program.
Refinements include: the use of more accurately estimated statistical
parameters; the filtering of database sequences, as opposed to query sequences,
in order to prevent segments with highly restricted or biased amino acid
composition from participating in the construction of profiles; and improved
treatment of gaps within alignments when estimating position-specific
amino acid frequencies5. Altschul and
his collaborators have many more ideas they would like to implement and
evaluate, always striving to provide the biomedical research community
with readily accessible and powerful tools for conducting state-of-the-art
molecular biology research. CB
| The
BLAST programs have been widely adopted as standard research tools
by the international biomedical community. The advances described
above not only improve the accuracy of BLAST searches, but provide
scientists worldwide with more powerful methods for characterizing
proteins by inferring function from sequence similarity. Using the
various versions of BLAST, researchers have assigned many proteins
to previously described families, and sometimes have uncovered completely
novel families. PSI-BLAST has found relationships that had previously
been detectable only with the aid of information about protein three-dimensional
structure6.
This research continues to reveal interesting, and, at times, unexpected
truths about evolution. |

|