• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of cmbMary Ann Liebert, Inc.Mary Ann Liebert, Inc.JournalsSearchAlerts
Journal of Computational Biology
J Comput Biol. Nov 2008; 15(9): 1187–1194.
PMCID: PMC2737730
NIHMSID: NIHMS78877

Significance of Gapped Sequence Alignments

Abstract

Measurement of the statistical significance of extreme sequence alignment scores is key to many important applications, but it is difficult. To precisely approximate alignment score significance, we draw random samples directly from a well chosen, importance-sampling probability distribution. We apply our technique to pairwise local sequence alignment of nucleic acid and amino acid sequences of length up to 1000. For instance, using a BLOSUM62 scoring system for local sequence alignment, we compute that the p-value of a score of 6000 for the alignment of two sequences of length 1000 is (3.4 ± 0.3) × 10−1314. Further, we show that the extreme value significance statistic for the local alignment model that we examine does not follow a Gumbel distribution. A web server for this application is available at http://bayesweb.wadsworth.org/alignmentSignificanceV1/.

Key words: alignment, dynamic programming, genomics, phylogenetic trees, regulatory regions

1. Introduction

The alignment of a pair of sequences is a fundamental task in genomics and proteomics, but methods for precisely gauging the significance of an alignment are not efficient (Mitrophanov and Borodovsky, 2006). Here we present an efficient method for precisely approximating the significance (also termed p-value) of an alignment score. We apply the approach to the alignment of sequences of length 1000 to precisely approximate p-values nearly as low as 10−4000. Furthermore, we observe that, for local sequence alignments, the log(significance) values approximately scale with the input sequence lengths; thus, the results of our method for select sequence length sizes can be scaled to arbitrary sequence lengths.

1.1. Notation and problem statement

Let (x, y) be a pair of sequences, composed of amino acid residues or nucleotides, with lengths (m, n), and let A be an alignment of the sequences. We choose to allow an alignment to have an insertion into y immediately following an insertion into x, but not vice versa; however, alternative approaches should demonstrate similar results. We choose to focus on local sequence alignments, where initial and final unaligned regions are permitted without penalty, rather than on global or semi-local sequence alignments, although the general approach presented here should also work well for such alignments.

Let s(x, y, A) be the score associated with x and y under the alignment A. It will be the sum of contributions from pairing scores, insertion start costs, and insertion extension costs. For instance, with SSEARCH (Pearson, 1991), the default parameters for sequences of nucleotides are: a pairing receives a score of +5 if the paired nucleotides match and a score of −4 if they are a mismatch; the first nucleotide of an insertion into a sequence incurs a score of −16; and extension of an insertion by a nucleotide incurs a score of −4.

We choose to define the alignment score for a pair of sequences (x, y) as

equation M1
(1)

the maximum achievable alignment score for the sequence pair, computable via the Viterbi algorithm (Viterbi, 1967) of Needleman and Wunsch (1970) or Smith and Waterman (1981), or a variant of either algorithm, depending upon the alignment type of interest. The general approach presented here should also work well with alternative choices for s(x, y), such as the “forward sum” alignment score, its logarithm the “free” alignment score, or the “average” alignment score,

equation M2
(2)

equation M3
(3)

equation M4
(4)

for some T > 0.

The definition of the statistical significance of a score s0 is

equation M5
(5)

where the sum is over all sequence pairs (x, y) with lengths (m, n), where Pr(x, y) is the null-model probability of the sequence pair (x, y), as determined by GC-content or database residue frequency, and where Θ( ) is a function that has value 1 if its argument is true and has value 0 otherwise. Because no one knows how to efficiently perform this sum exactly for x and y of moderate or large size, it is approximated via importance sampling (Hartmann, 2002; Wolfsheimer et al., 2007).

1.2. Importance sampling

We could sample N sequence pairs (x, y) directly from the probability distribution Pr(x, y), and then compute s(x, y) for each sampled sequence pair. The fraction of the sampled sequence pairs for which s(x, y) ≥ s0 would thus be an estimate equation M6 of the statistical significance of a score s0. However, the error in the estimate is

equation M7
(6)

which will be greater than p(s0) itself unless N ≥ 1/p(s0). Because it is not unusual for p(s0) to be 10−10 or smaller, the cost of using N ≥ 1/p(s0) samples is often prohibitive.

However, we can make the approximation more efficient by using importance sampling (Liu, 2001). We observe that

equation M8
(7)

equation M9
(8)

equation M10
(9)

for any probability distribution Pr*(x, y), where the last equality defines f*(x, y). Thus, if we draw samples according to Pr*(x, y), we can estimate p(s0) as the average value over these samples of the function f*(x, y). This estimator equation M11 has an uncertainty of

equation M12
(10)

which, we shall soon show, is significantly better than Equation (6) when Pr*(x, y) is well chosen.

1.3. Previous work

Hartmann (2002) and Wolfsheimer et al. (2007) use equation M13 for some value T. To draw samples from this probability distribution, they employ a Metropolis-coupled Markov chain Monte Carlo technique (MCMCMC), wherein an (x, y) sequence pair at a given iteration is modified at a single sequence position, giving a sequence pair (x, y); the new sequence pair is then accepted with probability min equation M14, or is otherwise rejected. A collection of N nearly independent samples is derived from this process, by allowing sufficiently many steps to occur between consecutive samples, thus allowing proper mixing to occur. For computing p-values as extreme as 10−70, they use this technique to approximate the significance of scores from the pairwise local alignment of two amino acid sequences, using the BLOSUM62 scoring matrix, SwissProt residue frequencies, and sequence of length up to 800.

Instead, we choose

equation M15
(11)

defining Pr* (x, y) with the free alignment score of a sequence pair rather than the maximum alignment score of the pair. To proceed, we must choose a value for T; we must demonstrate how to efficiently sample from this probability distribution; and we must demonstrate that the resulting estimate is statistically efficient.

2. Methods

We explain the mathematics of the approach here, leaving some of the implementation details until Section 6 in the Supplementary Materials. (See online supplementary material at www.liebertonline.com.)

2.1. Efficient sampling

We define forward sum functions (also termed partition functions):

equation M16
(12)

equation M17
(13)

The value of Z (x, y, T) is computed via a forward-sum adaptation of the relevant Viterbi algorithm for finding a maximum score alignment. In short, wherever a score s would be added to a value, instead a factor es/T multiplies the value; whenever a maximum of alternative values would be chosen, instead the sum of these values is taken. For a discussion on forward algorithms for pair-HMM sequence alignment, see Durbin et al. (2006), or for a more general discussion on forward algorithms for hidden Markov models, see Rabiner and Juang (1986).

The value of Z (T) is computed by use of the single mean value

equation M18
(14)

in the forward-sum alignment algorithm in lieu of any individual value, es(a,b)/T, for a given nucleotide/residue pair (a, b). Here qa is the null-model probability of the nucleotide/residue a, as determined by GC-content or database residue frequency, and similarly for qb.

We proceed by sampling an (x, y, A) triplet from the probability distribution

equation M19
(15)

via a backtrace through the forward-sum alignment algorithm that computes Z (T). In that backtrace, we sample the alignment (i.e., the locations of pairings and insertion starts and extensions) in the usual way. See Durbin et al. (2006) for a discussion on probabilistic sampling for sequence alignment, and see Newberg (2008) for the memory efficient backtrace procedure that we employed. During the backtrace, we also sample the paired and unpaired nucleotides/residues that comprise the sequences x and y. An unpaired nucleotide/residue a is sampled with probability qa, whereas a pair of nucleotides/residues (a, b) is sampled with probability

equation M20
(16)

Then, by disregarding the sampled alignment A, we are, in effect, sampling a sequence pair (x, y) from the marginal probability distribution

equation M21
(17)

the promised importance sampling probability distribution of Equation (11). Thus, we estimate the significance of a score s0 as indicated by Equation (8):

equation M22
(18)

equation M23
(19)

where the sums are over the sequence pairs sampled from the PrT (x, y) probability distribution.

We estimate the uncertainty in the estimate equation M24 via the samples themselves:

equation M25
(20)

2.2. Choice of T

The choice for T depends upon the value of s0 for which we wish to compute the significance. We find that, for s0 sufficiently high, a good choice for T is one for which s0 is near the average score for an (x, y, A) triplet drawn from the PrT (x, y, A) probability distribution. That is, we use T such that

equation M26
(21)

We find a value for T by using the Newton method for zero-finding on the function equation M27, where the derivative is approximated through differencing of Z(T) for two nearby values of T. This works well for high scores, which are usually the most interesting; unfortunately, when s0 is less extreme, for example, when p(s0) ≥ 10−6, the above procedure produces a value of T that is too low. For a description of how we handle this situation, see Section 6 in the Supplementary Materials. (See online supplementary material at www.liebertonline.com.)

We also have had some success with choosing the temperature so that the threshold s0 is near the median smax score of 11 sampled sequence pairs.

3. Results

Figure 1 graphs the results of our significance calculations for five problem instances. We performed calculations on amino acid sequences using BLOSUM62 scores, SwissProt residue frequencies, an insertion start penalty of 12, and an insertion extension penalty of 1 for sequences of length equation M28. We also performed calculations on nucleotide sequences using the default SSEARCH parameters (a score of +5 for a pairing of matching nucleotides, a score of −4 for a pairing mismatch, a score of −16 for an insertion start, and a score of −4 for an insertion extension) for sequences of length equation M29.

FIG. 1.
Significance of sequence alignment scores. (Left) The panels give log10 (p-value) as a function of alignment score. For instance, a score of 6000 for the alignment of two amino acid sequences of length 1000 residues is (3.4 ± 0.3) × 10 ...

For example, the results for the (m, n) = (1000, 1000) problem for amino acid sequences were computed to a tight precision using N = 10,000 samples for each of 76 temperatures (see Section 6 in the Supplementary Materials; see online supplementary material at www.liebertonline.com), with each temperature requiring 13–14 hours of run time. We were able to make these calculations in parallel, one Friday night, with 76 of the CPUs in our cluster. Typical relative error is approximately σ = 10% of the computed p-value; thus, the absolute error for log10 (p-value) is approximately 0.05. The running time is approximately equation M30 for each temperature.

4. Discussion

For the alignment of a pair of amino acid sequences of length 1000, we are able to precisely approximate the significance of alignment scores, even though the p-values can be nearly as low as 10−4000 (Fig. 1). Pre-computation of p-values for a variety of interesting alignment types and problem sizes is feasible; thus, lookup tables could fruitfully be employed by current alignment packages to report p-values to their users.

Furthermore, the similarity among the curves within each panel in the right half of Figure 1 indicates that significance scales nicely with the input sequence lengths (m, n). We can approximate the significance of a score s0 for input sequence lengths (m, n) from the significance values for input sequence lengths (equation M31) as

equation M32
(22)

where M is the maximum achievable alignment score for input sequence lengths (m, n), and equation M33 is the maximum achievable alignment score for input sequence lengths (equation M34), and where these maximum achievable scores and their p-values are calculated exactly by inspection.

In the limit, as the length of the sequences being aligned tends to infinity, the maximum scores for local ungapped alignments are expected to follow a Gumbel distribution

equation M35
(23)

for some positives values α and β (Karlin and Altschul, 1990; Eddy, 2008). When s0 is large,

equation M36
(24)

equation M37
(25)

Accordingly, the curves in Figure 1 might be expected to be straight lines, at least for high values of s0. For the pairwise alignment of amino acid sequences of length up to 800 and p-values down to 10−70, Wolfsheimer et al. (2007) observes that these curves are not straight lines, and we now extend that observation to p-values nearly as extreme as 10−4000.

Wolfsheimer et al. (2007) hypothesizes that the tails of these curves are parabolic. We find that the best-fit parabola of the log10 (p-value) data for the BLOSUM62 tail scores 5,500–11,000 is equation M38. The best-fit parabola has a root mean square error of approximately 14 with respect to our estimates. That is fairly good, although well in excess of the typical 0.05 error estimated from our simulations. In comparison, for these BLOSUM62 tail scores, the root mean square error for the best-fit line is 84; the error for the best-fit cubic is 6.7; and the error for the best-fit quartic is 3.6.

Because the run time of the algorithm, for fixed sequence lengths m and n, is approximately equation M39, the calculation can be orders of magnitude faster if good precision is not required. For instance, the example of a score of 6000 for amino acid sequences of length 1000 that is given in Figure 1, can be computed as log10 (p-value) = −1314 ± 1 with N = 100 samples in 39 minutes using one of our processors, and likely even faster once our prototype software attains production quality. Using the N = 100 reduced number of samples, an alignment of two sequences of length 10,000 requires approximately 13–14 hours to compute the value log10 (p-value) ≈ −13,100 for s0 = 60,000. This p-value might also have been estimated much more quickly, via Equation (22), from the −1314 value just mentioned.

We have focused on local sequence alignments with certain parameters, but the technique is easily applied to other sets of parameters and other alignment models. For instance, we conjecture that this technique will work well with profile-HMMs (Durbin et al., 2006), which permit the alignment model parameters (e.g., the nucleotide/residue frequencies and the indel penalties) to vary by position within a sequence.

Further, we conjecture that this technique will also be efficient when used to estimate extreme statistical significance values for a much broader set of applications, such as many hidden Markov models that emit a sequence of data (Durbin et al., 2006; Rabiner and Juang, 1986). For a given sequence of data, often it is the case that we wish to know a maximum-probability/Viterbi path through the model. The statistical significance of the log-probability of such a path can be estimated via our technique, as can the statistical significance of the sum of the probabilities over all paths emitting the given data sequence. We conjecture that, in many cases, such estimates will be quite efficient. Likewise, we conjecture efficiency for many of the even broader class of finite state automata, which are much like hidden Markov models, except that the scores are not necessarily derived from log probabilities; for instance, often scores are derived from (or interpreted as) log odds, the logarithm of the ratio of the probabilities from given alternative and null models (Karlin and Altschul, 1990).

5. Conclusion

We have presented a method for efficiently and accurately approximating the p-value for a pairwise alignment score. The approach can be used for a variety of definitions of local or global alignment and for a variety of definitions of alignment score. Via the pre-computation of p-values for a selection of interesting problem instances, algorithms that make pairwise alignments can employ scaling to quickly approximate the p-value for any given problem instance. A web server for this application is available at http://bayesweb.wadsworth.org/alignmentSignificanceV1/.

Supplementary Material

Supplemental data:

Acknowledgments

We thank the Computational Molecular Biology and Statistics Core Facility at the Wadsworth Center for the computing resources to make these calculations. We thank Charles E. Lawrence for helpful ideas and suggestions. This research was supported by the National Institutes of Health/National Human Genome Research Institute (grant K25 HG003291 to L.A.N.).

Disclosure Statement

No competing financial interests exist.

References

  • Durbin R. Eddy S. Krogh A., et al. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press; Cambridge, UK: 2006.
  • Eddy S.R. A probabilistic model of local sequence alignment that simplifies statistical significance estimation. PLoS Comput. Biol. 2008;4:e1000069. [PMC free article] [PubMed]
  • Hartmann A.K. Sampling rare events: statistics of local sequence alignments. Phys. Rev. E. 2002;65:056102. [PubMed]
  • Karlin S. Altschul S.F. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA. 1990;87:2264–2268. [PMC free article] [PubMed]
  • Liu J.S. Springer Series in Statistics. Springer-Verlag; New York: 2001. Monte Carlo strategies in scientific computing.
  • Mitrophanov A.Y. Borodovsky M. Statistical significance in biological sequence analysis. Brief Bioinform. 2006;7:2–24. [PubMed]
  • Needleman S.B. Wunsch C.D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 1970;48:443–453. [PubMed]
  • Newberg L.A. Memory-efficient dynamic programming backtrace and pairwise local sequence alignment. Bioinformatics. 2008;24:1772–1778. [PMC free article] [PubMed]
  • Pearson W.R. Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. Genomics. 1991;11:635–650. [PubMed]
  • Rabiner L.R. Juang B.-H. An introduction to hidden Markov models. IEEE ASSP Mag. 1986;3:4–16.
  • Smith T.F. Waterman M.S. Identification of common molecular subsequences. J. Mol. Biol. 1981;147:195–197. [PubMed]
  • Viterbi A.J. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inform. Theory. 1967;13:260–269.
  • Wolfsheimer S. Burghardt B. Hartmann A.K. Local sequence alignments statistics: deviations from gumbel statistics in the rare-event tail. Algorithms Mol. Biol. 2007;2:9. [PMC free article] [PubMed]

Articles from Journal of Computational Biology are provided here courtesy of Mary Ann Liebert, Inc.
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Links

  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...