- Journal List
- J Comput Biol
- PMC2737730

# Significance of Gapped Sequence Alignments

^{}Corresponding author.

*Dr. Lee A. Newberg, Wadsworth Center, NYS Department of Health, P.O. Box 509, Albany, NY 12201-0509. E-mail:*Email: gro.htrowsdaw@grebwen.eel

## Abstract

Measurement of the statistical significance of extreme sequence alignment scores is key to many important applications, but it is difficult. To precisely approximate alignment score significance, we draw random samples directly from a well chosen, importance-sampling probability distribution. We apply our technique to pairwise local sequence alignment of nucleic acid and amino acid sequences of length up to 1000. For instance, using a BLOSUM62 scoring system for local sequence alignment, we compute that the p-value of a score of 6000 for the alignment of two sequences of length 1000 is (3.4 ± 0.3) × 10^{−1314}. Further, we show that the extreme value significance statistic for the local alignment model that we examine does not follow a Gumbel distribution. A web server for this application is available at *http://bayesweb.wadsworth.org/alignmentSignificanceV1/*.

**Key words:**alignment, dynamic programming, genomics, phylogenetic trees, regulatory regions

## 1.Introduction

The alignment of a pair of sequences is a fundamental task in genomics and proteomics, but methods for precisely gauging the significance of an alignment are not efficient (Mitrophanov and Borodovsky, 2006). Here we present an efficient method for precisely approximating the significance (also termed *p*-value) of an alignment score. We apply the approach to the alignment of sequences of length 1000 to precisely approximate *p*-values nearly as low as 10^{−4000}. Furthermore, we observe that, for local sequence alignments, the log(significance) values approximately scale with the input sequence lengths; thus, the results of our method for select sequence length sizes can be scaled to arbitrary sequence lengths.

### 1.1.Notation and problem statement

Let (*x, y*) be a pair of sequences, composed of amino acid residues or nucleotides, with lengths (*m, n*), and let *A* be an alignment of the sequences. We choose to allow an alignment to have an insertion into *y* immediately following an insertion into *x*, but not vice versa; however, alternative approaches should demonstrate similar results. We choose to focus on local sequence alignments, where initial and final unaligned regions are permitted without penalty, rather than on global or semi-local sequence alignments, although the general approach presented here should also work well for such alignments.

Let *s*(*x, y, A*) be the score associated with *x* and *y* under the alignment *A*. It will be the sum of contributions from pairing scores, insertion start costs, and insertion extension costs. For instance, with SSEARCH (Pearson, 1991), the default parameters for sequences of nucleotides are: a pairing receives a score of +5 if the paired nucleotides match and a score of −4 if they are a mismatch; the first nucleotide of an insertion into a sequence incurs a score of −16; and extension of an insertion by a nucleotide incurs a score of −4.

We choose to define the alignment score for a pair of sequences (*x*, *y*) as

the maximum achievable alignment score for the sequence pair, computable via the Viterbi algorithm (Viterbi, 1967) of Needleman and Wunsch (1970) or Smith and Waterman (1981), or a variant of either algorithm, depending upon the alignment type of interest. The general approach presented here should also work well with alternative choices for *s*(*x*, *y*), such as the “forward sum” alignment score, its logarithm the “free” alignment score, or the “average” alignment score,

for some *T* > 0.

The definition of the statistical significance of a score *s*_{0} is

where the sum is over all sequence pairs (*x, y*) with lengths (*m, n*), where Pr(*x*, *y*) is the null-model probability of the sequence pair (*x, y*), as determined by GC-content or database residue frequency, and where Θ( ) is a function that has value 1 if its argument is true and has value 0 otherwise. Because no one knows how to efficiently perform this sum exactly for *x* and *y* of moderate or large size, it is approximated via importance sampling (Hartmann, 2002; Wolfsheimer et al., 2007).

### 1.2.Importance sampling

We could sample *N* sequence pairs (*x, y*) directly from the probability distribution Pr(*x, y*), and then compute *s*(*x, y*) for each sampled sequence pair. The fraction of the sampled sequence pairs for which *s*(*x, y*) ≥ *s*_{0} would thus be an estimate of the statistical significance of a score *s*_{0}. However, the error in the estimate is

which will be greater than *p*(*s*_{0}) itself unless *N* ≥ 1/*p*(*s*_{0}). Because it is not unusual for *p*(*s*_{0}) to be 10^{−10} or smaller, the cost of using *N* ≥ 1/*p*(*s*_{0}) samples is often prohibitive.

However, we can make the approximation more efficient by using importance sampling (Liu, 2001). We observe that

for any probability distribution Pr_{*}(*x*, *y*), where the last equality defines *f*_{*}(*x*, *y*). Thus, if we draw samples according to Pr_{*}(*x*, *y*), we can estimate *p*(*s*_{0}) as the average value over these samples of the function *f*_{*}(*x*, *y*). This estimator has an uncertainty of

which, we shall soon show, is significantly better than Equation (6) when Pr_{*}(*x*, *y*) is well chosen.

### 1.3.Previous work

Hartmann (2002) and Wolfsheimer et al. (2007) use for some value *T*. To draw samples from this probability distribution, they employ a Metropolis-coupled Markov chain Monte Carlo technique (MCMCMC), wherein an (*x*, *y*) sequence pair at a given iteration is modified at a single sequence position, giving a sequence pair (*x*^{′}, *y*^{′}); the new sequence pair is then accepted with probability min , or is otherwise rejected. A collection of *N* nearly independent samples is derived from this process, by allowing sufficiently many steps to occur between consecutive samples, thus allowing proper mixing to occur. For computing *p*-values as extreme as 10^{−70}, they use this technique to approximate the significance of scores from the pairwise local alignment of two amino acid sequences, using the BLOSUM62 scoring matrix, SwissProt residue frequencies, and sequence of length up to 800.

Instead, we choose

defining Pr_{*} (*x, y*) with the free alignment score of a sequence pair rather than the maximum alignment score of the pair. To proceed, we must choose a value for *T*; we must demonstrate how to efficiently sample from this probability distribution; and we must demonstrate that the resulting estimate is statistically efficient.

## 2.Methods

We explain the mathematics of the approach here, leaving some of the implementation details until Section 6 in the Supplementary Materials. (See online supplementary material at *www.liebertonline.com*.)

### 2.1.Efficient sampling

We define forward sum functions (also termed partition functions):

The value of *Z* (*x, y, T*) is computed via a forward-sum adaptation of the relevant Viterbi algorithm for finding a maximum score alignment. In short, wherever a score *s* would be added to a value, instead a factor *e ^{s/T}* multiplies the value; whenever a maximum of alternative values would be chosen, instead the sum of these values is taken. For a discussion on forward algorithms for pair-HMM sequence alignment, see Durbin et al. (2006), or for a more general discussion on forward algorithms for hidden Markov models, see Rabiner and Juang (1986).

The value of *Z* (*T*) is computed by use of the single mean value

in the forward-sum alignment algorithm in lieu of any individual value, *e ^{s}*

^{(a,b)/T}, for a given nucleotide/residue pair (

*a, b*). Here

*q*is the null-model probability of the nucleotide/residue

_{a}*a*, as determined by GC-content or database residue frequency, and similarly for

*q*.

_{b}We proceed by sampling an (*x, y, A*) triplet from the probability distribution

via a backtrace through the forward-sum alignment algorithm that computes *Z* (*T*). In that backtrace, we sample the alignment (i.e., the locations of pairings and insertion starts and extensions) in the usual way. See Durbin et al. (2006) for a discussion on probabilistic sampling for sequence alignment, and see Newberg (2008) for the memory efficient backtrace procedure that we employed. During the backtrace, we also sample the paired and unpaired nucleotides/residues that comprise the sequences *x* and *y*. An unpaired nucleotide/residue *a* is sampled with probability *q _{a}*, whereas a pair of nucleotides/residues (

*a, b*) is sampled with probability

Then, by disregarding the sampled alignment *A*, we are, in effect, sampling a sequence pair (*x, y*) from the marginal probability distribution

the promised importance sampling probability distribution of Equation (11). Thus, we estimate the significance of a score *s*_{0} as indicated by Equation (8):

where the sums are over the sequence pairs sampled from the Pr_{T} (*x, y*) probability distribution.

We estimate the uncertainty in the estimate via the samples themselves:

### 2.2.Choice of T

The choice for *T* depends upon the value of *s*_{0} for which we wish to compute the significance. We find that, for *s*_{0} sufficiently high, a good choice for *T* is one for which *s*_{0} is near the average score for an (*x, y, A*) triplet drawn from the Pr_{T} (*x, y, A*) probability distribution. That is, we use *T* such that

We find a value for *T* by using the Newton method for zero-finding on the function , where the derivative is approximated through differencing of *Z*(*T*) for two nearby values of *T*. This works well for high scores, which are usually the most interesting; unfortunately, when *s*_{0} is less extreme, for example, when *p*(*s*_{0}) ≥ 10^{−6}, the above procedure produces a value of *T* that is too low. For a description of how we handle this situation, see Section 6 in the Supplementary Materials. (See online supplementary material at *www.liebertonline.com*.)

We also have had some success with choosing the temperature so that the threshold *s*_{0} is near the median *s*_{max} score of 11 sampled sequence pairs.

## 3.Results

Figure 1 graphs the results of our significance calculations for five problem instances. We performed calculations on amino acid sequences using BLOSUM62 scores, SwissProt residue frequencies, an insertion start penalty of 12, and an insertion extension penalty of 1 for sequences of length . We also performed calculations on nucleotide sequences using the default SSEARCH parameters (a score of +5 for a pairing of matching nucleotides, a score of −4 for a pairing mismatch, a score of −16 for an insertion start, and a score of −4 for an insertion extension) for sequences of length .

**Left**) The panels give log

_{10}(

*p*-value) as a function of alignment score. For instance, a score of 6000 for the alignment of two amino acid sequences of length 1000 residues is (3.4 ± 0.3) × 10

**...**

For example, the results for the (*m, n*) = (1000, 1000) problem for amino acid sequences were computed to a tight precision using *N* = 10,000 samples for each of 76 temperatures (see Section 6 in the Supplementary Materials; see online supplementary material at *www.liebertonline.com*), with each temperature requiring 13–14 hours of run time. We were able to make these calculations in parallel, one Friday night, with 76 of the CPUs in our cluster. Typical relative error is approximately σ = 10% of the computed *p*-value; thus, the absolute error for log_{10} (*p*-value) is approximately 0.05. The running time is approximately for each temperature.

## 4.Discussion

For the alignment of a pair of amino acid sequences of length 1000, we are able to precisely approximate the significance of alignment scores, even though the *p*-values can be nearly as low as 10^{−4000} (Fig. 1). Pre-computation of *p*-values for a variety of interesting alignment types and problem sizes is feasible; thus, lookup tables could fruitfully be employed by current alignment packages to report *p*-values to their users.

Furthermore, the similarity among the curves within each panel in the right half of Figure 1 indicates that significance scales nicely with the input sequence lengths (*m, n*). We can approximate the significance of a score *s*_{0} for input sequence lengths (*m, n*) from the significance values for input sequence lengths () as

where *M* is the maximum achievable alignment score for input sequence lengths (*m, n*), and is the maximum achievable alignment score for input sequence lengths (), and where these maximum achievable scores and their *p*-values are calculated exactly by inspection.

In the limit, as the length of the sequences being aligned tends to infinity, the maximum scores for local *ungapped* alignments are expected to follow a Gumbel distribution

for some positives values α and β (Karlin and Altschul, 1990; Eddy, 2008). When *s*_{0} is large,

Accordingly, the curves in Figure 1 might be expected to be straight lines, at least for high values of *s*_{0}. For the pairwise alignment of amino acid sequences of length up to 800 and *p*-values down to 10^{−70}, Wolfsheimer et al. (2007) observes that these curves are not straight lines, and we now extend that observation to *p*-values nearly as extreme as 10^{−4000}.

Wolfsheimer et al. (2007) hypothesizes that the tails of these curves are parabolic. We find that the best-fit parabola of the log_{10} (*p*-value) data for the BLOSUM62 tail scores 5,500–11,000 is . The best-fit parabola has a root mean square error of approximately 14 with respect to our estimates. That is fairly good, although well in excess of the typical 0.05 error estimated from our simulations. In comparison, for these BLOSUM62 tail scores, the root mean square error for the best-fit line is 84; the error for the best-fit cubic is 6.7; and the error for the best-fit quartic is 3.6.

Because the run time of the algorithm, for fixed sequence lengths *m* and *n*, is approximately , the calculation can be orders of magnitude faster if good precision is not required. For instance, the example of a score of 6000 for amino acid sequences of length 1000 that is given in Figure 1, can be computed as log_{10} (*p*-value) = −1314 ± 1 with *N* = 100 samples in 39 minutes using one of our processors, and likely even faster once our prototype software attains production quality. Using the *N* = 100 reduced number of samples, an alignment of two sequences of length 10,000 requires approximately 13–14 hours to compute the value log_{10} (*p*-value) ≈ −13,100 for *s*_{0} = 60,000. This *p*-value might also have been estimated much more quickly, via Equation (22), from the −1314 value just mentioned.

We have focused on local sequence alignments with certain parameters, but the technique is easily applied to other sets of parameters and other alignment models. For instance, we conjecture that this technique will work well with profile-HMMs (Durbin et al., 2006), which permit the alignment model parameters (e.g., the nucleotide/residue frequencies and the indel penalties) to vary by position within a sequence.

Further, we conjecture that this technique will also be efficient when used to estimate extreme statistical significance values for a much broader set of applications, such as many hidden Markov models that emit a sequence of data (Durbin et al., 2006; Rabiner and Juang, 1986). For a given sequence of data, often it is the case that we wish to know a maximum-probability/Viterbi path through the model. The statistical significance of the log-probability of such a path can be estimated via our technique, as can the statistical significance of the sum of the probabilities over all paths emitting the given data sequence. We conjecture that, in many cases, such estimates will be quite efficient. Likewise, we conjecture efficiency for many of the even broader class of finite state automata, which are much like hidden Markov models, except that the scores are not necessarily derived from log probabilities; for instance, often scores are derived from (or interpreted as) log odds, the logarithm of the ratio of the probabilities from given alternative and null models (Karlin and Altschul, 1990).

## 5.Conclusion

We have presented a method for efficiently and accurately approximating the *p*-value for a pairwise alignment score. The approach can be used for a variety of definitions of local or global alignment and for a variety of definitions of alignment score. Via the pre-computation of *p*-values for a selection of interesting problem instances, algorithms that make pairwise alignments can employ scaling to quickly approximate the *p*-value for any given problem instance. A web server for this application is available at *http://bayesweb.wadsworth.org/alignmentSignificanceV1/*.

## Acknowledgments

We thank the Computational Molecular Biology and Statistics Core Facility at the Wadsworth Center for the computing resources to make these calculations. We thank Charles E. Lawrence for helpful ideas and suggestions. This research was supported by the National Institutes of Health/National Human Genome Research Institute (grant K25 HG003291 to L.A.N.).

## Disclosure Statement

No competing financial interests exist.

## References

- Durbin R. Eddy S. Krogh A., et al. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press; Cambridge, UK: 2006.
- Eddy S.R. A probabilistic model of local sequence alignment that simplifies statistical significance estimation. PLoS Comput. Biol. 2008;4:e1000069. [PMC free article] [PubMed]
- Hartmann A.K. Sampling rare events: statistics of local sequence alignments. Phys. Rev. E. 2002;65:056102. [PubMed]
- Karlin S. Altschul S.F. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA. 1990;87:2264–2268. [PMC free article] [PubMed]
- Liu J.S. Springer Series in Statistics. Springer-Verlag; New York: 2001. Monte Carlo strategies in scientific computing.
- Mitrophanov A.Y. Borodovsky M. Statistical significance in biological sequence analysis. Brief Bioinform. 2006;7:2–24. [PubMed]
- Needleman S.B. Wunsch C.D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 1970;48:443–453. [PubMed]
- Newberg L.A. Memory-efficient dynamic programming backtrace and pairwise local sequence alignment. Bioinformatics. 2008;24:1772–1778. [PMC free article] [PubMed]
- Pearson W.R. Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. Genomics. 1991;11:635–650. [PubMed]
- Rabiner L.R. Juang B.-H. An introduction to hidden Markov models. IEEE ASSP Mag. 1986;3:4–16.
- Smith T.F. Waterman M.S. Identification of common molecular subsequences. J. Mol. Biol. 1981;147:195–197. [PubMed]
- Viterbi A.J. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inform. Theory. 1967;13:260–269.
- Wolfsheimer S. Burghardt B. Hartmann A.K. Local sequence alignments statistics: deviations from gumbel statistics in the rare-event tail. Algorithms Mol. Biol. 2007;2:9. [PMC free article] [PubMed]

**Mary Ann Liebert, Inc.**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (1.1M) |
- Citation

- Evolution of biological sequences implies an extreme value distribution of type I for both global and local pairwise alignment scores.[BMC Bioinformatics. 2008]
*Bastien O, Maréchal E.**BMC Bioinformatics. 2008 Aug 7; 9:332. Epub 2008 Aug 7.* - A direct method for computing extreme value (Gumbel) parameters for gapped biological sequence alignments.[Int J Bioinform Res Appl. 2014]
*Quinn T, Sinkala Z.**Int J Bioinform Res Appl. 2014; 10(2):177-89.* - Island method for estimating the statistical significance of profile-profile alignment scores.[BMC Bioinformatics. 2009]
*Poleksic A.**BMC Bioinformatics. 2009 Apr 20; 10:112. Epub 2009 Apr 20.* - Robust E-values for gapped local alignments.[J Comput Biol. 2006]
*Metzler D.**J Comput Biol. 2006 May; 13(4):882-96.* - Statistical significance in biological sequence analysis.[Brief Bioinform. 2006]
*Mitrophanov AY, Borodovsky M.**Brief Bioinform. 2006 Mar; 7(1):2-24.*

- PubMedPubMedPubMed citations for these articles

- Significance of Gapped Sequence AlignmentsSignificance of Gapped Sequence AlignmentsJournal of Computational Biology. Nov 2008; 15(9)1187

Your browsing activity is empty.

Activity recording is turned off.

See more...