Logo of narLink to Publisher's site
Nucleic Acids Res. 2005; 33(15): 4987–4994.
Published online Sep 6, 2005. doi:  10.1093/nar/gki800
PMCID: PMC1199557

The Gumbel pre-factor k for gapped local alignment can be estimated from simulations of global alignment

Abstract

The optimal gapped local alignment score of two random sequences follows a Gumbel distribution. The Gumbel distribution has two parameters, the scale parameter λ and the pre-factor k. Presently, the basic local alignment search tool (BLAST) programs (BLASTP (BLAST for proteins), PSI-BLAST, etc.) use all time-consuming computer simulations to determine the Gumbel parameters. Because the simulations must be done offline, BLAST users are restricted in their choice of alignment scoring schemes. The ultimate aim of this paper is to speed the simulations, to determine the Gumbel parameters online, and to remove the corresponding restrictions on BLAST users. Simulations for the scale parameter λ can be as much as five times faster, if they use global instead of local alignment [R. Bundschuh (2002) J. Comput. Biol., 9, 243–260]. Unfortunately, the acceleration does not extend in determining the Gumbel pre-factor k, because k has no known mathematical relationship to global alignment. This paper relates k to global alignment and exploits the relationship to show that for the BLASTP defaults, 10 000 realizations with sequences of average length 140 suffice to estimate both Gumbel parameters λ and k within the errors required (λ, 0.8%; k, 10%). For the BLASTP defaults, simulations for both Gumbel parameters now take less than 30 s on a 2.8 GHz Pentium 4 processor.

INTRODUCTION

Local sequence alignment is an indispensable computational tool in modern molecular biology. It is frequently used to infer the functional, structural and evolutionary relationships of a novel protein or DNA sequence by finding similar sequences of known function in a database. Arguably, the most important sequence database search program available is BLAST (the Basic Local Alignment Search Tool) (1,2). Using a heuristic algorithm, BLAST implicitly performs a local alignment of a protein or DNA query against sequences in the corresponding database. The BLAST output then ranks each potential database match according to an E-value, which is derived from the corresponding local maximum score, given in bits. For each local maximum score y, the corresponding E-value Ey gives (under a random model) the expected number of false positives with a lower rank in the output. Thus, a small E-value indicates that the corresponding alignment is unlikely to occur by chance alone, whereas a large E-value indicates an unremarkable alignment. Without doubt, BLAST's E-values contribute substantially to its popularity.

Let us discuss the BLAST E-value Ey further here. (The Materials and Methods section also continues the discussion.) BLAST assumes a random model in which each unrelated pair of sequences A[1, m] = A1 ··· Am and B[1, n] = B1 ··· Bn consists of random letters chosen independently from a background distribution. BLASTP (BLAST for proteins), e.g. assumes that random proteins are composed of amino acids chosen independently from the Robinson and Robinson frequency distribution (3). BLAST also requires an input, a matrix s(Ai, Bj) for scoring matches between the letters Ai and Bj. BLASTP, e.g. uses the BLOSUM62 scoring matrix (4) as its default, offering as alternatives a few other PAM (5) and BLOSUM matrices. BLAST also enhances its detection of remote sequence similarities by using gapped sequence alignment. The cost of introducing a gap into an alignment is given by the ‘gap penalty’ Δ(g), where g is the gap length. Practical gap penalties Δ are usually super-additive, i.e. Δ(g) + Δ(h)≥Δ(g + h), so the concatenation of optimal subsequence alignments has a score no less than the sum of their scores. (However, our theory is not restricted to super-additive gap penalties). Affine gap penalties Δ(g) = a + bg are typical in database searches. We refer to the letter distribution, the scoring matrix, and gap penalty collectively as ‘BLAST parameters’.

Throughout the paper, we assume a ‘logarithmic regime’ (6) where the alignment scores of long random sequences have a negative expectation. In the logarithmic regime, the BLAST E-value Ey is approximately

Eykmneλy
1

for large y. Under a Poisson approximation (7) for large y, the E-value Ey yields the P-value Py = 1−exp(−Ey). Because of Equation 1, the tail probability Py corresponds to a Gumbel distribution with ‘scale parameter’ λ and ‘pre-factor’ k.

For ungapped local alignment (i.e. the special case Δ(g) = ∞, which disallows gaps in the optimal local alignment), a rigorous theory furnishes analytic formulas for the Gumbel parameters λ and k (7,8). For gapped local alignment, analytic results are scarce and usually come at a price: they depend on approximations whose accuracy in general is unknown (912). In the absence of a rigorous theory for gapped local alignment, computer simulations have confirmed the validity of Equation 1 (1316), and in the absence of formulas, they also have provided estimates of λ and k (1619).

Because of the exponentiation in Equation 1, errors in λ have a greater practical impact than errors in k. Thus, for use in BLAST, λ must be known to within 1–4% relative error; k, to within 10% (20). Therefore, in statements about computational speed, the following implicitly assumes that the estimation of λ and k is carried out to these accuracies, unless stated otherwise.

Presently, the BLAST program precomputes λ and k offline, using the so-called ‘island method’ (15,20). Because of the precomputation, users are given a narrow choice indeed of BLAST parameters. The choice of BLAST parameters would be much less restricted, if λ and k could be computed online (in, say, less than 1 s) before searching a database with arbitrary BLAST parameters. Accordingly, much recent research has been directed toward speeding estimation of λ and k.

With the ultimate aim of estimating λ and k online, Bundschuh gave some interesting conjectures about λ (21,22). He then applied them in global alignment simulations that estimated λ as much as five faster than the island method. Later, we extended his conjectures, reducing the sequence length required to estimate λ by almost a factor of 10 (23).

Despite their obvious promise, even with further improvements in speed and global alignment simulations will remain impractical for online estimation in BLAST, unless they can be made to estimate k as well. To remedy the problem, we relate k to global alignment and then exploit the relationship in simulations that estimate both λ and k.

MATERIALS AND METHODS

Notation for global sequence alignment

We denote the non-negative integers by Z+ = {0, 1, 2, 3,…}. Throughout the paper, the letters g, h, i, j, m, n and the letter y are the integers.

Consider a pair A = A1A2… and B = B1B2… of infinite sequences. The corresponding global alignment graph Γ is a directed and weighted lattice graph in two dimensions, as follows. The vertices of Γ are v=(i,j)+2, the non-negative two-dimensional integer lattice. Three sets of directed edges e come out of each vertex v = (i, j): northward, northeastward and eastward. One northeastward edge goes into (i + 1, j + 1) with weight s(Ai+1, Bj+1). For each g > 0, one eastward edge goes into (i + g, j) and one northward edge goes into (i,j + g); both are assigned the same weight −Δ(g) < 0. For simplicity, we assume s(Ai, Bj) and Δ(g) are always integers, with greatest common divisor 1.

A directed path π = (v0, e1, v1, e2,…eh, vh) in Γ is a finite, alternating sequence of vertices and edges that starts and ends with a vertex. We say that the path π starts at v0 and ends at vh. For instance, each gapped alignment of the subsequences A[i + 1, m] = Ai+1Am and B[j + 1, n] = Bj+1Bn corresponds to exactly one directed path that starts at v0 = (i, j) and ends at vh = (m, n). The alignment's score is the ‘path weight’ Wπ=i=1hW(ei), the sum of the weights W(ei) of the edges ei. By convention, any trivial path π = (v0) consisting of a single vertex has weight Wπ = 0.

Let Πij be the set of all paths π starting at v0 = (0, 0) and ending at vh = (i, j). Define the ‘global score’ Sij = max{Wπ: π [set membership] Πij}. The paths π starting at v0 and ending at vh with weight Wπ = Sij are ‘optimal global paths’ and correspond to ‘optimal global alignments’ between A[1, i] and B[1, j]. The Needleman–Wunsch algorithm computes the global scores Sij (24).

Let Π=(i,j)+2Πij be the set of all paths π starting at v0 = (0,0). Define the ‘global maximum’ M = max{Wπ: π [set membership] Π}, which is also the maximum M=max{Sij:(i,j)+2} of all global scores. Let N(y)=#{(i,j)+2:Sij=y} denote the number of vertices with global score y.

Define the lattice rectangle [0, n] = {0,1,…,n}. Our simulations involved a square subset [0,n]2 of +2. In particular single subscripts connote quantities for the square: Mn = max{Sij:(i, j) [set membership] [0, n]2}, the square's global maximum; En = max{max0≤inSin, max0≤jnSnj}, its edge maximum; and Nn (y) = #{(i, j) [set membership] [0, n]2:Sij = y}, the number of its vertices with global score y.

The formula for k from global alignment

We can show heuristically that k = limy→∞ky, where

ky=eλy1eλ·(M=y)2𝔼N(y)
2

(see our Appendix, online). Ultimately, the heuristics behind Equation 2 are based on two observations about random sequence matches. First, the two ends of a strong local alignment match are the mirrors of each other. Second, the right end of a strong alignment match looks the same for both local and global alignment.

Equation 2 computes ky from three components: the scale parameter λ, the probability P(M = y) of a global maximum y, and the expected number EN(y) of vertices with global score Sij = y. We now describe how our simulations determined the three components.

Numerical scheme for λ

First, we estimated λ from random global alignments (23). All simulations used to affine gap penalties Δ(g) = a + bg and the corresponding global alignment algorithms for computing Sij (25).

Recall the edge maximum En (defined at the end of the notation for global sequence alignment). As shown elsewhere (23), its cumulant generating function satisfies

ln[𝔼exp(λEn)]=β0+β1(λ)n+O(δn),
3

where 0 ≤ δ < 1. The root λ=λ^ of β1(λ) = 0 is our estimate for λ.

To estimate Eexp(λEn) efficiently, we used Bundschuh's importance sampling methods (21), which apply if the gap penalty is affine. Briefly, importance sampling is a variance-reduction technique for simulating rare events. In global alignment simulations, e.g. a large edge maximum is a rare event. By simulating optimal subsequence pairs in ‘hybrid alignment’ (a type of optimized Bayesian local alignment) (26), we ensured that our realizations frequently generated a large edge maximum En. Accordingly, we simulated a pair of sequences of some ‘base length’ n = l. After correcting for biases induced by the importance sampling distribution, we estimated Eexp(λEl).

Equation 3 corresponds to an asymptotic equality with two free parameters to β0 and β1(λ), which we estimated with robust regression. Robust regression was originally developed as an antidote to outliers (27), which badly skew least-square regression (2831). As noted elsewhere (23), however, robust regression is also remarkably suited for extracting asymptotic parameters like β0 and β1(λ).

Robust regression requires the specification of an influence function, to quantify the influence of potential outliers on the regression result. Many influence functions exist (27), but the Andrews function with a = 1.339 [(27), p. 388; (29)] works well in asymptotic regression, because it ignores points that obviously lie outside the asymptotic regime (23).

Accordingly, we applied robust regression to Equation 3. To solve β1(λ) = 0, let λu be the scale parameter for ungapped local alignment, which can be determined analytically. Because 0 ≤ λ ≤ λu, with repeated bisection of the interval [0, λu] yielded an estimate λ^ for the root of the equation β1(λ) = 0. In practice, multiple roots did not occur.

Numerical scheme for k

Next, we estimated P(M = y) and EN(y). Importance sampling has already generated sequence-pairs of base length l for estimating λ. The bias in importance sampling tends to yield large global scores Sij, ascending toward the global maximum M. To determine N(y), we needed to simulate and count all vertices with global scores Sij = y. Therefore, we extended the sequence pair beyond the base length l using random letters with the unbiased Robinson and Robinson frequencies. The global scores Sij beyond the base length l became progressively smaller, thereby permitting determination of N(y).

Given epsilon > 0, we simulated a random number L¯ of unbiased letters in each sequence, until we found some total length L=l¯+L¯ such that

(2L+1)exp{λ(MLEL)}ɛ.
4

The edge maximum EL is a maximum over 2L + 1 vertices. Therefore, for small enough stringencies epsilon > 0, if the edge maximum EL of the contributing 2L + 1 vertices satisfies Equation 4, it is probable that M = ML, because elongating the sequences is unlikely to increase the estimate of M. Similarly, the elongation does not increase the estimate of EN(y) much. After appropriate averaging, our simulations therefore yielded estimates ^(M=y)(ML=y) and 𝔼^N(y)𝔼NL(y) for P(M = y) and EN(y).

With the simulation estimates λ^, ^(M=y) and 𝔼^N(y) in hand, we found that errors in λ^ were negligible in practice. In contrast, the standard deviations sample (32) of ^(M=y) and 𝔼^N(y), denoted by sM and sN, were not.

We calculated an estimate k^y for ky by substituting λ^, ^(M=y), and 𝔼^N(y) into Equation 2. We estimated the error s(k^y) in k^y from the equation

s(k^y)=max|eλ^y1eλ^·[^(M=y)±sM]2𝔼^[N(y)]±sNk^y|.
5

Note that Equation 5 explicitly neglects the error in the estimate λ^.

Finally, we used robust regression to extract a summary estimate k^ from the estimates k^y±s(k^y) for individual y. To begin with, consider a constant regression model η = 1α + e, where η is a column vector consisting of the values k^y, 1 is a column vector whose elements are all 1, the constant α is the summary estimate k^, and e is the column vector consisting of the errors s(k^y).

Our ultimate aim is to compute k^ rapidly, with as few realizations as possible. Unfortunately, for small numbers of realizations, the errors sM and sN are correlated with the corresponding estimates ^(M=y) and 𝔼^N(y). The correlations propagate to s(k^y), noticeably biasing the summary estimate k^, with 𝔼k^<k (see Figure 1).

Figure 1
Plot of estimates for k^y against the global score y for the BLOSUM62 scoring matrix with an affine gap cost of 11 + g for a gap of length g, with random sequences whose letters are chosen according to the empirical Robinson and Robinson amino acid frequencies ...

To avoid the bias, we applied the constant regression model η′ = 1α′ + e′ to the errors s(k^y) themselves. The elements of the column vector η′ were the errors s(k^y), with errors in each s(k^y) is taken to be a constant s derived though a standard formula [(27), p. 387], e′ = 1s. Robust regression thus gave a constant estimate α=s^(k^) of the errors s(k^y). We substituted the constant error estimate e=1α=1s^(k^) back into the constant regression η = 1α + e of k^y to derive a robust regression estimate k^ for k. Although somewhat ad hoc, the constant regression of the errors successfully reduced biases (see Figure 3).

Figure 3
Plot of relative errors of estimate k obtained via robust regression using k^y±s^(k^) and k^y±s(k^y) against different numbers of simulations. Each bar represents an average over 20 absolute relative errors. The previous best estimate ...

Even for large simulations (e.g. 106 realizations), however, sampling of the event [M = y] was inadequate for many large y, with P(M = y) likely being underestimated. Although the corresponding average was unbiased (in theory, at least), we suspect that it had a distribution whose skewing increased with y. Consequently, for large y, k^y often slightly underestimated the true k, with improbable but substantial overestimations maintaining a correct expectation 𝔼k^y=k (see Figure 2). The putative skewing also made the anticipated relation P(M = y) ≈ eλ P(M = y + 1) fail for large y. To avoid skewing, we therefore restricted robust regression of k^y to the range [a, b] of y that minimized the function

f(a,b)=1(ba+1)y=ab|(M=y)(My)(1eλ)|.
6
Figure 2
Plot of estimates for k^y against the global score y for 106 realizations. The simulation conditions were the same as in Figure 1. The error bars showing s(k^y) for the under-sampled asymptotic regime y [set membership] [41 100] are large and are omitted. ...

Software and Hardware

Computer code was written in C++ and compiled with the Microsoft® Visual C++® 6.0 compiler. The computer had a single Intel® Pentium® 4 2.8 GHz processor with 0.5 GB RAM and employed the Microsoft® Windows® 2000 operating system.

RESULTS

Tables 1 and and22 give estimates of the Gumbel parameters λ and k for all online options of the BLASTP parameters. They therefore confirm that our simulations and our formulas for k produced correct results. Other figures show results for the BLASTP default parameters, namely, the Robinson and Robinson amino acid frequencies (3), the BLOSUM62 scoring matrix and the gap cost Δ(g) = 11 + g. Other BLAST parameters tested gave comparable results, unless indicated otherwise (data not shown).

Table 1
Estimates of λ for all online options of the BLASTP parameters
Table 2
Estimates of k for all online options of the BLASTP parameters

Empirically, simulations using BLASTP default parameters needed a base length of l = 50 and a stringency epsilon = 10−2 for the accuracies required for (λ, 1%; k, 10%). For scoring matrices with more dominant diagonals than BLOSUM62, shorter base lengths sufficed, (e.g. for PAM30, l = 15 sufficed).

Figure 1 plots the estimates k^y with their standard error bars s(k^y) against global score y, up to y = 25. Each point represents 30 000 realizations. The horizontal thick line represents the previous best estimate k ≈ 0.041 and the dotted line, the biased summary estimate k^=0.036 due to the positive correlation between k^y and s(k^y). Therefore Figure 1 motivated us to regress the errors in k^y, to produce a constant error estimate s^(k^), as described in the Materials and Methods.

Figure 2 plots the estimates k^y against global score y, up to y = 100. Each point represents 106 realizations. We obtained the estimate λ^ and used it to estimate k^y. The range y [set membership] [0, 3] is not asymptotic, so the k^y do not approximate the true k very well. The range y [set membership] [4, 40] is asymptotic, and it is adequately sampled, so the k^y fluctuate randomly around the true k. The range y > 40 is also asymptotic, but it is not adequately sampled, so the k^y usually underestimate the true k. Figure 2 motivated us to regress only in the range [a, b] minimizing Equation 6, as described in the Materials and Methods.

Figure 3 plots the relative errors of the summary estimate k^ using k^y±s(k^y) (with skewed error estimates s(k^y)) and those using k^y±s^(k^) (with constant error estimate s^(k^)) against different numbers of realizations). All errors in k^ were computed relative to the approximation k ≈ 0.041. Each error plotted is the average of the absolute relative error for 20 independent simulations, each using the indicated number of realizations. White bars show the results for k^y±s^(k^); black bars, for k^y±s(k^y). For 10 000 realizations, the constant error estimate s^(k^) reduces the relative errors dramatically. As the number of realizations increases, the difference in efficiency of estimation between k^y±s(k^y) and k^y±s^(k^) decreases. Figure 3 shows that 10 000 realizations estimated k with less than 10% relative error. The same 10 000 realizations also estimated λ^ with less than 0.8% relative error (data not shown).

The simulations of Figure 3 estimated k^ from 10 000 realizations, in less than 30 s. For comparison, the same simulations could have estimated λ^ in less than 7 s. For the PAM 30 matrix with Δ(g) = 9 + g, they estimated λ and k in less than 4 s.

DISCUSSION

BLAST programs (BLASTP, PSI-BLAST, etc.) are restricted to specific scoring schemes, because time-consuming local alignment simulations for estimating the corresponding Gumbel parameters must be done offline. However, simulations of global alignment can estimate the Gumbel scale parameter λ for local alignment (6). Some global alignment methods are as much as five times faster than the best local alignment methods (21,23), so global alignment has considerable potential for online estimation of the Gumbel parameter λ.

This paper surmounts an obstacle to online estimation by demonstrating that simulations of global alignment can determine the Gumbel pre-factor k. Table 2 displays the results of global alignment simulations over a wide range of BLAST parameters, all of which gave correct estimates of the corresponding k and supported the validity of our methods for computing k.

Global alignment simulation therefore appears a feasible method for estimating both Gumbel parameters, λ and k. (The BLASTP default parameters provide a standard for quantifying speed, so the following results apply to the BLASTP defaults, unless stated otherwise.) With local alignment, estimates of λ required 40 000 sequence-pairs of minimum length 600 (21); with our methods, 5000 sequence-pairs of maximum length 50 (23). In fact, our methods attained 1.3% accuracies in λ with only 1000 sequence-pairs of maximum length 50. In our hands, k was more difficult to estimate than λ, with 10% relative errors requiring 10 000 sequence-pairs of average length 140. In summary, the methods presented here for estimating the Gumbel parameters λ and k represent at least a 3-fold improvement in speed over local alignments.

Online computation of the BLAST P-value requires more than the Gumbel parameters. It also requires an estimate of the ‘finite-size effect’ (10,13,33,34). Global alignment (or some variant of it) can indeed produce the required estimate (manuscript in preparation). Without the finite-size estimate in hand, however, we were not strongly motivated to incorporate technical improvements or heuristics into our methods. Bundschuh, e.g. implemented a diagonal-cutting heuristic to remove irrelevant off-diagonal elements in the global alignment matrix (21); we did not. The heuristic could probably speed our computation by a further factor of at least three.

Online BLAST estimation of the Gumbel parameters is likely just a few years away.

Acknowledgments

We would like to acknowledge helpful discussions with Dr Ralf Bundschuh and Dr Stephen Altschul. This work was supported in whole by the Intramural Research Program of the National Library of Medicine at National Institutes of Health/DHHS. Funding to pay the Open Access publication charges for this article was provided by National Library of Medicine at National Institutes of Health/DHHS.

Conflict of interest statement. None declared.

APPENDIX

In the Appendix, we give a heuristic derivation of Equation 2.

Notation for local sequence alignment

For local alignment, consider a pair A^=A^1A^0A^1 and B^=B^1B^0B^1 of doubly-infinite sequences. Their local alignment graph Γ^ is a directed, weighted lattice graph in two dimensions, as follows. The vertices v of Γ^ are v = (i, j) [set membership] Z2, the entire two-dimensional integer lattice. In other respects, particularly with respect to the edges between its vertices, Γ^ has the same structure as the global alignment graph Γ.

We base the graph Γ^ on the entire two-dimensional integer lattice Z2 because of our interest in the Gumbel distribution. In intuitive terms, the BLAST E-value Ey follows the Gumbel distribution, only if the local alignment does not ‘see’ the ends of the sequences, so finite-size effects can be neglected (13,33).

Let Π^ij be the set of all paths π ending at vh = (i,j), regardless of their starting vertex. Define the ‘local score’ S^ij=max{Wπ:πΠ^ij}. The paths π ending at vk = (i,j) with local score Wπ=S^ij are ‘optimal local paths’ corresponding to ‘optimal local alignments’ matching subsequences of A^ and B^ up to and including the letters A^i and B^j.

Unlike the singly-infinite sequences A and B, the doubly-infinite sequences A^ and B^ correspond to the entire lattice Z2. The lattice Z2 is invariant under translation (i.e. it appears the same from each of its vertices). Thus, if A^ and B^ are sequences with independent random letters, the corresponding local scores S^ij are ‘stationary’ (i.e. their joint distribution is invariant under translation). Stationary scores carry a prime elsewhere (i.e. S^ij) (35), which we drop here for brevity. For many purposes, translation invariance renders all vertices in Z2 equivalent, so it usually suffices to define quantities below solely at the origin, (0,0). The definition at other vertices is usually left implicit.

If the sequences A^ and B^ were singly-infinite, the Smith–Waterman algorithm could compute the corresponding local scores S^ij (36). Although the algorithm is unable to compute S^ij for A^ and B^, a rigorous treatment shows that doubly-infinite sequences pose no essential difficulties in the logarithmic regime (35).

For efficiency, many simulations of random local alignments partition the vertices in Z2 into ‘islands’ (described below). To avoid technical nuisances, each vertex must belong to exactly one island, so we define the following strict total order on Z2: (i′, j′) [precedes] (i, j), if and only if either i′ + j′ < i + j or else, i′ + j′ = i + j and j′ < j.

Let us say that a vertex (i,j)+2 ‘belongs to’ the origin if (0,0) is the greatest vertex v0 = (i′,j′) (under the total order [precedes]) such that S^ij=Wπ, for some path π starting at v0 = (i′,j′) and ending at vh = (i, j). The ‘island’ belonging to (0,0) is the set 𝔹00+2 of all vertices (i, j) belonging to (0, 0), and we say that (0, 0) ‘owns’ the island. [Equation 12 below uses the translate Bi,−j of the set B00, where Bi,−j is the set of all vertices belonging to (−i,−j)].

By the following reasoning, B00 is empty if and only if S^00>0. First, if S^00>0, there is some path π′ ending at (0, 0) with a positive score. If (0, 0) owned any vertex (i, j), there would be a path π starting at (0, 0) and ending at (i, j) with S^ij=Wπ. Then, the path concatenating π and π′ would have a weight exceeding S^ij=Wπ, contrary to the definition of S^ij. Thus, if (0,0) owns some vertex, S^00=0. Conversely, if S^00=0, then by deliberate construction, the definition of the total order [precedes] implies that (0,0) owns itself [because the weight of the trivial path containing only (0,0) is 0].

Accordingly, define the ‘local maximum’ [implicitly, on the island B00 belonging to (0, 0)] as M^=max{S^ij:(i,j)𝔹00}, with the default M^=, if B00 is empty (i.e. if S^00>0). Let N^(y)=#{(i,j)𝔹00:S^ij=y} denote the number of island vertices with local score y.

To connect our quantities explicitly to the Gumbel parameters, define M^mn=max{S^ij:0im,0jn}, the maximum local score in the lattice rectangle [0, m] × [0, n]. Let ρy the density of islands yielding a local score S^ijy, or equivalently, the density of their owners in Z2. Under certain conditions in the logarithmic regime, (M^mny)=Py1exp(Ey), where as m, n → ∞,

Ey=ρymnkmneλy.
7

Simulations indicate that to a good approximation, islands yielding a large local score S^ij occur independently of each other (15). Therefore, Equation 7 asserts that ρyke−λy. In a Poisson approximation, ρy represents the intensity of the Poisson process on Z2 that generates the owners of islands yielding a local score S^ijy.

Because of translation invariance, the density ρy equals the probability that any particular vertex in Z2 [e.g. (0, 0)] owns an island yielding a local score S^ijy. In other words, (M^y)=ρykeλy. Thus, the limit

k=limyeλy(M^y)
8

exists and equals the pre-factor k.

Path reversal identity

To determine k from global alignments, we first relate the global maximum M to the local scores S^ij with a path reversal identity. Recall the global maximum M = max {Wπ[set membership] Π}, where Π is the set of all paths π in +2 starting at v0 = (0, 0). Recall also the local score S^ij=max{Wπ:πΠ^ij}, where Π^ij is the set of all paths π in Z2 ending at vh = (i, j).

It is believable that for any fixed (i, j) [set membership] Z2, each path in Π^ij with random edge-weights corresponds to a reversal of a path in Π with the same random edge-weights. Thus, for every (i, j) [set membership] Z2, (S^ij=y)=(M=y), i.e. the local score and the global maximum have the same distribution. (Note: the equality is solely distributional. In any particular random instance, the local score S^ij and global maximum score M are unlikely to be related.)

Because the distributional equality holds for every (i, j) [set membership] Z2, we drop the subscript ij on S^ij and write

(S^=y)=(M=y).
9

A formal proof of Equation 9 can be found elsewhere (35).

The Poisson clumping heuristic

Consider the Poisson clumping heuristic (37)

(S^=y)=(M^y)𝔼[N^(y)|M^y].
10

Equation 10 states that at any fixed vertex (i, j) [set membership] Z2, the probability that S^ij=y is the density of vertices with a local score y. This density equals ρy=(M^y), the density of islands where some local score is at least y, multiplied by 𝔼[N^(y)|M^y], the expected number N^(y) of island vertices (i,j) where the local score S^ij=y, is given M^y.

Equation 10 can be demonstrated as follows. First,

𝔼N^(y)=(M^y)𝔼[N^(y)|M^y],
11

because if M^<y, then N^(y)=0. Equation 11 follows, because the event [M^<y] contributes nothing to the expectation on the left.

Next, define the indicator IA = 1 if the event A occurs and IA = 0 otherwise. Then,

𝔼N^(y)=𝔼(i,j)+2𝕀[S^ij=y and (i,j)𝔹00]=𝔼(i,j)+2𝕀[S^00=y and (0,0)𝔹i,j]=(S^00=y).
12

The first equality is essentially the definition of N^(y), which counts the number of vertices belonging to (0,0) with local score S^ij=y. The second equality exploits the translation invariance of the probabilities associated with S^ij. The third inequality merely notes that in the logarithmic regime, (0,0) must belong to some vertex (35). Equation 10 follows.

Our speculations

Based on the success of our simulation results, we speculate. First,

limy𝔼[N^(y)|M^y]𝔼[N(y)|My]=1.
13

In fact, limy𝔼[N^(y)|M^y] and limy𝔼[N(y)|My] are likely to exist as a common finite limit, but Equation 13 suffices for present purposes.

Equation 13 can be justified intuitively, as follows. As y→∞, any vertices satisfying Sij = y become likely to cluster on a single island that has a large maximum local score. Thus, given My, the vertices with Sij = y have a comparable structure to vertices with S^ij=y on the island belonging to (0, 0), given that the island satisfies M^y. In particular, given My, the number N(y) of vertices with Sij = y has a similar random behaviour to the number N^(y) of vertices with S^ij=y, given M^y. Thus, the expectations approximate each other: 𝔼[N^(y)|M^y]𝔼[N(y)|My].

Though hardly a ‘speculation’, we assume that c = limy→∞ eλy P(My) exists. Unfortunately, still there is no rigorous proof of the limit's existence.

The formula for k from global alignment

Equation 11 has an analog for global alignment, with a similar demonstration:

𝔼N(y)=𝔼[N(y)|My](My).
14

Together, Equations 810, 13 and 14 yield

k=limyeλy(M^y)=limyeλy(M=y)𝔼[N^(y)|M^y]=limyeλy(M=y)𝔼[N(y)|My]=limyeλy(M=y)(My)𝔼N(y).
15

Recall our assumption that s(Ai,Bj) and Δ(g) are always integers:

limy(M=y)(My)=limy(My)(My+1)(My)=1eλ.
16

Let ky=eλy(M^y). From Equations 15 and 16, k = limy→∞ky, where ky is given by Equation 2.

REFERENCES

1. Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J. Basic local alignment search tool. J Mol. Biol. 1990;215:403–410. [PubMed]
2. Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. [PMC free article] [PubMed]
3. Robinson A.B., Robinson L.R. Distribution of glutamine and asparagine residues and their near neighbors in peptides and proteins. Proc. Natl Acad. Sci. USA. 1991;88:8880–8884. [PMC free article] [PubMed]
4. Henikoff S., Henikoff J.G. Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA. 1992;89:10915–10919. [PMC free article] [PubMed]
5. Dayhoff M.O., Schwartz R.M., Orcutt B.C. Atlas of Protein Sequence and Structure. Vol. 3. Silver Spring, MD: National Biomedical Research Foundation; 1978. pp. 345–352.
6. Arratia R., Waterman M.S. A phase transition for the score in matching random sequences allowing deletions. Ann Appl Probab. 1994;4:200–225.
7. Dembo A., Karlin S., Zeitouni O. Limit distributions of maximal non-aligned two-sequence segmental score. Ann Probab. 1994;22:2022–2039.
8. Karlin S., Altschul S.F. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl Acad. Sci. USA. 1990;87:2264–2268. [PMC free article] [PubMed]
9. Mott R. Local sequence alignments with monotonic gap penalties. Bioinformatics. 1999;15:455–462. [PubMed]
10. Mott R. Accurate formula for P-values of gapped local sequence and profile alignments. J Mol. Biol. 2000;300:649–659. [PubMed]
11. Storey J.D., Siegmund D. Approximate p-values for local sequence alignments: numerical studies. J Comput. Biol. 2001;8:549–556. [PubMed]
12. Siegmund D., Yakir B. Approximate p-values for local sequence alignments. Ann Stat. 2000;28:657–680.
13. Altschul S.F., Gish W. Local alignment statistics. Methods Enzymol. 1996;266:460–480. [PubMed]
14. Waterman M.S., Vingron M. Rapid and accurate estimates of statistical significance for sequence data base searches. Proc. Natl Acad. Sci. USA. 1994;91:4625–4628. [PMC free article] [PubMed]
15. Olsen R., Bundschuh R., Hwa T. Rapid assessment of extremal statistics for gapped local alignment. Proc. Int. Conf. Intell Syst Mol Biol. 1999:211–222. [PubMed]
16. Mott R. Maximum-Likelihood-Estimation of the Statistical Distribution of Smith-Waterman Local Sequence Similarity Scores. B Math Biol. 1992;54:59–75.
17. Smith T.F., Waterman M.S., Burks C. The statistical distribution of nucleic acid similarities. Nucleic Acids Res. 1985;13:645–656. [PMC free article] [PubMed]
18. Collins J.F., Coulson A.F., Lyall A. The significance of protein sequence similarities. Comput. Appl. Biosci. 1988;4:67–71. [PubMed]
19. Mott R., Tribe R. Approximate statistics of gapped alignments. J Comput. Biol. 1999;6:91–112. [PubMed]
20. Altschul S.F., Bundschuh R., Olsen R., Hwa T. The estimation of statistical parameters for local alignment score distributions. Nucleic Acids Res. 2001;29:351–361. [PMC free article] [PubMed]
21. Bundschuh R. Rapid significance estimation in local sequence alignment with gaps. J Comput. Biol. 2002;9:243–260. [PubMed]
22. Grossmann S., Yakir B. Large deviations for global maxima of independent superadditive processes with negative drift and an application to optimal sequence alignments. Bernoulli. 2004;10:829–845.
23. Park Y., Sheetlin S., Spouge J.L. Accelerated convergence and robust asymptotic regression of the Gumbel scale parameter for gapped sequence alignment. Journal of Physics A: MATHEMATICAL AND GENERAL. 2005;38:97–108.
24. Needleman S.B., Wunsch C.D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol. Biol. 1970;48:443–453. [PubMed]
25. Gotoh O. An improved algorithm for matching biological sequences. J Mol. Biol. 1982;162:705–708. [PubMed]
26. Yu Y.K., Hwa T. Statistical significance of probabilistic sequence alignment and related local hidden Markov models. J Comput. Biol. 2001;8:249–282. [PubMed]
27. Montgomery D.C., Peck E.A., Vining G.G. Introduction to Linear Regression Analysis. NY: John Wiley & Sons, Inc.; 2001.
28. Andrews D.F., Bickel P.J., Hampel F.R., Huber P.J., rogers W.H., Tukey K.W. Robust Estimates of Location: Survey and advances. Princenton, NJ: Princenton University Press; 1972.
29. Andrews D.F. A robust method for multiple linear regression. Technometrics. 1974;16:523–531.
30. Huber P.J. Robust estimation of a location parameter. Ann. Math. Statist. 1964;35:73–101.
31. Huber P.J. Robust regression: Asymptotics, conjectures and Monte Carlo. Ann Stat. 1973;1:799–821.
32. Dwass M. Probability and Statistics. NY: W.A. Benjamin; 1970.
33. Spouge J.L. Finite-size corrections to Poisson approximations of rare events in renewal processes. J Appl Probab. 2001;38:554–569.
34. Spouge J.L. Finite-Size Corrections to Poisson Approximations in General Renewal-Success Processes. J Math Anal Appl. 2005;301:401–418.
35. Spouge J.L. Path Reversal and Islands in the Gapped Alignment of Random Sequences. J Appl Probab. 2004;41:975–983.
36. Smith T.F., Waterman M.S. Identification of common molecular subsequences. J. Mol. Biol. 1981;147:195–197. [PubMed]
37. Aldous D. Probability approximations via the Poisson clumping heuristic. 1st edn. NY: Springer-Verlag; 1989.

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...