# The Gumbel pre-factor *k* for gapped local alignment can be estimated from simulations of global alignment

^{*}To whom correspondence should be addressed. Tel: +301 402 9310; Fax: +301 480 2288; Email: vog.hin.mln.ibcn@eguops

The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact gro.slanruojdrofxo@snoissimrep.slanruoj

## Abstract

The optimal gapped local alignment score of two random sequences follows a Gumbel distribution. The Gumbel distribution has two parameters, the scale parameter λ and the pre-factor *k*. Presently, the basic local alignment search tool (BLAST) programs (BLASTP (BLAST for proteins), PSI-BLAST, etc.) use all time-consuming computer simulations to determine the Gumbel parameters. Because the simulations must be done offline, BLAST users are restricted in their choice of alignment scoring schemes. The ultimate aim of this paper is to speed the simulations, to determine the Gumbel parameters online, and to remove the corresponding restrictions on BLAST users. Simulations for the scale parameter λ can be as much as five times faster, if they use global instead of local alignment [R. Bundschuh (2002) *J. Comput. Biol*., 9, 243–260]. Unfortunately, the acceleration does not extend in determining the Gumbel pre-factor *k*, because *k* has no known mathematical relationship to global alignment. This paper relates *k* to global alignment and exploits the relationship to show that for the BLASTP defaults, 10 000 realizations with sequences of average length 140 suffice to estimate both Gumbel parameters λ and *k* within the errors required (λ, 0.8%; *k*, 10%). For the BLASTP defaults, simulations for both Gumbel parameters now take less than 30 s on a 2.8 GHz Pentium 4 processor.

## INTRODUCTION

Local sequence alignment is an indispensable computational tool in modern molecular biology. It is frequently used to infer the functional, structural and evolutionary relationships of a novel protein or DNA sequence by finding similar sequences of known function in a database. Arguably, the most important sequence database search program available is BLAST (the Basic Local Alignment Search Tool) (1,2). Using a heuristic algorithm, BLAST implicitly performs a local alignment of a protein or DNA query against sequences in the corresponding database. The BLAST output then ranks each potential database match according to an *E*-value, which is derived from the corresponding local maximum score, given in bits. For each local maximum score *y*, the corresponding *E*-value *E _{y}* gives (under a random model) the expected number of false positives with a lower rank in the output. Thus, a small

*E*-value indicates that the corresponding alignment is unlikely to occur by chance alone, whereas a large

*E*-value indicates an unremarkable alignment. Without doubt, BLAST's

*E*-values contribute substantially to its popularity.

Let us discuss the BLAST *E*-value *E _{y}* further here. (The Materials and Methods section also continues the discussion.) BLAST assumes a random model in which each unrelated pair of sequences

**A**[1,

*m*] =

*A*

_{1}···

*A*and

_{m}**B**[1,

*n*] =

*B*

_{1}···

*B*consists of random letters chosen independently from a background distribution. BLASTP (BLAST for proteins), e.g. assumes that random proteins are composed of amino acids chosen independently from the Robinson and Robinson frequency distribution (3). BLAST also requires an input, a matrix

_{n}*s*(

*A*,

_{i}*B*) for scoring matches between the letters

_{j}*A*and

_{i}*B*. BLASTP, e.g. uses the BLOSUM62 scoring matrix (4) as its default, offering as alternatives a few other PAM (5) and BLOSUM matrices. BLAST also enhances its detection of remote sequence similarities by using gapped sequence alignment. The cost of introducing a gap into an alignment is given by the ‘gap penalty’ Δ(

_{j}*g*), where

*g*is the gap length. Practical gap penalties Δ are usually super-additive, i.e. Δ(

*g*) + Δ(

*h*)≥Δ(

*g*+

*h*), so the concatenation of optimal subsequence alignments has a score no less than the sum of their scores. (However, our theory is not restricted to super-additive gap penalties). Affine gap penalties Δ(

*g*) =

*a*+

*bg*are typical in database searches. We refer to the letter distribution, the scoring matrix, and gap penalty collectively as ‘BLAST parameters’.

Throughout the paper, we assume a ‘logarithmic regime’ (6) where the alignment scores of long random sequences have a negative expectation. In the logarithmic regime, the BLAST *E*-value *E _{y}* is approximately

for large *y*. Under a Poisson approximation (7) for large *y*, the *E*-value *E _{y}* yields the

*P*-value

*P*= 1−exp(−

_{y}*E*). Because of Equation 1, the tail probability

_{y}*P*corresponds to a Gumbel distribution with ‘scale parameter’ λ and ‘pre-factor’

_{y}*k*.

For ungapped local alignment (i.e. the special case Δ(*g*) = ∞, which disallows gaps in the optimal local alignment), a rigorous theory furnishes analytic formulas for the Gumbel parameters λ and *k* (7,8). For gapped local alignment, analytic results are scarce and usually come at a price: they depend on approximations whose accuracy in general is unknown (9–12). In the absence of a rigorous theory for gapped local alignment, computer simulations have confirmed the validity of Equation 1 (13–16), and in the absence of formulas, they also have provided estimates of λ and *k* (16–19).

Because of the exponentiation in Equation 1, errors in λ have a greater practical impact than errors in *k*. Thus, for use in BLAST, λ must be known to within 1–4% relative error; *k*, to within 10% (20). Therefore, in statements about computational speed, the following implicitly assumes that the estimation of λ and *k* is carried out to these accuracies, unless stated otherwise.

Presently, the BLAST program precomputes λ and *k* offline, using the so-called ‘island method’ (15,20). Because of the precomputation, users are given a narrow choice indeed of BLAST parameters. The choice of BLAST parameters would be much less restricted, if λ and *k* could be computed online (in, say, less than 1 s) before searching a database with arbitrary BLAST parameters. Accordingly, much recent research has been directed toward speeding estimation of λ and *k*.

With the ultimate aim of estimating λ and *k* online, Bundschuh gave some interesting conjectures about λ (21,22). He then applied them in global alignment simulations that estimated λ as much as five faster than the island method. Later, we extended his conjectures, reducing the sequence length required to estimate λ by almost a factor of 10 (23).

Despite their obvious promise, even with further improvements in speed and global alignment simulations will remain impractical for online estimation in BLAST, unless they can be made to estimate *k* as well. To remedy the problem, we relate *k* to global alignment and then exploit the relationship in simulations that estimate both λ and *k*.

## MATERIALS AND METHODS

### Notation for global sequence alignment

We denote the non-negative integers by ℤ_{+} = {0, 1, 2, 3,…}. Throughout the paper, the letters *g*, *h*, *i*, *j*, *m*, *n* and the letter *y* are the integers.

Consider a pair **A** = *A*_{1}*A*_{2}… and **B** = *B*_{1}*B*_{2}… of infinite sequences. The corresponding global alignment graph Γ is a directed and weighted lattice graph in two dimensions, as follows. The vertices of Γ are $v=(i,j)\in {\mathbb{Z}}_{+}^{2}$, the non-negative two-dimensional integer lattice. Three sets of directed edges *e* come out of each vertex *v* = (*i*, *j*): northward, northeastward and eastward. One northeastward edge goes into (*i* + 1, *j* + 1) with weight *s*(*A _{i}*

_{+1},

*B*

_{j}_{+1}). For each g > 0, one eastward edge goes into (

*i*+

*g*,

*j*) and one northward edge goes into (

*i*,

*j*+ g); both are assigned the same weight −Δ(

*g*) < 0. For simplicity, we assume

*s*(

*A*,

_{i}*B*) and Δ(

_{j}*g*) are always integers, with greatest common divisor 1.

A directed path π = (*v*_{0}, *e*_{1}, *v*_{1}, *e*_{2},…*e _{h}*,

*v*) in Γ is a finite, alternating sequence of vertices and edges that starts and ends with a vertex. We say that the path π starts at

_{h}*v*

_{0}and ends at

*v*. For instance, each gapped alignment of the subsequences

_{h}**A**[

*i*+ 1,

*m*] =

*A*

_{i+}_{1}…

*A*and

_{m}**B**[

*j*+ 1,

*n*] =

*B*

_{j}_{+1}…

*B*corresponds to exactly one directed path that starts at

_{n}*v*

_{0}= (

*i*,

*j*) and ends at

*v*= (

_{h}*m*,

*n*). The alignment's score is the ‘path weight’ ${W}_{\pi}={\sum}_{i=1}^{h}W\left({e}_{i}\right)$, the sum of the weights

*W*(

*e*) of the edges

_{i}*e*. By convention, any trivial path π = (

_{i}*v*

_{0}) consisting of a single vertex has weight

*W*

_{π}= 0.

Let Π* _{ij}* be the set of all paths π starting at

*v*

_{0}= (0, 0) and ending at

*v*= (

_{h}*i*,

*j*). Define the ‘global score’

*S*

_{ij}= max{

*W*

_{π}: π ∈ Π

_{ij}}. The paths π starting at

*v*

_{0}and ending at

*v*with weight

_{h}*W*

_{π}=

*S*are ‘optimal global paths’ and correspond to ‘optimal global alignments’ between

_{ij}**A**[1,

*i*] and

**B**[1,

*j*]. The Needleman–Wunsch algorithm computes the global scores

*S*(24).

_{ij}Let $\Pi ={\cup}_{\left(i,j\right)\in {\mathbb{Z}}_{+}^{2}}{\Pi}_{ij}$ be the set of all paths π starting at *v*_{0} = (0,0). Define the ‘global maximum’ *M* = max{*W*_{π}: π ∈ Π}, which is also the maximum $M=max\{{S}_{ij}:(i,j)\in {\mathbb{Z}}_{+}^{2}\}$ of all global scores. Let $N\left(y\right)=\#\left\{\right(i,j)\in {\mathbb{Z}}_{+}^{2}:{S}_{ij}=y\}$ denote the number of vertices with global score *y*.

Define the lattice rectangle [0, *n*] = {0,1,…,*n*}. Our simulations involved a square subset [0,*n*]^{2} of ${\mathbb{Z}}_{+}^{2}$. In particular single subscripts connote quantities for the square: *M*_{n} = max{*S*_{ij}:(*i*, *j*) ∈ [0, *n*]^{2}}, the square's global maximum; *E _{n}* = max{max

_{0≤i≤n}S

*, max*

_{in}_{0≤j≤n}

*S*}, its edge maximum; and

_{nj}*N*

_{n}(

*y*) = #{(

*i*,

*j*) ∈ [0,

*n*]

^{2}:

*S*

_{ij}=

*y*}, the number of its vertices with global score

*y*.

### The formula for *k* from global alignment

We can show heuristically that *k* = lim_{y→∞}*k _{y}*, where

(see our Appendix, online). Ultimately, the heuristics behind Equation 2 are based on two observations about random sequence matches. First, the two ends of a strong local alignment match are the mirrors of each other. Second, the right end of a strong alignment match looks the same for both local and global alignment.

Equation 2 computes *k _{y}* from three components: the scale parameter λ, the probability

**P**(

*M*=

*y*) of a global maximum

*y*, and the expected number 𝔼

*N*(

*y*) of vertices with global score

*S*=

_{ij}*y*. We now describe how our simulations determined the three components.

### Numerical scheme for λ

First, we estimated λ from random global alignments (23). All simulations used to affine gap penalties Δ(*g*) = *a* + *bg* and the corresponding global alignment algorithms for computing *S _{ij}* (25).

Recall the edge maximum *E _{n}* (defined at the end of the notation for global sequence alignment). As shown elsewhere (23), its cumulant generating function satisfies

where 0 ≤ δ < 1. The root $\lambda =\widehat{\lambda}$ of β_{1}(λ) = 0 is our estimate for λ.

To estimate 𝔼exp(λ*E*_{n}) efficiently, we used Bundschuh's importance sampling methods (21), which apply if the gap penalty is affine. Briefly, importance sampling is a variance-reduction technique for simulating rare events. In global alignment simulations, e.g. a large edge maximum is a rare event. By simulating optimal subsequence pairs in ‘hybrid alignment’ (a type of optimized Bayesian local alignment) (26), we ensured that our realizations frequently generated a large edge maximum *E _{n}*. Accordingly, we simulated a pair of sequences of some ‘base length’

*n*= l. After correcting for biases induced by the importance sampling distribution, we estimated 𝔼exp(λ

*E*

_{l}).

Equation 3 corresponds to an asymptotic equality with two free parameters to β_{0} and β_{1}(λ), which we estimated with robust regression. Robust regression was originally developed as an antidote to outliers (27), which badly skew least-square regression (28–31). As noted elsewhere (23), however, robust regression is also remarkably suited for extracting asymptotic parameters like β_{0} and β_{1}(λ).

Robust regression requires the specification of an influence function, to quantify the influence of potential outliers on the regression result. Many influence functions exist (27), but the Andrews function with *a* = 1.339 [(27), p. 388; (29)] works well in asymptotic regression, because it ignores points that obviously lie outside the asymptotic regime (23).

Accordingly, we applied robust regression to Equation 3. To solve β_{1}(λ) = 0, let λ* _{u}* be the scale parameter for ungapped local alignment, which can be determined analytically. Because 0 ≤ λ ≤ λ

*, with repeated bisection of the interval [0, λ*

_{u}*] yielded an estimate $\widehat{\lambda}$ for the root of the equation β*

_{u}_{1}(λ) = 0. In practice, multiple roots did not occur.

### Numerical scheme for *k*

Next, we estimated ℙ(*M* = *y*) and 𝔼*N*(*y*). Importance sampling has already generated sequence-pairs of base length l for estimating λ. The bias in importance sampling tends to yield large global scores *S _{ij}*, ascending toward the global maximum

*M*. To determine

*N*(

*y*), we needed to simulate and count all vertices with global scores

*S*=

_{ij}*y*. Therefore, we extended the sequence pair beyond the base length l using random letters with the unbiased Robinson and Robinson frequencies. The global scores

*S*beyond the base length l became progressively smaller, thereby permitting determination of

_{ij}*N*(

*y*).

Given ɛ > 0, we simulated a random number $\overline{L}$ of unbiased letters in each sequence, until we found some total length $L=\underset{\xaf}{l}+\overline{L}$ such that

The edge maximum *E _{L}* is a maximum over 2

*L*+ 1 vertices. Therefore, for small enough stringencies ɛ > 0, if the edge maximum

*E*of the contributing 2

_{L}*L*+ 1 vertices satisfies Equation 4, it is probable that

*M*=

*M*, because elongating the sequences is unlikely to increase the estimate of

_{L}*M*. Similarly, the elongation does not increase the estimate of 𝔼

*N*(

*y*) much. After appropriate averaging, our simulations therefore yielded estimates $\widehat{\mathbb{P}}(M=y)\approx \mathbb{P}({M}_{L}=y)$ and $\widehat{\mathbb{E}}N\left(y\right)\approx \mathbb{E}{N}_{L}\left(y\right)$ for ℙ(

*M*=

*y*) and 𝔼

*N*(

*y*).

With the simulation estimates $\widehat{\lambda}$, $\widehat{\mathbb{P}}(M=y)$ and $\widehat{\mathbb{E}}N\left(y\right)$ in hand, we found that errors in $\widehat{\lambda}$ were negligible in practice. In contrast, the standard deviations sample (32) of $\widehat{\mathbb{P}}(M=y)$ and $\widehat{\mathbb{E}}N\left(y\right)$, denoted by *s _{M}* and

*s*, were not.

_{N}We calculated an estimate ${\widehat{k}}_{y}$ for *k _{y}* by substituting $\widehat{\lambda}$, $\widehat{\mathbb{P}}(M=y)$, and $\widehat{\mathbb{E}}N\left(y\right)$ into Equation 2. We estimated the error $s\left({\widehat{k}}_{y}\right)$ in ${\widehat{k}}_{y}$ from the equation

Note that Equation 5 explicitly neglects the error in the estimate $\widehat{\lambda}$.

Finally, we used robust regression to extract a summary estimate $\widehat{k}$ from the estimates ${\widehat{k}}_{y}\pm s\left({\widehat{k}}_{y}\right)$ for individual *y*. To begin with, consider a constant regression model **η** = **1**α + **e**, where **η** is a column vector consisting of the values ${\widehat{k}}_{y}$, **1** is a column vector whose elements are all 1, the constant α is the summary estimate $\widehat{k}$, and **e** is the column vector consisting of the errors $s\left({\widehat{k}}_{y}\right)$.

Our ultimate aim is to compute $\widehat{k}$ rapidly, with as few realizations as possible. Unfortunately, for small numbers of realizations, the errors *s _{M}* and

*s*are correlated with the corresponding estimates $\widehat{\mathbb{P}}(M=y)$ and $\widehat{\mathbb{E}}N\left(y\right)$. The correlations propagate to $s\left({\widehat{k}}_{y}\right)$, noticeably biasing the summary estimate $\widehat{k}$, with $\mathbb{E}\widehat{k}<k$ (see Figure 1).

_{N}*y*for the BLOSUM62 scoring matrix with an affine gap cost of 11 +

*g*for a gap of length

*g*, with random sequences whose letters are chosen according to the empirical Robinson and Robinson amino acid frequencies

**...**

To avoid the bias, we applied the constant regression model **η**′ = **1**α′ + **e**′ to the errors $s\left({\widehat{k}}_{y}\right)$ themselves. The elements of the column vector **η**′ were the errors $s\left({\widehat{k}}_{y}\right)$, with errors in each $s\left({\widehat{k}}_{y}\right)$ is taken to be a constant *s* derived though a standard formula [(27), p. 387], **e**′ = **1***s*. Robust regression thus gave a constant estimate ${\alpha}^{\prime}=\widehat{s}\left(\widehat{k}\right)$ of the errors $s\left({\widehat{k}}_{y}\right)$. We substituted the constant error estimate $\text{e}=1{\alpha}^{\prime}=1\widehat{s}\left(\widehat{k}\right)$ back into the constant regression **η** = **1**α + **e** of ${\widehat{k}}_{y}$ to derive a robust regression estimate $\widehat{k}$ for *k*. Although somewhat *ad hoc*, the constant regression of the errors successfully reduced biases (see Figure 3).

*k*obtained via robust regression using ${\widehat{k}}_{y}\pm \widehat{s}\left(\widehat{k}\right)$ and ${\widehat{k}}_{y}\pm s\left({\widehat{k}}_{y}\right)$ against different numbers of simulations. Each bar represents an average over 20 absolute relative errors. The previous best estimate

**...**

Even for large simulations (e.g. 10^{6} realizations), however, sampling of the event [*M* = *y*] was inadequate for many large *y*, with ℙ(*M* = *y*) likely being underestimated. Although the corresponding average was unbiased (in theory, at least), we suspect that it had a distribution whose skewing increased with *y*. Consequently, for large *y*, ${\widehat{k}}_{y}$ often slightly underestimated the true *k*, with improbable but substantial overestimations maintaining a correct expectation $\mathbb{E}{\widehat{k}}_{y}=k$ (see Figure 2). The putative skewing also made the anticipated relation ℙ(*M* = *y*) ≈ *e*^{λ} ℙ(*M* = *y* + 1) fail for large *y*. To avoid skewing, we therefore restricted robust regression of ${\widehat{k}}_{y}$ to the range [*a*, *b*] of *y* that minimized the function

### Software and Hardware

Computer code was written in C++ and compiled with the Microsoft® Visual C++® 6.0 compiler. The computer had a single Intel® Pentium® 4 2.8 GHz processor with 0.5 GB RAM and employed the Microsoft® Windows® 2000 operating system.

## RESULTS

Tables 1 and and22 give estimates of the Gumbel parameters λ and *k* for all online options of the BLASTP parameters. They therefore confirm that our simulations and our formulas for *k* produced correct results. Other figures show results for the BLASTP default parameters, namely, the Robinson and Robinson amino acid frequencies (3), the BLOSUM62 scoring matrix and the gap cost Δ(*g*) = 11 + *g*. Other BLAST parameters tested gave comparable results, unless indicated otherwise (data not shown).

Empirically, simulations using BLASTP default parameters needed a base length of l = 50 and a stringency ɛ = 10^{−2} for the accuracies required for (λ, 1%; *k*, 10%). For scoring matrices with more dominant diagonals than BLOSUM62, shorter base lengths sufficed, (e.g. for PAM30, l = 15 sufficed).

Figure 1 plots the estimates ${\widehat{k}}_{y}$ with their standard error bars $s\left({\widehat{k}}_{y}\right)$ against global score *y*, up to *y* = 25. Each point represents 30 000 realizations. The horizontal thick line represents the previous best estimate *k* ≈ 0.041 and the dotted line, the biased summary estimate $\widehat{k}=0.036$ due to the positive correlation between ${\widehat{k}}_{y}$ and $s\left({\widehat{k}}_{y}\right)$. Therefore Figure 1 motivated us to regress the errors in ${\widehat{k}}_{y}$, to produce a constant error estimate $\widehat{s}\left(\widehat{k}\right)$, as described in the Materials and Methods.

Figure 2 plots the estimates ${\widehat{k}}_{y}$ against global score *y*, up to *y* = 100. Each point represents 10^{6} realizations. We obtained the estimate $\widehat{\lambda}$ and used it to estimate ${\widehat{k}}_{y}$. The range *y* ∈ [0, 3] is not asymptotic, so the ${\widehat{k}}_{y}$ do not approximate the true *k* very well. The range *y* ∈ [4, 40] is asymptotic, and it is adequately sampled, so the ${\widehat{k}}_{y}$ fluctuate randomly around the true *k*. The range *y* > 40 is also asymptotic, but it is not adequately sampled, so the ${\widehat{k}}_{y}$ usually underestimate the true *k*. Figure 2 motivated us to regress only in the range [*a*, *b*] minimizing Equation 6, as described in the Materials and Methods.

Figure 3 plots the relative errors of the summary estimate $\widehat{k}$ using ${\widehat{k}}_{y}\pm s\left({\widehat{k}}_{y}\right)$ (with skewed error estimates $s({\widehat{k}}_{y}))$ and those using ${\widehat{k}}_{y}\pm \widehat{s}\left(\widehat{k}\right)$ (with constant error estimate $\widehat{s}(\widehat{k}))$ against different numbers of realizations). All errors in $\widehat{k}$ were computed relative to the approximation *k* ≈ 0.041. Each error plotted is the average of the absolute relative error for 20 independent simulations, each using the indicated number of realizations. White bars show the results for ${\widehat{k}}_{y}\pm \widehat{s}\left(\widehat{k}\right)$; black bars, for ${\widehat{k}}_{y}\pm s\left({\widehat{k}}_{y}\right)$. For 10 000 realizations, the constant error estimate $\widehat{s}\left(\widehat{k}\right)$ reduces the relative errors dramatically. As the number of realizations increases, the difference in efficiency of estimation between ${\widehat{k}}_{y}\pm s\left({\widehat{k}}_{y}\right)$ and ${\widehat{k}}_{y}\pm \widehat{s}\left(\widehat{k}\right)$ decreases. Figure 3 shows that 10 000 realizations estimated *k* with less than 10% relative error. The same 10 000 realizations also estimated $\widehat{\lambda}$ with less than 0.8% relative error (data not shown).

The simulations of Figure 3 estimated $\widehat{k}$ from 10 000 realizations, in less than 30 s. For comparison, the same simulations could have estimated $\widehat{\lambda}$ in less than 7 s. For the PAM 30 matrix with Δ(*g*) = 9 + *g*, they estimated λ and *k* in less than 4 s.

## DISCUSSION

BLAST programs (BLASTP, PSI-BLAST, etc.) are restricted to specific scoring schemes, because time-consuming local alignment simulations for estimating the corresponding Gumbel parameters must be done offline. However, simulations of *global* alignment can estimate the Gumbel scale parameter λ for *local* alignment (6). Some global alignment methods are as much as five times faster than the best local alignment methods (21,23), so global alignment has considerable potential for online estimation of the Gumbel parameter λ.

This paper surmounts an obstacle to online estimation by demonstrating that simulations of global alignment can determine the Gumbel pre-factor *k*. Table 2 displays the results of global alignment simulations over a wide range of BLAST parameters, all of which gave correct estimates of the corresponding *k* and supported the validity of our methods for computing *k*.

Global alignment simulation therefore appears a feasible method for estimating both Gumbel parameters, λ and *k*. (The BLASTP default parameters provide a standard for quantifying speed, so the following results apply to the BLASTP defaults, unless stated otherwise.) With local alignment, estimates of λ required 40 000 sequence-pairs of minimum length 600 (21); with our methods, 5000 sequence-pairs of maximum length 50 (23). In fact, our methods attained 1.3% accuracies in λ with only 1000 sequence-pairs of maximum length 50. In our hands, *k* was more difficult to estimate than λ, with 10% relative errors requiring 10 000 sequence-pairs of average length 140. In summary, the methods presented here for estimating the Gumbel parameters λ and *k* represent at least a 3-fold improvement in speed over local alignments.

Online computation of the BLAST *P*-value requires more than the Gumbel parameters. It also requires an estimate of the ‘finite-size effect’ (10,13,33,34). Global alignment (or some variant of it) can indeed produce the required estimate (manuscript in preparation). Without the finite-size estimate in hand, however, we were not strongly motivated to incorporate technical improvements or heuristics into our methods. Bundschuh, e.g. implemented a diagonal-cutting heuristic to remove irrelevant off-diagonal elements in the global alignment matrix (21); we did not. The heuristic could probably speed our computation by a further factor of at least three.

Online BLAST estimation of the Gumbel parameters is likely just a few years away.

## Acknowledgments

We would like to acknowledge helpful discussions with Dr Ralf Bundschuh and Dr Stephen Altschul. This work was supported in whole by the Intramural Research Program of the National Library of Medicine at National Institutes of Health/DHHS. Funding to pay the Open Access publication charges for this article was provided by National Library of Medicine at National Institutes of Health/DHHS.

*Conflict of interest statement*. None declared.

#### APPENDIX

In the Appendix, we give a heuristic derivation of Equation 2.

##### Notation for local sequence alignment

For local alignment, consider a pair $\widehat{\text{A}}=\mathrm{\dots}{\widehat{A}}_{-1}{\widehat{A}}_{0}{\widehat{A}}_{1}\mathrm{\dots}$ and $\widehat{\text{B}}=\mathrm{\dots}{\widehat{B}}_{-1}{\widehat{B}}_{0}{\widehat{B}}_{1}\mathrm{\dots}$ of doubly-infinite sequences. Their local alignment graph $\widehat{\Gamma}$ is a directed, weighted lattice graph in two dimensions, as follows. The vertices *v* of $\widehat{\Gamma}$ are *v* = (*i*, *j*) ∈ ℤ^{2}, the *entire* two-dimensional integer lattice. In other respects, particularly with respect to the edges between its vertices, $\widehat{\Gamma}$ has the same structure as the global alignment graph Γ.

We base the graph $\widehat{\Gamma}$ on the entire two-dimensional integer lattice ℤ^{2} because of our interest in the Gumbel distribution. In intuitive terms, the BLAST *E*-value *E _{y}* follows the Gumbel distribution, only if the local alignment does not ‘see’ the ends of the sequences, so finite-size effects can be neglected (13,33).

Let ${\widehat{\Pi}}_{ij}$ be the set of all paths π ending at *v _{h}* = (

*i*,

*j*), regardless of their starting vertex. Define the ‘local score’ ${\widehat{S}}_{ij}=max\{{W}_{\pi}:\pi \in {\widehat{\Pi}}_{ij}\}$. The paths π ending at

*v*= (

_{k}*i*,

*j*) with local score ${W}_{\pi}={\widehat{S}}_{ij}$ are ‘optimal local paths’ corresponding to ‘optimal local alignments’ matching subsequences of $\widehat{\text{A}}$ and $\widehat{\text{B}}$ up to and including the letters ${\widehat{A}}_{i}$ and ${\widehat{B}}_{j}$.

Unlike the singly-infinite sequences **A** and **B**, the doubly-infinite sequences $\widehat{\text{A}}$ and $\widehat{\text{B}}$ correspond to the entire lattice ℤ^{2}. The lattice ℤ^{2} is invariant under translation (i.e. it appears the same from each of its vertices). Thus, if $\widehat{\text{A}}$ and $\widehat{\text{B}}$ are sequences with independent random letters, the corresponding local scores ${\widehat{S}}_{ij}$ are ‘stationary’ (i.e. their joint distribution is invariant under translation). Stationary scores carry a prime elsewhere (i.e. ${{\widehat{S}}^{\prime}}_{ij}$) (35), which we drop here for brevity. For many purposes, translation invariance renders all vertices in ℤ^{2} equivalent, so it usually suffices to define quantities below solely at the origin, (0,0). The definition at other vertices is usually left implicit.

If the sequences $\widehat{\text{A}}$ and $\widehat{\text{B}}$ were singly-infinite, the Smith–Waterman algorithm could compute the corresponding local scores ${\widehat{S}}_{ij}$ (36). Although the algorithm is unable to compute ${\widehat{S}}_{ij}$ for $\widehat{\text{A}}$ and $\widehat{\text{B}}$, a rigorous treatment shows that doubly-infinite sequences pose no essential difficulties in the logarithmic regime (35).

For efficiency, many simulations of random local alignments partition the vertices in ℤ^{2} into ‘islands’ (described below). To avoid technical nuisances, each vertex must belong to exactly one island, so we define the following strict total order on ℤ^{2}: (*i*′, *j*′) ≺ (*i*, *j*), if and only if either *i*′ + *j*′ < *i* + *j* or else, *i*′ + *j*′ = *i* + *j* and *j*′ < *j*.

Let us say that a vertex $(i,j)\in {\mathbb{Z}}_{+}^{2}$ ‘belongs to’ the origin if (0,0) is the greatest vertex *v*_{0} = (*i*′,*j*′) (under the total order ≺) such that ${\widehat{S}}_{ij}={W}_{\pi}$, for some path π starting at *v*_{0} = (*i*′,*j*′) and ending at *v _{h}* = (

*i*,

*j*). The ‘island’ belonging to (0,0) is the set ${\mathbb{B}}_{00}\subseteq {\mathbb{Z}}_{+}^{2}$ of all vertices (

*i*,

*j*) belonging to (0, 0), and we say that (0, 0) ‘owns’ the island. [Equation 12 below uses the translate 𝔹

_{−i,−j}of the set 𝔹

_{00}, where 𝔹

_{−i,−j}is the set of all vertices belonging to (−

*i*,−

*j*)].

By the following reasoning, 𝔹_{00} is empty if and only if ${\widehat{S}}_{00}>0$. First, if ${\widehat{S}}_{00}>0$, there is some path π′ ending at (0, 0) with a positive score. If (0, 0) owned any vertex (*i*, *j*), there would be a path π starting at (0, 0) and ending at (*i*, *j*) with ${\widehat{S}}_{ij}={W}_{\pi}$. Then, the path concatenating π and π′ would have a weight exceeding ${\widehat{S}}_{ij}={W}_{\pi}$, contrary to the definition of ${\widehat{S}}_{ij}$. Thus, if (0,0) owns some vertex, ${\widehat{S}}_{00}=0$. Conversely, if ${\widehat{S}}_{00}=0$, then by deliberate construction, the definition of the total order ≺ implies that (0,0) owns itself [because the weight of the trivial path containing only (0,0) is 0].

Accordingly, define the ‘local maximum’ [implicitly, on the island 𝔹_{00} belonging to (0, 0)] as $\widehat{M}=\text{max}\{{\widehat{S}}_{ij}:(i,j)\in {\mathbb{B}}_{00}\}$, with the default $\widehat{M}=-\infty $, if 𝔹_{00} is empty (i.e. if ${\widehat{S}}_{00}>0$). Let $\widehat{N}\left(y\right)=\#\left\{\right(i,j)\in {\mathbb{B}}_{00}:{\widehat{S}}_{ij}=y\}$ denote the number of island vertices with local score *y*.

To connect our quantities explicitly to the Gumbel parameters, define ${\widehat{M}}_{\mathrm{mn}}=max\{{\widehat{S}}_{ij}:0\le i\le m,0\le j\le n\}$, the maximum local score in the lattice rectangle [0, *m*] × [0, *n*]. Let ρ* _{y}* the density of islands yielding a local score ${\widehat{S}}_{ij}\ge y$, or equivalently, the density of their owners in ℤ

^{2}. Under certain conditions in the logarithmic regime, $\mathbb{P}({\widehat{M}}_{mn}\ge y)={P}_{y}\approx 1-exp(-{E}_{y})$, where as

*m*,

*n*→ ∞,

Simulations indicate that to a good approximation, islands yielding a large local score ${\widehat{S}}_{ij}$ occur independently of each other (15). Therefore, Equation 7 asserts that ρ* _{y}* ≈

*ke*

^{−λy}. In a Poisson approximation, ρ

*represents the intensity of the Poisson process on ℤ*

_{y}^{2}that generates the owners of islands yielding a local score ${\widehat{S}}_{ij}\ge y$.

Because of translation invariance, the density ρ* _{y}* equals the probability that any particular vertex in ℤ

^{2}[e.g. (0, 0)] owns an island yielding a local score ${\widehat{S}}_{ij}\ge y$. In other words, $\mathbb{P}(\widehat{M}\ge y)={\rho}_{y}\approx k{e}^{-\lambda y}$. Thus, the limit

exists and equals the pre-factor *k*.

##### Path reversal identity

To determine *k* from global alignments, we first relate the global maximum *M* to the local scores ${\widehat{S}}_{ij}$ with a path reversal identity. Recall the global maximum *M* = max {*W*_{π}:π ∈ Π}, where Π is the set of all paths π in ${\mathbb{Z}}_{+}^{2}$ starting at *v*_{0} = (0, 0). Recall also the local score ${\widehat{S}}_{ij}=max\{{W}_{\pi}:\pi \in {\widehat{\Pi}}_{ij}\}$, where ${\widehat{\Pi}}_{ij}$ is the set of all paths π in ℤ^{2} ending at *v _{h}* = (

*i*,

*j*).

It is believable that for any fixed (*i*, *j*) ∈ ℤ^{2}, each path in ${\widehat{\Pi}}_{ij}$ with random edge-weights corresponds to a reversal of a path in Π with the same random edge-weights. Thus, for every (*i*, *j*) ∈ ℤ^{2}, $\mathbb{P}({\widehat{S}}_{ij}=y)=\mathbb{P}(M=y)$, i.e. the local score and the global maximum have the same distribution. (Note: the equality is solely distributional. In any particular random instance, the local score ${\widehat{S}}_{ij}$ and global maximum score *M* are unlikely to be related.)

Because the distributional equality holds for every (*i*, *j*) ∈ ℤ^{2}, we drop the subscript *ij* on ${\widehat{S}}_{ij}$ and write

A formal proof of Equation 9 can be found elsewhere (35).

##### The Poisson clumping heuristic

Consider the Poisson clumping heuristic (37)

Equation 10 states that at any fixed vertex (*i*, *j*) ∈ ℤ^{2}, the probability that ${\widehat{S}}_{ij}=y$ is the density of vertices with a local score *y*. This density equals ${\rho}_{y}=\mathbb{P}(\widehat{M}\ge y)$, the density of islands where some local score is at least *y*, multiplied by $\mathbb{E}\left[\widehat{N}\right(y\left)\right|\widehat{M}\ge y]$, the expected number $\widehat{N}\left(y\right)$ of island vertices (*i*,*j*) where the local score ${\widehat{S}}_{ij}=y$, is given $\widehat{M}\ge y$.

Equation 10 can be demonstrated as follows. First,

because if $\widehat{M}<y$, then $\widehat{N}\left(y\right)=0$. Equation 11 follows, because the event $[\widehat{M}<y]$ contributes nothing to the expectation on the left.

Next, define the indicator 𝕀*A* = 1 if the event *A* occurs and 𝕀*A* = 0 otherwise. Then,

The first equality is essentially the definition of $\widehat{N}\left(y\right)$, which counts the number of vertices belonging to (0,0) with local score ${\widehat{S}}_{ij}=y$. The second equality exploits the translation invariance of the probabilities associated with ${\widehat{S}}_{ij}$. The third inequality merely notes that in the logarithmic regime, (0,0) must belong to some vertex (35). Equation 10 follows.

##### Our speculations

Based on the success of our simulation results, we speculate. First,

In fact, ${\mathrm{lim}}_{y\to \infty}\mathbb{E}\left[\widehat{N}\right(y\left)\right|\widehat{M}\ge y]$ and ${\mathrm{lim}}_{y\to \infty}\mathbb{E}\left[N\right(y\left)\right|M\ge y]$ are likely to exist as a common finite limit, but Equation 13 suffices for present purposes.

Equation 13 can be justified intuitively, as follows. As *y*→∞, any vertices satisfying *S _{ij}* =

*y*become likely to cluster on a single island that has a large maximum local score. Thus, given

*M*≥

*y*, the vertices with

*S*=

_{ij}*y*have a comparable structure to vertices with ${\widehat{S}}_{ij}=y$ on the island belonging to (0, 0), given that the island satisfies $\widehat{M}\ge y$. In particular, given

*M*≥

*y*, the number

*N*(

*y*) of vertices with

*S*=

_{ij}*y*has a similar random behaviour to the number $\widehat{N}\left(y\right)$ of vertices with ${\widehat{S}}_{ij}=y$, given $\widehat{M}\ge y$. Thus, the expectations approximate each other: $\mathbb{E}\left[\widehat{N}\right(y\left)\right|\widehat{M}\ge y]\approx \mathbb{E}[N\left(y\right)|M\ge y]$.

Though hardly a ‘speculation’, we assume that *c* = lim_{y→∞} *e*^{λy} ℙ(*M* ≥ *y*) exists. Unfortunately, still there is no rigorous proof of the limit's existence.

##### The formula for *k* from global alignment

Equation 11 has an analog for global alignment, with a similar demonstration:

Together, Equations 8–10, 13 and 14 yield

Recall our assumption that *s*(*A _{i}*,

*B*) and Δ(

_{j}*g*) are always integers:

Let ${k}_{y}={e}^{\lambda y}\mathbb{P}(\widehat{M}\ge y)$. From Equations 15 and 16, *k* = lim_{y}_{→∞}k* _{y}*, where k

*is given by Equation 2.*

_{y}## REFERENCES

**Oxford University Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (200K) |
- Citation

- Evolution of biological sequences implies an extreme value distribution of type I for both global and local pairwise alignment scores.[BMC Bioinformatics. 2008]
*Bastien O, Maréchal E.**BMC Bioinformatics. 2008 Aug 7; 9:332. Epub 2008 Aug 7.* - Island method for estimating the statistical significance of profile-profile alignment scores.[BMC Bioinformatics. 2009]
*Poleksic A.**BMC Bioinformatics. 2009 Apr 20; 10:112. Epub 2009 Apr 20.* - Local sequence alignments statistics: deviations from Gumbel statistics in the rare-event tail.[Algorithms Mol Biol. 2007]
*Wolfsheimer S, Burghardt B, Hartmann AK.**Algorithms Mol Biol. 2007 Jul 11; 2:9. Epub 2007 Jul 11.* - Finding homologs to nucleotide sequences using network BLAST searches.[Curr Protoc Bioinformatics. 2002]
*Ladunga I.**Curr Protoc Bioinformatics. 2002 Aug; Chapter 3:Unit 3.3.* - Robust E-values for gapped local alignments.[J Comput Biol. 2006]
*Metzler D.**J Comput Biol. 2006 May; 13(4):882-96.*

- Frameshift alignment: statistics and post-genomic applications[Bioinformatics. 2014]
*Sheetlin SL, Park Y, Frith MC, Spouge JL.**Bioinformatics. 2014 Dec 15; 30(24)3575-3582* - Improved search heuristics find 20 000 new alignments between human and mouse genomes[Nucleic Acids Research. 2014]
*Frith MC, Noé L.**Nucleic Acids Research. 2014 Apr; 42(7)e59* - FastAnnotator- an efficient transcript annotation web tool[BMC Genomics. ]
*Chen TW, Gan RC, Wu TH, Huang PJ, Lee CY, Chen YY, Chen CC, Tang P.**BMC Genomics. 13(Suppl 7)S9* - Shape-based alignment of genomic landscapes in multi-scale resolution[Nucleic Acids Research. 2012]
*Ashida H, Asai K, Hamada M.**Nucleic Acids Research. 2012 Aug; 40(14)6435-6448* - Gentle Masking of Low-Complexity Sequences Improves Homology Search[PLoS ONE. ]
*Frith MC.**PLoS ONE. 6(12)e28819*

- PubMedPubMedPubMed citations for these articles

- The Gumbel pre-factor k for gapped local alignment can be estimated from simulat...The Gumbel pre-factor k for gapped local alignment can be estimated from simulations of global alignmentNucleic Acids Research. 2005; 33(15)4987

Your browsing activity is empty.

Activity recording is turned off.

See more...