• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptNIH Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Ann Stat. Author manuscript; available in PMC Feb 9, 2010.
Published in final edited form as:
Ann Stat. Dec 1, 2009; 37(6A): 3697.
doi:  10.1214/08-AOS663
PMCID: PMC2818155
NIHMSID: NIHMS145603

ESTIMATING THE GUMBEL SCALE PARAMETER FOR LOCAL ALIGNMENT OF RANDOM SEQUENCES BY IMPORTANCE SAMPLING WITH STOPPING TIMES

Abstract

The gapped local alignment score of two random sequences follows a Gumbel distribution. If computers could estimate the parameters of the Gumbel distribution within one second, the use of arbitrary alignment scoring schemes could increase the sensitivity of searching biological sequence databases over the web. Accordingly, this article gives a novel equation for the scale parameter of the relevant Gumbel distribution. We speculate that the equation is exact, although present numerical evidence is limited. The equation involves ascending ladder variates in the global alignment of random sequences. In global alignment simulations, the ladder variates yield stopping times specifying random sequence lengths. Because of the random lengths, and because our trial distribution for importance sampling occurs on a different sample space from our target distribution, our study led to a mapping theorem, which led naturally in turn to an efficient dynamic programming algorithm for the importance sampling weights. Numerical studies using several popular alignment scoring schemes then examined the efficiency and accuracy of the resulting simulations.

Keywords: Gumbel scale parameter estimation, gapped sequence alignment, importance sampling, stopping time, Markov renewal process, Markov additive process

1. Introduction

Sequence alignment is an indispensable tool in modern molecular biology. As an example, BLAST [2, 3, 18] (the Basic Local Alignment Search Tool, http://www.ncbi.nlm.nih.gov/BLAST/), a popular sequence alignment program, receives about 2.89 submissions per second over the Internet. Currently, BLAST users can choose among only 5 standard alignment scoring systems, because BLAST p-values must be pre-computed with simulations that take about 2 days for the required p-value accuracies. Moreover, adjustments for unusual amino acid compositions are essential in protein database searches [33], and in that application, computational speed demands that the corresponding p-values be calculated with crude, relatively inaccurate approximations [3]. Accordingly, for more than a decade, much research has been directed at estimating BLAST p-values in real time (i.e., in less than 1 sec) [7, 24, 26, 29], so that BLAST might use arbitrary alignment scoring systems.

Several studies have used importance sampling to estimate the BLAST p-value [7, 9, 26]. To describe importance sampling briefly, let E denote the expectation for some “target distribution” P, let Q be any distribution, and consider the equation

EXX(ω)dP(ω)=X(ω)dP(ω)dQ(ω)dQ(ω).
(1.1)

A computer can draw samples ωi (i = 1, . . . , r) from the “trial distribution” Q to estimate the expectation: EXr1i=1rX(ωi)[dP(ωi)dQ(ωi)]. The name “importance sampling” derives from the fact that the subsets of the sample space where X is large dominate contributions to EX. By focusing sampling on the “important” subsets, judicious choice of the trial distribution Q can reduce the effort required to estimate EX. In importance sampling, the likelihood ratio dP(ω)dQ(ω) is often called the “importance sampling weight” (or simply, the “weight”) of the sample ω.

A Monte Carlo technique called “sequential importance sampling” can substantially increase the statistical efficiency of importance sampling by generating samples from Q incrementally and exploiting the information gained during the increments to guide further increments. Although sequences might seem an especially natural domain for sequential sampling, most simulation studies for BLAST p-values have used sequences of fixed length. In contrast, our study involves sequences of random length.

Here, as in several other importance sampling studies [7, 9, 26, 34], hidden Markov models generate a trial distribution Q of random alignments between two sequences, where the sequences have a target distribution P. The other studies gloss over the fact that their trial and target distributions occur on different sample spaces, such as alignments and sequences. The other studies used sequences of fixed lengths, however, where a relatively simple formula for the weight dPdQ pertains. For the sequences of random length in this paper, however, the stopping rules for sequential sampling complicate formulas for dPdQ. Accordingly, the Appendix gives a general mapping theorem giving formulas for the weights dPdQ when each sample from P corresponds to many different samples from Q. (In the present article, e.g., each pair of random sequences corresponds to many possible random alignments.) In addition to the mapping theorem, we also develop several other techniques specifically tailored to speeding the estimation of the BLAST p-value.

The organization of this article follows. Section 2 on background and notation is divided into 4 subsections containing: (1) a friendly introduction to sequence alignment and its notation; (2) a brief self-contained description of the algorithm for calculating global alignment scores; (3) a technical summary of previous research on estimating the BLAST p-value introducing our importance sampling methods; and (4) a heuristic model for random sequence alignment using Markov additive processes. Section 3 on Methods is also divided into 4 subsections containing: (1) a novel formula for the relevant Gumbel scale parameter λ; (2) a Markov chain model for simulating sequence alignments (borrowed directly from a previous study [34], but used here with a stopping time); (3) a dynamic programming algorithm for calculating the importance sampling weights in the presence of a stopping time; and (4) formulas for the simulation errors. Section 4 then gives numerical results for the estimation of λ under 5 popular alignment scoring schemes. Finally, Section 5 is our Discussion.

2. Background and notation

2.1. Sequence alignment and its notation

Let A = A1A2 ··· and B = B1B2 ··· be two semi-infinite sequences drawn from a finite alphabet L, for example, {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y} (the amino acid alphabet) or {A, C, G, T} (the nucleotide alphabet). Let s:L×LR denote a “scoring matrix.” In database applications, s(a, b) quantifies the similarity between a and b, for example, the so-called “PAM” (point accepted mutation) and “BLOSUM” (block sum) scoring matrices can quantify evolutionary similarity between two amino acids [11, 16].

The alignment graph ΓA, B of the sequence-pair (A, B) is a directed, weighted lattice graph in two dimensions, as follows. The vertices v of ΓA, B are nonnegative integer points (i, j). (Below, “:=” denotes a definition, e.g., the natural numbers are N{1,2,3,}. Throughout the article, i, j, k, m, n and g are integers.) Three sets of directed edges e come out of each vertex v = (i, j): northward, northeastward and eastward (see Figure 1). One northeastward edge goes into v = (i + 1, j + 1) with weight s[e] = s(Ai+1, Bj+1). For each g > 0, one eastward edge goes into v = (i + g, j) and one northward edge goes into v = (i, j + g); both are assigned the same weight s[e] = –wg < 0. The deterministic function w:N(0,] is called the “gap penalty.” (The value wg = ∞ is explicitly permitted.) This article focuses on affine gap penalties wg = Δ0 + Δ1g0, Δ1 ≥ 0), which are typical in BLAST sequence alignments. Together, the scoring matrix s(a, b) and the gap penalty wg constitute the “alignment parameters.”

Fig. 1
Gapped global alignment scores and the corresponding directed paths for two subsequences A[1, 10] = TACTAGCGCA and B[1, 9] = ACGGTAGAT, drawn from the nucleotide alphabet {A, C, G, T}. Figure 1 uses a nucleotide scoring matrix, where s(a, b) = 5 if a ...

A (directed) path π = (v0, e1, v1, e2, . . . , ek, vk) in ΓA, B is a finite alternating sequence of vertices and edges that starts and ends with a vertex. For each i = 1, 2, . . . , k, the directed edge ei comes out of vertex vi–1 and goes into vertex vi. We say that the path π starts at v0 and ends at vk.

Denote finite subsequences of the sequence A by A[i, m] = AiAi+1 ··· Am. Every gapped alignment of the subsequences A[i, m] and B[j, n] corresponds to exactly one path that starts at v0 = (i – 1, j – 1) and ends at vk = (m, n) (see Figure 1). The alignment's score is the “path weight” Sπi=1ks[ei].

Define the “global score” Si, j := maxπ Sπ, where the maximum is taken over all paths π starting at v0 = (0, 0) and ending at vk = (i, j). The paths π starting at v0, ending at vk, and having weight Sπ = Si, j are “optimal global paths” and correspond to “optimal global alignments” between A[1, i] and B[1, j]. Define the “edge maximum” Mn := max{max0≤in Si, n, max0≤jn Sn, j}, and the “global maximum” M := supn≥0 Mn. (The single subscript in Mn indicates that the variate corresponds to a square [0, n] × [0, n], rather than a general rectangle [0, m] × [0, n].) Define the “strict ascending ladder epochs” (SALEs) in the sequence (Mn): let β(0) := 0 and β(k + 1) := min{n > β(k) : Mn > Mβ(k)}, where min . We call Mβ(k) the “kth SALE score.”

Define also the “local score” Si, j := maxπ Sπ, where the maximum is taken over all paths π ending at vk = (i, j), regardless of their starting point. Define the “local maximum” [M with tilde]m, n := max0≤im, 0≤jn Si, j. The paths π ending at vk = (i, j) with local score Sπ = Si, j = [M with tilde]m, n are “optimal local paths” corresponding to the “optimal local alignments” between subsequences of A[1, m] and B[1, n].

Now, the following “independent letters” model introduces randomness. Choose each letter in the sequence A and B randomly and independently from the alphabet L according to fixed probability distributions {pa:aL} and {pb:bL}. (Although this article permits the distributions {pa} and {pb} to be different, in applications they are usually the same.) Throughout the paper, the probability and expectation for the independent letters model are denoted by P and E.

Let Γ = ΓA, B denote the random alignment graph of the sequence-pair (A, B). In the appropriate limit, if the alignment parameters are in the so-called “logarithmic phase” [6, 12] (i.e., if the optimal global alignment score of long random sequences has a negative score), the random local maximum [M with tilde]m, n follows an approximate Gumbel extreme value distribution with “scale parameter” λ and “pre-factor” K [1, 14],

P(M~m,n>y)1exp[Kmnexp(λy)].
(2.1)

2.2. The dynamic programming algorithm for global sequence alignment

For affine gaps wg = Δ0 + Δ1g, the global score Si, j is calculated with the recursion

Si,j=max{Si1,j1,Ii1,j1,Di1,j1}+s(Ai,Bj),
(2.2)

where

Ii,j=max{Si,j1Δ0Δ1,Ii,j1Δ1,Di,j1Δ0Δ1},

Di, j = max{Si–1, j – Δ0 – Δ1, Di–1, j – Δ1} and boundary conditions S0, 0 = 0, I0, 0 = D0, 0 = –∞, Dg, 0 = I0, g = –Δ0 – Δ1g, Sg, 0 = S0, g = Ig, 0 = D0, g = –∞ for g > 0 [15]. The three array names, S, I, and D, are mnemonics for “substitution,” “insertion” and “deletion.” If “Δ” denotes a gap character, the corresponding alignment letter-pairs (a, b), (Δ, b) and (a, Δ) correspond to the operations for editing sequence A into sequence B [30].

2.3. Previous methods for estimating the BLAST p-value

If wg [equivalent] ∞ identically, so northward and eastward (gap) edges are disallowed in an optimal alignment path, a rigorous proof of (2.1) yields analytic formulas for the Gumbel parameters λ and K [12]. For gapped local alignment, rigorous results are sparse, although some approximate analytical studies are extant [21, 22, 27, 29]. The prevailing approach therefore estimates λ and K from simulations [4, 31]. Because λ is an exponential rate, it dominates K's contribution to the BLAST p-value. Most studies therefore (including the present one) have focused on λ. (Note, however, some recent progress on the real-time estimation of K [26].) Typically, current applications require a 1–4% relative error in λ; 10–20%, in K [4]. The characteristics of the relevant sequence database determine the actual accuracies required, however, making approximations with controlled error and of arbitrary accuracy extremely desirable in practice.

Storey and Siegmund [29] approximate λ (with neither controlled errors nor arbitrary accuracy) as

λ~λ2(μ)1ΛeλΔ0(eλΔ11),
(2.3)

where (a,b)papbexp[λs(a,b)]=1 [so λ* the so-called “ungapped lambda,” for Δ(g) [equivalent] ∞] and μ(a,b)s(a,b)papbexp[λs(a,b)]. In (2.3), Λ is an upper bound for an infinite sequence of constants defined in terms of gap lengths in a random alignment.

Many other studies have used local alignment simulations to estimate BLAST p-values, for example, Chan [9] used importance sampling and a mixture distribution. Some rigorous results [28] are also extant for the so-called “island method” [31, 32], which yields maximum likelihood estimates of λ and K from a Poisson process associated with local alignments exceeding a threshold score [4, 23].

Large deviations arguments [6, 35] support the common belief that global alignment can estimate λ for local alignment through the equation λ=limγγ1lnP{My}. For a fixed error, global alignment typically requires less computational effort than local alignment. For example, one early study [34] used importance sampling based on trial distributions Q from a hidden Markov model.

The study demonstrated that the global alignment equation E[exp(λSn,n)]=1 estimated λ with only O(n–1) error [7]. (Recall that “E” denotes the expectation corresponding to the random letters model.) The equation E[exp(λMm)]=E[exp(λMn)](mn), suggested by heuristic modeling with Markov additive processes (MAPs) [5, 10], improved the error substantially, to O(εn) [24].

The next subsection shows how the MAP heuristic can improve the efficiency of importance sampling even further, with its renewal structure. The next subsection gives the relevant parts of the MAP heuristic.

2.4. The Markov additive process heuristic

The rigorous theory of MAPs appears elsewhere [5, 10]. Because the MAP heuristics given below parallel a previous publication [24], we present only informal essentials.

Consider a finite Markov-chain state-space J, containing #J elements. Without loss of generality, J={1,,#J}. Until further notice, all vectors are row vectors of dimension #J; all matrices, of dimension (#J)×(#J). A MAP can be defined in terms of a time-homogenous Markov chain (MC) (JnJ:n=0,1,) and a (#J)×(#J) matrix of real random variates ||Zi, j||. Let the MC have transition matrix P=pi,j,sopi,j=P(Jn=jJn1=i). Let the stationary distribution of the MC π, assumed strictly positive and satisfying both πP = π and π 1t = 1, where 1t denotes the (#J)×1 column vector whose elements are all 1.

As usual, let Pγ and Eγ be the probability measure and expectation corresponding to an initial state J0 with distribution γ; Pi and Ei, to an initial state J0 = i; and Pπ and Eπ, to an initial state in the equilibrium distribution π.

Run the MC (Jn), and take its succession of states as given. Consider the following sequence (YnR:n=0,1,) of random variates. Define Y0 := 0. For n = 1, 2, . . . , let the (Yn) be conditionally independent, with distributions determined by the transition Jn–1Jn of the Markov chain as follows. If Jn–1 = i and Jn = j, the value of Yn is chosen randomly from the distribution of Zi, j. (Thus, if Jm–1 = Jn–1 = i and Jm = Jn = j, Ym and Yn share the distribution of Zi, j, although independence permits randomness to give them different values.)

The random variates of central interest are the sums Tn=m=0nYm (n = 0, 1, . . .) and the maximum M := maxn≥0 Tn. To exclude trivial distributions for M (i.e., M = 0 a.s. and M = ∞ a.s.), make two assumptions: (1) EπY1<0; and (2) there is some m and state i such that

Pi{min{Tk:k=1,,m}>0;Jm=i,Jjiforj=1,,m1}>0.
(2.4)

Consider the sequence (Tn), its SALEs β(0) := 0 and β(k + 1) := min{n > β(k): Tn > Tβ(k)}, and its SALE scores Tβ(k). For brevity, let β := β(1). Note that M = Tβ(k) for some k [set membership] {0, 1, . . .}. In a MAP, (Jβ(k), Tβ(k)) forms a defective Markov renewal process.

Now, define the matrix LθEi[exp(θTβ);Jβ=j,β<]. The Perron–Frobenius theorem [5], page 25, shows that Lθ has a strictly dominant eigenvalue ρ(θ) > 0 [i.e., ρ(θ) is the unique eigenvalue of greatest absolute value]. Moreover, ρ(θ) is a convex function [19], and because L0 is substochastic, ρ(0) < 1. The two assumptions above (2.4) ensure that M := maxn≥0 Tn has a nontrivial distribution and that ρ(λ) = 1 for some unique λ > 0.

The notation intentionally suggests a heuristic analogy between MAPs and global alignment. Identify the Markov chain states Jn in the MAP with the rectangle [0, n] × [0, n] of ΓA, B, and identify the sum Tn in the MAP with the edge maximum Mn in global alignment. In the following, therefore, the identification leads to Mn replacing Tn in the MAP formulas. In particular, the MAP heuristic identifies the Gumbel scale parameter in (2.1) with the root λ > 0 of the equation ρ(λ) = 1. Although the heuristic analogy between MAPs and global alignment is in no way precise or rigorous, it has produced useful results [24].

The details of why the MAP heuristic works so well are presently obscure, although some additional motivation appears in an heuristic calculation related to λ [8]. The calculation takes the limit of nested successively wider semi-infinite strips, each strip having constant width and propagating itself northeastward in the alignment graph ΓA, B. The successive northeast boundaries of the propagation are states in an ergodic MC. MAPs therefore might rigorously justify the heuristic calculation.

3. Methods

3.1. A novel equation for λ

From the definition of Lθ in a MAP, if the Markov chain {Jn} starts in a state J0 with distribution γ (with Mn replacing Tn in the MAP formulas), matrix algebra applied to the concatenation of SALEs in a MAP yields

Eγ[exp(θMβ(k));β(k)<]=γ(Lθ)k1t.
(3.1)

For a MAP, equation (3.1) is exact; but for global alignment, it has no literal meaning. Equation (3.1) has some consequences for the limit k → ∞, and we speculate that the consequences hold, even for global alignment. [Note: although the sequence (β(k)) is a.s. finite, the limits k → ∞ below involve no contradiction or approximation, because they are not a.s. limits.]

Define Kk(θ)ln{Eγ[exp(θMβ(k));β(k)<]}. In (3.1), a spectral (eigenvalue) decomposition of the matrix Lθ [25] shows that

Kk(θ)=kln{ρ(θ)}+c0+O(εk),
(3.2)

where 0 ≤ ε < 1 is determined by the magnitude of the subdominant eigenvalue of Lθ, and c0 is a constant independent of θ and k.

For k′ – k > 0 fixed, we can accelerate the convergence in (3.2) as k → ∞ by differencing

Kk(θ)Kk(θ)=(kk)ln{ρ(θ)}+O(εk).
(3.3)

Let λk′, k denote the root of (3.3) after dropping the error term O(εk). Because ρ(λ) = 1, Taylor approximation around λ yields ln{ρ(λk′, k)} ≈ ρ′ (λ)(λk′, kλ), so (3.3) becomes

(kk)ρ(λ)(λk,kλ)=O(εk),
(3.4)

that is, with k′ – k fixed, λk′, k converges geometrically to λ as the SALE index k → ∞.

The initial state γ of global alignment has a deterministic distribution, namely the origin (0, 0). Equation (3.3) for θ = λ therefore becomes

E[exp(λMβ(k));β(k)<]=E[exp(λMβ(k));β(k)<]
(3.5)

after dropping the geometric error O(εk). Let λ^k,k be the root of (3.5).

3.2. The trial distribution for importance sampling

In (3.5), crude Monte Carlo simulation generating random sequence-pairs with the identical letters model P is inefficient for the following reason. When practical alignment scoring systems are used, P{β(k)<}<1 for k ≥ 1. For, example, the BLAST defaults (scoring matrix BLOSUM62, gap penalty wg = 11 + g, and Robinson–Robinson letter frequencies), P{β(4)=}0.047, so only about 1 in 20 crude Monte Carlo simulations generate a fourth ladder point. Empirically in our importance sampling, however, Gumbel parameter estimation seemed most efficient when the stopping time corresponded to β(4) (see below).

Importance sampling requires a trial distribution to determine λ^k,k from (3.5). By editing one sequence into another, a Markov chain model borrowed directly from a previous study [34] generates random sequence alignments, as follows.

Consider a Markov state space consisting of the set of alignment letter-pairs L2, where LL{Δ}, “Δ” being a character representing gaps. The ordered pair (Δ, Δ) has probability 0, so a succession of Markov states corresponds to a global sequence alignment (see Figure 1), that is, to a path in the alignment graph ΓA, B. Ordered pairs other than (Δ, Δ) fall into three sets, corresponding to edit operations following (2.2): SL×L [substitution, a bioinformatics term implicitly including identical letter-pairs (a, a)], I{Δ}×L (insertion); and DL×{Δ} (deletion). The sets S, I and D form “atoms” of the MC [13], page 203, as follows. (By definition, each atom of a MC is a set of all states with identical outgoing transition probabilities.)

From the set S, the transition probability to (a, b) is tS, Sqa, b; to (Δ, b), tS,Ipb; and to (a, Δ), tS, Dpa. From the set I, the transition probability to (a, b) is tI, Sqa, b; to (Δ, b), tI,Ipb; and to (a, Δ), tI, Dpa. From the set D, the transition probability to (a, b) is tD, Sqa, b; to (Δ, b), tD,Ipb; and to (a, Δ), tD, Dpa. Transition probabilities sum to 1, so the following restrictions apply: a,bLqa,b=1, bLpb=1, aLpa=1, tS, S + tS, I + tS, D = 1 (transit from the substitution atom), tD, D + tD, S + tD, I = 1 (transit from the deletion atom) and tI, I + tI, S + tI, D = 1 (transit from the insertion atom). Usually in practice, the term tI, D = 0, to disallow insertions following a deletion. Our formulas retain the term, to exploit the resulting symmetry later.

In the terminology of hidden Markov models, S, I, D are hidden Markov states. ti, j for i, j [set membership] {S, I, D} are transition probabilities and qa, b, pb, pa for a,bL are emission probabilities from the state S, I, D, respectively.

As described elsewhere [34], numerical values for the Markov probabilities can be determined from the scores s(a, b) and the gap penalty wg. Note that the values are selected for statistical efficiency, although many other values also yield unbiased estimates for λ in the appropriate limit.

3.3. Importance sampling weights and stopping times

To establish notation, and to make connections to the Appendix and its mapping theorem, note that the MC above can be supported on a probability space (Ω,F,Q), where each ω = (π, A, B) [set membership] Ω is an ordered triple. Here, π is an infinite path starting at the origin in the alignment graph ΓA, B; F is the set generated by cylinder sets in Ω (here, cylinder sets essentially consist of some finite path and the corresponding pair of subsequences); and Q is the MC probability distribution described above, started at the atom S, with expectation operator EQ.

Let N be any stopping time for the sequence (Mn : n = 0, 1, . . .) of edge maxima for ΓA, B (i.e., the sequence {M0, . . . , Mn} determines whether Nn or not). Because Mn is determined by (A[1, n], B[1, n]), N is also a stopping time for the sequence {(A[1, n], B[1, n]) : n = 0, 1, . . .}. The stopping time of main interest here is N = β(k) the kth ladder index of (Mn), where k ≥ 1 is arbitrary. (As further motivation for the mapping theorem in the Appendix, other stopping times of possible interest include, for example, N = n, a fixed epoch [7], and N = β(Ky), where β(Ky) = inf{n : Mny} is the index of first ladder-score outside the interval (0, y).)

To use the mapping theorem, introduce the probability space (Ω,F,P), where each ω″ = (A, B) [set membership] Ω″ is an ordered pair. Here, A and B are sequences, F″ is the set generated by all cylinder sets in Ω″ (i.e., sets corresponding to pairs of finite subsequences) and P(A)=k=1ipAkk=1jpBk, if the cylinder set A″ corresponds to the subsequence pair (A[1, i], B[1, j]). Given N, the theory of stopping times [5], page 414, can be used to construct a discrete probability space (Ω,F,P), where each event ω[set membership] Ω′ is a finite-sequence pair ω′ = (A[1, N], B[1, N]), F′ is the set of all subsets of Ω′ and P(ω)=k=1N(ω)pAkk=1N(ω)pBk.

Let Im, n := {(i, j) : i = m, jn} and Dm, n := {(i, j) : im, j = n}. Define the function f:ωω, where ω = (π, A, B) and ω=(ωA,ωB)(A[1,N],B[1,N]). Then, ω [set membership] f–1(ω′), if and only if the path π hits the set IN,NDN,N at (i, j), so that A[1,N]=ωA and B[1,N]=ωB (see Figure 2).

Fig. 2
Two examples of alignment path π generated by a Markov chain. As in Figure 1, the shading and the double lines indicate squares where a vertex (surrounded by double lines) generated an SALE. The SALEs determine the stopping time N = β ...

Empirically, our simulations satisfied Q{β(k)<}=1, and we speculate that our application therefore satisfies the hypothesis QH=1 of the Appendix. According to the Appendix, the reciprocal importance sampling weight 1W(ω)=ω0f1{f(ω)}Q(ω0)Pf(ω) depends on the sum over all possible Markov chain realizations ω0 [set membership] f–1(ω′). Dynamic programming computes the sum efficiently, as follows.

Let the “transition” T represent any element of {S, I, D} [substitution (ai, bj), insertion (Δ, bj), or deletion (ai, Δ)]. Fix any particular pair (A, B) of infinite sequences, which fixes N = β(k). To set up a recursion for dynamic programming, consider the following set of events Ei,jT, defined for T [set membership] {S, I, D} and min{i, j} ≤ N, and illustrated in Figure 2. Let Ei,jT be the event consisting of all ω yielding a path π whose final transition is T and which corresponds to the subsequences: (1) A[1, i] and B[1, j] for 0 ≤ i, jN; (2) A[1, i] and B[1, N] for 0 ≤ N = ji; and (3) A[1, N] and B[1, j] for 0 ≤ N = ij. Define Qi,jTQ(Ei,jT) and Qi,jQi,jS+Qi,jI+Qi,jD. (Note: in the following, T [set membership] {S, I, D} is always a superscript, never an exponent.)

For brevity, let q~i,j=qAi,Bj for 0i,jN;q~i,j=(aL)qa,Bj for 0jN<i;q~i,j=(bL)qAi,b for 0 ≤ iN < j; and q~i,j=1 otherwise. Let p~J=pBj for 0 ≤ jN; and 1 otherwise. Finally, Let p~i=pAi for 0 ≤ iN; and 1 otherwise. Because every path into the vertex (i, j) comes from one of three vertices, each corresponding to a different transition T [set membership] {S, I, D},

Qi,jS=q~i,j(tS,SQi1,j1S+tI,SQi1,j1I+tD,SQi1,j1E),Qi,jI=p~j(tS,IQi,j1S+tI,IQi,j1I+tD,IQi,j1D),Qi,jD=p~i(tS,DQi1,jS+tI,DQi1,jI+tD,DQi1,jD)
(3.6)

with boundary conditions Q0,0S=1, Q0,0I=Q0,0D=0, Qg,0S=Q0,gS=Qg,0I=Q0,gD=0, Q0,gI=pB1pBgtS,I(tI,I)g1 and Qg,0D=pA1pAgtS,D×(tD,D)g1(g>0).

Recall that ω = (π, A, B) [set membership] f–1(ω′), if and only if the path π hits the set IN,NDN,N at (i, j), so that A[1,N]=ωA and B[1,N]=ωB. Thus,

ωf1(ω)Q(ω)=QN,NS+j=N(QN,jS+QN,jD)+i=N(Qi,NS+Qi,NI).
(3.7)

To turn (3.6) into a recursion for importance sampling weights, define pipA1pAmin{i,N}=p~1p~i and pjpB1pBmin{j,N}=p~1p~j, and let Wi,jTQi,jT(PiPj)(T{S,I,D}). Let ri,j=q~i,j(p~ip~j). For future reference, define r•, j := ri, j for 0 ≤ jN < i and ri, • := ri, j for 0 ≤ iN < j. Note that r•, j is independent of i, and ri, • is independent of j. Equation (3.6) yields

Wi,jS=ri,j(tS,SWi1,j1S+tI,SWi1,j1I+tD,SWi1,j1D),Wi,jI=tS,IWi,j1S+tI,IWi,j1I+tD,IWi,j1D,Wi,jD=tS,DWi1,jS+tI,DWi1,jI+tD,DWi1,jD
(3.8)

with boundary conditions W0,0S=1, W0,0I=W0,0D=0, Wg,0S=W0,gS=W0,gI=Wg,0D=0, W0,gI=tS,I(tI,I)g1 and Wg,0D=tS,D(tD,D)g1 (g > 0). Because of (3.7), the importance sampling weight W := W(ω) satisfies

1W=ω0f1{f(ω)}Q(ω0)Pf(ω)=WN,NS+j=N(WN,jS+WN,jD)+i=N(Wi,NS+Wi,NI).
(3.9)

Because ri, j = r• ,j (0 ≤ jN < i) and ri, j = ri, • (0 ≤ iN < j), only a finite number of recursions are needed to compute the infinite sums in (3.9), as follows. For T [set membership] {S, I, D}, define U~iTUi,NT, where Um,nTj=nWm,jT. Likewise, define V~jTVN,jT, where Vm,nTi=mWi,nT. Equation (3.9) becomes

1W=WN,NS+U~NS+U~ND+V~NS+V~NI.
(3.10)

Note that Ui,j1TUi,jT=Wi,j1T. To determine U~NT, summation of (3.8) for 0 ≤ iN < j yields

Ui,jS=ri,(tS,SUi1,j1S+tI,SUi1,j1I+tD,SUi1,j1D)=Ui,j1SWi,j1S,Ui,jI=tS,IUi,j1S+tI,IUi,j1I+tD,IUi,j1D=Ui,j1IWi,j1I,Ui,jD=tS,DUi1,jS+tI,DUi1,jI+tD,DUi1,jD.
(3.11)

Elimination of Ui,jT for j = N + 1 and i = 1, . . . , N in the first two equations yields

Ui,NS=ri,(tS,SUi1,NS+tI,SUi1,NI+tD,SUi1,ND)+Wi,NS,Ui,NI=tS,IUi,NS+tI,IUi,NI+tD,IUi,ND+Wi,NI,Ui,ND=tS,DUi1,NS+tI,DUi1,NI+tD,DUi1,ND,
(3.12)

that is,

U~iS=ri,(tS,SU~i1S+tI,SU~i1I+tD,SU~i1D)+Wi,NS,U~iI=(1tI,I)1(tS,IU~iS+tD,IU~iD+Wi,NI),U~iD=tS,DU~i1S+tI,DU~i1I+tD,DU~i1D
(3.13)

with initial values U~0S=U0D=0 and U~0I=(1tI,I)1W0,NI=(1tI,I)1×tS,I(tI,I)N1. Compute (3.13) recursively for i = 1, . . . , N. Similarly, reflect through i = j to derive

V~jS=r,j(tS,SV~j1S+tD,SV~j1D+tI,SV~j1I)+WN,jS,V~jI=tS,IV~j1S+tD,IV~j1D+tI,IV~j1I,V~jD=(1tD,D)1(tS,DV~jS+tI,DV~jI+WN,jD)
(3.14)

with initial values V~0S=V~0I=0 and V~0D=(1tD,D)1WN,0D=(1=tD,D)=1×tS,D(tD,D)N1. Iterate (3.14) for j = 1, . . . , N. Substitute the results for U~NS,U~ND, V~NS, and V~NI into (3.10) to compute W.

3.4. Error estimates for λ^k,k

Denote the indicator of an event A by IA, that is, IA=1 if A occurs and 0 otherwise. For a realization ω in the simulation, define

hk,k(θ)hk,k(θ;ω)exp(θMβ(k))I[β(k)<]exp(θMβ(k))I[β(k)<]
(3.15)

and let hk,k be its derivative with respect to θ.

Given samples ωi (i = 1, . . . , r) from the trial distribution Q, let W = W(ωi) denote the corresponding importance sampling weights. Because λ^k,k is the M-estimator [17] of the root λk′, k of Ehk,k(λk,k)=0, as r → ∞, r(λ^k,kλk,k) converges in distribution to the normal distribution with mean 0 and variance [17]

EQ[h(λk,k)W]2{EQ[h(λk,k)W]}2r11r[h(ωi;λ^k,k)W(ωi)]2{r11r[h(ωi;λ^k,k)W(ωi)]}2.
(3.16)

4. Numerical study for Gumbel scale parameter

Table 1 gives our “best estimate” λ of the Gumbel scale parameter λ from (3.5) for each of the 5 options BLASTP gives users for the alignment scoring scheme. For every scheme, estimates λ^ derived from the first to fourth SALEs indicated that λ^ generally is biased above the true value λ, but that λ^ converged adequately by the fourth SALE. The best estimate λ (shown in Table 1) is the average of 200 independent estimates λ^, each computed within 1 sec from sequence-pairs simulated up to their fourth SALE. For BLOSUM 62 and gap penalty wg = 11 + g, the average computation produced 1441 sequence-pairs up to their fourth SALE within 1 second. (For results relevant to the other publicly available scoring schemes, see Table 1.) The best estimates λ derived from (3.5) were within the error of the BLASTP values for λ.

Table 1
Best estimates λ for the 5 BLASTP alignment scoring schemes. For each scheme, we generated 200 estimates λ^, each within a one-second computation time. The third column gives present estimates of λ used on the BLAST web ...

Despite having the variance formula in (3.16) in hand, we elected to estimate the standard error s^λ directly from the 200 independent estimates λ^. Figure 3 plots the relative error s^λλ in each individual λ^ against the computation time, where s^λ is the standard error of λ^. It shows that for all 5 BLASTP online options, (3.5) easily computed λ^ to 1–4% accuracy within about 0.5 seconds.

Fig. 3
Plot of relative errors against computation time (sec). Both axes are in logarithmic scale. Computation time was measured on a 2.99 GHz Pentium® DCPU. Relative errors for BLOSUM45 with Δ(g) = 14 + 2g are shown by ■; BLOSUM62 ...

5. Discussion

This article indicates that the scale parameter λ of the Gumbel distribution for local alignment of random sequences satisfies (3.5), an equation involving the strict ascending ladder-points (SALEs) from global alignment, at least approximately. For standard protein scoring systems, in fact, simulation error could account for most (if not all) of the observed differences between values of λ calculated from (3.5) and values calculated from extensive crude Monte Carlo simulations. (The values of λ from crude simulation have a standard error of about ±1%.) In SALE simulations, (3.5) estimated λ to 1–4% accuracy within 0.5 second, as required by BLAST database searches over the Web. The present study did not tune simulations much; it relied instead on methods specific to sequence alignment to improve estimation. Many general strategies for sequential importance sampling therefore remain available to speed simulation. Preliminary investigations estimating the other Gumbel parameter (the pre-factor K) with SALEs are encouraging, so online estimation of the entire Gumbel distribution for arbitrary scoring schemes appears imminent, and preliminary computer code is already in place.

E[X;H]ωHX(ω)dP(ω)=ωHX(ω)dP(ω)ωf1(ω)dQ(ω)ω0f1(ω)dQ(ω0)=ωHωf1(ω)Xf(ω)dPf(ω)ω0f1{f(ω)}dQ(ω0)dQ(ω)=ωHXf(ω)dPf(ω)ω0f1{f(ω)}dQ(ω0)dQ(ω).
(A.1)

Acknowledgments

The authors Y. Park and S. Sheetlin contributed equally to the article. All authors would like to acknowledge helpful discussion with Dr. Nak-Kyeong Kim. This research was supported by the Intramural Research Program of the NIH, National Library of Medicine.

APPENDIX: A GENERAL MAPPING THEOREM FOR IMPORTANCE SAMPLING

The following theorem describes an unusual type of Rao-Blackwellization [20]. Consider two probability spaces (Ω,F,Q) and (Ω,F,P), and a F/F′-measurable function f:ΩΩ (i.e., f–1 F[set membership] F′). Note: f is explicitly permitted to be many-to-one. Let P<<Qf1 on some set H′ (i.e., Qf1G=0PG=0 for any set GH), so the Radon–Nikodym derivative in the second line of (A.1) below exists. Let H := f–1H′, so for every random variate X′ on (Ω′, F′), Consider the application of (A.1) to importance sampling with target distribution P and trial distribution Q. Assume QH=1, so H supports Q. In our application to global alignment, H=[β(k)=]Ω (“[subset or is implied by]” being strict inclusion), but we speculate QH=1.

In Monte Carlo applications, a discrete sample space H is usually available. Accordingly, the following theorem replaces the integrals in (A.1) by sums.

The mapping theorem for importance sampling. Let

1W(ω)ω0f1{f(ω)}Q(ω0)Pf(ω).
(A.2)

Under the above conditions, r–1 i=1r[Xf(ωi)W(ωi)]E[X;H] with probability 1 and in mean (with respect to Q), as the number of realizations r → ∞.

The mapping theorem is an easy application of the law of large numbers to (A.1).

REFERENCES

1. Aldous D. Probability Approximations via the Poisson Clumping Heuristic. 1st ed. Springer; New York: 1989. MR0969362.
2. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J. Molecular Biology. 1990;215:403–410. [PubMed]
3. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. [PMC free article] [PubMed]
4. Altschul SF, Bundschuh R, Olsen R, Hwa T. The estimation of statistical parameters for local alignment score distributions. Nucleic Acids Res. 2001;29:351–361. [PMC free article] [PubMed]
5. Asmussen S. Applied Probability and Queues. Springer; New York: 2003. MR1978607.
6. Arratia R, Waterman MS. A phase transition for the score in matching random sequences allowing deletions. Ann. Appl. Probab. 1994;4:200–225. MR1258181.
7. Bundschuh R. Rapid significance estimation in local sequence alignment with gaps. J. Comput. Biology. 2002;9:243–260. [PubMed]
8. Bundschuh R. Asymmetric exclusion process and extremal statistics of random sequences. Phys. Rev. E. 2002;65:031911. [PubMed]
9. Chan HP. Upper bounds and importance sampling of p-values for DNA and protein sequence alignments. Bernoulli. 2003;9:183–199. MR1997026.
10. Cinlar E. Introduction to Stochastic Processes. Prentice Hall; Upper Saddle River, NJ: 1975. MR0380912.
11. Dayhoff MO, Schwartz RM, Orcutt BC. Atlas of Protein Sequence and Structure. National Biomedical Research Foundation; Silver Spring, MD: 1978. A model of evolutionary change in proteins. pp. 345–352.
12. Dembo A, Karlin S, Zeitouni O. Limit distributions of maximal nonaligned two-sequence segmental score. Ann. Probab. 1994;22:2022–2039. MR1331214.
13. Djellout H, Guillin A. Moderate deviations for Markov chains with atom. Stochastic Process. Appl. 2001;95:203–217. MR1854025.
14. Galombos J. The Asymptotic Theory of Extreme Order Statistics. 1st ed. Wiley and Sons; New York: 1978.
15. Gotoh O. An improved algorithm for matching biological sequences. J. Molecular Biology. 1982;162:705–708. [PubMed]
16. Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA. 1992;89:10915–10919. [PMC free article] [PubMed]
17. Huber PJ. Robust estimation of a location parameter. Ann. Math. Statist. 1964;35:73–101. MR0161415.
18. Karlin S, Altschul SF. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proceedings of the National Academy of Sciences of the United States of America. 1990;87:2264–2268. [PMC free article] [PubMed]
19. Kingman JFC. A convexity property of positive matrices. Quart. J. Math. Oxford. 1961;12:283–284. MR0138632.
20. Liu JS. Monte Carlo Strategies in Scientific Computing. Springer; New York: 2001. MR1842342.
21. Mott R. Local sequence alignments with monotonic gap penalties. Bioinformatics. 1999;15:455–462. [PubMed]
22. Mott R. Accurate formula for p-values of gapped local sequence and profile alignments. J. Molecular Biology. 2000;300:649–659. [PubMed]
23. Olsen R, Bundschuh R, Hwa T. Rapid assessment of extremal statistics for gapped local alignment.. Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology; AAAI Press, Menlo Park, CA. 1999. pp. 211–222. [PubMed]
24. Park Y, Sheetlin S, Spouge JL. Accelerated convergence and robust asymptotic regression of the Gumbel scale parameter for gapped sequence alignment. J. Phys. A: Mathematical and General. 2005;38:97–108.
25. Seneta E. Nonnegative Matrices and Markov Chain. Springer; New York: 1981. MR0719544.
26. Sheetlin S, Park Y, Spouge JL. The Gumbel pre-factor k for gapped local alignment can be estimated from simulations of global alignment. Nucleic Acids Res. 2005;33:4987–4994. [PMC free article] [PubMed]
27. Siegmund D, Yakir B. Approximate p-values for local sequence alignments. Ann. Statist. 2000;28:657–680. MR1792782.
28. Spouge JL. Path reversal, islands, and the gapped alignment of random sequences. J. Appl. Probab. 2004;41:975–983. MR2122473.
29. Storey JD, Siegmund D. Approximate p-values for local sequence alignments: Numerical studies. J. Comput. Biology. 2001;8:549–556. [PubMed]
30. Waterman MS, Smith TF, Beyer WA. Some biological sequence metrics. Adv. in Math. 1976;20:367–387. MR0408876.
31. Waterman MS, Vingron M. Rapid and accurate estimates of statistical significance for sequence data base searches. Proc. Natl. Acad. Sci. USA. 1994;91:4625–4628. [PMC free article] [PubMed]
32. Waterman MS, Vingron M. Sequence comparison significance and Poisson approximation. Statist. Sci. 1994;9:367–381. MR1325433.
33. Yu YK, Altschul SF. The construction of amino acid substitution matrices for the comparison of proteins with nonstandard compositions. Bioinformatics. 2005;21:902–911. [PubMed]
34. Yu YK, Hwa T. Statistical significance of probabilistic sequence alignment and related local hidden Markov models. J. Comput. Biology. 2001;8:249–282. [PubMed]
35. Zhang Y. A limit theorem for matching random sequences allowing deletions. Ann. Appl. Probab. 1995;5:1236–1240. MR1384373.

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...