- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

- Journal List
- NIHPA Author Manuscripts
- PMC2818155

# ESTIMATING THE GUMBEL SCALE PARAMETER FOR LOCAL ALIGNMENT OF RANDOM SEQUENCES BY IMPORTANCE SAMPLING WITH STOPPING TIMES

## Abstract

The gapped local alignment score of two random sequences follows a Gumbel distribution. If computers could estimate the parameters of the Gumbel distribution within one second, the use of arbitrary alignment scoring schemes could increase the sensitivity of searching biological sequence databases over the web. Accordingly, this article gives a novel equation for the scale parameter of the relevant Gumbel distribution. We speculate that the equation is exact, although present numerical evidence is limited. The equation involves ascending ladder variates in the global alignment of random sequences. In global alignment simulations, the ladder variates yield stopping times specifying random sequence lengths. Because of the random lengths, and because our trial distribution for importance sampling occurs on a different sample space from our target distribution, our study led to a mapping theorem, which led naturally in turn to an efficient dynamic programming algorithm for the importance sampling weights. Numerical studies using several popular alignment scoring schemes then examined the efficiency and accuracy of the resulting simulations.

**Keywords:**Gumbel scale parameter estimation, gapped sequence alignment, importance sampling, stopping time, Markov renewal process, Markov additive process

## 1. Introduction

Sequence alignment is an indispensable tool in modern molecular biology. As an example, BLAST [2, 3, 18] (the Basic Local Alignment Search Tool, http://www.ncbi.nlm.nih.gov/BLAST/), a popular sequence alignment program, receives about 2.89 submissions per second over the Internet. Currently, BLAST users can choose among only 5 standard alignment scoring systems, because BLAST *p*-values must be pre-computed with simulations that take about 2 days for the required *p*-value accuracies. Moreover, adjustments for unusual amino acid compositions are essential in protein database searches [33], and in that application, computational speed demands that the corresponding *p*-values be calculated with crude, relatively inaccurate approximations [3]. Accordingly, for more than a decade, much research has been directed at estimating BLAST *p*-values in real time (i.e., in less than 1 sec) [7, 24, 26, 29], so that BLAST might use arbitrary alignment scoring systems.

Several studies have used importance sampling to estimate the BLAST *p*-value [7, 9, 26]. To describe importance sampling briefly, let $\mathbb{E}$ denote the expectation for some “target distribution” $\mathbb{P}$, let $\mathbb{Q}$ be any distribution, and consider the equation

A computer can draw samples *ω _{i}* (

*i*= 1, . . . ,

*r*) from the “trial distribution” $\mathbb{Q}$ to estimate the expectation: $\mathbb{E}X\approx {r}^{-1}{\sum}_{i=1}^{r}X\left({\omega}_{i}\right)[d\mathbb{P}\left({\omega}_{i}\right)\u2215d\mathbb{Q}\left({\omega}_{i}\right)]$. The name “importance sampling” derives from the fact that the subsets of the sample space where

*X*is large dominate contributions to $\mathbb{E}X$. By focusing sampling on the “important” subsets, judicious choice of the trial distribution $\mathbb{Q}$ can reduce the effort required to estimate $\mathbb{E}X$. In importance sampling, the likelihood ratio $d\mathbb{P}\left(\omega \right)\u2215d\mathbb{Q}\left(\omega \right)$ is often called the “importance sampling weight” (or simply, the “weight”) of the sample

*ω*.

A Monte Carlo technique called “sequential importance sampling” can substantially increase the statistical efficiency of importance sampling by generating samples from $\mathbb{Q}$ incrementally and exploiting the information gained during the increments to guide further increments. Although sequences might seem an especially natural domain for sequential sampling, most simulation studies for BLAST *p*-values have used sequences of fixed length. In contrast, our study involves sequences of random length.

Here, as in several other importance sampling studies [7, 9, 26, 34], hidden Markov models generate a trial distribution $\mathbb{Q}$ of random *alignments* between two sequences, where the *sequences* have a target distribution $\mathbb{P}$. The other studies gloss over the fact that their trial and target distributions occur on different sample spaces, such as alignments and sequences. The other studies used sequences of fixed lengths, however, where a relatively simple formula for the weight $d\mathbb{P}\u2215d\mathbb{Q}$ pertains. For the sequences of random length in this paper, however, the stopping rules for sequential sampling complicate formulas for $d\mathbb{P}\u2215d\mathbb{Q}$. Accordingly, the Appendix gives a general mapping theorem giving formulas for the weights $d\mathbb{P}\u2215d\mathbb{Q}$ when each sample from $\mathbb{P}$ corresponds to many different samples from $\mathbb{Q}$. (In the present article, e.g., each pair of random sequences corresponds to many possible random alignments.) In addition to the mapping theorem, we also develop several other techniques specifically tailored to speeding the estimation of the BLAST *p*-value.

The organization of this article follows. Section 2 on background and notation is divided into 4 subsections containing: (1) a friendly introduction to sequence alignment and its notation; (2) a brief self-contained description of the algorithm for calculating global alignment scores; (3) a technical summary of previous research on estimating the BLAST *p*-value introducing our importance sampling methods; and (4) a heuristic model for random sequence alignment using Markov additive processes. Section 3 on Methods is also divided into 4 subsections containing: (1) a novel formula for the relevant Gumbel scale parameter *λ*; (2) a Markov chain model for simulating sequence alignments (borrowed directly from a previous study [34], but used here with a stopping time); (3) a dynamic programming algorithm for calculating the importance sampling weights in the presence of a stopping time; and (4) formulas for the simulation errors. Section 4 then gives numerical results for the estimation of *λ* under 5 popular alignment scoring schemes. Finally, Section 5 is our Discussion.

## 2. Background and notation

### 2.1. Sequence alignment and its notation

Let **A** = *A*_{1}*A*_{2} ··· and **B** = *B*_{1}*B*_{2} ··· be two semi-infinite sequences drawn from a finite alphabet $\mathfrak{L}$, for example, {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y} (the amino acid alphabet) or {A, C, G, T} (the nucleotide alphabet). Let $s:\mathfrak{L}\times \mathfrak{L}\mapsto \mathbb{R}$ denote a “scoring matrix.” In database applications, *s*(*a, b*) quantifies the similarity between *a* and *b*, for example, the so-called “PAM” (point accepted mutation) and “BLOSUM” (block sum) scoring matrices can quantify evolutionary similarity between two amino acids [11, 16].

The alignment graph Γ_{A, B} of the sequence-pair (**A, B**) is a directed, weighted lattice graph in two dimensions, as follows. The vertices *v* of Γ_{A, B} are nonnegative integer points (*i, j*). (Below, “:=” denotes a definition, e.g., the natural numbers are $\mathbb{N}\u2254\{1,2,3,\dots \}$. Throughout the article, *i, j, k, m, n* and *g* are integers.) Three sets of directed edges *e* come out of each vertex *v* = (*i, j*): northward, northeastward and eastward (see Figure 1). One northeastward edge goes into *v* = (*i* + 1, *j* + 1) with weight *s*[*e*] = *s*(*A*_{i+1}, *B*_{j+1}). For each *g* > 0, one eastward edge goes into *v* = (*i* + *g, j*) and one northward edge goes into *v* = (*i, j + g*); both are assigned the same weight *s*[*e*] = –*w _{g}* < 0. The deterministic function $w:\mathbb{N}\mapsto (0,\infty ]$ is called the “gap penalty.” (The value

*w*= ∞ is explicitly permitted.) This article focuses on affine gap penalties

_{g}*w*= Δ

_{g}_{0}+ Δ

_{1}

*g*(Δ

_{0}, Δ

_{1}≥ 0), which are typical in BLAST sequence alignments. Together, the scoring matrix

*s*(

*a, b*) and the gap penalty

*w*constitute the “alignment parameters.”

_{g}**A**[1, 10] = TACTAGCGCA

*and*

**B**[1, 9] = ACGGTAGAT,

*drawn from the nucleotide alphabet*{A, C, G, T}.

*Figure*1

*uses a nucleotide scoring matrix, where s*(

*a, b*) = 5

*if a*

**...**

A (directed) path *π* = (*v*_{0}, *e*_{1}, *v*_{1}, *e*_{2}, . . . , *e _{k}, v_{k}*) in Γ

_{A, B}is a finite alternating sequence of vertices and edges that starts and ends with a vertex. For each

*i*= 1, 2, . . . ,

*k*, the directed edge

*e*comes out of vertex

_{i}*v*

_{i–1}and goes into vertex

*v*. We say that the path

_{i}*π*starts at

*v*

_{0}and ends at

*v*.

_{k}Denote finite subsequences of the sequence **A** by **A**[*i, m*] = *A*_{i}*A*_{i+1} ··· *A _{m}*. Every gapped alignment of the subsequences

**A**[

*i, m*] and

**B**[

*j, n*] corresponds to exactly one path that starts at

*v*

_{0}= (

*i*– 1,

*j*– 1) and ends at

*v*= (

_{k}*m, n*) (see Figure 1). The alignment's score is the “path weight” ${S}_{\pi}\u2254{\sum}_{i=1}^{k}s\left[{e}_{i}\right]$.

Define the “global score” *S _{i, j}* := max

_{π }

*S*, where the maximum is taken over all paths

_{π}*π*starting at

*v*

_{0}= (0, 0) and ending at

*v*= (

_{k}*i, j*). The paths

*π*starting at

*v*

_{0}, ending at

*v*, and having weight

_{k}*S*=

_{π}*S*are “optimal global paths” and correspond to “optimal global alignments” between

_{i, j}**A**[1,

*i*] and

**B**[1,

*j*]. Define the “edge maximum”

*M*:= max{max

_{n}_{0≤i≤n }

*S*, max

_{i, n}_{0≤j≤n }

*S*}, and the “global maximum”

_{n, j}*M*:= sup

_{n≥0 }

*M*. (The single subscript in

_{n}*M*indicates that the variate corresponds to a square [0,

_{n}*n*] × [0,

*n*], rather than a general rectangle [0,

*m*] × [0,

*n*].) Define the “strict ascending ladder epochs” (SALEs) in the sequence (

*M*): let

_{n}*β*(0) := 0 and

*β*(

*k*+ 1) := min{

*n*>

*β*(

*k*) :

*M*>

_{n}*M*

_{β(k)}}, where min $\varnothing \u2254\infty $. We call

*M*

_{β(k)}the “

*k*th SALE score.”

Define also the “local score” _{i, j} := max_{π }*S*_{π}, where the maximum is taken over all paths *π* ending at *v _{k}* = (

*i, j*), regardless of their starting point. Define the “local maximum”

_{m, n}:= max

_{0≤i≤m, 0≤j≤n }

_{i, j}. The paths

*π*ending at

*v*= (

_{k}*i, j*) with local score

*S*=

_{π}*=*

_{i, j}*are “optimal local paths” corresponding to the “optimal local alignments” between subsequences of*

_{m, n}**A**[1,

*m*] and

**B**[1,

*n*].

Now, the following “independent letters” model introduces randomness. Choose each letter in the sequence **A** and **B** randomly and independently from the alphabet $\mathfrak{L}$ according to fixed probability distributions $\{{p}_{a}:a\in \mathfrak{L}\}$ and $\{{p}_{b}^{\prime}:b\in \mathfrak{L}\}$. (Although this article permits the distributions {*p _{a}*} and $\left\{{p}_{b}^{\prime}\right\}$ to be different, in applications they are usually the same.) Throughout the paper, the probability and expectation for the independent letters model are denoted by $\mathbb{P}$ and $\mathbb{E}$.

Let Γ = Γ_{A, B} denote the random alignment graph of the sequence-pair (**A, B**). In the appropriate limit, if the alignment parameters are in the so-called “logarithmic phase” [6, 12] (i.e., if the optimal global alignment score of long random sequences has a negative score), the random local maximum * _{m, n}* follows an approximate Gumbel extreme value distribution with “scale parameter”

*λ*and “pre-factor”

*K*[1, 14],

### 2.2. The dynamic programming algorithm for global sequence alignment

For affine gaps *w _{g}* = Δ

_{0}+ Δ

_{1}

*g*, the global score

*S*is calculated with the recursion

_{i, j}where

*D _{i, j}* = max{

*S*

_{i–1, j}– Δ

_{0}– Δ

_{1},

*D*

_{i–1, j}– Δ

_{1}} and boundary conditions

*S*

_{0, 0}= 0,

*I*

_{0, 0}=

*D*

_{0, 0}= –∞,

*D*

_{g, 0}=

*I*

_{0, g}= –Δ

_{0}– Δ

_{1}

*g*,

*S*

_{g, 0}=

*S*

_{0, g}=

*I*

_{g, 0}=

*D*

_{0, g}= –∞ for

*g*> 0 [15]. The three array names,

*S, I*, and

*D*, are mnemonics for “substitution,” “insertion” and “deletion.” If “Δ” denotes a gap character, the corresponding alignment letter-pairs (

*a, b*), (Δ,

*b*) and (

*a*, Δ) correspond to the operations for editing sequence

**A**into sequence

**B**[30].

### 2.3. Previous methods for estimating the BLAST p-value

If *w _{g}* ∞ identically, so northward and eastward (gap) edges are disallowed in an optimal alignment path, a rigorous proof of (2.1) yields analytic formulas for the Gumbel parameters

*λ*and

*K*[12]. For gapped local alignment, rigorous results are sparse, although some approximate analytical studies are extant [21, 22, 27, 29]. The prevailing approach therefore estimates

*λ*and

*K*from simulations [4, 31]. Because

*λ*is an exponential rate, it dominates

*K*'s contribution to the BLAST

*p*-value. Most studies therefore (including the present one) have focused on

*λ*. (Note, however, some recent progress on the real-time estimation of

*K*[26].) Typically, current applications require a 1–4% relative error in

*λ*; 10–20%, in

*K*[4]. The characteristics of the relevant sequence database determine the actual accuracies required, however, making approximations with controlled error and of arbitrary accuracy extremely desirable in practice.

Storey and Siegmund [29] approximate *λ* (with neither controlled errors nor arbitrary accuracy) as

where ${\sum}_{(a,b)}{p}_{a}{p}_{b}^{\prime}\mathrm{exp}\left[{\lambda}^{\ast}s(a,b)\right]=1$ [so *λ** the so-called “ungapped lambda,” for Δ(*g*) ∞] and ${\mu}^{\ast}\u2254{\sum}_{(a,b)}s(a,b){p}_{a}{p}_{b}^{\prime}\phantom{\rule{thinmathspace}{0ex}}\mathrm{exp}\left[{\lambda}^{\ast}s(a,b)\right]$. In (2.3), Λ is an upper bound for an infinite sequence of constants defined in terms of gap lengths in a random alignment.

Many other studies have used local alignment simulations to estimate BLAST *p*-values, for example, Chan [9] used importance sampling and a mixture distribution. Some rigorous results [28] are also extant for the so-called “island method” [31, 32], which yields maximum likelihood estimates of *λ* and *K* from a Poisson process associated with local alignments exceeding a threshold score [4, 23].

Large deviations arguments [6, 35] support the common belief that global alignment can estimate *λ* for local alignment through the equation $\lambda =-{\mathrm{lim}}_{\gamma \to \infty}{\gamma}^{-1}\phantom{\rule{thinmathspace}{0ex}}\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}\mathbb{P}\{M\ge y\}$. For a fixed error, global alignment typically requires less computational effort than local alignment. For example, one early study [34] used importance sampling based on trial distributions $\mathbb{Q}$ from a hidden Markov model.

The study demonstrated that the global alignment equation $\mathbb{E}\left[\mathrm{exp}\left(\lambda {S}_{n,n}\right)\right]=1$ estimated *λ* with only *O*(*n*^{–1}) error [7]. (Recall that “$\mathbb{E}$” denotes the expectation corresponding to the random letters model.) The equation $\mathbb{E}\left[\mathrm{exp}\left(\lambda {M}_{m}\right)\right]=\mathbb{E}\left[\mathrm{exp}\left(\lambda {M}_{n}\right)\right]\phantom{\rule{thickmathspace}{0ex}}(m\ne n)$, suggested by heuristic modeling with Markov additive processes (MAPs) [5, 10], improved the error substantially, to *O*(*ε ^{n}*) [24].

The next subsection shows how the MAP heuristic can improve the efficiency of importance sampling even further, with its renewal structure. The next subsection gives the relevant parts of the MAP heuristic.

### 2.4. The Markov additive process heuristic

The rigorous theory of MAPs appears elsewhere [5, 10]. Because the MAP heuristics given below parallel a previous publication [24], we present only informal essentials.

Consider a finite Markov-chain state-space $\mathfrak{J}$, containing $\#\mathfrak{J}$ elements. Without loss of generality, $\mathfrak{J}=\{1,\dots ,\#\mathfrak{J}\}$. Until further notice, all vectors are row vectors of dimension $\#\mathfrak{J}$; all matrices, of dimension $\left(\#\mathfrak{J}\right)\times \left(\#\mathfrak{J}\right)$. A MAP can be defined in terms of a time-homogenous Markov chain (MC) $({J}_{n}\in \mathfrak{J}:n=0,1,\dots )$ and a $\left(\#\mathfrak{J}\right)\times \left(\#\mathfrak{J}\right)$ matrix of real random variates ||*Z _{i, j}*||. Let the MC have transition matrix $\mathbf{P}=\Vert {p}_{i,j}\Vert ,\phantom{\rule{thickmathspace}{0ex}}\mathrm{so}\phantom{\rule{thickmathspace}{0ex}}{p}_{i,j}=\mathbb{P}({J}_{n}=j\mid {J}_{n-1}=i)$. Let the stationary distribution of the MC

**, assumed strictly positive and satisfying both**

*π***=**

*π*P**and**

*π*

*π***1**

^{t}= 1, where

**1**

^{t}denotes the $\left(\#\mathfrak{J}\right)\times 1$ column vector whose elements are all 1.

As usual, let ${\mathbb{P}}_{\gamma}$ and ${\mathbb{E}}_{\gamma}$ be the probability measure and expectation corresponding to an initial state *J*_{0} with distribution ** γ**; ${\mathbb{P}}_{i}$ and ${\mathbb{E}}_{i}$, to an initial state

*J*

_{0}=

*i*; and ${\mathbb{P}}_{\pi}$ and ${\mathbb{E}}_{\pi}$, to an initial state in the equilibrium distribution

**.**

*π*Run the MC (*J _{n}*), and take its succession of states as given. Consider the following sequence $({Y}_{n}\in \mathbb{R}:n=0,1,\dots )$ of random variates. Define

*Y*

_{0}:= 0. For

*n*= 1, 2, . . . , let the (

*Y*) be conditionally independent, with distributions determined by the transition

_{n}*J*

_{n–1}→

*J*of the Markov chain as follows. If

_{n}*J*

_{n–1}=

*i*and

*J*=

_{n}*j*, the value of

*Y*is chosen randomly from the distribution of

_{n}*Z*. (Thus, if

_{i, j}*J*

_{m–1}=

*J*

_{n–1}=

*i*and

*J*=

_{m}*J*=

_{n}*j, Y*and

_{m}*Y*share the distribution of

_{n}*Z*, although independence permits randomness to give them different values.)

_{i, j}The random variates of central interest are the sums ${T}_{n}={\sum}_{m=0}^{n}{Y}_{m}$ (*n* = 0, 1, . . .) and the maximum *M* := max_{n≥0 }*T _{n}*. To exclude trivial distributions for

*M*(i.e.,

*M*= 0 a.s. and

*M*= ∞ a.s.), make two assumptions: (1) ${\mathbb{E}}_{\pi}{Y}_{1}<0$; and (2) there is some

*m*and state

*i*such that

Consider the sequence (*T _{n}*), its SALEs

*β*(0) := 0 and

*β*(

*k*+ 1) := min{

*n*>

*β*(

*k*):

*T*>

_{n}*T*

_{β(k)}}, and its SALE scores

*T*

_{β(k)}. For brevity, let

*β*:=

*β*(1). Note that

*M*=

*T*

_{β(k)}for some

*k*{0, 1, . . .}. In a MAP, (

*J*

_{β(k)},

*T*

_{β(k)}) forms a defective Markov renewal process.

Now, define the matrix ${\mathbf{L}}_{\theta}\u2254\Vert {\mathbb{E}}_{i}[\mathrm{exp}\left(\theta {T}_{\beta}\right);{J}_{\beta}=j,\beta <\infty ]\Vert $. The Perron–Frobenius theorem [5], page 25, shows that **L**_{θ} has a strictly dominant eigenvalue *ρ*(*θ*) > 0 [i.e., *ρ*(*θ*) is the unique eigenvalue of greatest absolute value]. Moreover, *ρ*(*θ*) is a convex function [19], and because **L**_{0} is substochastic, *ρ*(0) < 1. The two assumptions above (2.4) ensure that *M* := max_{n≥0 }*T _{n}* has a nontrivial distribution and that

*ρ*(

*λ*) = 1 for some unique

*λ*> 0.

The notation intentionally suggests a heuristic analogy between MAPs and global alignment. Identify the Markov chain states *J _{n}* in the MAP with the rectangle [0,

*n*] × [0,

*n*] of Γ

_{A, B}, and identify the sum

*T*in the MAP with the edge maximum

_{n}*M*in global alignment. In the following, therefore, the identification leads to

_{n}*M*replacing

_{n}*T*in the MAP formulas. In particular, the MAP heuristic identifies the Gumbel scale parameter in (2.1) with the root

_{n}*λ*> 0 of the equation

*ρ*(

*λ*) = 1. Although the heuristic analogy between MAPs and global alignment is in no way precise or rigorous, it has produced useful results [24].

The details of why the MAP heuristic works so well are presently obscure, although some additional motivation appears in an heuristic calculation related to *λ* [8]. The calculation takes the limit of nested successively wider semi-infinite strips, each strip having constant width and propagating itself northeastward in the alignment graph Γ_{A, B}. The successive northeast boundaries of the propagation are states in an ergodic MC. MAPs therefore might rigorously justify the heuristic calculation.

## 3. Methods

### 3.1. A novel equation for λ

From the definition of **L**_{θ} in a MAP, if the Markov chain {*J _{n}*} starts in a state

*J*

_{0}with distribution

**(with**

*γ**M*replacing

_{n}*T*in the MAP formulas), matrix algebra applied to the concatenation of SALEs in a MAP yields

_{n} For a MAP, equation (3.1) is exact; but for global alignment, it has no literal meaning. Equation (3.1) has some consequences for the limit *k* → ∞, and we speculate that the consequences hold, even for global alignment. [Note: although the sequence (*β*(*k*)) is a.s. finite, the limits *k* → ∞ below involve no contradiction or approximation, because they are not a.s. limits.]

Define ${K}_{k}\left(\theta \right)\u2254\mathrm{ln}\left\{{\mathbb{E}}_{\gamma}[\mathrm{exp}\left(\theta {M}_{\beta \left(k\right)}\right);\beta \left(k\right)<\infty ]\right\}$. In (3.1), a spectral (eigenvalue) decomposition of the matrix **L**_{θ} [25] shows that

where 0 ≤ *ε* < 1 is determined by the magnitude of the subdominant eigenvalue of **L**_{θ}, and *c*_{0} is a constant independent of *θ* and *k*.

For *k*′ – *k* > 0 fixed, we can accelerate the convergence in (3.2) as *k* → ∞ by differencing

Let *λ _{k′, k}* denote the root of (3.3) after dropping the error term

*O*(

*ε*). Because

^{k}*ρ*(

*λ*) = 1, Taylor approximation around

*λ*yields ln{

*ρ*(

*λ*)} ≈

_{k′, k}*ρ*′ (

*λ*)(

*λ*–

_{k′, k}*λ*), so (3.3) becomes

that is, with *k*′ – *k* fixed, *λ _{k′, k}* converges geometrically to

*λ*as the SALE index

*k*→ ∞.

The initial state ** γ** of global alignment has a deterministic distribution, namely the origin (0, 0). Equation (3.3) for

*θ*=

*λ*therefore becomes

after dropping the geometric error *O*(*ε ^{k}*). Let ${\widehat{\lambda}}_{{k}^{\prime},k}$ be the root of (3.5).

### 3.2. The trial distribution for importance sampling

In (3.5), crude Monte Carlo simulation generating random sequence-pairs with the identical letters model $\mathbb{P}$ is inefficient for the following reason. When practical alignment scoring systems are used, $\mathbb{P}\{\beta \left(k\right)<\infty \}<1$ for *k* ≥ 1. For, example, the BLAST defaults (scoring matrix BLOSUM62, gap penalty *w _{g}* = 11 +

*g*, and Robinson–Robinson letter frequencies), $\mathbb{P}\{\beta \left(4\right)=\infty \}\approx 0.047$, so only about 1 in 20 crude Monte Carlo simulations generate a fourth ladder point. Empirically in our importance sampling, however, Gumbel parameter estimation seemed most efficient when the stopping time corresponded to

*β*(4) (see below).

Importance sampling requires a trial distribution to determine ${\widehat{\lambda}}_{{k}^{\prime},k}$ from (3.5). By editing one sequence into another, a Markov chain model borrowed directly from a previous study [34] generates random sequence alignments, as follows.

Consider a Markov state space consisting of the set of alignment letter-pairs ${\stackrel{\u2012}{\mathfrak{L}}}^{2}$, where $\stackrel{\u2012}{\mathfrak{L}}\u2254\mathfrak{L}\cup \left\{\Delta \right\}$, “Δ” being a character representing gaps. The ordered pair (Δ, Δ) has probability 0, so a succession of Markov states corresponds to a global sequence alignment (see Figure 1), that is, to a path in the alignment graph Γ_{A, B}. Ordered pairs other than (Δ, Δ) fall into three sets, corresponding to edit operations following (2.2): $S\u2254\mathfrak{L}\times \mathfrak{L}$ [substitution, a bioinformatics term implicitly including identical letter-pairs (*a, a*)], $I\u2254\left\{\Delta \right\}\times \mathfrak{L}$ (insertion); and $D\u2254\mathfrak{L}\times \left\{\Delta \right\}$ (deletion). The sets *S, I* and *D* form “atoms” of the MC [13], page 203, as follows. (By definition, each atom of a MC is a set of all states with identical outgoing transition probabilities.)

From the set *S*, the transition probability to (*a, b*) is *t _{S, S}q_{a, b}*; to (Δ,

*b*), ${t}_{S,I}\phantom{\rule{thinmathspace}{0ex}}{p}_{b}^{\prime}$; and to (

*a*, Δ),

*t*. From the set

_{S, D}p_{a}*I*, the transition probability to (

*a, b*) is

*t*; to (Δ,

_{I, S}q_{a, b}*b*), ${t}_{I,I}\phantom{\rule{thinmathspace}{0ex}}{p}_{b}^{\prime}$; and to (

*a*, Δ),

*t*. From the set

_{I, D}p_{a}*D*, the transition probability to (

*a, b*) is

*t*; to (Δ,

_{D, S}q_{a, b}*b*), ${t}_{D,I}\phantom{\rule{thinmathspace}{0ex}}{p}_{b}^{\prime}$; and to (

*a*, Δ),

*t*. Transition probabilities sum to 1, so the following restrictions apply: ${\sum}_{a,b\in \mathfrak{L}}{q}_{a,b}=1$, ${\sum}_{b\in \mathfrak{L}}{p}_{b}^{\prime}=1$, ${\sum}_{a\in \mathfrak{L}}{p}_{a}=1$,

_{D, D}p_{a}*t*+

_{S, S}*t*+

_{S, I}*t*= 1 (transit from the substitution atom),

_{S, D}*t*+

_{D, D}*t*+

_{D, S}*t*= 1 (transit from the deletion atom) and

_{D, I}*t*+

_{I, I}*t*+

_{I, S}*t*= 1 (transit from the insertion atom). Usually in practice, the term

_{I, D}*t*= 0, to disallow insertions following a deletion. Our formulas retain the term, to exploit the resulting symmetry later.

_{I, D}In the terminology of hidden Markov models, *S, I, D* are hidden Markov states. *t _{i, j}* for

*i, j*{

*S, I, D*} are transition probabilities and

*q*, ${p}_{b}^{\prime}$,

_{a, b}*p*for $a,b\in \mathfrak{L}$ are emission probabilities from the state

_{a}*S, I, D*, respectively.

As described elsewhere [34], numerical values for the Markov probabilities can be determined from the scores *s*(*a, b*) and the gap penalty *w _{g}*. Note that the values are selected for statistical efficiency, although many other values also yield unbiased estimates for

*λ*in the appropriate limit.

### 3.3. Importance sampling weights and stopping times

To establish notation, and to make connections to the Appendix and its mapping theorem, note that the MC above can be supported on a probability space $(\Omega ,\mathsf{F},\mathbb{Q})$, where each *ω* = (π, **A, B**) Ω is an ordered triple. Here, *π* is an infinite path starting at the origin in the alignment graph Γ_{A, B}; F is the set generated by cylinder sets in Ω (here, cylinder sets essentially consist of some finite path and the corresponding pair of subsequences); and $\mathbb{Q}$ is the MC probability distribution described above, started at the atom *S*, with expectation operator ${\mathbb{E}}_{\mathbb{Q}}$.

Let *N* be any stopping time for the sequence (*M _{n}* :

*n*= 0, 1, . . .) of edge maxima for Γ

_{A, B}(i.e., the sequence {

*M*

_{0}, . . . ,

*M*} determines whether

_{n}*N*≤

*n*or not). Because

*M*is determined by (

_{n}**A**[1,

*n*],

**B**[1,

*n*]),

*N*is also a stopping time for the sequence {(

**A**[1,

*n*],

**B**[1,

*n*]) :

*n*= 0, 1, . . .}. The stopping time of main interest here is

*N*=

*β*(

*k*) the

*k*th ladder index of (

*M*), where

_{n}*k*≥ 1 is arbitrary. (As further motivation for the mapping theorem in the Appendix, other stopping times of possible interest include, for example,

*N*=

*n*, a fixed epoch [7], and

*N*=

*β*(

*K*), where

_{y}*β*(

*K*) = inf{

_{y}*n*:

*M*≥

_{n}*y*} is the index of first ladder-score outside the interval (0,

*y*).)

To use the mapping theorem, introduce the probability space $({\Omega}^{\u2033},{\mathsf{F}}^{\u2033},\mathbb{P})$, where each *ω*″ = (**A, B**) Ω″ is an ordered pair. Here, **A** and **B** are sequences, F″ is the set generated by all cylinder sets in Ω″ (i.e., sets corresponding to pairs of finite subsequences) and $\mathbb{P}\left({A}^{\u2033}\right)={\prod}_{k=1}^{i}{p}_{{A}_{k}}{\prod}_{k=1}^{j}{p}_{{B}_{k}}^{\prime}$, if the cylinder set *A*″ corresponds to the subsequence pair (**A**[1, *i*], **B**[1, *j*]). Given *N*, the theory of stopping times [5], page 414, can be used to construct a discrete probability space $({\Omega}^{\prime},{\mathsf{F}}^{\prime},\mathbb{P})$, where each event *ω*′ Ω′ is a finite-sequence pair *ω*′ = (**A**[1, *N*], **B**[1, *N*]), F′ is the set of all subsets of Ω′ and $\mathbb{P}\left({\omega}^{\prime}\right)={\prod}_{k=1}^{N\left({\omega}^{\prime}\right)}{p}_{{A}_{k}}{\prod}_{k=1}^{N\left({\omega}^{\prime}\right)}{p}_{{B}_{k}}^{\prime}$.

Let I_{m, n} := {(*i, j*) : *i* = *m, j* ≥ *n*} and D_{m, n} := {(*i, j*) : *i* ≥ *m, j* = *n*}. Define the function $f:\omega \mapsto {\omega}^{\prime}$, where *ω* = (*π*, **A, B**) and ${\omega}^{\prime}=({\omega}_{\mathbf{A}}^{\prime},{\omega}_{\mathbf{B}}^{\prime})\u2254(\mathbf{A}[1,N],\mathbf{B}[1,N])$. Then, *ω* *f*^{–1}(*ω*′), if and only if the path *π* hits the set ${\mathsf{I}}_{N,N}\cup {\mathsf{D}}_{N,N}$ at (*i, j*), so that $\mathbf{A}[1,N]={\omega}_{\mathbf{A}}^{\prime}$ and $\mathbf{B}[1,N]={\omega}_{\mathbf{B}}^{\prime}$ (see Figure 2).

*surrounded by double lines*) generated an SALE. The SALEs determine the stopping time N =

*β*

**...**

Empirically, our simulations satisfied $\mathbb{Q}\{\beta \left(k\right)<\infty \}=1$, and we speculate that our application therefore satisfies the hypothesis $\mathbb{Q}H=1$ of the Appendix. According to the Appendix, the reciprocal importance sampling weight $1\u2215W\left(\omega \right)={\sum}_{{\omega}_{0}\in {f}^{-1}\left\{f\left(\omega \right)\right\}}\mathbb{Q}\left({\omega}_{0}\right)\u2215\mathbb{P}f\left(\omega \right)$ depends on the sum over all possible Markov chain realizations *ω*_{0} *f*^{–1}(*ω*′). Dynamic programming computes the sum efficiently, as follows.

Let the “transition” *T* represent any element of {*S, I, D*} [substitution (*a _{i}, b_{j}*), insertion (Δ,

*b*), or deletion (

_{j}*a*, Δ)]. Fix any particular pair (

_{i}**A, B**) of infinite sequences, which fixes

*N*=

*β*(

*k*). To set up a recursion for dynamic programming, consider the following set of events ${\mathsf{E}}_{i,j}^{T}$, defined for

*T*{

*S, I, D*} and min{

*i, j*} ≤

*N*, and illustrated in Figure 2. Let ${\mathsf{E}}_{i,j}^{T}$ be the event consisting of all

*ω*yielding a path

*π*whose final transition is

*T*and which corresponds to the subsequences: (1)

**A**[1,

*i*] and

**B**[1,

*j*] for 0 ≤

*i, j*≤

*N*; (2)

**A**[1,

*i*] and

**B**[1,

*N*] for 0 ≤

*N*=

*j*≤

*i*; and (3)

**A**[1,

*N*] and

**B**[1,

*j*] for 0 ≤

*N*=

*i*≤

*j*. Define ${Q}_{i,j}^{T}\u2254\mathbb{Q}\left({\mathsf{E}}_{i,j}^{T}\right)$ and ${Q}_{i,j}\u2254{Q}_{i,j}^{S}+{Q}_{i,j}^{I}+{Q}_{i,j}^{D}$. (Note: in the following,

*T*{

*S, I, D*} is always a superscript, never an exponent.)

For brevity, let ${\stackrel{~}{q}}_{i,j}={q}_{{A}_{i},{B}_{j}}$ for $0\le i,j\le N;{\stackrel{~}{q}}_{i,j}={\sum}_{(a\in \mathfrak{L})}{q}_{a,{B}_{j}}$ for $0\le j\le N<i;{\stackrel{~}{q}}_{i,j}={\sum}_{(b\in \mathfrak{L})}{q}_{{A}_{i},b}$ for 0 ≤ *i* ≤ *N* < *j*; and ${\stackrel{~}{q}}_{i,j}=1$ otherwise. Let ${\stackrel{~}{p}}_{J}^{\prime}={p}_{{B}_{j}}^{\prime}$ for 0 ≤ *j* ≤ *N*; and 1 otherwise. Finally, Let ${\stackrel{~}{p}}_{i}={p}_{{A}_{i}}$ for 0 ≤ *i* ≤ *N*; and 1 otherwise. Because every path into the vertex (*i, j*) comes from one of three vertices, each corresponding to a different transition *T* {*S, I, D*},

with boundary conditions ${Q}_{0,0}^{S}=1$, ${Q}_{0,0}^{I}={Q}_{0,0}^{D}=0$, ${Q}_{g,0}^{S}={Q}_{0,g}^{S}={Q}_{g,0}^{I}={Q}_{0,g}^{D}=0$, ${Q}_{0,g}^{I}={p}_{{B}_{1}}^{\prime}\cdots {p}_{{B}_{g}}^{\prime}{t}_{S,I}{\left({t}_{I,I}\right)}^{g-1}$ and ${Q}_{g,0}^{D}={p}_{{A}_{1}}\cdots {p}_{{A}_{g}}{t}_{S,D}\times {\left({t}_{D,D}\right)}^{g-1}(g>0)$.

Recall that *ω* = (*π*, **A, B**) *f*^{–1}(*ω*′), if and only if the path *π* hits the set ${\mathsf{I}}_{N,N}\cup {\mathsf{D}}_{N,N}$ at (*i, j*), so that $\mathbf{A}[1,N]={\omega}_{\mathbf{A}}^{\prime}$ and $\mathbf{B}[1,N]={\omega}_{\mathbf{B}}^{\prime}$. Thus,

To turn (3.6) into a recursion for importance sampling weights, define ${p}_{i}\u2254{p}_{{A}_{1}}\cdots {p}_{{A}_{\mathrm{min}\{i,N\}}}={\stackrel{~}{p}}_{1}\cdots {\stackrel{~}{p}}_{i}$ and ${p}_{j}^{\prime}\u2254{p}_{{B}_{1}}^{\prime}\cdots {p}_{{B}_{\mathrm{min}\{j,N\}}}^{\prime}={\stackrel{~}{p}}_{1}^{\prime}\cdots {\stackrel{~}{p}}_{j}^{\prime}$, and let ${W}_{i,j}^{T}\u2254{Q}_{i,j}^{T}\u2215\left({P}_{i}{P}_{j}^{\prime}\right)\phantom{\rule{thickmathspace}{0ex}}(T\in \{S,I,D\})$. Let ${r}_{i,j}={\stackrel{~}{q}}_{i,j}\u2215\left({\stackrel{~}{p}}_{i}{\stackrel{~}{p}}_{j}^{\prime}\right)$. For future reference, define *r*_{•, j} := *r _{i, j}* for 0 ≤

*j*≤

*N*<

*i*and

*r*

_{i, •}:=

*r*for 0 ≤

_{i, j}*i*≤

*N*<

*j*. Note that

*r*

_{•, j}is independent of

*i*, and

*r*

_{i, •}is independent of

*j*. Equation (3.6) yields

with boundary conditions ${W}_{0,0}^{S}=1$, ${W}_{0,0}^{I}={W}_{0,0}^{D}=0$, ${W}_{g,0}^{S}={W}_{0,g}^{S}={W}_{0,g}^{I}={W}_{g,0}^{D}=0$, ${W}_{0,g}^{I}={t}_{S,I}{\left({t}_{I,I}\right)}^{g-1}$ and ${W}_{g,0}^{D}={t}_{S,D}{\left({t}_{D,D}\right)}^{g-1}$ (*g* > 0). Because of (3.7), the importance sampling weight *W* := *W*(*ω*) satisfies

Because *r _{i, j}* =

*r*

_{• ,j}(0 ≤

*j*≤

*N*<

*i*) and

*r*=

_{i, j}*r*

_{i, •}(0 ≤

*i*≤

*N*<

*j*), only a finite number of recursions are needed to compute the infinite sums in (3.9), as follows. For

*T*{

*S, I, D*}, define ${\stackrel{~}{U}}_{i}^{T}\u2254{U}_{i,N}^{T}$, where ${U}_{m,n}^{T}\u2254{\sum}_{j=n}^{\infty}{W}_{m,j}^{T}$. Likewise, define ${\stackrel{~}{V}}_{j}^{T}\u2254{V}_{N,j}^{T}$, where ${V}_{m,n}^{T}\u2254{\sum}_{i=m}^{\infty}{W}_{i,n}^{T}$. Equation (3.9) becomes

Note that ${U}_{i,j-1}^{T}-{U}_{i,j}^{T}={W}_{i,j-1}^{T}$. To determine ${\stackrel{~}{U}}_{N}^{T}$, summation of (3.8) for 0 ≤ *i* ≤ *N* < *j* yields

Elimination of ${U}_{i,j}^{T}$ for *j* = *N* + 1 and *i* = 1, . . . , *N* in the first two equations yields

that is,

with initial values ${\stackrel{~}{U}}_{0}^{S}={U}_{0}^{D}=0$ and ${\stackrel{~}{U}}_{0}^{I}={(1-{t}_{I,I})}^{-1}{W}_{0,N}^{I}={(1-{t}_{I,I})}^{-1}\times {t}_{S,I}{\left({t}_{I,I}\right)}^{N-1}$. Compute (3.13) recursively for *i* = 1, . . . , *N*. Similarly, reflect through *i* = *j* to derive

with initial values ${\stackrel{~}{V}}_{0}^{S}={\stackrel{~}{V}}_{0}^{I}=0$ and ${\stackrel{~}{V}}_{0}^{D}={(1-{t}_{D,D})}^{-1}{W}_{N,0}^{D}={(1={t}_{D,D})}^{=1}\times {t}_{S,D}{\left({t}_{D,D}\right)}^{N-1}$. Iterate (3.14) for *j* = 1, . . . , *N*. Substitute the results for ${\stackrel{~}{U}}_{N}^{S},{\stackrel{~}{U}}_{N}^{D}$, ${\stackrel{~}{V}}_{N}^{S}$, and ${\stackrel{~}{V}}_{N}^{I}$ into (3.10) to compute *W*.

### 3.4. Error estimates for ${\widehat{\lambda}}_{{k}^{\prime},k}$

Denote the indicator of an event *A* by $\mathbb{I}A$, that is, $\mathbb{I}A=1$ if *A* occurs and 0 otherwise. For a realization *ω* in the simulation, define

and let ${h}_{k,{k}^{\prime}}^{\prime}$ be its derivative with respect to *θ*.

Given samples *ω _{i}* (

*i*= 1, . . . ,

*r*) from the trial distribution $\mathbb{Q}$, let

*W*=

*W*(

*ω*) denote the corresponding importance sampling weights. Because ${\widehat{\lambda}}_{{k}^{\prime},k}$ is the M-estimator [17] of the root

_{i}*λ*of $\mathbb{E}{h}_{k,{k}^{\prime}}\left({\lambda}_{{k}^{\prime},k}\right)=0$, as

_{k′, k}*r*→ ∞, $\sqrt{r}({\widehat{\lambda}}_{{k}^{\prime},k}-{\lambda}_{{k}^{\prime},k})$ converges in distribution to the normal distribution with mean 0 and variance [17]

## 4. Numerical study for Gumbel scale parameter

Table 1 gives our “best estimate” $\stackrel{\u2012}{\lambda}$ of the Gumbel scale parameter *λ* from (3.5) for each of the 5 options BLASTP gives users for the alignment scoring scheme. For every scheme, estimates $\widehat{\lambda}$ derived from the first to fourth SALEs indicated that $\widehat{\lambda}$ generally is biased above the true value *λ*, but that $\widehat{\lambda}$ converged adequately by the fourth SALE. The best estimate $\stackrel{\u2012}{\lambda}$ (shown in Table 1) is the average of 200 independent estimates $\widehat{\lambda}$, each computed within 1 sec from sequence-pairs simulated up to their fourth SALE. For BLOSUM 62 and gap penalty *w _{g}* = 11 +

*g*, the average computation produced 1441 sequence-pairs up to their fourth SALE within 1 second. (For results relevant to the other publicly available scoring schemes, see Table 1.) The best estimates $\stackrel{\u2012}{\lambda}$ derived from (3.5) were within the error of the BLASTP values for

*λ*.

**...**

Despite having the variance formula in (3.16) in hand, we elected to estimate the standard error ${\widehat{s}}_{\lambda}$ directly from the 200 independent estimates $\widehat{\lambda}$. Figure 3 plots the relative error ${\widehat{s}}_{\lambda}\u2215\stackrel{\u2012}{\lambda}$ in each individual $\widehat{\lambda}$ against the computation time, where ${\widehat{s}}_{\lambda}$ is the standard error of $\widehat{\lambda}$. It shows that for all 5 BLASTP online options, (3.5) easily computed $\widehat{\lambda}$ to 1–4% accuracy within about 0.5 seconds.

## 5. Discussion

This article indicates that the scale parameter *λ* of the Gumbel distribution for local alignment of random sequences satisfies (3.5), an equation involving the strict ascending ladder-points (SALEs) from global alignment, at least approximately. For standard protein scoring systems, in fact, simulation error could account for most (if not all) of the observed differences between values of *λ* calculated from (3.5) and values calculated from extensive crude Monte Carlo simulations. (The values of *λ* from crude simulation have a standard error of about ±1%.) In SALE simulations, (3.5) estimated *λ* to 1–4% accuracy within 0.5 second, as required by BLAST database searches over the Web. The present study did not tune simulations much; it relied instead on methods specific to sequence alignment to improve estimation. Many general strategies for sequential importance sampling therefore remain available to speed simulation. Preliminary investigations estimating the other Gumbel parameter (the pre-factor *K*) with SALEs are encouraging, so online estimation of the entire Gumbel distribution for arbitrary scoring schemes appears imminent, and preliminary computer code is already in place.

## Acknowledgments

The authors Y. Park and S. Sheetlin contributed equally to the article. All authors would like to acknowledge helpful discussion with Dr. Nak-Kyeong Kim. This research was supported by the Intramural Research Program of the NIH, National Library of Medicine.

## APPENDIX: A GENERAL MAPPING THEOREM FOR IMPORTANCE SAMPLING

The following theorem describes an unusual type of Rao-Blackwellization [20]. Consider two probability spaces $(\Omega ,\mathsf{F},\mathbb{Q})$ and $({\Omega}^{\prime},{\mathsf{F}}^{\prime},\mathbb{P})$, and a F/F′-measurable function $f:\Omega \mapsto {\Omega}^{\prime}$ (i.e., *f*^{–1 }*F*′ F′). Note: *f* is explicitly permitted to be many-to-one. Let $\mathbb{P}<<\mathbb{Q}{f}^{-1}$ on some set *H*′ (i.e., $\mathbb{Q}{f}^{-1}{G}^{\prime}=0\Rightarrow \mathbb{P}{G}^{\prime}=0$ for any set ${G}^{\prime}\subseteq {H}^{\prime}$), so the Radon–Nikodym derivative in the second line of (A.1) below exists. Let *H* := *f*^{–1}*H*′, so for every random variate *X*′ on (Ω′, F′), Consider the application of (A.1) to importance sampling with target distribution $\mathbb{P}$ and trial distribution $\mathbb{Q}$. Assume $\mathbb{Q}H=1$, so *H* supports $\mathbb{Q}$. In our application to global alignment, $H=[\beta \left(k\right)=\infty ]\subset \Omega $ (“” being strict inclusion), but we speculate $\mathbb{Q}H=1$.

In Monte Carlo applications, a discrete sample space *H* is usually available. Accordingly, the following theorem replaces the integrals in (A.1) by sums.

**The mapping theorem for importance sampling. ***Let*

*Under the above conditions, r*^{–1} ${\sum}_{i=1}^{r}\left[{X}^{\prime}f\left({\omega}_{i}\right)W\left({\omega}_{i}\right)\right]\to \mathbb{E}[{X}^{\prime};{H}^{\prime}]$ *with probability* 1 *and in mean* (*with respect to* $\mathbb{Q}$), *as the number of realizations r* → ∞.

The mapping theorem is an easy application of the law of large numbers to (A.1).

## REFERENCES

*p*-values for DNA and protein sequence alignments. Bernoulli. 2003;9:183–199. MR1997026.

*p*-values of gapped local sequence and profile alignments. J. Molecular Biology. 2000;300:649–659. [PubMed]

*p*-values for local sequence alignments. Ann. Statist. 2000;28:657–680. MR1792782.

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (1.5M)

- The Gumbel pre-factor k for gapped local alignment can be estimated from simulations of global alignment.[Nucleic Acids Res. 2005]
*Sheetlin S, Park Y, Spouge JL.**Nucleic Acids Res. 2005; 33(15):4987-94. Epub 2005 Sep 6.* - Local sequence alignments statistics: deviations from Gumbel statistics in the rare-event tail.[Algorithms Mol Biol. 2007]
*Wolfsheimer S, Burghardt B, Hartmann AK.**Algorithms Mol Biol. 2007 Jul 11; 2:9. Epub 2007 Jul 11.* - Statistical significance of probabilistic sequence alignment and related local hidden Markov models.[J Comput Biol. 2001]
*Yu YK, Hwa T.**J Comput Biol. 2001; 8(3):249-82.* - Evolution of biological sequences implies an extreme value distribution of type I for both global and local pairwise alignment scores.[BMC Bioinformatics. 2008]
*Bastien O, Maréchal E.**BMC Bioinformatics. 2008 Aug 7; 9:332. Epub 2008 Aug 7.* - Effects of long-range correlations in DNA on sequence alignment score statistics.[J Comput Biol. 2007]
*Messer PW, Bundschuh R, Vingron M, Arndt PF.**J Comput Biol. 2007 Jun; 14(5):655-68.*

- New finite-size correction for local alignment score distributions[BMC Research Notes. ]
*Park Y, Sheetlin S, Ma N, Madden TL, Spouge JL.**BMC Research Notes. 5286* - Objective method for estimating asymptotic parameters, with an application to sequence alignment[Physical review. E, Statistical, nonlinear,...]
*Sheetlin S, Park Y, Spouge JL.**Physical review. E, Statistical, nonlinear, and soft matter physics. 2011 Sep; 84(3-1)031914* - A new repeat-masking method enables specific detection of homologous sequences[Nucleic Acids Research. 2011]
*Frith MC.**Nucleic Acids Research. 2011 Mar; 39(4)e23* - The whole alignment and nothing but the alignment: the problem of spurious alignment flanks[Nucleic Acids Research. 2008]
*Frith MC, Park Y, Sheetlin SL, Spouge JL.**Nucleic Acids Research. 2008 Oct; 36(18)5863-5871*

- PubMedPubMedPubMed citations for these articles