# The estimation of statistical parameters for local alignment score distributions

^{1}Department of Physics, University of California at San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0319, USA

^{a}To whom correspondence should be addressed. Tel: +1 301 496 2475; Fax: +1 301 480 9241; Email: vog.hin.mln.ibcn@luhcstla

## Abstract

The distribution of optimal local alignment scores of random sequences plays a vital role in evaluating the statistical significance of sequence alignments. These scores can be well described by an extreme-value distribution. The distribution’s parameters depend upon the scoring system employed and the random letter frequencies; in general they cannot be derived analytically, but must be estimated by curve fitting. For obtaining accurate parameter estimates, a form of the recently described ‘island’ method has several advantages. We describe this method in detail, and use it to investigate the functional dependence of these parameters on finite-length edge effects.

## INTRODUCTION

Local sequence alignment is perhaps the most widely used tool in computational molecular biology, with most protein and DNA database search programs (1–4) implementing heuristic versions of local alignment algorithms (5,6). These algorithms seek the highest-scoring alignment of segments from the two sequences being compared. An alignment’s score is calculated by adding substitution scores, defined for each aligned pair of letters, and gap scores for each run of letters in one segment aligned with null characters inserted into the other.

A key question is what alignment scores may be expected to occur
purely by chance. This question is generally addressed by analyzing
the distribution of optimal alignment scores from random or real
but unrelated sequences. We confine attention to random sequences,
defined as strings of independent letters chosen with fixed background
probabilities, because they are easier to control and study. Depending
upon the details of the alignment scoring system and the background
letter probabilities, the optimal score for the alignment of two
random sequences of length *n* tends to grow proportionally
either to *n* or to log(*n*) (7–10).
The linear scoring regime corresponds to optimal alignments that
tend to involve virtually the entire sequences; the logarithmic
regime, with substitution and gap scores that are on average more
negative, corresponds to optimal alignments that are relatively
short. Many alignments representing true biological relationships
involve only segments of the sequences compared, but these will
tend to be outscored by long ‘random alignments’ when
a scoring system in the linear regime is employed. Therefore, attention
has focused primarily on scoring systems in the logarithmic regime,
and we deal here exclusively with such scores.

In the asymptotic limit of long sequences, optimal local alignment
scores follow an extreme-value distribution (11), described
by two parameters λ and *K*.
For the type of scoring system in most general use, these parameters
cannot be calculated but must instead be estimated by random simulation.
Most directly, one may generate optimal alignment scores for a large number
of random sequence pairs, and fit an extreme-value distribution
to these scores. Recently, an alternative approach has been described;
it uses scores for local alignment ‘islands’ generated
by a slight modification of the Smith–Waterman algorithm
(12). We will discuss in detail
the implementation and application of the island parameter estimation
method, and compare it to the direct method in several ways. The
island method has a number of useful features. (i) It renders explicit
a tradeoff between parameter estimate bias and stochastic error, and
allows this tradeoff to be easily controlled. (ii) It estimates accurately
the tail behavior of score distributions for small-length comparisons.
(iii) It allows parameter estimates to be obtained for arbitrary
length sequence comparisons, including the infinite-length limit.
In some circumstances, the first two of these features can be transferred
advantageously to the direct method, appropriately modified. For
asymptotic parameter estimation, however, the island method has
a clear speed advantage.

## THE DIRECT ESTIMATION OF STATISTICAL PARAMETERS

An asymptotic theory for local alignment scores has been developed
for the case in which no gaps are permitted. In brief, for the comparison
of random sequences of sufficient lengths *m* and *n*, the number of distinct local alignments with
score at least *x* is approximately Poisson distributed,
with mean

* E*(*x*) ≈ *Kmne*^{–}^{λ}^{x}*, * ** 1**

where λ and *K* are
easily calculated parameters (13,14). This implies that the optimal alignment
score *S*′ approximately follows
an extreme-value distribution (11),
with

Prob (*S*′ ≥ *x*) ≈ 1 – exp(–*Kmne*^{–}^{λ}* ^{x}*)

*.*

**2**

For local alignments that allow gaps, no asymptotic score distribution
has been established analytically. However, computational experiments
strongly suggest that equations **1** and **2** apply
to this type of alignment as well (12,15–22).
The key to using equations **1** and **2** is
the accurate estimation of the statistical parameters λ and *K*. Perhaps the most direct approach to estimating
these parameters for a fixed scoring system and set of background
letter frequencies is to generate a large number of pairs of random
sequences of equal length *n*, and find the optimal
local alignment score for each pair. From these scores one may calculate
maximum-likelihood estimates

and

for the
statistical parameters in equation **2 **(23). If *R* scores are generated,
the ratio

/λ is
approximately normally distributed, with mean 1 and standard error
0.78/√*R *(23). Note that the estimates developed
by Lawless (23) assume continuous data, whereas
alignment scores are almost always discrete. If the scale parameter λ times
the lattice spacing of possible scores is small, the error introduced
by assuming continuous scores is minor. One may, however, derive
maximum-likelihood estimates

and

that explicitly assume discrete scores (Appendix).

Because λ enters equations **1** and **2** exponentially, accurate estimates of λ are
particularly important. Marginally significant alignments from current
database searches typically have a scaled score λ*x* > 25, for which even a 4% error
in λ leads to an estimated *E*-value in
error by greater than a factor of 2.7. Thus, standard errors of <2%,
or even 1%, in

may be desirable.

## THE ISLAND METHOD

Recently, Olsen *et al*. (12)
proposed the island method for estimating λ and *K*;
it is a variant of ideas introduced by Waterman and Vingron (18,19)
that translates into a very efficient algorithm. Rather than finding
optimal alignment scores for pairs of random sequences, they propose
generating scores for each island (as defined below) in a path graph.
To generate sufficiently many scores for accurate parameter estimation,
a single large or multiple smaller pairwise comparisons may be used.

Briefly, the Smith–Waterman algorithm generates a score for
each cell *C* in a path graph, corresponding to the
highest-scoring local alignment ending at *C *(5). This local alignment starts at a specific
anchoring cell, and an island consists of all cells with identical
anchors (Fig. (Fig.1).1). The score assigned to
an island is the maximum score of the cells it contains. A simple modification
of the Smith–Waterman algorithm, involving only a fixed
amount of extra computation per cell, allows one to record which
island each cell belongs to, and to keep track of each island’s
score. Note also that as one moves row by row through a path graph
with *n* columns, there can be at most *O*(*n*) islands represented on any given row. This allows
one to tabulate all island scores generated by an *m* × *n* path graph in *O*(*mn*) time, and using only *O*(*n*) space.

**a**) Schematic representation of the path graph. In every cell

*C*the red line recalls the choice made by the optimization procedure of the Smith–Waterman algorithm. By these lines, all the cells with

**...**

Island scores correspond to distinct locally optimal alignments, and
thus the number of islands with score at least *x* should
be well described by equation **1** when *x* is
sufficiently large. The island method generates maximum-likelihood
estimates of λ and *K* from equation **1**,
while the direct method generates these estimates from equation **2**.

The concept of two or more local alignments being distinct is a subtle one, and a variety of definitions have been proposed (6,12,24,25). The differences among these definitions are relevant more for the comparison of real than random sequences. Because using any reasonable definition of distinct alignments should yield equivalent statistical results, the advantage of the ‘island’ (12) over the ‘declumping’ definition (18,19,24,25) for parameter estimation is its algorithmic efficiency.

In general, equation **1** becomes increasingly
accurate for larger values of *x*, so to obtain a
good estimate for λ one should confine attention to islands
whose score attains at least some threshold value *c*.
Assume the set *I** _{c}* of
such islands has cardinality

*R*

*, and let*

_{c} be
the mean score in excess of *c* of these islands:

, **3**

where *S*(*i*) is the score of
island *i*. Then, assuming island scores are integral,
with unit lattice spacing, the maximum-likelihood estimate (Appendix)
for λ is

. **4**

The standard error of

/λ is

, **5**

where the approximation holds to better than 0.05% for λ < 1. If the island scores were continuous, the maximum-likelihood estimate

would instead be simply 1/

, and the standard error of

/λ would be 1/√*R** _{c}*.

In conjunction with

, the maximum-likelihood
estimate for *K* is

, **6**

where *A* is the aggregate ‘area’ of
the search space from which the collection of islands were drawn.
If a single pair of sequences, of lengths *m* and *n*, were compared to generate the islands, then *A* = *mn*; if *B* such
comparisons were performed, then *A* = *Bmn*.

The parameters λ and *K* of equations **1** and **2** properly apply only in
the limit of infinite-length sequences. If one uses either the island
or direct method to estimate λ for sequences of finite length,
one obtains estimates with an observable finite-length bias. As
will be discussed below, this bias can be explained in terms of ‘edge
effects’, for which a simple correction can be applied
to the lengths *m* and *n* in equations **1** and **2**. The resulting formulas
retain the asymptotic values of λ and *K*,
so it is desirable to avoid any finite-length bias in the estimation of
these parameters. We note here that, by eliminating edge effects,
the island method can estimate asymptotic values of λ and *K* directly. This is done by embedding a length *n* × *n* sequence
comparison within a larger (*n *+ 2*b*) × (*n *+ 2*b*) comparison, with a border of length *b* on
each side (Fig. (Fig.2).2). Only islands anchored
within the central *n* × *n* region are recorded. When *b* is
sufficiently large, edge effects are essentially abolished.

## THE TRADEOFF OF SPEED, BIAS AND PRECISION

Because of λ’s exponential role in equations **1** and **2**, accurate estimates for λ are
far more important than those for *K*, and we shall
therefore focus on the estimation of λ. A key question
for applying the island method effectively is how to choose an appropriate
threshold parameter *c* for use in equation **4**.

While we believe that the qualitative features presented here are
truly independent of the scoring system used, we will illustrate
below the issues involved in choosing *c* using a specific
example. To obtain extremely accurate parameter estimates for this
case study, we performed a massive random simulation for a particular
local alignment scoring system. Specifically, we used a set of standard
amino acid frequencies for proteins (26)
to generate over 92 000 pairs of length-7000 ‘random amino
acid sequences’. We compared each pair using the BLOSUM-62
amino acid substitution matrix (27),
in conjunction with affine gap scores (28–31) of –(11 + *k*)
for gaps of length *k*. To suppress edge effects,
scores were tabulated only for islands anchored within the central
5000 × 5000 square of each pairwise
comparison; approximately 10^{12} total island scores were
recorded. Using equations **3**–**6**,
estimates of λ and *K* were obtained from
these data for a range of cutoff scores *c*; the
results are summarized in Table Table11 and
the values of

are plotted in Figure Figure33.

*c*. Standard errors for the estimates are shown with error bars. The plotted horizontal line indicates the best estimate of the asymptotic λ. Details of the simulation are given

**...**

While the estimates

of
Table Table11 should be essentially free of
edge-effect bias, there is another systematic and easily understood
bias (12) evident for small values
of *c*. Optimal local alignments with low score are
unlikely to contain a gap, as will be discussed further below, and
for low thresholds

is therefore biased towards the higher λ applicable to local alignments that exclude gaps. In this example,

falls monotonically for *c* ≥ 20, until it reaches the
value 0.2670 at *c* = 37; thereafter,

appears
to fluctuate randomly about this value. Of course a yet larger simulation,
yielding smaller stochastic errors, might detect systematic bias
even beyond *c* = 37.

There is a tension between the bias of

and
its precision, for the larger the value of *c* chosen,
the fewer the islands that attain score *c*, and
the larger the standard error of

. To illustrate the point, consider a realistically sized random simulation, 10 000 times smaller than that shown in Table Table1,1, which would require ~2 min on a modern workstation. The systematic bias in the

from
such a simulation should be the same as seen in Table Table1,1, but the standard errors will be 100 times
larger. Table Table22 shows the resulting tradeoff
between bias and precision. The best tradeoff probably occurs near *c* = 28, where the sum (~2.2%)
of the bias and the standard error are minimized. As the size of
the random simulation grows, the bias at a given cutoff remains
fixed, whereas the standard error decreases. Thus in general the
optimal tradeoff for larger simulations will tend to occur at higher
values of *c*.

For a given simulation one may estimate well the standard error
at any given *c*, but not the bias; if one could
estimate bias, one could correct for it. The analysis of a relatively
small simulation given in Table Table22 is
possible only because a much larger simulation has in fact been
performed. In practice, one must choose the *c* at
which to estimate λ without knowing to any certainty how
much bias it entails. We have investigated automatic procedures
for choosing *c*, and found several reasonable methods,
but none for which an argument of optimality can be advanced. In
outline,

decreases systematically for increasing *c*, until its increasing standard error obscures
any further change. It is at this point that the cutoff *c* should
be chosen.

## EDGE EFFECTS AND THEIR CORRECTION

Independently of the type of bias in estimating λ described above,

varies
substantially as a function of *m* and *n* when λ is
estimated from traditional borderless (i.e. *b* = 0) *m* × *n* sequence
comparisons (20). One may therefore
argue that one’s estimate of λ and *K* should
depend upon the lengths of the real sequences to which they will
be applied (22). We here take
the alternative view that the length-dependence of

is merely
an artifact of finite-length sequence comparison edge effects, and
that a correction for these effects is best applied to *m* and *n* in equations **1** and **2** rather
than to λ and *K*.

The central idea of the ‘edge effect’ correction
is that high-scoring local alignments from the comparison of two
random sequences have an expected length *l*(*x*),
dependent upon their score *x*, and therefore cannot
begin arbitrarily close to the end of either sequence. Accordingly,
in place of *m* and *n* in equations **1** and **2**, the ‘effective’ lengths
of the sequences should be taken to be *m*′ = *m* – *l*(*x*)
and *n*′ = *n* – *l*(*x*)
(20).

Empirically, the mean length *l*(*x*)
of high-scoring random alignments with sufficiently large score *x* depends linearly on *x*

* l*(*x*) = α*x* + β. **7**

We will discuss in a later section the interpretation of α and β, but note here that these parameters may be estimated by recording the lengths as well as the scores of optimal island alignments. The length of a gapped alignment is interpreted as the average length of the two segments it involves.

For the island method, the way that edge effects bias

is easy
to understand. The decay in the observed number of alignments with
score at least *x* is steeper than would be estimated from
equation **1** because the effective lengths *m*′ and *n*′ shrink with
increasing *x*. Some simple calculus suggests the
apparent λ from the comparison of sequences of sufficient
lengths *m* and *n* should be given
approximately by

. **8**

For the specific scoring system studied in the massive random simulation
above, we estimate α = 1.90 ± 0.02 (see discussion below). Therefore,
we expect the apparent λ for *n* × *n* comparisons to follow the equation

. **9**

To test this theory, we used the island method to estimate λ for
the same scoring system studied in the simulation above. We generated
islands from many *n* × *n* random sequence comparisons, but with no border
for suppressing edge effects. Sufficient comparisons were performed
to yield over 10^{6} islands with a score of at least 37
for each of the 12 lengths *n* studied; as described
above, using this threshold eliminates almost all cutoff-based bias.
The resulting maximum-likelihood estimates

(*n*,*n*) have a standard error of
0.1%, and are shown as open circles in Figure Figure4.4. Given our small uncertainty in λ and α, for *n* > 400
(1/*n* < 0.0025 in Fig. Fig.4)4) the data fit the theory of equation **9** to
within stochastic error (i.e. two standard deviations). Furthermore,

(*n*,*n*) deviates from equation **9** by <0.5% for *n* > 218
(1/*n* < 0.0045), and by <1% throughout
the range studied. For each *n*, we calculated a χ^{2} goodness-of-fit test to the
geometric distribution; in all 12 cases, the data fit the model with
a *P*-value > 0.09.

*n*×

*n*sequence comparisons by the island method as a function of 1/

*n*. Approximately 1 000 000 islands with a score of at least 37 were generated to produce the estimates, which thus have a standard

**...**

We emphasize that we do not argue that the line plotted in Figure Figure44 is more accurate in describing score distribution tail behavior than the experimental

(*n*,*n*) produced by the island method. Rather, the good
agreement implies the correction we recommend for finite lengths *m* and *n* should be sufficiently accurate
for comparing proteins of typical size. In evaluating the statistical
significance of actual sequence comparisons, one may apply edge-effect
corrections either to the sequence lengths, as we suggest, or to λ,
but one should not combine the two corrections. We emphasize further
that equation **8** does not permit one to estimate λ accurately
from a ‘finite-size’ simulation that estimates

(*m*,*n*) because such a simulation
will not yield an estimate of the asymptotic value of α.

The island method with borders allows one to estimate the ‘infinite-length’ or
asymptotic parameters λ and *K* directly, and
simultaneously to estimate, as described below, the edge-effect
correction parameters α and β.
A single simulation that estimates these four parameters thus permits
the statistical evaluation of comparisons of sequences of arbitrary
length.

## COMPARISON OF THE DIRECT AND ISLAND METHODS

For estimating the asymptotic parameters λ, *K*, α and β,
the island method has a distinct speed advantage over the direct method,
as we will discuss below. However, it is easiest first to compare
the two methods on the problem of estimating

(*n*,*n*) studied in the previous section.
In this ‘finite size’ case, the methods have contrasting
advantages. To achieve a standard error σ in

(*n*,*n*)/

(*n*,*n*), the island method must generate approximately
1/σ^{2} data points
(see equation **5**), while the direct method need
generate only about 0.61/σ^{2} points
(23). Furthermore, the algorithm
for generating island scores requires more computation than that
for generating maximal local alignment scores because it must keep
track to which island the score of each path graph cell belongs.
Our implementation and timing experiments show the direct method
uses only ~70% of the time per cell that the island method
does. These two factors combined yield a speed advantage of ~240% for
the direct method. On the other hand, the island method may generate multiple
data points from each *n* × *n* sequence comparison. The expected number of such
points depends both upon the length of the sequences being compared
and upon the threshold score *c* as given by equation **1**. The total speed advantage of the island over
the direct method is then *Kn*^{2}*e*^{–}^{λ}* ^{c}*/2.4.
In our case study we have been employing

*c*= 37 and very many data points to obtain extremely accurate parameter estimates, but as stated above

*c*= 28 would be appropriate for a comparison of more typical accuracy. At this threshold, the comparison of two sequences of length 340 yields about 2.4 islands on average, counterbalancing the direct method’s speed advantages. For comparisons larger than this, the island method will be faster than the direct method, and slower for smaller comparisons.

This analysis, however, tells only part of the story, because the biases of the direct and island methods in estimating

(*n*,*n*) vary with *n*.
To study the extent of this bias, for each length *n* considered
in the previous section we generated sufficient data points for
both the direct method and the island method with *c *= 28
to produce estimates

(*n*,*n*) with
a standard error of 0.1%. We then compared these estimates
to the independent and effectively unbiased estimates (also with
standard error 0.1%) shown by the points plotted in Figure Figure4;4; the resulting estimates of bias are given
in Table Table3.3. For sequence lengths *n* ≤ 343 the direct method tends to overestimate

(*n*,*n*) by >1%.
Some reflection reveals why this should be the case. For the scoring system
under study, ~81% of all optimal alignments from 343 × 343 comparisons have a score
less than 37, and >7% have a score less than 28.
As we learned from our analysis of the island method, including
low scoring, largely ungapped, alignments introduces noticeable
bias into estimates of λ. The problem is amplified for
the direct method because, due to the extremely fast decay of the
left-hand tail of the extreme-value distribution, the data points
upon which the maximum-likelihood estimate most strongly depends
are those with lowest score.

**Bias in the estimation of λ(**

*n*,*n*) of the island, direct and censored direct methods, and their relative speeds Borrowing from our analysis of the island method, it is possible
to greatly reduce the bias of the direct method by basing its maximum-likelihood
estimate only on those scores that reach a minimum threshold *c *(23) (see Appendix). This refinement is
achieved at a cost in speed, however, because not every *n* × *n* comparison will
yield a data point, and because such ‘censoring’ increases
the number of data points required to achieve a given standard error
(23). For example, only 56% of
200 × 200 comparisons have a maximal
alignment score of at least 28, and with this degree of censoring
the number of data points required for a given error increases by
40% (23). For this size
comparison, the censored direct method is thus 2.5 times
slower than the unmodified method, while still 20% faster
than the island method. A fuller analysis gives the speed advantage
to the island over the censored method only for comparisons larger
than about 280 × 280. Of course as the
size of the comparisons grows, so does the island method’s
relative speed advantage (Table (Table3),3), reaching
a factor greater than 3 for comparisons of size 600 × 600.

The island method has a major speed advantage for comparisons of size ≥ 800 × 800, but it appears to have a corresponding disadvantage with respect to bias (Table (Table3).3). This arises because over 99.8% of the optimal alignment scores from 800 × 800 comparisons are at least 31, and at least 34 from 1200 × 1200 comparisons. Effectively, the score ‘threshold’ for the direct method increases with comparison size, while we have used a fixed score threshold of 28 for all the island comparisons in Table Table3.3. Were this threshold raised for large comparisons, it would be possible to achieve equivalent bias to the direct method, while retaining a >3-fold speed advantage. However, for large comparisons, one has the option with the island method to choose greater speed over smaller bias.

If all the island method had to offer were a 2–3-fold
speed advantage for comparisons in the size range 500 × 500
to 800 × 800, it would
hardly constitute a significant advance. However, our main point
is that in lieu of estimating statistical parameters for various
finite-size comparisons, a single estimate of the asymptotic λ and *K* along with the edge-effect correction parameters α and β will
suffice. In this context, the island method has major advantages
to the direct method. Most simply, the island method can accurately
estimate the asymptotic λ by increasing the dimensions
of its comparisons to such an extent that finite-size effects become
negligible. Because the number of data points the island method
generates grows in proportion to the area of its comparisons, there
is no loss in speed. In contrast, as we have seen, the direct method
pays a heavy penalty in speed as the size of its comparisons grow.

To avoid unduly increasing the comparison size, one might consider adding borders to direct method comparisons, as described above for the island method (Fig. (Fig.2).2). This, however, imposes substantial computational overheads. First, one must record where local alignments are ‘rooted’, to avoid counting local alignments rooted outside the central square. The extra computation per cell is similar to keeping track of which island a cell belongs to and increases run time by a factor greater than 1.4. Second, borders can greatly increase the computational area of medium-sized comparisons. For example, a border of moderate length 200 (see the next section) increases the area of a 600 × 600 comparison by a factor of 2.8. The two effects combined would slow such a comparison down by a factor close to 4. In contrast, for asymptotic parameter estimation, borders may be added to the island method comparisons at essentially no computational cost: first because the island method must record the roots of local alignments in any case; second because the comparisons’ underlying dimensions may be enlarged arbitrarily, rendering inconsequential the additional area entailed by the inclusion of borders.

In conclusion, for finite-size parameter estimation, the island method begins to have a speed advantage only for the comparison of sequences of moderate length. However, for the asymptotic parameter estimation we recommend, the island method has a speed advantage to the direct method approaching an order of magnitude.

## THE ESTIMATION OF α AND β

For optimal local alignments of a given score *x*,
the standard deviation in the distribution of alignment lengths
is large: about the same as the mean length. Nevertheless, the mean length
can be seen to grow approximately linearly with *x*,
as illustrated by data from the massive simulation above, plotted in
Figure Figure5.5. The slope of this dependence does
not approach its asymptotic value until *x* is sufficiently
large. Therefore, as with estimates of λ, estimates of
the parameters α and β in
equation **7** are best calculated by confining attention
to alignments with a score greater than or equal to a threshold
value *c*. In Table Table44 we
give, for various thresholds, estimates of α and β obtained by linear regression on the
lengths of the optimal island alignments. Once again, choosing a
threshold that balances bias and stochastic error is to some degree
arbitrary. We show in Figure Figure55 the line
implied by the estimates

= 1.90 and

= –30, yielded by the threshold *c* = 47. These estimates agree within stochastic
error to those for all *c* ≥ 44.

*l*(

*x*) of optimal island alignments, as a function of the alignment score

*x*. Error bars, representing one standard error, grow with score primarily because the number of alignments on which the mean length estimates are based decreases.

**...**

While the standard error for

is 1% at *c* = 47,
one is forced to settle for much larger errors in simulations of
more realistic size. However, α and β are used only to correct the lengths
of the sequences being compared, and the significance of alignment scores
depends only linearly upon these lengths. Therefore it is generally
quite acceptable to estimate α to within
10 or even 20%. The data generated to provide reasonably accurate
estimates of the far more important parameter λ easily
suffice for this purpose.

At a score of 95, the highest score achieved in this simulation, the
predicted mean length is less than 150. Therefore, even though the
standard deviation of the alignment length is approximately equal
to the mean length, the border of length 1000 used in our simulation
should be much more than sufficient for estimating the asymptotic
values of the parameters λ, *K*, α and β, corresponding to ‘infinite
length’ comparisons. For comparisons performed without
borders, or with borders of insufficient length, estimates of α and β deviate
from the asymptotic values, just as estimates of λ were
shown to deviate above.

The expected length of gapped alignments with a high score clearly
places limits on the applicability of equations **1** and **2** to the comparison of short sequences, even after
edge effects have been corrected for. Specifically, if the expected
length of an optimal alignment is longer than the shorter of the
two sequences being compared, then one has effectively entered the
realm of global sequence comparison, to which our theory no longer
applies. This is perhaps best seen as an indication that the combination
of substitution and gap costs being employed are tailored for too ‘distant’ similarities,
and that a scoring system with a greater relative entropy should
be used instead (32).

## RELATIVE ENTROPY AND THE RELATION OF α TO β

It has recently been established under certain simplifying assumptions
that in the no-gap case, the edge-effect correction outlined above
is the proper first-order correction to equations **1** and **2** for finite-length sequences (J.L.Spouge, personal communication).
For high-scoring local alignments without gaps, it can be shown
(33) that the average length
of alignments with score *x* is well approximated
by

, **10**

where *H** _{u}* is the
relative entropy of the scoring system in nats (32),
and the subscript

*u*indicates we are speaking of ungapped alignments. It is therefore reasonable to define, and estimate, the relative entropy per amino acid pair for gapped alignments by the formula

* H** _{g}* = λ

*/α*

_{g}*,*

_{g}**11**

where the subscript *g* indicates the gapped case.

Given this definition, we estimate *H** _{g}* for
the scoring system studied above to be 0.141 ± 2% nats.
Note that for the identical scoring system, Altschul and Gish (20) obtained the much greater estimate
of 0.25 nats for

*H*

*, due primarily to their assumption that β is 0 in equation*

_{g}**7**. This assumption yields a good estimate of

*H*

*only in the limit of very large scores*

_{g }*x*, a limit not nearly approached in simulations of practical size.

Given that for ungapped alignments β* _{u}* is near zero, as seen experimentally
(see Table Table55 for some examples), one
may ask why β

*should be distinctly negative. An understanding is to realize that for a scoring system in which a gap of length 1 has score –*

_{g}*G*, at each end of an optimal alignment there must be a section with score +

*G*that does not include gaps. The average lengths of these sections will be described better by the ungapped than by the gapped α. This is a much stronger effect than the fact that an optimal alignment may not begin or end with a negatively scoring aligned pair of letters, which causes β

*to be slightly negative. Together, these two effects lead to the prediction that the parameter β*

_{u}*can be approximated by the formula*

_{g}β* _{g}* ≈ 2

*G*(α

*– α*

_{u}*) + β*

_{g}*.*

_{u}**12**

For the particular scoring system and random letter frequencies we
have been studying, *G* = 12, α* _{u}* = 0.79 and β

*= –3.2. In conjunction with our estimate of 1.90 ± 0.02 for α*

_{u}*, this yields an estimate of –29.8 ± 0.5 for β*

_{g}*, which coincides with the experimental value of –30 ± 1 within the precision of measurement. Similar agreement is found for other gap costs and scoring systems that are not too close to the log-linear transition (see Table Table55).*

_{g} Equation **12** suggests that, with a knowledge
of the easily accessible α* _{u}* and β

*, the estimation of α*

_{u}*alone is sufficient for the edge-effect correction. In practice, however, estimating β*

_{g}*requires no more work than estimating α*

_{g}*, so one might as well use the experimental value.*

_{g}## DISCUSSION AND CONCLUSION

It was originally claimed that the primary advantage of the island
over the direct method for estimating statistical parameters lay
in speed (12). We have shown
here that this is substantially true only for asymptotic parameter
estimation. However, we also have argued that edge-effect parameters
allow a single estimation of asymptotic parameters to replace all
finite-size parameter estimates. Furthermore, the island method
permits simple maximum-likelihood estimation of λ that
accounts for discrete score data, and it allows for simultaneous
parameter estimation using various score thresholds *c*,
and thus the controlled tradeoff of systematic bias and stochastic
error.

The parameter λ depends not only upon the scoring system employed, but also upon the letter frequencies of the sequences being compared. In practice, λ may sometimes vary by >10% from one pair of sequences to another, due merely to variations in sequence composition. Yet, in the context of a database search, it is simply too time consuming to re-estimate λ for each pairwise comparison of potential interest: one moderately accurate estimate of λ requires as much time as searching a typical current database using standard heuristic methods (4). Thus, while one may precompute highly accurate estimates of λ for a fixed ‘standard’ composition, isn’t this accuracy vitiated by varying compositions?

Two solutions to the problem of varying background frequencies
have been proposed, both of which can make use of accurate parameter
estimation procedures. Altschul *et al*. (4) have suggested that for non-standard
letter frequencies, the substitution scores be rescaled so as to
set the calculable (13) parameter λ* _{u}* equal
to that for the original substitution scores used with standard
frequencies. The conjecture is that the precalculated λ

*will then apply to gapped alignments using the rescaled substitution scores in the context of the non-standard frequencies. This procedure has been implemented with good results (34). Alternatively, Mott (22) has used random simulations for a very large number of different scoring systems, gap costs, sequence compositions and sequence lengths to derive an empirical formula for λ, dependent upon variables calculable from the scoring system, letter frequencies and sequence lengths. Because the values of*

_{g}used in deriving this formula were calculated by the direct method, frequently with short sequences, some improvement in Mott’s formula may be obtainable using the methods described here. To be more conservative, statistical parameters may be based upon residue compositions within sequence regions containing the aligned segments of interest. In general, by improving the precision with which statistical parameters are estimated for local sequence alignment, more accurate judgments can be rendered concerning the biological relevance of protein and DNA sequence similarities.

## ACKNOWLEDGEMENTS

We thank Dr John Spouge for helpful conversations. This research is supported by the National Science Foundation through grants nos DMR-9971456 and DBI-9970199. R.B. and T.H. are grateful to the hospitality of the N.C.B.I. through its Scientific Visitors Program. In addition, R.B. acknowledges a Hochschulsonderprogramm III fellowship of the DAAD, R.O. acknowledges an LJIS fellowship by the Wellcome-Burroughs Fund and T.H. a Beckman Young Investigator Award.

## APPENDIX

#### Maximum-likelihood fitting

In this appendix we explain the maximum-likelihood fitting technique in the presence of discrete scores. In the case of the extreme-value distribution this extends the more commonly used maximum-likelihood fitting for continuous scores as it is, e.g. presented by Lawless (23). By analogy to some analytical results on discrete extreme-value distributions (35), for small lattice spacing we expect only small deviations in the estimated parameters due to the discreteness of the scores as long as we perform uncensored fits. However, for alignment scores it is often necessary to estimate the parameters only for a subset of the observed scores. For such a censored fit, the discreteness of the scores must be taken into account, as discussed here, to obtain correct maximum-likelihood estimates of the parameters of the underlying distribution.

Throughout the appendix we will assume that sufficiently large
island scores *S* follow a geometric distribution

Prob(*S* = *x*) = *Dp** ^{x}*,

**13**

where we write *p* = exp(–λ) in order to emphasize the discrete character
of the scores *S*. In the simplest case, the distribution is
of this geometric form for all *x* ≥ 0
which fixes the prefactor *D *through normalization
to *D *= 1 – *p*.
Let us assume that we observed *n* islands with the
scores *x*_{1}, …, *x** _{N}* and we wish to find the value of

*p*(i.e. λ) which best describes these observed scores. Since the probability of observing these scores is just the product of their individual probabilities the logarithm of the total probability (i.e. the log-likelihood) is

**14**

The best value

of *p* is the one which maximizes
this expression. We can obtain it by equating the first derivative
of this expression to zero. This yields after some simple algebra

. **15**

In our application, the distribution of island scores is not
of the geometric form (equation **13**) for all scores *x*. It follows this form only asymptotically for
large scores *x*. In this case, the prefactor *D* is
no longer fixed by normalization. Rather, it depends on the shape
of the distribution for small *x*. In order to get
a good estimate for *p* we have to perform a censored
fit, i.e. we keep only those island scores *x* with
value at least *c*. The integer cutoff *c* is
chosen so that the geometric form (equation **13**)
is a reasonable description of the data. This is commonly called
Type I censoring (23). We expect
the censored scores to be distributed according to the restricted
probabilities

**16**

which is independent of the unknown normalization factor *D*. If,
out of *n* total scores, the scores *x*_{1}, …, *x** _{M}* attain a value of

*c*or larger, the logarithm of the probability is

17

This log-likelihood function is identical to the one without censoring
presented in equation **14** except for the shift of
all scores by the cutoff *c*. Thus, the optimal value

is given by

**18**

From this expression it becomes obvious why it is important to take the discreteness of the scores into account for censored fits. The maximum-likelihood estimate

depends explicitly on
the cutoff, as *c* appears in equation **18**,
but the set of scores *x*_{1}, …, *x** _{M}* remains unchanged
as the cutoff

*c*is varied between two adjacent integers. Therefore, it is important to demand that

*c*be integral, taking the discreteness of the scores into account. In order to get an estimate

for the other distribution
parameter *K*, we have to employ the expected number *E*(*x*) of islands with score at
least *x* given by equation **1**. If
we observe *R** _{c}* islands
with score at least

*c*in

*B*pairwise comparisons we get

*R*

*≈*

_{c}*BE*(

*c*) ≈

*B*

^{mn}

c which can be rearranged
into equation **6** for the maximum-likelihood estimate

. If we choose the direct rather than the island method to estimate λ, we are interested in the distribution

Prob(*S*′ = *x*′) = Prob(*S*′ ≤ *x*′) – Prob(*S*′ ≤ *x*′ – 1) **19**

of the optimal local alignment scores. The optimal local alignment score *S*′ is the maximum of
all the approximately ρ*mn* island scores
of the two sequences compared, where ρ describes
the typical island density. Thus, the two probabilities on the right-hand
side of equation **19** can be expressed by the distribution of
island scores *S*, and *S*′ is
distributed according to

Prob(*S*′ = *x*′) = Prob(*S* ≤ *x*′)^{ρ}* ^{mn}* – Prob(

*S*≤

*x*′ – 1)

^{ρ}

*.*

^{mn}**20**

Since each island score follows the geometric distribution (equation **13**) we get

21

with *K* = ρ*D*/(1 – *p*). The
last approximation is justified since *Dp*^{x}^{′}/(1 – *p*)
is a small number for all the scores *x*′ that
we are interested in. Equation **21** suggests that
the optimal alignment scores are indeed extreme-value distributed.
The only influence of the discreteness of the scores is that the
probability Prob(*S*′ = *x*′) is given by finite
differences of the extreme-value distribution instead of being a
proper probability density, i.e. the derivative of the
extreme-value distribution. Let us now assume that we performed *B* comparisons. We again choose a cutoff *c*,
and keep only the *m* optimal local alignment scores *x*′_{1}, …, *x*′* _{M}* that are
greater than or equal to

*c*. (If we are interested in an uncensored fit, we can always choose

*c*= 0 since all local alignment scores are non-negative.) These scores are expected to follow the distribution

**22**

The logarithm *L*(*p*,*K*) = ln
Prob(*x*′_{1}, …, *x*′* _{M}* |

*x*′

_{1}≥

*c*, …,

*x*′

*≥*

_{M }*c*) of the probability of observing the censored scores

*x*′

_{1}, …,

*x*′

*then becomes*

_{M }

** 23**

As before, the best estimates

= exp(–

) and

of the two parameters *p* and *K*, given the observed data *x*′_{1}, …, *x*′* _{M}*,
are the ones which maximize the function

*L*(

*p*,

*K*). We could try to find this maximum by taking the derivatives of

*L*(

*p*,

*K*) with respect to

*p*and

*K*and equating them to zero. However, this leads to a pair of equations that can only be solved numerically. Therefore, it is better to directly use a numerical minimization algorithm applied to the function –

*L*(

*p*,

*K*). We used the downhill simplex method in two dimensions (36). In order to improve convergence, it can be conveniently started at the values of

and

which are obtained by the relatively simple uncensored, continuous extreme value fit to the data. Then, it converges rapidly towards the global minimum (

,

) of the function –*L*(*p*,*K*).

## References

*Statistics of Extremes*. Columbia University Press, New York, NY.

*Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology*. AAAI Press, Menlo Park, CA, pp. 211–222.

*Statistical Models and Methods for Lifetime Data*. Wiley, New York, NY, pp. 141–202.

*Numerical Recipes in C. The Art of Scientific Computing*, Second Edition. Cambridge University Press, New York, NY, pp. 408–412.

*Atlas of Protein Sequence and Structure*. National Biomedical Research Foundation, Washington, DC, Vol. 5, Suppl. 3, pp. 345–352.

*Atlas of Protein Sequence and Structure*. National Biomedical Research Foundation, Washington, DC, Vol. 5, Suppl. 3, pp. 353–358.

**Oxford University Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (683K) |
- Citation

- Island method for estimating the statistical significance of profile-profile alignment scores.[BMC Bioinformatics. 2009]
*Poleksic A.**BMC Bioinformatics. 2009 Apr 20; 10:112. Epub 2009 Apr 20.* - Statistical significance in biological sequence analysis.[Brief Bioinform. 2006]
*Mitrophanov AY, Borodovsky M.**Brief Bioinform. 2006 Mar; 7(1):2-24.* - Calibrating E-values for hidden Markov models using reverse-sequence null models.[Bioinformatics. 2005]
*Karplus K, Karchin R, Shackelford G, Hughey R.**Bioinformatics. 2005 Nov 15; 21(22):4107-15. Epub 2005 Aug 25.* - Toward an accurate statistics of gapped alignments.[Bull Math Biol. 2005]
*Kschischo M, Lässig M, Yu YK.**Bull Math Biol. 2005 Jan; 67(1):169-91.* - Robust E-values for gapped local alignments.[J Comput Biol. 2006]
*Metzler D.**J Comput Biol. 2006 May; 13(4):882-96.*

- An Artificial Functional Family Filter in Homolog Searching in Next-generation Sequencing Metagenomics[PLoS ONE. ]
*Du R, Mercante D, Fang Z.**PLoS ONE. 8(3)e58669* - Predicting pseudoknotted structures across two RNA sequences[Bioinformatics. 2012]
*Sperschneider J, Datta A, Wise MJ.**Bioinformatics. 2012 Dec 1; 28(23)3058-3065* - Chromatin signature discovery via histone modification profile alignments[Nucleic Acids Research. 2012]
*Wang J, Lunyak VV, Jordan IK.**Nucleic Acids Research. 2012 Nov; 40(21)10642-10656* - New finite-size correction for local alignment score distributions[BMC Research Notes. ]
*Park Y, Sheetlin S, Ma N, Madden TL, Spouge JL.**BMC Research Notes. 5286* - PhyLAT: a phylogenetic local alignment tool[Bioinformatics. 2012]
*Sun H, Buhler JD.**Bioinformatics. 2012 May 15; 28(10)1336-1344*

- PubMedPubMedPubMed citations for these articles

- The estimation of statistical
parameters for local alignment score distribution...The estimation of statistical parameters for local alignment score distributionsNucleic Acids Research. 2001 Jan 15; 29(2)351

Your browsing activity is empty.

Activity recording is turned off.

See more...