- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

###### Colloquium Paper

# A unified statistical framework for sequence comparison and structure comparison

^{*}Department of Structural Biology, Stanford University, Stanford, CA 94305; and

^{‡}Molecular Biophysics and Biochemistry Department, P.O. Box 208114, Yale University, New Haven, CT 06520-8114

^{†}To whom reprint requests should be addressed. e-mail: ude.drofnats@ttivel.leahcim.

## Abstract

We present an approach for assessing the significance of
sequence and structure comparisons by using nearly identical
statistical formalisms for both sequence and structure. Doing so
involves an all-vs.-all comparison of protein domains [taken here from
the Structural Classification of Proteins (scop) database] and then
fitting a simple distribution function to the observed scores. By using
this distribution, we can attach a statistical significance to each
comparison score in the form of a *P* value, the
probability that a better score would occur by chance. As expected, we
find that the scores for sequence matching follow an extreme-value
distribution. The agreement, moreover, between the *P*
values that we derive from this distribution and those reported by
standard programs (e.g., blast and fasta
validates our approach. Structure comparison scores also follow an
extreme-value distribution when the statistics are expressed in terms
of a structural alignment score (essentially the sum of reciprocated
distances between aligned atoms minus gap penalties). We find that the
traditional metric of structural similarity, the rms deviation in atom
positions after fitting aligned atoms, follows a different distribution
of scores and does not perform as well as the structural alignment
score. Comparison of the sequence and structure statistics for pairs of
proteins known to be related distantly shows that structural comparison
is able to detect approximately twice as many distant relationships as
sequence comparison at the same error rate. The comparison also
indicates that there are very few pairs with significant similarity in
terms of sequence but not structure whereas many pairs have significant
similarity in terms of structure but not sequence.

**Keywords:**sequence analysis, structure analysis, fold family, database statistics, protein evolution

Comparison is a most fundamental operation in biology. Measuring the similarities between “things” enables us to group them in families, cluster them in trees, and infer common ancestors and an evolutionary progression. Biological comparisons can take place at many levels, from that of whole organisms to that of individual molecules. We are concerned here with the comparison on the latter level, specifically, with comparisons of individual protein sequences and structures. (For an example of systematic comparison applied to whole organisms, see refs. 1 and 2.)

Our overall aim is to describe these two types of comparisons in a self-consistent, unified framework. For sequence or structure comparison, each act of comparing one “entity” to another (that is, either comparing two sequences or two structures) involves two steps. First, the two objects are aligned optimally through the introduction of gaps in such a way as to maximize their residue-by-residue similarity. This operation generates some form of total similarity score for the number of residues matched—traditionally, a percent identity for sequences or an rms for structures, although we will use other measures. Second, one has to assess the significance of this score in the context of what is known about the proteins currently in the database.

In earlier papers, Gerstein and Levitt (3, 30) extended the work of
Subbiah *et al.* (4) and Laurents *et al.* (5) and
described an approach for structural alignment in an analogous fashion
to the traditional approach for sequence alignment (6–9). Like
sequence alignment, this method involves applying dynamic programming
to a matrix of similarities between individual residues to optimize
their overall correspondence through the introduction of gaps.

In this paper, we tackle the second of the two steps in protein
comparison: assessing significance. We developed a simple empirical
approach for calculating the significance of an alignment score based
on doing an all-vs.-all comparison of the database and then curve
fitting to the distribution of scores of true negatives. This allows us
to express the significance of a given alignment score in terms of a
*P* value, which is the chance that an alignment of two
randomly selected proteins would obtain this score. We applied our
approach consistently to both sequences and structures. For sequences,
we could compare our fit-based *P* values with the differently
derived statistical score from commonly used programs such as
blast and fasta (10–13). The agreement we
found validated our approach. For structure alignment, we followed a
parallel route to derive an expression for the *P* value of a
given alignment in terms of the structural alignment
score.

Our work followed on much that recently has been done assessing
the significance of sequence and structure comparison. One of the major
developments in the past few years has been the implementation of
probabilistic scoring schemes (13–16). These give the significance of
a match in terms of a *P* value rather than an absolute,
“raw” score (such as percent identity). This places scores from
very different programs in a common framework and provides an obvious
way to set a significance cutoff (that is, at *P* = <
0.0001 or 0.01%). *P* values were first used in the
blast family of programs, where they are derived from an
analytic model for the chance of an arbitrary ungapped alignment (10,
17). *P* values subsequently have been implemented in other
programs, such as fasta and gapped blast by
using a somewhat different formalism (13, 18, 19).

There are currently many methods for structural alignment (20–31).
Some of these are associated with probabilistic scoring schemes. In
particular, one method (vast) computes a *P* value
for an alignment based on measuring how many secondary structure
elements are aligned as compared with the chance of aligning this many
elements randomly (28). Another method (27, 32) expresses the
significance of an alignment in terms of the number of standard
deviations it scores above the mean alignment score in an all-vs.-all
comparison (i.e., a Z-score).

#### Data Set Used for Testing.

One of the most important aspects of our analysis is that we carefully tested it against the known structural relationships. This testing allowed us to decide unambiguously whether a given comparison resulted in a true or false-positive and to decide objectively between different statistical schemes. In particular, structures were taken from the Protein Data Bank (33–34) and definitions of domains, structural classes, and structural similarities were taken from the Structural Classification of Proteins (scop) database (version 1.32; refs. 35–37). The creators of scop have clustered the domains in the Protein Data Bank on the basis of sequence identity (38, 39). At a sequence identity level of 40%, this clustering resulted in 941 unique sequences corresponding to the known structural domains. These 941 sequences were what we used as test data for both the sequence and structure comparisons. They contained 390 different superfamilies and 281 different folds. Because they had a considerably closer and more certain relationship than fold pairs, we concentrated here on superfamily pairs. These 2,107 nontrivial, pairwise relationships between the domains formed our set of true-positives.

#### Sequence Comparison Statistics.

Sequence matching was done with standard approaches: In particular, we used the ssearch implementation of the Smith–Waterman algorithm (7) [from the fasta package, version 3, (12, 40); the URL is ftp://ftp.virginia.edu/pub/fasta], with a gap-opening penalty of −12, a gap-extension penalty of −2, and the blosum50 substitution matrix [which has a maximal match score of 13 (for C to C) and an average match score of −0.36].

##### A probability–density function for sequence–comparison scores.

Each pairwise sequence comparison was best quantified by
three numbers, S_{seq}, n, and m, where
S_{seq} is the raw sequence alignment score and n
and m are the lengths of the two sequences compared. Comparing all
possible pairs of sequences allowed us to calculate an observed
probability density, ρ^{o}_{seq},
for the chance of finding a pair of sequences with particular values
for S_{seq} and ln(nm). Fig.
Fig.11*A* shows the density
for pairs between all sequences. This includes the scores for ≈300
sequence pairs that are related closely, which clearly show up as
“spots” on the right side of the plot. These high-scoring
“true-positives” are removed in Fig. Fig.11*B*, which shows
the density for just the pairs in different structural classes (42),
i.e., the pairs that definitely are unrelated. This is the density
distribution that we aim to fit.

_{seq}

^{o}, contoured against S

_{seq}, the sequence alignment score (along the horizontal axis) and ln(nm), where n and m are the lengths of the pair sequences (along the vertical

**...**

Fig. Fig.22*A* shows the
density distribution as a function of S_{seq} for
sections at constant ln(nm). The clear linear relationship between
log(ρ_{seq}^{o}) and S_{seq} at
high values of S_{seq} is indicative of an
extreme-value distribution

The variable “Z” was defined in terms of
S_{seq} and ln(nm) by using the “Z-score-like”
expression Z = (S_{seq} −
μ_{seq})/σ_{seq}, where
μ_{seq} = a ln(nm) + b and
σ_{seq}= a are the most likely sequence score and
width parameter for the distribution. The two adjustable parameters a
and b were obtained by fitting the calculated density
ρ_{seq}^{c}(Z) to the observed density
ρ_{seq}^{o}(Z) for all values of
S_{seq} and ln(nm). Substituting for
μ_{seq} and σ_{seq} for Z
above gave Z = (S_{seq} − a ln(nm) −
b)/a = S_{seq}/a − ln(nm) − b/a.

*A*) Plots of the logarithm of the observed, log(ρ

_{seq}

**...**

To derive specific values for the a and b parameters, we fit the above
formulas to the observed density distribution obtained by comparing
pairs in different scop classes, getting a = 5.84 and b =
−26.3. The fit was done by least-squares optimization by using the
simplex minimizer in matlab (Math Works, Natick, MA). It
has a residual of 0.084, which was calculated by using the standard
relation *r* = Σ
w_{i}(O_{i} −
C_{i})^{2}/Σ
w_{i}(O_{i})^{2},
where i indexes “bins” with particular S_{seq}
and ln(nm) values, O_{i} = log
(ρ_{seq}^{o}(Z_{i})) is the
observed density in a bin, C_{i}= log
(ρ_{seq}^{c}(Z_{i})) is the
calculated density in a bin, w_{i} =
1/N_{i} is a weighting factor,
N_{i} is the number of sequence pairs in a bin, and
the summation is over all bins, I, with ln(nm) between 5.9 and 13.5.

##### A cumulative sequence distribution function, giving the P value.

To estimate the statistical significance of a particular
comparison in terms of particular S_{seq}, n, and m
values, we needed the cumulative distribution function
P_{seq}(z > Z), which is defined as the
probability that matching any two random sequences will give a z value
greater than or equal to Z. This is just the integral of
ρ_{seq}^{c}(z) = exp(−z − exp(−z)) = exp(−z)
exp(−exp(−z)), from z = Z to z = ∞, so that
P_{seq}(z > Z) = 1 − exp(−exp(−Z)).
Writing Z in terms of S_{seq}, n, and m gives

where the parameters a and b are given above.

##### Relation to blast P value.

For sequence
comparison without gaps, Karlin and Altschul (10, 11) derived the
following cumulative distribution function:
P_{K&A}(s > S_{seq}) =
1 − exp(−exp(−λ(S_{seq} −
ln(Kmn)/λ)))= 1 − exp(−exp(−λ(S_{seq}
+ ln(Kmn)/λ))), where λ and K are calculated analytically based
on the sequence composition and amino acid scoring matrix. Comparison
of their analytical form with our *P* value expression shows
that λ = 1/a and K = exp(b/a). Substituting the specific
values for a and b that we calculated from the fit, we found that λ =
0.171 and K = 0.011. For the particular database sequences and
amino acid scoring matrix used here, the values for λ calculated by
Karlin and Altschul’s formula ranged from 0.217 to 0.259, all somewhat
larger than our value for λ.

##### Relation to fasta E value.

In the
fasta sequence comparison programs (12, 13, 18), the
significance of a given alignment score S_{fa} is
estimated by fitting an extreme-value distribution to scores resulting
from comparison of a given query sequence to each sequence in the
database. The distribution is recomputed for each new query so that,
unlike our approach, each query sequence is associated with a different
distribution function. This type of association has the advantage of
allowing for any peculiarities of the query sequence (e.g., composition
bias), but it also means that one cannot estimate the significance of a
single pairwise comparison of two sequences.

The value used by fasta in judging the significance of a
sequence similarity is known as the expectation value or *E*
value (here E_{fa}). The *P* value, defined
above, gives the statistical significance of a single comparison
whereas the *E* value is an estimate of the expected number of
false-positives (dissimilar matches with a significant score) for a
search of the entire database. With N_{db} entries
in the database, the *E* value E_{seq} is
calculated from our P_{seq}(s >
S_{seq}) as E_{seq} =
N_{db} P_{seq}. The *E*
values we obtained were very similar to those found by
fasta over a very wide range of values (Fig.
(Fig.3).3). When one considers that our
closed-form E_{seq} depends on only two parameters
for all pairs whereas E_{fa} is optimized separately
for each query sequence (941 × 2 = 1,882 parameters in all),
this agreement is astonishing.

##### Measuring coverage vs. error rate to compare different formalisms for significance-statistics.

We have presented two forms of
*E* value statistics for sequence comparison: our method,
E_{seq}, which is based on fitting a two-parameter
model to the observed distribution of alignment scores; and the
fasta method E_{fa}, which is based on
fitting different distributions for each query. Now we naturally are
led to ask whether there is an objective way to decide which formalism
performs the best on some representative test data.

The seminal work of Brenner *et al.* (39) and Brenner (43)
provides a framework for such an assessment by using the known
true-positives in the scop database and a coverage-vs.-error plot. To
compare any two significance-statistics formalisms, we proceeded as
follows for each:

(*i*) For each of the pairs in the all-vs.-all comparison
(941 × 940 pairs), we determined an *E* value and noted
whether the pair was a true-positive or true-negative (for
true-positives, both sequences must belong to protein domains with the
same fold in the scop classification). (*ii*) We sorted the
pairs by increasing *E* value. (*iii*) We counted
down the list from best to worst until the number of false-positives
was 1% of the total number of database entries (here, this was 9
false-positives, which is ≈1% of 941). (*iv*) We got the
threshold *E* value at this point, which ideally should be
close to 0.01, so as to correspond to the 1% error rate per query. (5)
Finally, we got the number of entries that were more significant than
the threshold *E* value; this number defined the coverage,
which should be as large as possible.

Here, we compared the coverage and error rate of our sequence score
statistics with those of fasta (E_{seq}
vs. E_{fa}). At the threshold *E* value,
our sequence statistics had log E_{seq} = −1.98 and
a coverage of 328, and the fasta statistics had a log
E_{fa} of −1.68 and a coverage of 379. The
fasta statistics had better coverage, but our statistics
had an almost perfect threshold value, which should be −2 for 1%
error rate.

#### Structure Comparison Statistics.

The procedure we used for
pairwise structural alignment is described in detail in Gerstein and
Levitt (3, 30) and is summarized only briefly here. Our core method was
based on iterative application of dynamic programming. As such, it was
a simple application of the Needleman–Wunsch sequence alignment (6).
It originally was derived from the align program of Cohen
(21, 31), with many subsequent refinements. One starts with two
structures in an arbitrary orientation. Then one computes all pairwise
distances between every atom in the first structure and every atom in
the second, which results in an interprotein distance matrix in which
each entry, d_{ij}, corresponds to the distance
between residue i in the first structure and residue j in the second
(interresidue distances usually are expressed between α-carbons).
This distance matrix, d_{ij}, can be converted into a
similarity matrix, S_{ij}, through the relationship
S_{ij} = M/(1 +
(d_{ij}/d_{o})^{2}),
where M = 20 and d_{o} = 5 Å.

One applies dynamic programming to the similarity matrix to get equivalences (using a gap opening penalty of M/2 = 10 and no gap extension penalty) and uses them to least-squares fit the first structure onto the second one (44). Then one repeats the procedure, finding all pairwise distances and doing dynamic programming to get new equivalences, until the process converges. After an alignment is determined, it can be “refined” by eliminating the worst-fitting pairs of aligned residues and then refitting to get a new rms in a similar fashion to the core-finding procedure in Gerstein and Altman (45, 46). This refinement is necessary because the dynamic programming used tries to match as many residues as possible. (It is a global, as opposed to local, method.)

##### The structural comparison score and the rms.

At the end of the
procedure, we were left with a number of scores characterizing our
final alignment. The score optimized by dynamic programming was the sum
of the similarity matrix scores S_{ij} minus the
total penalty for opening gaps. We refer to this as
“S_{str}.” To be more explicit, it was
computed from the following formula:

where N_{gap} is the total number of gaps (not
including gaps at the end of a chain) and the summation is carried out
over all pairs, ij, of equivalenced residues. The more traditional
score is the rms deviation in α-carbon position after doing a
least-squares fit on the aligned atoms (the “rms”). rms-based
statistics were used in our earlier work (for example, refs. 3–5) and
have been used in almost all other work in structural alignment.

##### A probability–density function for structural alignment scores.

To derive significance-statistics for the structural
alignment score S_{str}, we proceeded exactly as we
did for sequence comparison. Structural alignment of all pairs in the
database gave us an observed probability distribution for comparison
scores ρ_{str}^{c}, which was a function of the number
of residues matched N and the comparison score
S_{str} (Fig.
(Fig.44*A*. This distribution
contained the many pairs of structures that were similar, and these
pairs stood out with high S_{str} scores. Fig.
Fig.44*B* shows data for pairs that were in different scop
structural classes and, therefore, should not have had any structural
similarity. Fig. Fig.44*B* is much “cleaner” than Fig.
Fig.44*A* and shows the underlying distribution expected for the
comparison of structures that are not similar.

_{ste}

^{o}, is contoured against S

_{str}, the structural alignment score (along the horizontal axis), and N, the number of aligned residues (along the vertical axis). By following

**...**

Fig. Fig.22*B* shows the density distribution as a function
of S_{str} for sections at constant N. There is a
close parallel between the structural alignment score
S_{str} and the sequence alignment score,
S_{seq}, in Fig. Fig.22*A*, and both can
be modeled by an extreme-value distribution. Thus, we fit the
calculated structure density by ρ_{str}^{c}(Z) =
exp(−Z − exp (−Z)), where the variable Z is defined in terms of
S_{str} and N by using Z =
(S_{str} −
μ_{str})/σ_{str}. The most
likely structure score μ_{str} and the width
parameter σ_{str} have a more complicated
dependence on sequence length N than was the case for sequences with
μ_{str}(N) = c ln(N)^{2} + d
ln(N) + e (if N < 120), μ_{str}(N) = a ln(N)
+ b (if N ≥ 120) and σ_{str}(N) = f ln(N) +
g (if N < 120) and σ_{str}(N) = f ln(120) +
g (if N ≥ 120).

Continuity of function values and slopes allows a and b to be written
in terms of c, d, and e. To be more specific, at N = 120, a ln(N)
+ b = c ln(N)^{2} + d ln(N) + e and a = 2c
ln(N) + d. Thus, the expressions for μ_{str}(N)
and σ_{str}(N) involve five independent
parameters: c, d, e, f, and g. We determined these five parameters via
least-squares optimization by using the simplex minimizer
in matlab, which yielded c = 18.4, d = −4.50,
e = 2.64, f = 21.4, and g = −37.5 (a = 419.3 and
b = 171.8 were derived as described above). The residual was
0.288. It was given by the same formula as was used for the residual in
the sequence statistics fit with O_{i} =
ρ_{str}^{o}(Z_{i}),
C_{i} =
ρ_{str}^{c}(Z_{i}) and
w_{i} = 1, and the summation was over bins with any
value of S_{str} and N between 30 and 170 residues.
The resulting fit of the observed and calculated distribution (Fig.
(Fig.22*B*) was good for all values of N and
S_{str}.

##### A cumulative structure distribution function, giving the P value.

To estimate the statistical significance of a particular
structure comparison in terms of its S_{str} and N
values, we proceeded as we did for sequence comparison. We integrated
the score distribution to determine a cumulative distribution function
P_{str}, defined as the probability that matching
two random structures will give a z value greater than or equal to Z.
The structure score distribution has the same extreme-value form as the
sequence score distribution, so the derivation of
P_{str} follows that of P_{seq},
with P_{str}(z > Z) = 1 −
exp[−exp(−Z)], where Z is expressed in terms of
S_{str} and N by using

and the seven parameters a, b, c, d, e, f, and g are given above.

##### Structural comparison statistics based on rms.

The traditional
characterization of a structural alignment is in terms of the number of
residues matched, N, and the rms deviation from fitting these matched
residues, R. It is convenient to focus on ln(R), which ensures that
there is good separation of values for small R, where the significant
pairs occur. We calculated a probability distribution
ρ_{rms}^{o}[ln(R),N] for the observed rms values of
true-negative pairs in the same fashion as we did earlier for the
observed distribution of structural alignment scores
ρ_{str}^{o}(S_{str},N).

The fact that log (ρ_{rms}^{o}) varies very slowly
with ln(R) near the maximum (Fig.
(Fig.5)5) led us to fit
the calculated density by using ρ_{rms}^{c}(Z) =
exp(−Z^{4}), where Z is defined in terms of ln(R)
and N as Z = (ln(R) −
μ_{rms}(N))/σ_{rms}(N),
with μ_{rms}(N) = c ln(N)^{2} +
d ln(N) + e (if N < 60), μ_{rms}(N) = a
ln(N) + b (if N ≥ 60) and σ_{rms}(N) = f
ln(N) + g (if N < 60), σ_{rms}(N) = f ln(60)
+ g (if N ≥ 60). The values of the five independent parameters c, d,
e, f, and g were determined by least-squares optimization by using the
simplex minimizer in matlab, which yielded
c = 0.155, d = −0.619, e = 1.73, f = 0.0922, and
g = 0.212. (a = 0.872 and b = 0.650 were determined as
before to ensure continuity.)

_{str}

^{o}), and calculated, log(ρ

_{str}

^{c}), structure pair density distributions are plotted against the rms score ln(R) for different numbers of aligned residues,

**...**

To estimate the statistical significance of a particular comparison in
terms of its R and N values, we derived a cumulative distribution
function P_{rms}(z > Z), defined as the
probability that any z will be less than or equal to a given Z. This
was just the integral of
ρ^{c}_{rms}(z) from z = −∞
to z = Z. Because the function exp(−z^{4})
cannot be integrated analytically, we integrated it numerically for z
from −5 to Z and tabulated its value for 10,000 different Z values
from −5 to 5.

##### Comparing structure comparison statistics: Alignment score
S_{str} vs.

*rms*. Once we had derived structure comparison
statistics based on structural alignment score
S_{str} and rms, we could compare them. The same
coverage-vs.-error scheme used above to compare the two formulae for
sequence alignment significance could be used again here. When assessed
in terms of coverage (number of true-positives found) at a given error
rate on our test data, the *E* value statistics based on
S_{str} gave a much better performance (i.e., had a
larger coverage) than those based on rms. To be more specific, we
compared the two approaches (E_{str} vs.
E_{rms}) in exactly the same way that we previously
had compared our sequence *E* value to that produced by
fasta (E_{seq} vs.
E_{fa}). We found that, at the 1% error threshold,
the rms-based statistics have log(E_{rms}) = −32.8
and a coverage of 202 whereas the structural-alignment score statistics
have log(E_{str}) = −1.58 and a coverage of 627.
Clearly, the statistics based on S_{str} perform
much better because the threshold is much more reliable (i.e., closer
to the value of −2 for an error rate of 1%) and the true-positive
coverage is >3-fold higher. The difference between
E_{str} and E_{rms} is striking
and confirms that the structure score is much better than the rms
score.

There are other reasons why the structural alignment score
S_{str} is a more reliable indicator than rms:
(*i*) S_{str} depends most strongly on the
best-fitting atoms whereas rms depends most on the worst-fitting atoms;
(*ii*) S_{str} penalizes gaps, whereas rms
does not; and (*iii*) S_{str} is formally
analogous to the score one gets from a standard sequence comparison,
S_{seq}, because both quantities are derived from a
“dynamic-programming” similarity matrix. As dynamic programming
finds a maximum score over many possible alignments, it is reasonable
that both S_{str} and S_{seq} should follow an
extreme value distribution. However, this is not a trivial result, as
the scores are not independent, random variables whose maximum must
follow such a distribution.

#### Relationship Between Sequence Comparison and Structure Comparison.

Having derived sequence and structure significance
scores by using all-vs.-all comparisons on the same database of 941
sequences and structures, we were in a position to compare directly
structure and sequence significance scores. Fig.
Fig.66 shows such a comparison for the
2,107 pairs of proteins in our data set that are considered to be
related evolutionarily according to scop (i.e., they are the
true-positives in the same superfamily). The lines at
log(E_{seq}) = −2 and at
log(E_{str}) = −2 divide the 2,107 true-positive
pairs among four quadrants, depending on whether their sequence or
structure matches are significant, as follows:

_{str}), against the sequence significance, log(E

_{seq}), for the 2,107 pairs of proteins judged to be homologous in the scop database (in the same

**...**

##### Top right (1,204 pairs; nonsignificant sequence match, nonsignificant structure match).

Over half (1,204 of 2,107) of the pairs of domains thought to be evolutionarily related by scop fall into this category of having no significant match, indicating that the combination of manual measures used in scop is more sensitive than either automatic sequence or structure comparison.

##### Lower left (244 pairs; significant sequence match, significant structure match).

These pairs are evenly distributed in the lower left quadrant, indicating that the sequence and structure significance scores are on the same scale.

##### Lower right (576 pairs; nonsignificant sequence match, significant structure match).

There are many more pairs with good structure matches but without sequence matches than the converse (sequence match but no structure match). This fact objectively shows how structure is conserved more than sequence in evolution. These 576 pairs are very good test cases for threading algorithms that match a sequence to a structure, and we currently are testing them in this way.

##### Top left (83 pairs; significant sequence match, nonsignificant structure match).

Almost all of the pairs (70 of 83) in this category involve matches with a small number of residues (N < 70). For such short matches, the structures may be deformed and may not match well. There are seven labeled pairs that are exceptions because the match is extensive (N > 70), but the pairs structurally are less similar than would be expected from the strong sequence match. These seven exceptions involve 11 coordinate sets. Three of these sets were solved by x-ray crystallography to only medium resolution (>2.9 Å, 1mys, 1scm, and 1tlk), five were solved by NMR (1prr, 1ntr, 2pld, 2pna, and 1tnm), and three are high resolution x-ray structures (better than 1.7 Å for 1osa, 3chy, and 1sha). None of the seven exceptional pairs involved two high resolution structures, and it seems likely that some of the seven exceptions would have had a more significant structural match if both structures in the pair were determined to a high resolution. Furthermore, as determined from consultation of a Database of Macromolecular Movements (ref. 47; see database at http://bioinfo.mbb.yale.edu/MolMovDB), some of the seven exceptions involved proteins that had been solved in different conformational states. In particular, 1osa, 1mys, and 1scm involved proteins with the highly flexible calmodulin fold. These are clearly examples for which one would expect sequence similarity but structural differences.

## DISCUSSION AND CONCLUSION

#### Summary.

We have presented an approach for assessing in a
unified statistical framework the significance of a given comparison of
proteins, whether involving sequences or structures. For either
sequence or structure we fit an extreme-value distribution to the
observed distribution obtained from the all-vs.-all comparison of the
database (i.e., between pairs of scop domains in different structural
classes). For sequence comparison, this extreme-value distribution is
as expected: We empirically observed for gapped alignments what Karlin
and Altschul (11) derived for ungapped ones. We also gave a simple
formula for the *E* value that is likely to be useful for
pairwise comparisons without involving searches of the entire database.

For structure comparison, we found that the score distribution follows
an extreme-value distribution when expressed in terms of the structural
alignment score S_{str}. By using this measure, expressions
for statistical significance can be formulated in an almost identical
way for structure as they are for sequence. It is important to realize
that, although the S_{str} is produced naturally by our
specific alignment method, it can be calculated from any arbitrary
structural alignment. Thus, by using our formulas, a significance can
be computed from the results of any structural alignment program. Using
the more traditional rms deviation as a score does not lead to as
reliable a measure of structural significance.

In connection with this, it is interesting that recent work (39, 43) indicates that the significance statistics based on optimized “sum” scores from dynamic programming (i.e., Smith–Waterman scores, which are essentially sums of blosum matrix values minus gap penalties) perform much better than those based on the traditional measure of sequence similarity, percentage identity, which parallels the poor performance of our structural alignment statistics based on the traditional rms. It is disconcerting that such well established and intuitive measures such as percentage identity or rms perform so much worse than the statistical measures based on the sequence or structure alignment scores.

Furthermore, it is surprising that over half of the relationships between distant homologues in scop were not statistically significant (at a rate of 1% error per query) using either pure sequence comparison or pure structure comparison. Almost all of the pairs found by sequence comparison were found by structure comparison, but there were many pairs found by structure comparison that were not found by sequence comparison. Overall, structural comparison was able to detect about twice as many of the scop distant homology superfamily pairs as sequence comparison (at the same rate of error).

#### Future Directions.

The approach we have used to derive
statistical significance easily could be generalized to other contexts.
In particular, it can be adapted to provide significance statistics for
threading. We have not presented a detailed examination of the
significance values for specific pairs of sequences or structures. Such
an examination could prove to be a useful endeavor in the future,
particularly if it focused on pairs of proteins with the same fold but
insignificant *E* values and those with different folds but
significant *E* values. These two classes of pairs
characterize the twilight zone for structure, which has yet to be
described fully.

## Acknowledgments

We thank S. E. Brenner for carefully reading the manuscript and S. E. Brenner and T. Hubbard for providing the pdb40d-1.32 database. M.G. acknowledges the National Science Foundation for support (Grant DBI-9723182), and M.L. acknowledges the Department of Energy (Grant DE-FG03-95ER62135).

## ABBREVIATION

- scop
- Structural Classification of Proteins

## References

**A**. 1976;32:922–923.

**National Academy of Sciences**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (360K)

- Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores.[J Mol Biol. 2000]
*Wilson CA, Kreychman J, Gerstein M.**J Mol Biol. 2000 Mar 17; 297(1):233-49.* - Use of a database of structural alignments and phylogenetic trees in investigating the relationship between sequence and structural variability among homologous proteins.[Protein Eng. 2001]
*Balaji S, Srinivasan N.**Protein Eng. 2001 Apr; 14(4):219-26.* - Protein structure comparison using the markov transition model of evolution.[Proteins. 2000]
*Kawabata T, Nishikawa K.**Proteins. 2000 Oct 1; 41(1):108-22.* - A model for statistical significance of local similarities in structure.[J Mol Biol. 2003]
*Stark A, Sunyaev S, Russell RB.**J Mol Biol. 2003 Mar 7; 326(5):1307-16.* - Contemporary approaches to protein structure classification.[Bioessays. 1998]
*Swindells MB, Orengo CA, Jones DT, Hutchinson EG, Thornton JM.**Bioessays. 1998 Nov; 20(11):884-91.*

- Emerging area: biomaterials that mimic and exploit protein motion[Soft matter. 2011]
*Murphy WL.**Soft matter. 2011 Apr; 7(8)3679-3688* - Exploring representations of protein structure for automated remote homology detection and mapping of protein structure space[BMC Bioinformatics. ]
*Molloy K, Van MJ, Barbara D, Shehu A.**BMC Bioinformatics. 15(Suppl 8)S4* - SMOQ: a tool for predicting the absolute residue-specific quality of a single protein model with support vector machines[BMC Bioinformatics. ]
*Cao R, Wang Z, Wang Y, Cheng J.**BMC Bioinformatics. 15120* - QA-RecombineIt: a server for quality assessment and recombination of protein models[Nucleic Acids Research. 2013]
*Pawlowski M, Bogdanowicz A, Bujnicki JM.**Nucleic Acids Research. 2013 Jul; 41(Web Server issue)W389-W397* - 3Drefine: Consistent Protein Structure Refinement by Optimizing Hydrogen Bonding Network and Atomic-Level Energy Minimization[Proteins. 2013]
*Bhattacharya D, Cheng J.**Proteins. 2013 Jan; 81(1)119-131*

- Colloquium Paper: A unified statistical framework for sequence comparison and
s...Colloquium Paper: A unified statistical framework for sequence comparison and structure comparisonProceedings of the National Academy of Sciences of the United States of America. May 26, 1998; 95(11)5913PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...