The Statistics of PSI-BLAST Scores |

The statistics of sequence similarity scores |
The statistics of PSI-BLAST scores |
Iterated profile searches with PSI-BLAST |
BLAST Home |

Estimation of statistical parameters for local alignment scores Generalization to PSI-BLAST alignment scores References |
## Estimation of statistical parameters for local alignment scoresAs discussed previously, computation experiments [1-8] strongly suggest that the optimal gapped local alignment scores produced by the Smith-Waterman algorithm [9] and approximated by FASTA [10] or Gapped BLAST [6,7] follow an extreme value distribution [11]. Specifically, the probability that the optimal scoreS from the comparison of unrelated proteins is at least x is
given by the equationwhere K and lambda are statistical parameters dependent upon the scoring
system and the background amino acid frequences of the sequences being
compared. While FASTA estimates these parameters from the scores generated
by actual database searches, BLAST estimates them beforehand for specific
scoring schemes by comparing many random sequences generated using a standard
protein amino acid composition [12].For example, using BLOSUM-62 amino acid substitution scores [13], and affine gap costs [14-16] in which a gap of length k is assigned a score of -(10 + k),
we generated 10,000 pairs of length-1000 random protein sequences, and used
the Smith-Waterman algorithm to calculate 10,000 optimal local alignment
scores. From these scores, lambda was estimated at 0.252 and K at 0.035
by the method of maximum-likelihood [17]. In general, given M samples from
an extreme value distribution, the ratio of the maximum-likelihood estimate
of lambda to its actual value is approximately normally distributed, with mean
1.0 and standard deviation 0.78/sqrt(M) [17]. Thus the standard error for our
estimate of lambda is about 0.002, or less than 1%.A plot of the optimal local alignment scores generated in this example is shown in Figure 1, along with the maximum-likelihood extreme value curve. The chi-squared goodness-of-fit test for these data, with 34 degrees of freedom, is 25.6, which is lower than would be expected to occur by chance 87% of the time even were the theory precisely valid. To better illustrate the theory's accuracy, Figure 2 shows P-values derived from formula (1), or
by integrating the curve in Figure 1. The solid curve represents the data,
and the dotted curve the theory. It is difficult to see any space between
them.## Generalization to PSI-BLAST alignment scoresIn order for PSI-BLAST to iterate automatically, it needs to be able to generate accurate estimates of the statistical significance of the alignments it produces. Unfortunately, there is no analytic theory with which to estimate the statistical significance of a gapped local alignment of a profile and a simple sequence. We may conjecture, however, that the empirical statistics for optimal local alignments of two sequences may also apply to alignments involving a profile and a sequence.Even were this the case, we are faced with the problem of estimating the statistical parameters lambda and K for profiles we have seen for the first
time. As before, this may be done by random simulation, but sufficiently
precise estimates require on the order of half an hour on a typical current
workstation. This is at least two orders of magnitude too long for use with
PSI-BLAST.One hope is that if we construct the amino acid scores within each column of a PSI-BLAST profile to the same "scale" [18,19] (i.e. with the same ungapped lambda) as those for a standard amino acid substitution matrix,
and then use the same position-independent gap costs, the same gapped lambda
may result.To review, for ungapped local alignments, any substitution matrix takes the form where the q are the target frequencies for aligned pairs of amino acids,
the _{ij}p are background frequencies, and the subscript for _{i}lambda indicates
it is the statistical parameter for ungapped local alignments [18,19]. For a
PSI-BLAST profile [7], each column has its own unique set of amino acid target
frequencies q. Following (2), the amino acid scores for this column may
be constructed to the same "scale" by using the formula_{i}The hope is that, given a specific set of gap costs, the gapped lambda for
the PSI-BLAST profile will be the same as the gapped lambda for the standard
substitution matrix, which may be calculated in advance.To test this hypothesis, we searched SWISS-PROT [20], release 34 (59,576 sequences; 21,219,450 residues) with the length-567 influenza A virus hemagglutinin precursor [21] (SWISS-PROT accession number P03435) as query, and captured the profile constructed by PSI-BLAST from the 128 local alignments with E-value <= 0.01. This profile was constructed to have almost the same
ungapped lambda (i.e. 0.3176) as the BLOSUM-62 matrix. We then compared this
profile to 10,000 random sequences of length 567, generated using standard
background amino acid frequencies [12]. As before, a gap of length k was
assigned a score of -(10 + k). Counts of the optimal local alignment scores,
calculated using an appropriately modified version of the Smith-Waterman
algorithm, are plotted in Figure 3. Also shown is the corresponding maximum-
likelihood extreme value distribution, which has statistical parameters
lambda = 0.254 and K = 0.040. It is apparent that the distribution fits the
random trial reasonably well; a chi-squared goodness-of-fit test with 34
degrees of freedom has value 39.7, which is lower than one would expect 27%
of the time even were the theory precisely valid.This supports the idea that our statistical theory applies to PSI-BLAST alignments. Furthermore, the estimate for gapped lambda of 0.254 +- 0.002
agrees to within experimental error with the value 0.252 +- 0.002 previously
estimated for BLOSUM-62, using the same gap costs (see Table 1). Similar
agreement was obtained with a number of other protein sequences as initial
query (results not shown), and in all cases the much less important gapped K
parameter could be estimated accurately as well (Table 1). In general, values
of lambda for local profile alignments appear to differ by less than 2% from
the values for simple local sequence comparison. Using these precomputed
values for gapped lambda should thus entail an error of less one bit for
PSI-BLAST scores under 50 bits, corresponding to a factor of less than two
in the estimation of statistical significance.The accuracy of PSI-BLAST statistics was further tested by the comparison of PSI-BLAST profiles with a shuffled version of SWISS-PROT. The results from eleven queries are shown in Table 2. For each, the query was compared to the SWISS-PROT database, and a PSI-BLAST profile was constructed from the resulting significant alignments. This profile was then compared to a version of SWISS-PROT in which each sequence was shuffled. The E-values predicted by the estimated PSI-BLAST parameters are pretty
close to the mark.The accuracy of PSI-BLAST statistics might lead us to speculate that the lambdas for gapped alignments are a function only of the scale of the substitution matrix and the gap costs. Unfortunately, this is not the case, as an examination of Table 3 demonstrates. It appears that the effect of specific gap costs on lambda is a function at least of the relative entropy
of the substitution matrix, as well as its scale. The reason that PSI-BLAST
profile lambdas could be predicted so well by BLOSUM-62 is that these profiles
are constructed based upon BLOSUM-62 target frequencies, and therefore have
relative entropies roughly equivalent to that of BLOSUM-62. A more precise
theory of gapped alignment statistics, however, may be in the offing [22].## References[1] Smith, T.F., Waterman, M.S. & Burks, C. (1985) "The statistical distribution of nucleic acid similarities." Nucleic Acids Res. 13:645-656. (PubMed)[2] Collins, J.F., Coulson, A.F.W. & Lyall, A. (1988) "The significance of protein sequence similarities." Comput. Appl. Biosci. 4:67-71. (PubMed) [3] Mott, R. (1992) "Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores." Bull. Math. Biol. 54:59-75. [4] Waterman, M.S. & Vingron, M. (1994) "Rapid and accurate estimates of statistical significance for sequence database searches." Proc. Natl. Acad. Sci. USA 91:4625-4628. (PubMed) [5] Waterman, M.S. & Vingron, M. (1994) "Sequence comparison significance and Poisson approximation." Stat. Sci. 9:367-381. [6] Altschul, S.F. & Gish, W. (1996) "Local alignment statistics." Meth. Enzymol. 266:460-480. (PubMed) [7] Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic Acids Res. 25:3389-3402. (PubMed) [8] Pearson, W.R. (1998) "Empirical statistical estimates for sequence similarity searches." J. Mol. Biol. 276:71-84. (PubMed) [9] Smith, T.F. & Waterman, M.S. (1981) "Identification of common molecular subsequences." J. Mol. Biol. 147:195-197. (PubMed) [10] Pearson, W.R. & Lipman, D.J. (1988) "Improved tools for biological sequence comparison." Proc. Natl. Acad. Sci. USA 85:2444-2448. (PubMed) [11] Gumbel, E.J. (1958) "Statistics of extremes." Columbia University Press, New York, NY. [12] Robinson, A.B. & Robinson, L.R. (1991) "Distribution of glutamine and asparagine residues and their near neighbors in peptides and proteins." Proc. Natl. Acad. Sci. USA 88:8880-8884. (PubMed) [13] Henikoff, S. & Henikoff, J.G. (1992) "Amino acid substitution matrices from protein blocks." Proc. Natl. Acad. Sci. USA 89:10915-10919. (PubMed) [14] Fitch, W.M. & Smith, T.F. (1983) "Optimal sequence alignments." Proc. Natl. Acad. Sci. USA 80:1382-1386. [15] Altschul, S.F. & Erickson, B.W. (1986) "Optimal sequence alignment using affine gap costs." Bull. Math. Biol. 48:603-616. (PubMed) [16] Myers, E.W. & Miller, W. (1988) "Optimal alignments in linear space." Comput. Appl. Biosci. 4:11-17. (PubMed) [17] Lawless, J.F. (1982) "Statistical models and methods for lifetime data." Chapter 4. John Wiley & Sons, New York, NY. [18] Karlin, S. & Altschul, S.F. (1990) "Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes." Proc. Natl. Acad. Sci. USA 87:2264-2268. (PubMed) [19] Altschul, S.F. (1991) "Amino acid substitution matrices from an information theoretic perspective." J. Mol. Biol. 219:555-565. (PubMed) [20] Bairoch, A. & Apweiler, R. (1998) "The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1998." Nucleic Acids Res. 26:38-42. (PubMed) 21] Jou, W.M., Verhoeyen, M., Devos, R., Saman, E., Fang, R., Huylebroeck, D., Fiers, W., Threlfall, G., Barber, C., Carey, N. & Emtage, S. (1980) "Complete structure of the hemagglutinin gene from the human influenza A/Victoria/3/75 (H3N2) strain as determined from cloned DNA." Cell 19:683-696. (PubMed) [22] Mott, R. & Tribe, R. "Approximate statistics for gapped alignments." J. Comp. Biol., in press. |