The Statistics of PSI-BLAST Scores
The statistics of
sequence similarity scores
The statistics of
PSI-BLAST scores
Iterated profile searches


Estimation of statistical parameters for local alignment scores

Generalization to PSI-BLAST alignment scores



Estimation of statistical parameters for local alignment scores

   As discussed previously, computation experiments [1-8] strongly suggest that the optimal gapped local alignment scores produced by the Smith-Waterman algorithm [9] and approximated by FASTA [10] or Gapped BLAST [6,7] follow an extreme value distribution [11]. Specifically, the probability that the optimal score S from the comparison of unrelated proteins is at least x is given by the equation

where K and lambda are statistical parameters dependent upon the scoring system and the background amino acid frequences of the sequences being compared. While FASTA estimates these parameters from the scores generated by actual database searches, BLAST estimates them beforehand for specific scoring schemes by comparing many random sequences generated using a standard protein amino acid composition [12].
   For example, using BLOSUM-62 amino acid substitution scores [13], and affine gap costs [14-16] in which a gap of length k is assigned a score of -(10 + k), we generated 10,000 pairs of length-1000 random protein sequences, and used the Smith-Waterman algorithm to calculate 10,000 optimal local alignment scores. From these scores, lambda was estimated at 0.252 and K at 0.035 by the method of maximum-likelihood [17]. In general, given M samples from an extreme value distribution, the ratio of the maximum-likelihood estimate of lambda to its actual value is approximately normally distributed, with mean 1.0 and standard deviation 0.78/sqrt(M) [17]. Thus the standard error for our estimate of lambda is about 0.002, or less than 1%.
   A plot of the optimal local alignment scores generated in this example is shown in Figure 1, along with the maximum-likelihood extreme value curve. The chi-squared goodness-of-fit test for these data, with 34 degrees of freedom, is 25.6, which is lower than would be expected to occur by chance 87% of the time even were the theory precisely valid. To better illustrate the theory's accuracy, Figure 2 shows P-values derived from formula (1), or by integrating the curve in Figure 1. The solid curve represents the data, and the dotted curve the theory. It is difficult to see any space between them.

Generalization to PSI-BLAST alignment scores

   In order for PSI-BLAST to iterate automatically, it needs to be able to generate accurate estimates of the statistical significance of the alignments it produces. Unfortunately, there is no analytic theory with which to estimate the statistical significance of a gapped local alignment of a profile and a simple sequence. We may conjecture, however, that the empirical statistics for optimal local alignments of two sequences may also apply to alignments involving a profile and a sequence.
   Even were this the case, we are faced with the problem of estimating the statistical parameters lambda and K for profiles we have seen for the first time. As before, this may be done by random simulation, but sufficiently precise estimates require on the order of half an hour on a typical current workstation. This is at least two orders of magnitude too long for use with PSI-BLAST.
   One hope is that if we construct the amino acid scores within each column of a PSI-BLAST profile to the same "scale" [18,19] (i.e. with the same ungapped lambda) as those for a standard amino acid substitution matrix, and then use the same position-independent gap costs, the same gapped lambda may result.
   To review, for ungapped local alignments, any substitution matrix takes the form

where the qij are the target frequencies for aligned pairs of amino acids, the pi are background frequencies, and the subscript for lambda indicates it is the statistical parameter for ungapped local alignments [18,19]. For a PSI-BLAST profile [7], each column has its own unique set of amino acid target frequencies qi. Following (2), the amino acid scores for this column may be constructed to the same "scale" by using the formula

The hope is that, given a specific set of gap costs, the gapped lambda for the PSI-BLAST profile will be the same as the gapped lambda for the standard substitution matrix, which may be calculated in advance.
   To test this hypothesis, we searched SWISS-PROT [20], release 34 (59,576 sequences; 21,219,450 residues) with the length-567 influenza A virus hemagglutinin precursor [21] (SWISS-PROT accession number P03435) as query, and captured the profile constructed by PSI-BLAST from the 128 local alignments with E-value <= 0.01. This profile was constructed to have almost the same ungapped lambda (i.e. 0.3176) as the BLOSUM-62 matrix. We then compared this profile to 10,000 random sequences of length 567, generated using standard background amino acid frequencies [12]. As before, a gap of length k was assigned a score of -(10 + k). Counts of the optimal local alignment scores, calculated using an appropriately modified version of the Smith-Waterman algorithm, are plotted in Figure 3. Also shown is the corresponding maximum- likelihood extreme value distribution, which has statistical parameters lambda = 0.254 and K = 0.040. It is apparent that the distribution fits the random trial reasonably well; a chi-squared goodness-of-fit test with 34 degrees of freedom has value 39.7, which is lower than one would expect 27% of the time even were the theory precisely valid.
   This supports the idea that our statistical theory applies to PSI-BLAST alignments. Furthermore, the estimate for gapped lambda of 0.254 +- 0.002 agrees to within experimental error with the value 0.252 +- 0.002 previously estimated for BLOSUM-62, using the same gap costs (see Table 1). Similar agreement was obtained with a number of other protein sequences as initial query (results not shown), and in all cases the much less important gapped K parameter could be estimated accurately as well (Table 1). In general, values of lambda for local profile alignments appear to differ by less than 2% from the values for simple local sequence comparison. Using these precomputed values for gapped lambda should thus entail an error of less one bit for PSI-BLAST scores under 50 bits, corresponding to a factor of less than two in the estimation of statistical significance.
   The accuracy of PSI-BLAST statistics was further tested by the comparison of PSI-BLAST profiles with a shuffled version of SWISS-PROT. The results from eleven queries are shown in Table 2. For each, the query was compared to the SWISS-PROT database, and a PSI-BLAST profile was constructed from the resulting significant alignments. This profile was then compared to a version of SWISS-PROT in which each sequence was shuffled. The E-values predicted by the estimated PSI-BLAST parameters are pretty close to the mark.
   The accuracy of PSI-BLAST statistics might lead us to speculate that the lambdas for gapped alignments are a function only of the scale of the substitution matrix and the gap costs. Unfortunately, this is not the case, as an examination of Table 3 demonstrates. It appears that the effect of specific gap costs on lambda is a function at least of the relative entropy of the substitution matrix, as well as its scale. The reason that PSI-BLAST profile lambdas could be predicted so well by BLOSUM-62 is that these profiles are constructed based upon BLOSUM-62 target frequencies, and therefore have relative entropies roughly equivalent to that of BLOSUM-62. A more precise theory of gapped alignment statistics, however, may be in the offing [22].


[1] Smith, T.F., Waterman, M.S. & Burks, C. (1985) "The statistical distribution of nucleic acid similarities." Nucleic Acids Res. 13:645-656. (PubMed)

[2] Collins, J.F., Coulson, A.F.W. & Lyall, A. (1988) "The significance of protein sequence similarities." Comput. Appl. Biosci. 4:67-71. (PubMed)

[3] Mott, R. (1992) "Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores." Bull. Math. Biol. 54:59-75.

[4] Waterman, M.S. & Vingron, M. (1994) "Rapid and accurate estimates of statistical significance for sequence database searches." Proc. Natl. Acad. Sci. USA 91:4625-4628. (PubMed)

[5] Waterman, M.S. & Vingron, M. (1994) "Sequence comparison significance and Poisson approximation." Stat. Sci. 9:367-381.

[6] Altschul, S.F. & Gish, W. (1996) "Local alignment statistics." Meth. Enzymol. 266:460-480. (PubMed)

[7] Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic Acids Res. 25:3389-3402. (PubMed)

[8] Pearson, W.R. (1998) "Empirical statistical estimates for sequence similarity searches." J. Mol. Biol. 276:71-84. (PubMed)

[9] Smith, T.F. & Waterman, M.S. (1981) "Identification of common molecular subsequences." J. Mol. Biol. 147:195-197. (PubMed)

[10] Pearson, W.R. & Lipman, D.J. (1988) "Improved tools for biological sequence comparison." Proc. Natl. Acad. Sci. USA 85:2444-2448. (PubMed)

[11] Gumbel, E.J. (1958) "Statistics of extremes." Columbia University Press, New York, NY.

[12] Robinson, A.B. & Robinson, L.R. (1991) "Distribution of glutamine and asparagine residues and their near neighbors in peptides and proteins." Proc. Natl. Acad. Sci. USA 88:8880-8884. (PubMed)

[13] Henikoff, S. & Henikoff, J.G. (1992) "Amino acid substitution matrices from protein blocks." Proc. Natl. Acad. Sci. USA 89:10915-10919. (PubMed)

[14] Fitch, W.M. & Smith, T.F. (1983) "Optimal sequence alignments." Proc. Natl. Acad. Sci. USA 80:1382-1386.

[15] Altschul, S.F. & Erickson, B.W. (1986) "Optimal sequence alignment using affine gap costs." Bull. Math. Biol. 48:603-616. (PubMed)

[16] Myers, E.W. & Miller, W. (1988) "Optimal alignments in linear space." Comput. Appl. Biosci. 4:11-17. (PubMed)

[17] Lawless, J.F. (1982) "Statistical models and methods for lifetime data." Chapter 4. John Wiley & Sons, New York, NY.

[18] Karlin, S. & Altschul, S.F. (1990) "Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes." Proc. Natl. Acad. Sci. USA 87:2264-2268. (PubMed)

[19] Altschul, S.F. (1991) "Amino acid substitution matrices from an information theoretic perspective." J. Mol. Biol. 219:555-565. (PubMed)

[20] Bairoch, A. & Apweiler, R. (1998) "The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1998." Nucleic Acids Res. 26:38-42. (PubMed)

21] Jou, W.M., Verhoeyen, M., Devos, R., Saman, E., Fang, R., Huylebroeck, D., Fiers, W., Threlfall, G., Barber, C., Carey, N. & Emtage, S. (1980) "Complete structure of the hemagglutinin gene from the human influenza A/Victoria/3/75 (H3N2) strain as determined from cloned DNA." Cell 19:683-696. (PubMed)

[22] Mott, R. & Tribe, R. "Approximate statistics for gapped alignments." J. Comp. Biol., in press.