# Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches

^{*}To whom correspondence should be addressed. Tel: +301 435 7803; Fax: +301 480 2288; Email: vog.hin.mln.ibcn@luhcstla

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

## Abstract

Protein sequence database search programs may be evaluated both for their retrieval accuracy—the ability to separate meaningful from chance similarities—and for the accuracy of their statistical assessments of reported alignments. However, methods for improving statistical accuracy can degrade retrieval accuracy by discarding compositional evidence of sequence relatedness. This evidence may be preserved by combining essentially independent measures of alignment and compositional similarity into a unified measure of sequence similarity. A version of the BLAST protein database search program, modified to employ this new measure, outperforms the baseline program in both retrieval and statistical accuracy on ASTRAL, a SCOP-based test set.

## INTRODUCTION

Computer programs that search protein or DNA databases often are evaluated by their ability to distinguish true biological relationships from chance similarities. Such an evaluation requires a test query set and database for which the true relationships are known. Given a query sequence, a database search program returns alignments in some specific order, often ranked using an objective measure such as an alignment score, *E*-value or *P*-value. One way to evaluate search accuracy is with ROC (receiver operating characteristic) analysis (1). To produce a ROC curve, the number of true positives is plotted against the number of false positives returned as one descends the retrieval list. The ROC* _{n}* score, the normalized area under this curve up to the first

*n*false positives, has become a popular measure of search accuracy. ROC

*scores may be calculated for individual queries or, if objective scores render distinct database searches comparable, the search results from many different queries may be pooled to produce a single combined ROC curve and score (2).*

_{n}Although ROC* _{n}* scores capture one aspect of search method quality, they ignore an important issue. In practice, most database searches return results for which the true and false positives are not known. Even a perfect ordering, with all the true results preceding all the false ones, may be of limited use if a user has no effective method for drawing a line between the two classes. Most protein database search programs now provide, in addition to an ordered list,

*E*- or

*P*-values with which to assess whether any given result can be explained by chance. Many applications rely critically upon these values for further analyses. For example, PSI-BLAST (3) and related iterative protein profile search programs use an

*E*-value threshold to automatically include alignments in further rounds of analysis; a single false positive below this threshold can corrupt all subsequent results (4). Although unreliable

*E*-values may be countered by setting a very stringent threshold, this can squander efforts that have gone into improving retrieval accuracy. Accordingly, as important a consideration for a program's utility as the degree to which it separates true from chance similarities may be the accuracy with which it calculates

*E*-values (5).

A central question in calculating alignment *E*-values is defining the random distribution to which they refer. The simplest approach is to calculate *E*-values with reference to a random protein model, based on standard amino acid frequencies; this is the baseline behavior of the BLAST programs (3,6). The most immediate problem arises from ‘low-complexity’ segments—protein regions with extremely restricted amino acid usage—which depart drastically from the random model. It is usually not of interest to align such segments, and they may be filtered out of consideration using a program such as SEG (7). However, even after low-complexity segments have been removed, many proteins have distinctly biased amino acid compositions. Such biases are typical of some protein families, but particular organisms also have AT- or CG-biased genomes, leading by means of the genetic code to characteristically biased proteomes (8,9). Accordingly, many authors have proposed calculating the significance of an alignment of two proteins by considering their amino acid compositions within some model for generating random sequences (2,10,11). If one does not account for compositional bias, the reported *E*-values or *P*-values may be orders of magnitude too low. As we show below, an implementation of this basic idea within BLAST greatly improves the accuracy of its statistical evaluations.

One might expect that adopting a more accurate calculation of *E*-values would yield improved retrieval accuracy as well. However, retrieval accuracy actually decreases, even when the search results from many queries are pooled (2). A possible explanation is that similar amino acid compositional biases for two proteins constitutes, in itself, some evidence of protein relationship. The challenge, then, is to produce a mathematically justified way of taking compositional similarity into account when assessing sequence relationships.

The approach we take here is to consider two distinct measures of sequence similarity: the traditional measure of alignment similarity and a new measure of compositional bias similarity. We investigate empirically the distribution of compositional similarity among unrelated proteins. This allows us to define an associated compositional *P*-value distinct from an alignment *P*-value. We find that, on average, related proteins have a greater compositional similarity than unrelated proteins. Furthermore, we find that for unrelated sequence pairs, alignment and compositional *P*-values are effectively independent. Therefore, we can combine these *P*-values into a single, unified *P*-value (12–14), which can serve as a new measure for assessing sequence similarity. We show that this measure recaptures the previously forfeited retrieval accuracy, while at the same time yielding accurate statistics. By also employing the previously described compositional score adjustment (15–17), we generate a program that substantially outperforms the baseline BLAST program on an ASTRAL test set (18,19) both in retrieval accuracy and in the accuracy of its reported *E*-values.

## THEORY

### Variants of BLAST

We will compare five variants of the gapped protein-query, protein-database BLAST program (3), which we distinguish in this paper with different prefixes: B-BLAST, S-BLAST, SU-BLAST, C-BLAST and CU-BLAST (Table 1). The baseline program B-BLAST is modified in a few minor ways from the default protein–protein BLAST program available on the web site of the National Center for Biotechnology Information (NCBI) (20). As noted below, some of these changes are for testing purposes only, in order to minimize the number of confounding factors when comparing B-BLAST to the other BLAST variants, while others may be retained in future web versions of BLAST.

The program S-BLAST scales the scores of a standard matrix, for each alignment reported, based upon the compositions of the two sequences compared. This approach, which improves statistical evaluations, was introduced by Schäffer *et al.* (2). The program SU-BLAST, which is new to this paper, combines the alignment similarity of S-BLAST with compositional similarity, described below, to produce a unified measure of sequence similarity.

The program C-BLAST conditionally adjusts the scores of a standard matrix, for each alignment reported, based upon the compositions of the two sequences compared. The approach is described in Altschul *et al.* (15), and is based on methods from Yu *et al.* (17) and Yu and Altschul (16). The program CU-BLAST, which is new to this paper, combines the alignment similarity of C-BLAST with compositional similarity, described below, to produce a unified measure of sequence similarity.

All five variants use the program SEG (7) to filter database sequences for low-complexity regions. SEG replaces certain amino acids with the character ‘X’, which is also used to signify an unknown amino acid. Past default versions of BLAST have assigned a fixed negative score to the aligned pair (α, X), where α is a standard amino acid. Here, all five variants assign to (α, X) a weighted average of the scores for α aligned to the twenty standard amino acids. The new way to score aligned letter pairs involving X may be retained in future default versions of BLAST. Here, the calculated composition of any sequence that is filtered using SEG ignores those amino acids that are replaced with ‘X’.

All five variants use the optional Smith-Waterman algorithm (21) to generate all local alignments actually reported. Also, in this final alignment step, all five variants use five more bits of precision for their substitution scores than are used by the standard BLOSUM-62 matrix (22). In both respects, B-BLAST differs from past default versions of BLAST.

Except for its extra precision in the final step, B-BLAST uses the standard BLOSUM-62 matrix in conjunction with scores of −11−*k* for gaps of length *k*. For its *E*-value calculations, it employs gapped statistical parameters that are estimated (23) for a set of standard amino acid frequencies (24). S-BLAST uses ‘composition-based statistics’ (2) to scale the BLOSUM-62 substitution scores for any pair of sequences, while leaving the gap scores fixed. The program SU-BLAST, introduced here, combines a measure of compositional bias similarity with S-BLAST's measure of alignment similarity to produce a unified measure of sequence similarity. C-BLAST and CU-BLAST replace the scaled BLOSUM-62 scores of S-BLAST and SU-BLAST with the conditionally compositionally adjusted scores described by Altschul *et al.* (15).

There is one additional change we made for testing purposes only. The implementation of composition-based statistics, available for many years on the web-site of the NCBI, places upper and lower bounds of 1.0 and 0.5 on the factor by which a substitution matrix can be scaled. The upper bound is imposed to improve slightly the program's retrieval accuracy and speed (2). The lower bound, rarely invoked, improves the utility of the program's output for certain applications. However, these artificial bounds confound the issues we wish to study here, and so we have removed the upper bound entirely, and reduced the lower bound to 0.05.

Finally, since the publication of the paper introducing ‘conditional compositional adjustment’ to derive a new substitution matrix (15), we have found that in addition to the three criteria therein described, a fourth is appropriate for invoking compositional adjustment. Specifically, compositional adjustment should be used for any comparison involving a protein of length at least 50 amino acids and whose two most abundant residues constitute at least 40% of the protein. The implementations of C-BLAST and CU-BLAST studied here employ this additional criterion, but it has no measurable effect on the paper's results.

### BLAST heuristics

All variants of BLAST we study involve scaling or adjusting the substitution matrix for each alignment reported. BLAST is a heuristic program, not guaranteed always to find the optimal alignments, and scaling or adjusting the substitution matrix separately for each database sequence would unduly increase execution time. Accordingly, we re-evaluate only those database sequences which pass an initial screen. Specifically, the standard gap scores, BLOSUM-62 matrix, and statistical parameters for standard amino acid frequencies (24) are used to calculate preliminary *E*-values. Any database sequence producing an alignment with a preliminary *E*-value less than or equal to a set threshold, here taken to be 100, is retained for further evaluation. The substitution and gap scores are rescaled or adjusted, and an optimal local alignment is generated using the rigorous Smith-Waterman algorithm.

SU-BLAST and CU-BLAST calculate compositional *P*-values to combine with alignment *P*-values in order to produce unified *P*- and *E*-values for reporting. Because alignments are produced only for database sequences that pass the initial screen, SU-BLAST and CU-BLAST calculate compositional *P*-values for only these sequences.

### Statistics

For the comparison of random sequences, an analytic, asymptotic statistical theory has been developed for the distribution of scores of ungapped local alignments (25,26). In brief, for the comparison of two random sequences of lengths *m* and *n*, the number of distinct local alignments with score at least *S* is approximately Poisson distributed, with expected value

where *K* and λ are calculable parameters dependent upon the scoring system and amino acid distribution. The Poisson distribution implies that the *maximum* score follows an extreme value distribution (27), with the probability of achieving a score at least *S* given by

This formula may be inverted, yielding

For ungapped alignments, λ is defined only for scoring systems with negative expected score. It is the unique positive solution to the equation

where *s _{ij}* is the score for aligning amino acids

*i*and

*j*, and

*p*and

_{i}*p*′

_{j}are the background probabilities, respectively, for amino acid

*i*in the first sequence and amino acid

*j*in the second (25,26). Empirically, for typical scoring regimes, optimal gapped local alignments follow the same type of distribution as do ungapped local alignments (28), although the distribution's parameters λ and

*K*can not be calculated analytically but may be estimated by random simulation (23).

The *E*-value or *P*-value of an alignment score depends upon the lengths of the sequences compared. *E*- or *P*-values may be reported in the context of a pairwise comparison or in the context of a database search. For the database search context, the length *n* of a single sequence is replaced in formulas 1 and 2 by the aggregate length *N*, in residues, of all database sequences. By default, the BLAST programs report database *E*-values, but because we will discuss pairwise *E*- and *P*-values below, we will always use a hat symbol to indicate pairwise as opposed to database *E*- or *P*-values. The relationship between the pairwise and database *E*-values for an alignment involving a database sequence of length *n* (29) is given by the formula

### Unified *P*- and *E*-values

As described below, we will combine an alignment pairwise *P*-value ${\widehat{P}}_{a}$ with a compositional *P*-value *P _{c}* to calculate a unified pairwise

*P*-value ${\widehat{P}}_{u}.$ Because BLAST reports database

*E*-values, it must perform the following calculations to convert an alignment database

*E*-value

*E*into a unified database

_{a}*E*-value

*E*:

_{u}

*E _{u}* is reported, and can be converted into a database

*P*-value

*P*using Equation 2 if desired.

_{u}### Program evaluation

We evaluate versions of BLAST both for the reliability of their statistics and for their retrieval accuracy. To study the reliability of the statistics produced, we compare 10 000 shuffled mouse sequences of length at least 150 with shuffled human RefSeq (20) sequences from Build 35 of the human genome. For each query, we record the lowest database *P*-value returned. We then plot the number of queries for which this *P*-value is ≤*x*.

To study retrieval accuracy, we use the ‘astral40’ data set (18,19), based upon the SCOP structural classification of proteins (30,31), for ROC analysis. Specifically, the 3586 astral40 sequences related to at least one other astral40 sequence are compared to the complete data set, and the results of all searches are pooled by *E*-value. For increasing *E*-value, the number of true positives is plotted against the number of false positives, and a ROC_{5000} score is calculated.

## RESULTS

Here, we first study the statistical and retrieval accuracy of the previously described programs B-BLAST and S-BLAST. We then describe a measure of similarly biased amino acid compositions in two proteins and a method for combining this compositional similarity with alignment similarity to create a unified assessment of sequence similarity. We implement this unified measure in SU-BLAST, whose statistical and retrieval accuracies we study. Finally, we replace the compositional scaling employed by S-BLAST and SU-BLAST with conditional compositional matrix adjustment to create C-BLAST and CU-BLAST, and study the performance of these programs. The central idea behind such compositional matrix adjustment is to find target frequencies for aligned amino acid pairs that are consistent with the background frequencies of the sequences being compared, but as close as possible to the target frequencies implicit in a standard substitution matrix. For the comparison of sequences with different background frequencies, the resulting substitution matrix is asymmetric. It is fruitful to employ compositional adjustment only under certain conditions (15), with compositional scaling (2) used when these conditions fail.

### B-BLAST and S-BLAST

The proper statistical parameters λ and *K* for the distributions of Equation 1 and 2 depend upon the amino acid compositions of the sequences being compared. Thus the significance of alignments with the same nominal score can vary, dependent upon the context in which the alignments arise. For example, using BLOSUM-62 scores (22), a high-scoring alignment is much more likely to arise by chance from the comparison to two cysteine-rich proteins than from the comparison of two proteins with more typical amino acid compositions. Numerically, this is mediated in Equation 1 primarily by the cysteine-rich compositions implying a smaller value of λ, which discounts the nominal score.

The baseline B-BLAST program evaluates all alignments using gapped statistical parameters ${\lambda}_{g}^{\ast}$ and ${K}_{g}^{\ast}$ estimated (23) for a standard amino acid composition (24). Thus, for alignments involving proteins whose compositions imply a gapped λ_{g} substantially smaller than ${\lambda}_{g}^{\ast}$, the *E*-values reported may be much smaller than justified.

BLAST estimates the number of alignments that are expected to achieve a given score by chance, i.e. from the comparison of unrelated proteins. Our test of BLAST statistics retains the compositions of real proteins, while scrambling the order of their amino acids. Figure 1 shows that B-BLAST's statistics are far from accurate. For example, in 10 000 randomized B-BLAST database searches, 639 (6.4%) yield best matches with *P*-value ≤ 10^{−4}, when only one would be expected. Furthermore, some queries can yield best matches with extremely inaccurate statistical assessments: 143 queries (1.4%) returned best matches with *P*-value ≤ 10^{−10}. Note also that when the best random match has an extremely low *P*-value or *E*-value, many other matches frequently do as well. For example, a single query whose best match had a B-BLAST *P*-value of 10^{−12} yielded 101 matches with *E*-value ≤ 10^{−4}.

This problem with BLAST statistics is understood to be due primarily to similarly biased compositions among many protein sequences and is largely mitigated by the ‘composition-based statistics’ (2) employed by S-BLAST. To estimate rapidly the statistical parameters for gapped alignments, S-BLAST multiplies the standard BLOSUM-62 matrix by a distinct constant for each pair of sequences, so that the scaled matrix has the same ungapped scale parameter λ in the new compositional context that the unscaled matrix has in the standard context. The gap costs remain fixed. When the new scoring system is employed, the optimal local alignment may change. Therefore alignments must be recalculated, as described in the Theory section above, after the substitution matrix has been scaled.

Figure 1 shows that the statistics of S-BLAST are far more accurate that those of B-BLAST, and even slightly conservative. From 10 000 database searches, only six best matches are returned with *P*-value ≤ 10^{−3}, where ten are expected. In some applications it is crucial to exclude false positives reliably, for example when constructing a PSI-BLAST position-specific score matrix for further database searches (3). In these instances, S-BLAST is strongly preferred to B-BLAST (2). However, if one ignores the accuracy of reported statistics and pays attention only to the relative abilities of the two programs to separate true from chance similarities, then B-BLAST appears better; Figure 2 shows that its ROC curve lies significantly above S-BLAST's. This seemingly paradoxical result can be understood by recognizing that similarly biased compositions can in themselves constitute evidence for sequence relatedness. This evidence is effectively discarded when one calculates the significance of an alignment given the compositions of the sequences being compared. The problem we now address is whether this evidence can be recaptured in a mathematically justified manner, thereby restoring the retrieval accuracy of B-BLAST while retaining the statistical accuracy of S-BLAST.

### Compositional similarity

To study coordinated amino acid biases among related and unrelated proteins, we first require an appropriate measure. The difference between the substitution scores used by B-BLAST and S-BLAST, a difference that caused a sizable erosion in retrieval accuracy, was multiplication by a factor proportional to the ungapped scale parameter λ. We therefore propose to use λ itself, calculated for a fixed reference set of substitution scores, but using the observed amino acid frequencies of two proteins, as a measure of coordinated amino acid bias. An analysis of equation (4) whose solution is λ (26) shows that λ will be low for any pair of proteins with an unusually large number of amino acids having high mutual substitution scores as defined by the reference substitution matrix.

There is no model from which one may derive an accurate distribution of λ for unrelated proteins. Accordingly, we proceed by calculating λ, based on the BLOSUM-62 substitution scores (22), for all pairs of unrelated proteins from the astral40 data set (18,19). The resulting empirical probability density function for λ is shown in Figure 3. We use this distribution to assign to any particular value of λ an empirical compositional *P*-value, *P _{c}*(λ), equal to the area under the density curve to the left of λ. Because for small λ the data supporting our empirical distribution become sparse, we set a lower bound on

*P*(λ) of 10

_{c}^{−6}, which is returned whenever λ less than equal to 0.068.

Figure 3 also shows the empirical probability density of λ for pairs of related sequences from the astral40 data set. Our strategy is to glean from the separation between the distributions for related and unrelated sequences evidence of sequence relatedness based on compositional considerations alone.

### Combining alignment and compositional significance

How may one combine *P _{c}* for a pair of sequences with a traditional alignment-score

*P*-value,

*P*, calculated using composition-corrected substitution scores, to derive a valid unified

_{a}*P*-value,

*P*, based upon both compositional and alignment evidence? One approach is to assume that

_{u}*P*and

_{a}*P*are independent random variables with a uniform distribution on the interval (0,1). Define a new random variable equal to the product of

_{c}*P*and

_{a}*P*, which has an easily calculated probability density over (0,1), from which a unified

_{c}*P*-value may be derived (12–14). The resulting

*P*is given by

_{u}
For this formula to be valid, *P _{a}* and

*P*should be accurate and independent. If we calculate

_{c}*P*using composition-based statistics (2), as described above for S-BLAST, there should be little if any dependence between

_{a}*P*and

_{c}*P*. To test whether this is effectively the case, for each best match of a shuffled mouse sequence to a shuffled human RefSeq sequence, we calculated

_{a}*P*and

_{a}*P*and plotted their joint empirical probability density in Figure 4. There was close to no evident dependence between the values of

_{c}*P*and

_{a}*P*, although

_{c}*P*was systematically high. For values of

_{a}*P*< ~0.3, the probability density shown in Figure 4 is close to uniform, although appreciably < 1.0. This implies that for such

_{a}*P*, Equation 6 should yield values for

_{a}*P*that are consistently conservative by a constant multiplicative factor. This is approximately what is observed in Figure 1.

_{u}### SU-BLAST

We have implemented a program, here called SU-BLAST, that uses both compositional and alignment evidence to assess protein sequence similarity. Before we discuss this program's retrieval and statistical accuracy, two technical details bear comment.

First, because we wish to combine compositional similarity not just with the best hit from a database search, but in principle with the best hits for all sequences, it is appropriate to apply Equation 6 with pairwise *P*-values ${\widehat{P}}_{a}$ and ${\widehat{P}}_{u}$ in place of database *P*-values *P _{a}* and

*P*. Therefore, to calculate a unified database

_{u}*E*-value

*E*from an alignment database

_{u}*E*-value

*E*, we have to perform the five-step calculation described above in the Theory section. Second, as described in the Theory section above, SU-BLAST returns a result only for those sequences whose preliminary alignment

_{a}*E*-value is lower than a set threshold. This heuristic is unlikely to exclude many alignments with low unified

*E*-values because alignment similarity generally carries much more information than compositional similarity.

We tested SU-BLAST for retrieval and statistical accuracy. As shown in Figure 1, SU-BLAST's reported statistics are noticeably more conservative than those of S-BLAST, but are still accurate within an order of magnitude. However, as shown in Figure 2, by using λ to measure and reward similar compositional bias, SU-BLAST recaptures the retrieval accuracy forfeited by S-BLAST. SU-BLAST and B-BLAST have very similar ROC curves and ROC_{5000} scores. For the data set used, B-BLAST is slightly better at false positive rates < 0.3/query, and SU-BLAST is slightly better at false positive rates > 0.3/query. The major difference between SU-BLAST's and B-BLAST's performance is found in the far greater accuracy of SU-BLAST's statistics.

### C-BLAST AND CU-BLAST

It has been argued that standard substitution matrices such as BLOSUM-62 are not ideal for comparing sequences with non-standard compositions, and an efficient method has been proposed for adjusting standard matrices for use in arbitrary compositional contexts (15–17).

For protein database searching, Altschul *et al.* (15) have shown that conditional compositional substitution matrix adjustment yields better retrieval accuracy than does the substitution matrix scaling (2) embodied in S-BLAST. Like matrix scaling, compositional adjustment produces alignment statistics conditioned on the compositions of the sequences compared. Therefore, it is appropriate to replace the matrix scaling of S-BLAST and SU-BLAST with conditional compositional adjustment (15) to produce the programs C-BLAST and CU-BLAST. An analysis of the independence of *P _{a}* and

*P*similar to that shown in Figure 4, but with S-BLAST replaced by C-BLAST, produces results nearly equivalent to those discussed above, and is omitted here.

_{c}The reliability of C-BLAST's and CU-BLAST's statistics is evaluated in Figure 1. As can be seen, the replacement of scaled by compositionally adjusted substitution matrices yields a somewhat improved agreement of statistical theory with experiment. Also, as shown in Figure 2, CU-BLAST outperforms both B-BLAST and SU-BLAST in retrieval accuracy. In summary, evaluated from the baseline provided by B-BLAST, the integration of compositionally adjusted substitution matrices with a measure of similar compositional bias yields a program that is substantially improved from the standpoints of both statistical and retrieval accuracy.

## DISCUSSION AND CONCLUSION

It has been recognized for some time that a failure to account for biased amino acid compositions can lead to exaggerated claims of protein alignment statistical significance (2,10,11). It has not been widely recognized, however, that basing alignment statistics upon sequence composition can erode retrieval accuracy. We have argued that such erosion may stem from the fact that similarly biased compositions, in themselves, constitute evidence of protein relatedness. To improve alignment statistics, earlier methods have teased apart alignment and compositional similarity, but have then discarded the latter. We have proposed rejoining these two threads of evidence in a mathematically valid manner. Some studies may involve comparing related proteins that have amino acid compositions known or suspected to be discordant. In such cases, adding to alignment similarity the compositional similarity discussed here may well be counterproductive.

We have derived an empirical distribution of ungapped λ values for unrelated sequence pairs from the astral40 data set (18,19). This set is biased towards globular proteins, and therefore our λ distribution may not be valid for more comprehensive protein sets. Unfortunately, accurate classifications of proteins into related and unrelated classes are not currently available for such larger sets, and the λ distribution may bear refinement as protein relationships become more fully understood. However, proteins with highly biased or repetitive sequences generally are heavily filtered by the SEG program, and are probably not best studied using traditional alignment methods.

Different versions of the command line executable program ‘blastpgp’ implementing the five BLAST versions described above and compiled for Linux are available at: ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/blast_unified_statistics. Also available in this directory are a table of empirical values for *P _{c}*(λ), the shuffled human and mouse sequences used in the statistical tests above, as well as the unshuffled sequences from which they are derived.

In this paper we have been concerned primarily with comparing the newly proposed measure of sequence similarity to earlier ones, and have sought to minimize confounding factors introduced by various BLAST heuristics. Accordingly, we have used the Smith-Waterman option in the final alignment phase (2), and we have used a high preliminary *E*-value threshold of 100 for re-evaluating alignments. However, the modest improvements in search sensitivity yielded by these options come at a significant cost in execution time, and different options may be chosen as defaults for NCBI's BLAST web servers. We have no reason to believe that our results would be qualitatively different using either faster BLAST parameter settings, or using the Smith-Waterman algorithm on every database sequence instead of on only those selected by the BLAST heuristics.

The use of compositional similarity, as well as its integration with compositional matrix adjustment, will be available as an option for the protein-query, protein-database BLAST search program on the NCBI web site. The current command line *blastpgp* offers unified *P*-values as part of the composition option (-*t*). Also, the use of compositional similarity may be specified for PSI-BLAST's initial BLAST round of database search, but thereafter PSI-BLAST uses only alignment similarity to assess results.

Compositional sequence relatedness has been used in other ways for protein sequence analysis before. For example, the PHD program for predicting protein secondary structure (32) employs global amino acid composition as one input to a neural network. Also, a somewhat ad hoc approach has been described for adjusting the reported *E*-values of alignments involving low complexity regions by post-processing BLAST outputs (33). Some database search programs, such as FASTA (34), correct for the composition of query sequences by estimating statistical parameters from database searches. This procedure, however, takes no account of the compositions of individual database sequences, and so can at most partially correct for the manner in which compositional biases skew pairwise alignment scores.

It may be possible to improve in several ways on this paper's approach. For example, the statistical corrections for compositional bias employed by S-BLAST and C-BLAST, while quite accurate for random sequences, are compromised to varying degrees by periodicity and non-uniformity within real sequences. SEG (7), applied to database sequences in this paper, removes many low-complexity segments from consideration, but certain periodic patterns remain. These may be dealt with by additional special-purpose filters, e.g. for coiled coils (35–38), or by calculating alignment statistics based on a reversed-sequence model of randomness (10,39).

SCOP-based test sets such as astral40 are widely used for the evaluation and comparison of protein sequence database search methods, but they have potential disadvantages. SCOP is a classification of protein domains, but most comparisons performed in database searches involve complete proteins. As a result, SCOP-based evaluations may tend unduly to favor alignment scoring systems that have more of a global than a local flavor, such as the compositional bias similarity studied here. A test set we have previously employed (2), and which compares queries to full-length yeast sequences, is too small to yield statistically significant results in this study. Certainly the utility of the methods we have discussed is a function of the degree to which the compositional properties of complete proteins reflect those of the proteins' domains. Progress on the automatic parsing of protein sequences into domains should therefore be able to improve both statistical and retrieval accuracy.

Here we have studied one measure, the ungapped λ implied by a fixed substitution matrix, of two sequences' compositional similarity. Many other measures are possible, and we did investigate one, a compositional distance metric recently described by Endres and Schindelin (40). The distance distributions for related and unrelated sequence pairs showed a marked separation, similar to that of Figure 3. However, the improvement in retrieval accuracy yielded by this measure was distinctly inferior to that yielded by λ (data not shown). Nevertheless, it remains possible that theoretical considerations or further experimentation will yield a compositional similarity measure more effective than λ for our purposes.

An alternative approach to measuring global compositional similarity is by log–odds scores, analogous to those for alignment similarity, which make use of information derived from related as well as unrelated sequence pairs (John Spouge, personal communication). One may imagine other methods for taking advantage of the different behaviors of these two sets, and it is likely that a more sensitive measure than those we have so far studied will be found.

The idea of combining alignment similarity with independent measures of sequence relatedness, such as compositional similarity, may be applied fruitfully to database search programs other than BLAST (34). It may also be possible to graft compositional or other similarity measures onto the alignment similarity measures used by protein profile programs such as HMMER (41), PSI-BLAST (3), SAM (10), IMPALA (42) or SALTO (43). To what extent this can improve the statistics or retrieval accuracy of these programs awaits further investigation.

## Acknowledgments

Thanks to Aleksandr Morgulis for comments on the manuscript and for technical assistance. This research was supported by the Intramural Research Program of the National Library of Medicine of the NIH/DHHS. Funding to pay the Open Access publication charges for this article was provided by the Intramural Research Program of the National Institutes of Health, NLM.

*Conflict of interest statement*. None declared.

## REFERENCES

*P*-values of gapped local sequence and profile alignments. J. Mol. Biol. 2000;300:649–659. [PubMed]

*p*-values. Biom. J. 1991;33:339–345.

*E*-values for hidden Markov models using reversesequence null models. Bioinformatics. 2005;21:4107–4115. [PubMed]

**Oxford University Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (210K) |
- Citation

- Composition-based statistics and translated nucleotide searches: improving the TBLASTN module of BLAST.[BMC Biol. 2006]
*Gertz EM, Yu YK, Agarwala R, Schäffer AA, Altschul SF.**BMC Biol. 2006 Dec 7; 4:41. Epub 2006 Dec 7.* - Accuracy of structure-based sequence alignment of automatic methods.[BMC Bioinformatics. 2007]
*Kim C, Lee B.**BMC Bioinformatics. 2007 Sep 20; 8:355. Epub 2007 Sep 20.* - Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements.[Nucleic Acids Res. 2001]
*Schäffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, Wolf YI, Koonin EV, Altschul SF.**Nucleic Acids Res. 2001 Jul 15; 29(14):2994-3005.* - SS-Wrapper: a package of wrapper applications for similarity searches on Linux clusters.[BMC Bioinformatics. 2004]
*Wang C, Lefkowitz EJ.**BMC Bioinformatics. 2004 Oct 28; 5:171. Epub 2004 Oct 28.* - Pairwise statistical significance and empirical determination of effective gap opening penalties for protein local sequence alignment.[Int J Comput Biol Drug Des. 2008]
*Agrawal A, Brendel VP, Huang X.**Int J Comput Biol Drug Des. 2008; 1(4):347-67.*

- PubMedPubMedPubMed citations for these articles

- Retrieval accuracy, statistical significance and compositional similarity in pro...Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searchesNucleic Acids Research. 2006 Nov; 34(20)5966

Your browsing activity is empty.

Activity recording is turned off.

See more...