From: Altschul, Stephen (NIH/NLM/NCBI) Sent: Monday, December 06, 2004 10:36 AM To: ncbi-seminar@ncbi.nlm.nih.gov Subject: Tuesday seminar, Dec. 7 I will give the regular CBB Tuesday seminar at 11 AM tomorrow in the NCBI Library in Building 38A, Floor B2. A title and abstract follow. -Stephen Altschul The Use of Compositionally-Adjusted Amino Acid Substitution Matrices in General Purpose Protein Database Similarity Searches Standard amino acid substitution matrices are constructed as log-odds ratios from large collections of alignments of related proteins. Any such collection has an implicit "standard" set of amino acid background frequencies. The matrices produced, however, often are used to compare proteins with quite non-standard amino acid compositions. It has been argued on theoretical grounds that this is inappropriate, and a method has been described for transforming a standard matrix into one appropriate for comparing proteins with any non-standard compositions. Furthermore, it has been shown that such compositionally-adjusted matrices yield improved results, from the twin perspectives of alignment score and alignment quality, when proteins with strongly biased compositions are compared. Here, we study to whether and to what extent such adjusted matrices are of utility for general purpose protein database searches. Using standard test platforms, we compared a standard matrix to compositionally-adjusted matrices, with relative entropy left unconstrained, or constrained in various ways. We found that constraining the relative entropy of the compositionally adjusted matrix to a fixed value in the new compositional context generally produced the best results. We also found that if the sequences compared are not known to have strong compositional biases, then it is still on average advantageous to use an adjusted matrix when the sequences satisfy certain simple length or compositional inequalities. Applying these findings to general-purpose database searches can lead to a significant improvement in retrieval performance, with a minimal increase in execution time. Preliminary results also suggest that the use of compositionally adjusted matrices may allow more sparing of the SEG program in general purpose database searches.