From: Altschul, Stephen (NIH/NLM/NCBI) [E] Sent: Tuesday, January 29, 2008 10:02 AM To: NLM/NCBI List ncbi-seminar Subject: Reminder: CBB seminar today Time: January 29, 11:00 AM Place: NCBI Library B2, Bldg. 38A Speaker: Stephen Altschul Pseudocounts and the Minimum Description Length Principle In the context of PSI-BLAST, amino acid "target frequencies" for a given alignment position are predicted using the observed amino acid counts for that position, and data-dependent pseudocounts. The number of pseudocounts to use has been determined empirically. Here, we apply the minimum description length principle to derive analytically the number of pseudocounts expected to be optimal. We find that in the limit of a very large number n of independent observations, the number of pseudocounts should grow as the cube root of n. The available data are insufficient to confirm this prediction. However, we also find that alignment positions with greater relative entropy should use fewer pseudocounts, yielding position-specific score matrices with greater "contrast". Implementation of this prediction does yield a statistically significant improvement in PSI-BLAST's retrieval performance. A similar prediction can be derived from an alternative theoretical formulation.