Minimal-risk scoring matrices for sequence analysis

J Comput Biol. 1999 Summer;6(2):219-35. doi: 10.1089/cmb.1999.6.219.

Abstract

We introduce a minimal-risk method for estimating the frequencies of amino acids at conserved positions in a protein family. Our method, called minimal-risk estimation, finds the optimal weighting between a set of observed amino acid counts and a set of pseudofrequencies, which represent prior information about the frequencies. We compute the optimal weighting by minimizing the expected distance between the estimated frequencies and the true population frequencies, measured by either a squared-error or a relative-entropy metric. Our method accounts for the source of the pseudofrequencies, which arise either from the background distribution of amino acids or from applying a substitution matrix to the observed data. Our frequency estimates therefore depend on the size and composition of the observed data as well as the source of the pseudofrequencies. We convert our frequency estimates into minimal-risk scoring matrices for sequence analysis. A large-scale cross-validation study, involving 48 variants of seven methods, shows that the best performing method is minimal-risk estimation using the squared-error metric. Our method is implemented in the package EMATRIX, which is available on the Internet at http://motif.stanford.edu/ematrix.

Publication types

  • Comparative Study
  • Research Support, Non-U.S. Gov't
  • Research Support, U.S. Gov't, P.H.S.

MeSH terms

  • Amino Acids / analysis
  • Conserved Sequence / genetics
  • Entropy
  • Likelihood Functions*
  • Markov Chains
  • Proteins / chemistry*
  • Reproducibility of Results
  • Risk
  • Sensitivity and Specificity
  • Sequence Analysis / methods*
  • Software

Substances

  • Amino Acids
  • Proteins