Minimal-risk scoring matrices for sequence analysis

T D Wu; C G Nevill-Manning; D L Brutlag

doi:10.1089/cmb.1999.6.219

Minimal-risk scoring matrices for sequence analysis

J Comput Biol. 1999 Summer;6(2):219-35. doi: 10.1089/cmb.1999.6.219.

Authors

T D Wu¹, C G Nevill-Manning, D L Brutlag

Affiliation

¹ Department of Biochemistry, Stanford University School of Medicine, California 94305-5307, USA. thomas.wu@stanford.edu

PMID: 10421524
DOI: 10.1089/cmb.1999.6.219

Abstract

We introduce a minimal-risk method for estimating the frequencies of amino acids at conserved positions in a protein family. Our method, called minimal-risk estimation, finds the optimal weighting between a set of observed amino acid counts and a set of pseudofrequencies, which represent prior information about the frequencies. We compute the optimal weighting by minimizing the expected distance between the estimated frequencies and the true population frequencies, measured by either a squared-error or a relative-entropy metric. Our method accounts for the source of the pseudofrequencies, which arise either from the background distribution of amino acids or from applying a substitution matrix to the observed data. Our frequency estimates therefore depend on the size and composition of the observed data as well as the source of the pseudofrequencies. We convert our frequency estimates into minimal-risk scoring matrices for sequence analysis. A large-scale cross-validation study, involving 48 variants of seven methods, shows that the best performing method is minimal-risk estimation using the squared-error metric. Our method is implemented in the package EMATRIX, which is available on the Internet at http://motif.stanford.edu/ematrix.

Publication types

Comparative Study
Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, P.H.S.

MeSH terms

Amino Acids / analysis
Conserved Sequence / genetics
Entropy
Likelihood Functions*
Markov Chains
Proteins / chemistry*
Reproducibility of Results
Risk
Sensitivity and Specificity
Sequence Analysis / methods*
Software

Substances

Amino Acids
Proteins

Grants and funding

LM-05716/LM/NLM NIH HHS/United States