Evaluating, comparing, and interpreting protein domain hierarchies

Andrew F Neuwald

doi:10.1089/cmb.2013.0098

Evaluating, comparing, and interpreting protein domain hierarchies

J Comput Biol. 2014 Apr;21(4):287-302. doi: 10.1089/cmb.2013.0098. Epub 2014 Feb 21.

Author

Andrew F Neuwald¹

Affiliation

¹ Institute for Genome Sciences and Department of Biochemistry & Molecular Biology, University of Maryland School of Medicine , Baltimore, Maryland.

Abstract

Arranging protein domain sequences hierarchically into evolutionarily divergent subgroups is important for investigating evolutionary history, for speeding up web-based similarity searches, for identifying sequence determinants of protein function, and for genome annotation. However, whether or not a particular hierarchy is optimal is often unclear, and independently constructed hierarchies for the same domain can often differ significantly. This article describes methods for statistically evaluating specific aspects of a hierarchy, for probing the criteria underlying its construction and for direct comparisons between hierarchies. Information theoretical notions are used to quantify the contributions of specific hierarchical features to the underlying statistical model. Such features include subhierarchies, sequence subgroups, individual sequences, and subgroup-associated signature patterns. Underlying properties are graphically displayed in plots of each specific feature's contributions, in heat maps of pattern residue conservation, in "contrast alignments," and through cross-mapping of subgroups between hierarchies. Together, these approaches provide a deeper understanding of protein domain functional divergence, reveal uncertainties caused by inconsistent patterns of sequence conservation, and help resolve conflicts between competing hierarchies.

Publication types

Comparative Study
Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Amino Acid Sequence
Computer Simulation
Conserved Sequence
Evolution, Molecular
Models, Biological
Molecular Sequence Data
Protein Structure, Tertiary
Proteins / chemistry*
Sequence Analysis, Protein

Substances

Proteins

Grants and funding

HHSN 2630000999571/PHS HHS/United States