Manual classification strategies in the ECOD database

Proteins. 2015 Jul;83(7):1238-51. doi: 10.1002/prot.24818. Epub 2015 May 8.

Abstract

ECOD (Evolutionary Classification Of protein Domains) is a comprehensive and up-to-date protein structure classification database. The majority of new structures released from the PDB (Protein Data Bank) each week already have close homologs in the ECOD hierarchy and thus can be reliably partitioned into domains and classified by software without manual intervention. However, those proteins that lack confidently detectable homologs require careful analysis by experts. Although many bioinformatics resources rely on expert curation to some degree, specific examples of how this curation occurs and in what cases it is necessary are not always described. Here, we illustrate the manual classification strategy in ECOD by example, focusing on two major issues in protein classification: domain partitioning and the relationship between homology and similarity scores. Most examples show recently released and manually classified PDB structures. We discuss multi-domain proteins, discordance between sequence and structural similarities, difficulties with assessing homology with scores, and integral membrane proteins homologous to soluble proteins. By timely assimilation of newly available structures into its hierarchy, ECOD strives to provide a most accurate and updated view of the protein structure world as a result of combined computational and expert-driven analysis.

Keywords: classification; database; domain; evolution; homology; protein; sequence; structure.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Amino Acid Sequence
  • Animals
  • Computational Biology / methods*
  • Databases, Protein*
  • Dimethylallyltranstransferase / chemistry
  • Dimethylallyltranstransferase / classification
  • Evolution, Molecular
  • Humans
  • Hydrogen Bonding
  • Hydrophobic and Hydrophilic Interactions
  • Models, Molecular
  • Molecular Sequence Data
  • Neuropeptides / chemistry
  • Neuropeptides / classification
  • Neurotoxins / chemistry
  • Neurotoxins / classification
  • Protein Structure, Secondary
  • Protein Structure, Tertiary
  • Sequence Alignment
  • Sequence Homology, Amino Acid
  • Software
  • Spider Venoms / chemistry
  • Spider Venoms / classification
  • Static Electricity
  • Terminology as Topic*

Substances

  • Neuropeptides
  • Neurotoxins
  • Spider Venoms
  • TaITX-1
  • molt-inhibiting hormone, crayfish
  • Dimethylallyltranstransferase