Random Forest Refinement of the KECSA2 Knowledge-Based Scoring Function for Protein Decoy Detection

J Chem Inf Model. 2019 May 28;59(5):1919-1929. doi: 10.1021/acs.jcim.8b00734. Epub 2019 Feb 20.

Abstract

Knowledge-based potentials generally perform better than physics-based scoring functions in detecting the native structure from a collection of decoy protein structures. Through the use of a reference state, the pure interactions between atom/residue pairs can be obtained through the removal of contributions from ideal-gas state potentials. However, it is a challenge for conventional knowledge-based potentials to assign different importance factors to different atom/residue pairs. In this work, via the use of the "comparison" concept, Random Forest (RF) models were successfully generated using unbalanced data sets that assign different importance factors to atom pair potentials to enhance their ability to identify native proteins from decoy proteins. Individual and combined data sets consisting of 12 decoy sets were used to test the performance of the RF models. We find that RF models increase the recognition of native structures without affecting their ability to identify the best decoy structures. We also created models using scrambled atom types, which create physically unrealistic probability functions in order to test the ability of the RF algorithm to create useful models based on inputted scrambled probability functions. From this test, we find that we are unable to create models that are of similar quality relative to the unscrambled probability functions. Next, we created uniform probability functions where the peak positions are the same as the original, but each interaction has the same peak height. Using these uniform potentials, we were able to recover models as good as the ones using the full potentials suggesting all that is important in these models are the experimental peak positions. The KECSA2 potential along with all codes used in this work are available at https://github.com/JunPei000/protein_folding-decoy-set .

MeSH terms

  • Algorithms
  • Knowledge Bases
  • Machine Learning
  • Models, Molecular
  • Probability
  • Protein Conformation
  • Protein Folding
  • Proteins / chemistry*
  • Thermodynamics

Substances

  • Proteins