Training based on ligand efficiency improves prediction of bioactivities of ligands and drug target proteins in a machine learning approach

J Chem Inf Model. 2013 Oct 28;53(10):2525-37. doi: 10.1021/ci400240u. Epub 2013 Sep 24.

Abstract

Machine learning methods based on ligand-protein interaction data in bioactivity databases are one of the current strategies for efficiently finding novel lead compounds as the first step in the drug discovery process. Although previous machine learning studies have succeeded in predicting novel ligand-protein interactions with high performance, all of the previous studies to date have been heavily dependent on the simple use of raw bioactivity data of ligand potencies measured by IC50, EC50, K(i), and K(d) deposited in databases. ChEMBL provides us with a unique opportunity to investigate whether a machine-learning-based classifier created by reflecting ligand efficiency other than the IC50, EC50, K(i), and Kd values can also offer high predictive performance. Here we report that classifiers created from training data based on ligand efficiency show higher performance than those from data based on IC50 or K(i) values. Utilizing GPCRSARfari and KinaseSARfari databases in ChEMBL, we created IC50- or K(i)-based training data and binding efficiency index (BEI) based training data then constructed classifiers using support vector machines (SVMs). The SVM classifiers from the BEI-based training data showed slightly higher area under curve (AUC), accuracy, sensitivity, and specificity in the cross-validation tests. Application of the classifiers to the validation data demonstrated that the AUCs and specificities of the BEI-based classifiers dramatically increased in comparison with the IC50- or K(i)-based classifiers. The improvement of the predictive power by the BEI-based classifiers can be attributed to (i) the more separated distributions of positives and negatives, (ii) the higher diversity of negatives in the BEI-based training data in a feature space of SVMs, and (iii) a more balanced number of positives and negatives in the BEI-based training data. These results strongly suggest that training data based on ligand efficiency as well as data based on classical IC50, EC50, K(d), and K(i) values are important when creating a classifier using a machine learning approach based on bioactivity data.

MeSH terms

  • Area Under Curve
  • Artificial Intelligence*
  • Data Mining
  • Databases, Chemical
  • Databases, Pharmaceutical
  • Drug Discovery
  • Humans
  • Inhibitory Concentration 50
  • Ligands
  • Principal Component Analysis
  • Protein Kinases / chemistry*
  • Receptors, G-Protein-Coupled / agonists
  • Receptors, G-Protein-Coupled / antagonists & inhibitors
  • Receptors, G-Protein-Coupled / chemistry*
  • Sensitivity and Specificity
  • Small Molecule Libraries / chemistry*
  • Support Vector Machine*

Substances

  • Ligands
  • Receptors, G-Protein-Coupled
  • Small Molecule Libraries
  • Protein Kinases