Send to

Choose Destination
J Chem Inf Model. 2006 Jan-Feb;46(1):193-200.

Enrichment of high-throughput screening data with increasing levels of noise using support vector machines, recursive partitioning, and laplacian-modified naive bayesian classifiers.

Author information

Lead Discovery Center, Novartis Institutes for Biomedical Research Inc., Cambridge, Massachusetts 02139, USA.


High-throughput screening (HTS) plays a pivotal role in lead discovery for the pharmaceutical industry. In tandem, cheminformatics approaches are employed to increase the probability of the identification of novel biologically active compounds by mining the HTS data. HTS data is notoriously noisy, and therefore, the selection of the optimal data mining method is important for the success of such an analysis. Here, we describe a retrospective analysis of four HTS data sets using three mining approaches: Laplacian-modified naive Bayes, recursive partitioning, and support vector machine (SVM) classifiers with increasing stochastic noise in the form of false positives and false negatives. All three of the data mining methods at hand tolerated increasing levels of false positives even when the ratio of misclassified compounds to true active compounds was 5:1 in the training set. False negatives in the ratio of 1:1 were tolerated as well. SVM outperformed the other two methods in capturing active compounds and scaffolds in the top 1%. A Murcko scaffold analysis could explain the differences in enrichments among the four data sets. This study demonstrates that data mining methods can add a true value to the screen even when the data is contaminated with a high level of stochastic noise.


Supplemental Content

Full text links

Icon for American Chemical Society
Loading ...
Support Center