Send to

Choose Destination
Biotechnol J. 2018 Jan;13(1). doi: 10.1002/biot.201700503. Epub 2017 Dec 6.

Analysing and Navigating Natural Products Space for Generating Small, Diverse, But Representative Chemical Libraries.

O'Hagan S1,2, Kell DB1,2,3.

Author information

Dr. S. O'Hagan, Prof. D. B. Kell, School of Chemistry, The University of Manchester, 131 Princess St, Manchester M1 7DN, UK.
Dr. S. O'Hagan, Prof. D. B. Kell, The Manchester Institute of Biotechnology, The University of Manchester, 131 Princess St, Manchester M1 7DN, UK.
Prof. D. B. Kell, Centre for the Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM), The University of Manchester, 131 Princess St, Manchester M1 7DN, UK.


Armed with the digital availability of two natural products libraries, amounting to some 195 885 molecular entities, we ask the question of how we can best sample from them to maximize their "representativeness" in smaller and more usable libraries of 96, 384, 1152, and 1920 molecules. The term "representativeness" is intended to include diversity, but for numerical reasons (and the likelihood of being able to perform a QSAR) it is necessary to focus on areas of chemical space that are more highly populated. Encoding chemical structures as fingerprints using the RDKit "patterned" algorithm, we first assess the granularity of the natural products space using a simple clustering algorithm, showing that there are major regions of "denseness" but also a great many very sparsely populated areas. We then apply a "hybrid" hierarchical K-means clustering algorithm to the data to produce more statistically robust clusters from which representative and appropriate numbers of samples may be chosen. There is necessarily again a trade-off between cluster size and cluster number, but within these constraints, libraries containing 384 or 1152 molecules can be found that come from clusters that represent some 18 and 30% of the whole chemical space, with cluster sizes of, respectively, 50 and 27 or above, just about sufficient to perform a QSAR. By using the online availability of molecules via the Molport system (, we are also able to construct (and, for the first time, provide the contents of) a small virtual library of available molecules that provided effective coverage of the chemical space described. Consistent with this, the average molecular similarities of the contents of the libraries developed is considerably smaller than is that of the original libraries. The suggested libraries may have use in molecular or phenotypic screening, including for determining possible transporter substrates.


cheminformatics; drug transporters; encodings; endogenites; maximum common substructure; metabolomics

[Indexed for MEDLINE]

Supplemental Content

Full text links

Icon for Wiley
Loading ...
Support Center