Format

Send to

Choose Destination
J Chem Inf Model. 2007 Jul-Aug;47(4):1308-18. Epub 2007 Jun 30.

Counting clusters using R-NN curves.

Author information

1
School of Informatics, Indiana University, Bloomington, Indiana 47406, USA. rguha@indiana.edu

Abstract

Clustering is a common task in the field of cheminformatics. A key parameter that needs to be set for nonhierarchical clustering methods, such as k-means, is the number of clusters, k. Traditionally, the value of k is obtained by performing the clustering with different values of k and selecting that value that leads to the optimal clustering. In this study, we describe an approach to selecting k, a priori, based on the R-NN curve algorithm described by Guha et al. (J. Chem. Inf. Model., 2006, 46, 1713-722), which uses a nearest-neighbor technique to characterize the spatial location of compounds in arbitrary descriptor spaces. The algorithm generates a set of curves for the data set which are then analyzed to estimate the natural number of clusters. We then performed k-means clustering with the predicted value of k as well as with similar values to check that the correct number of clusters was obtained. In addition, we compared the predicted value to the number indicated by the average silhouette width as a cluster quality measure. We tested the algorithm on simulated data as well as on two chemical data sets. Our results indicate that the R-NN curve algorithm is able to determine the natural number of clusters and is in general agreement the average silhouette width in identifying the optimal number of clusters.

PMID:
17602604
PMCID:
PMC2543137
DOI:
10.1021/ci600541f
[Indexed for MEDLINE]
Free PMC Article

Supplemental Content

Full text links

Icon for American Chemical Society Icon for PubMed Central
Loading ...
Support Center