Entropy-based cluster validation and estimation of the number of clusters in gene expression data

J Bioinform Comput Biol. 2012 Oct;10(5):1250011. doi: 10.1142/S0219720012500114. Epub 2012 Jun 26.

Abstract

Many external and internal validity measures have been proposed in order to estimate the number of clusters in gene expression data but as a rule they do not consider the analysis of the stability of the groupings produced by a clustering algorithm. Based on the approach assessing the predictive power or stability of a partitioning, we propose the new measure of cluster validation and the selection procedure to determine the suitable number of clusters. The validity measure is based on the estimation of the "clearness" of the consensus matrix, which is the result of a resampling clustering scheme or consensus clustering. According to the proposed selection procedure the stable clustering result is determined with the reference to the validity measure for the null hypothesis encoding for the absence of clusters. The final number of clusters is selected by analyzing the distance between the validity plots for initial and permutated data sets. We applied the selection procedure to estimate the clustering results on several datasets. As a result the proposed procedure produced an accurate and robust estimate of the number of clusters, which are in agreement with the biological knowledge and gold standards of cluster quality.

Publication types

  • Validation Study

MeSH terms

  • Algorithms*
  • Cluster Analysis
  • Entropy*
  • Gene Expression Profiling / methods*