An unsupervised hierarchical dynamic self-organizing approach to cancer class discovery and marker gene identification in microarray data

Bioinformatics. 2003 Nov 1;19(16):2131-40. doi: 10.1093/bioinformatics/btg296.

Abstract

Motivation: Current Self-Organizing Maps (SOMs) approaches to gene expression pattern clustering require the user to predefine the number of clusters likely to be expected. Hierarchical clustering methods used in this area do not provide unique partitioning of data. We describe an unsupervised dynamic hierarchical self-organizing approach, which suggests an appropriate number of clusters, to perform class discovery and marker gene identification in microarray data. In the process of class discovery, the proposed algorithm identifies corresponding sets of predictor genes that best distinguish one class from other classes. The approach integrates merits of hierarchical clustering with robustness against noise known from self-organizing approaches.

Results: The proposed algorithm applied to DNA microarray data sets of two types of cancers has demonstrated its ability to produce the most suitable number of clusters. Further, the corresponding marker genes identified through the unsupervised algorithm also have a strong biological relationship to the specific cancer class. The algorithm tested on leukemia microarray data, which contains three leukemia types, was able to determine three major and one minor cluster. Prediction models built for the four clusters indicate that the prediction strength for the smaller cluster is generally low, therefore labelled as uncertain cluster. Further analysis shows that the uncertain cluster can be subdivided further, and the subdivisions are related to two of the original clusters. Another test performed using colon cancer microarray data has automatically derived two clusters, which is consistent with the number of classes in data (cancerous and normal).

Availability: JAVA software of dynamic SOM tree algorithm is available upon request for academic use.

Supplementary information: A comparison of rectangular and hexagonal topologies for GSOM is available from http://www.mame.mu.oz.au/mechatronics/journalinfo/Hsu2003supp.pdf

Publication types

  • Comparative Study
  • Evaluation Study
  • Validation Study

MeSH terms

  • Algorithms*
  • Biomarkers, Tumor / genetics*
  • Cluster Analysis
  • Colonic Neoplasms / classification*
  • Colonic Neoplasms / genetics*
  • Gene Expression Profiling / methods*
  • Gene Expression Regulation, Neoplastic / genetics
  • Genetic Markers / genetics
  • Genetic Testing / methods
  • Humans
  • Leukemia / classification*
  • Leukemia / genetics*
  • Oligonucleotide Array Sequence Analysis / methods*
  • Reproducibility of Results
  • Sensitivity and Specificity

Substances

  • Biomarkers, Tumor
  • Genetic Markers