Estimating dataset size requirements for classifying DNA microarray data

Sayan Mukherjee; Pablo Tamayo; Simon Rogers; Ryan Rifkin; Anna Engle; Colin Campbell; Todd R Golub; Jill P Mesirov

doi:10.1089/106652703321825928

Estimating dataset size requirements for classifying DNA microarray data

J Comput Biol. 2003;10(2):119-42. doi: 10.1089/106652703321825928.

Authors

Sayan Mukherjee¹, Pablo Tamayo, Simon Rogers, Ryan Rifkin, Anna Engle, Colin Campbell, Todd R Golub, Jill P Mesirov

Affiliation

¹ Whitehead Institute/Massachusetts Institute of Technology Center for Genome Research, Cambridge, MA 02139, USA. sayan@genome.wi.mit.edu

PMID: 12804087
DOI: 10.1089/106652703321825928

Abstract

A statistical methodology for estimating dataset size requirements for classifying microarray data using learning curves is introduced. The goal is to use existing classification results to estimate dataset size requirements for future classification experiments and to evaluate the gain in accuracy and significance of classifiers built with additional data. The method is based on fitting inverse power-law models to construct empirical learning curves. It also includes a permutation test procedure to assess the statistical significance of classification performance for a given dataset size. This procedure is applied to several molecular classification problems representing a broad spectrum of levels of complexity.

Publication types

Comparative Study
Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms
Computational Biology / methods
Computer Simulation
Gene Expression Profiling / classification
Gene Expression Profiling / methods*
Humans
Models, Molecular
Neoplasms / classification*
Neoplasms / genetics*
Neoplasms / metabolism
Oligonucleotide Array Sequence Analysis*