Format

Send to

Choose Destination
Artif Intell Med. 2004 Jun;31(2):155-67.

Multivariate selection of genetic markers in diagnostic classification.

Author information

1
Decision Systems Group, Division of Health Sciences and Technology, Harvard and MIT, Brigham and Women's Hospital, Thorn 310, 75 Francis Street, Boston, MA 02115, USA.

Abstract

Analysis of gene expression data obtained from microarrays presents a new set of challenges to machine learning modeling. In this domain, in which the number of variables far exceeds the number of cases, identifying relevant genes or groups of genes that are good markers for a particular classification is as important as achieving good classification performance. Although several machine learning algorithms have been proposed to address the latter, identification of gene markers has not been systematically pursued. In this article, we investigate several algorithms for selecting gene markers for classification. We test these algorithms using logistic regression, as this is a simple and efficient supervised learning algorithm. We demonstrate, using 10 different data sets, that a conditionally univariate algorithm constitutes a viable choice if a researcher is interested in quickly determining a set of gene expression levels that can serve as markers for disease. We show that the classification performance of logistic regression is not very different from that of more sophisticated algorithms that have been applied in previous studies, and that the gene selection in the logistic regression algorithm is reasonable in both cases. Furthermore, the algorithm is simple, its theoretical basis is well established, and our user-friendly implementation is now freely available on the internet, serving as a benchmarking tool for the development of new algorithms.

PMID:
15219292
DOI:
10.1016/j.artmed.2004.01.011
[Indexed for MEDLINE]

Supplemental Content

Full text links

Icon for Elsevier Science
Loading ...
Support Center