Format

Send to

Choose Destination
Conf Proc IEEE Eng Med Biol Soc. 2008;2008:5704-7. doi: 10.1109/IEMBS.2008.4650509.

Toward a measure of classification complexity in gene expression signatures.

Author information

1
Biomedical Engineering program at the University of South Florida, Tampa, Florida, USA. Vidya.Kamath@moffitt.org

Abstract

Gene expression signatures identify important genes that predict a specified outcome. In several notable diseases such as leukemia and breast cancer, the results have been encouraging. In these datasets, many techniques work well when discriminating particular outcomes. However, these same methods, applied to other datasets, are unable to achieve similar levels of success. Given the small sample sizes common to these studies and the large dimensionality of the data, several key issues exist when attempting to construct reliable, reproducible gene signatures. The classifiers may not be sufficient to discriminate classes, or the data itself may not be sufficient to produce effective separation. In this paper, three simple measures of classification complexity are considered to explore a limit to the predictive accuracy that can be achieved in a dataset. Two independent gene expression datasets (lung and colorectal cancer) are considered, using three different outcomes on each dataset. Four different classifiers, using the t-test for feature selection, were tested on these datasets as a representative panel of classifiers. Our results indicate that Fisher's discriminant ratio provides a good measure of the complexity of the classification problem, with a high correlation between complexity and best classification accuracy across all problems (R(2)=0.78). Specifically, predicting gender is a low complexity problem as indicated both by the complexity measure and the classification results. More clinically-oriented endpoints are more complex and have lower classification accuracies. Therefore, classification complexity can be used to estimate maximum attainable accuracy for a problem reducing the need to evaluate many different classifiers.

PMID:
19164012
DOI:
10.1109/IEMBS.2008.4650509
[Indexed for MEDLINE]

Supplemental Content

Full text links

Icon for IEEE Engineering in Medicine and Biology Society
Loading ...
Support Center