Send to

Choose Destination
Stat Methods Med Res. 2018 Feb;27(2):336-351. doi: 10.1177/0962280216628901. Epub 2016 Mar 16.

Bayesian clinical classification from high-dimensional data: Signatures versus variability.

Author information

1 Institute for Mathematical and Molecular Biomedicine, King's College London, London, UK.
2 Department of Electrical Engineering and Bioscience, School of Advanced Science and Engineering, Waseda University, Tokyo, Japan.
3 Breakthrough Breast Cancer Research Unit, Department of Research Oncology, Guy's Hospital, London, UK.
4 NIHR Biomedical Research Centre - R&D Department, Guy's Hospital, London, UK.


When data exhibit imbalance between a large number d of covariates and a small number n of samples, clinical outcome prediction is impaired by overfitting and prohibitive computation demands. Here we study two simple Bayesian prediction protocols that can be applied to data of any dimension and any number of outcome classes. Calculating Bayesian integrals and optimal hyperparameters analytically leaves only a small number of numerical integrations, and CPU demands scale as O(nd). We compare their performance on synthetic and genomic data to the mclustDA method of Fraley and Raftery. For small d they perform as well as mclustDA or better. For dā€‰=ā€‰10,000 or more mclustDA breaks down computationally, while the Bayesian methods remain efficient. This allows us to explore phenomena typical of classification in high-dimensional spaces, such as overfitting and the reduced discriminative effectiveness of signatures compared to intra-class variability.


Bayesian classification; Discriminant analysis; curse of dimensionality; outcome prediction; overfitting

[Indexed for MEDLINE]

Supplemental Content

Full text links

Icon for Atypon
Loading ...
Support Center