Format

Send to

Choose Destination
Bioinformatics. 2019 Nov 8. pii: btz796. doi: 10.1093/bioinformatics/btz796. [Epub ahead of print]

Model selection for metabolomics: predicting diagnosis of coronary artery disease using automated machine learning (AutoML).

Author information

1
Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA.
2
Department of Cardiology, Division Heart and Lungs, Utrecht, the Netherlands.
3
Department of Clinical Chemistry, Fimlab Laboratories, Tampere, Finland.
4
Department of Cardiology, Tampere University Hospital, Tampere, Finland.
5
Department of Cardio-Thoracic Surgery, Heart Center, Tampere University Hospital, Tampere, Finland.
6
Department of Forensic Medicine, Fimlab Laboratories, Tampere, Finland.
7
Department of Clinical Physiology, Tampere University Hospital, Tampere, Finland.
8
Health Data Research UK London, University College London, UK.
9
Institute of Cardiovascular Science, University College London, London, UK.

Abstract

MOTIVATION:

Selecting the optimal machine learning (ML) model for a given dataset is often challenging. Automated ML (AutoML) has emerged as a powerful tool for enabling the automatic selection of ML methods and parameter settings for the prediction of biomedical endpoints. Here, we apply the tree-based pipeline optimization tool (TPOT) to predict angiographic diagnoses of coronary artery disease (CAD). With TPOT, ML models are represented as expression trees and optimal pipelines discovered using a stochastic search method called genetic programming. We provide some guidelines for TPOT-based ML pipeline selection and optimization-based on various clinical phenotypes and high-throughput metabolic profiles in the Angiography and Genes Study (ANGES).

RESULTS:

We analyzed nuclear magnetic resonance (NMR)-derived lipoprotein and metabolite profiles in the ANGES cohort with a goal to identify the role of non-obstructive CAD patients in CAD diagnostics. We performed a comparative analysis of TPOT-generated ML pipelines with selected ML classifiers, optimized with a grid search approach, applied to two phenotypic CAD profiles. As a result, TPOT generated ML pipelines that outperformed grid search optimized models across multiple performance metrics including balanced accuracy and area under the precision-recall curve. With the selected models, we demonstrated that the phenotypic profile that distinguishes non-obstructive CAD patients from no CAD patients is associated with higher precision, suggesting a discrepancy in the underlying processes between these phenotypes.

AVAILABILITY:

TPOT is freely available via http://epistasislab.github.io/tpot/.

SUPPLEMENTARY INFORMATION:

Supplementary data are available at Bioinformatics online.

Supplemental Content

Full text links

Icon for Silverchair Information Systems
Loading ...
Support Center