Send to

Choose Destination
Bioinformatics. 2019 Nov 8. pii: btz796. doi: 10.1093/bioinformatics/btz796. [Epub ahead of print]

Model selection for metabolomics: predicting diagnosis of coronary artery disease using automated machine learning (AutoML).

Author information

Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA.
Department of Cardiology, Division Heart and Lungs, Utrecht, the Netherlands.
Department of Clinical Chemistry, Fimlab Laboratories, Tampere, Finland.
Department of Cardiology, Tampere University Hospital, Tampere, Finland.
Department of Cardio-Thoracic Surgery, Heart Center, Tampere University Hospital, Tampere, Finland.
Department of Forensic Medicine, Fimlab Laboratories, Tampere, Finland.
Department of Clinical Physiology, Tampere University Hospital, Tampere, Finland.
Health Data Research UK London, University College London, UK.
Institute of Cardiovascular Science, University College London, London, UK.



Selecting the optimal machine learning (ML) model for a given dataset is often challenging. Automated ML (AutoML) has emerged as a powerful tool for enabling the automatic selection of ML methods and parameter settings for the prediction of biomedical endpoints. Here, we apply the tree-based pipeline optimization tool (TPOT) to predict angiographic diagnoses of coronary artery disease (CAD). With TPOT, ML models are represented as expression trees and optimal pipelines discovered using a stochastic search method called genetic programming. We provide some guidelines for TPOT-based ML pipeline selection and optimization-based on various clinical phenotypes and high-throughput metabolic profiles in the Angiography and Genes Study (ANGES).


We analyzed nuclear magnetic resonance (NMR)-derived lipoprotein and metabolite profiles in the ANGES cohort with a goal to identify the role of non-obstructive CAD patients in CAD diagnostics. We performed a comparative analysis of TPOT-generated ML pipelines with selected ML classifiers, optimized with a grid search approach, applied to two phenotypic CAD profiles. As a result, TPOT generated ML pipelines that outperformed grid search optimized models across multiple performance metrics including balanced accuracy and area under the precision-recall curve. With the selected models, we demonstrated that the phenotypic profile that distinguishes non-obstructive CAD patients from no CAD patients is associated with higher precision, suggesting a discrepancy in the underlying processes between these phenotypes.


TPOT is freely available via


Supplementary data are available at Bioinformatics online.

Supplemental Content

Full text links

Icon for Silverchair Information Systems
Loading ...
Support Center