Send to

Choose Destination
Bioinformatics. 2019 Jun 4. pii: btz470. doi: 10.1093/bioinformatics/btz470. [Epub ahead of print]

Scaling tree-based automated machine learning to biomedical big data with a feature set selector.

Author information

Department of Biostatistics, Epidemiology and Informatics, Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA.



Automated machine learning (AutoML) systems are helpful data science assistants designed to scan data for novel features, select appropriate supervised learning models and optimize their parameters. For this purpose, Tree-based Pipeline Optimization Tool (TPOT) was developed using strongly typed genetic programming to recommend an optimized analysis pipeline for the data scientist's prediction problem. However, like other AutoML systems, TPOT may reach computational resource limits when working on big data such as whole-genome expression data.


We introduce two new features implemented in TPOT that helps increase the system's scalability: Feature Set Selector and Template. Feature Set Selector (FSS) provides the option to specify subsets of the features as separate datasets, assuming the signals come from one or more of these specific data subsets. FSS increases TPOT's efficiency in application on big data by slicing the entire dataset into smaller sets of features and allowing genetic programming to select the best subset in the final pipeline. Template enforces type constraints with strongly typed genetic programming and enables the incorporation of FSS at the beginning of each pipeline. Consequently, FSS and Template help reduce TPOT computation time and may provide more interpretable results. Our simulations show TPOT-FSS significantly outperforms a tuned XGBoost model and standard TPOT implementation. We apply TPOT-FSS to real RNA-Seq data from a study of major depressive disorder. Independent of the previous study that identified significant association with depression severity of two modules, TPOT-FSS corroborates that one of the modules is largely predictive of the clinical diagnosis of each individual.


Detailed simulation and analysis code needed to reproduce the results in this study is available at Implementation of the new TPOT operators is available at


Supplementary data are available at Bioinformatics online.

Supplemental Content

Full text links

Icon for Silverchair Information Systems
Loading ...
Support Center