Machine learning applications in microbial ecology, human microbiome studies, and environmental monitoring

Comput Struct Biotechnol J. 2021 Jan 27:19:1092-1107. doi: 10.1016/j.csbj.2021.01.028. eCollection 2021.

Abstract

Advances in nucleic acid sequencing technology have enabled expansion of our ability to profile microbial diversity. These large datasets of taxonomic and functional diversity are key to better understanding microbial ecology. Machine learning has proven to be a useful approach for analyzing microbial community data and making predictions about outcomes including human and environmental health. Machine learning applied to microbial community profiles has been used to predict disease states in human health, environmental quality and presence of contamination in the environment, and as trace evidence in forensics. Machine learning has appeal as a powerful tool that can provide deep insights into microbial communities and identify patterns in microbial community data. However, often machine learning models can be used as black boxes to predict a specific outcome, with little understanding of how the models arrived at predictions. Complex machine learning algorithms often may value higher accuracy and performance at the sacrifice of interpretability. In order to leverage machine learning into more translational research related to the microbiome and strengthen our ability to extract meaningful biological information, it is important for models to be interpretable. Here we review current trends in machine learning applications in microbial ecology as well as some of the important challenges and opportunities for more broad application of machine learning to understanding microbial communities.

Keywords: 16S rRNA; ANN, Artificial Neural Networks; ASV, Amplicon Sequence Variant; AUC, Area Under the Curve; Forensics; GB, Gradient Boosting; ML, Machine Learning; Machine learning; Marker genes; Metagenomics; PCoA, Principal Coordinate Analysis; RF, Random Forests; ROC, Receiver Operating Characteristic; SML, Supervised Machine Learning; SVM, Support Vector Machines; USML, Unsupervised Machine Learning; tSNE, t-distributed Stochastic Neighbor Embedding.

Publication types

  • Review