STELAR: a statistically consistent coalescent-based species tree estimation method by maximizing triplet consistency

BMC Genomics. 2020 Feb 10;21(1):136. doi: 10.1186/s12864-020-6519-y.

Abstract

Background: Species tree estimation is frequently based on phylogenomic approaches that use multiple genes from throughout the genome. However, estimating a species tree from a collection of gene trees can be complicated due to the presence of gene tree incongruence resulting from incomplete lineage sorting (ILS), which is modelled by the multi-species coalescent process. Maximum likelihood and Bayesian MCMC methods can potentially result in accurate trees, but they do not scale well to large datasets.

Results: We present STELAR (Species Tree Estimation by maximizing tripLet AgReement), a new fast and highly accurate statistically consistent coalescent-based method for estimating species trees from a collection of gene trees. We formalized the constrained triplet consensus (CTC) problem and showed that the solution to the CTC problem is a statistically consistent estimate of the species tree under the multi-species coalescent (MSC) model. STELAR is an efficient dynamic programming based solution to the CTC problem which is highly accurate and scalable. We evaluated the accuracy of STELAR in comparison with SuperTriplets, which is an alternate fast and highly accurate triplet-based supertree method, and with MP-EST and ASTRAL - two of the most popular and accurate coalescent-based methods. Experimental results suggest that STELAR matches the accuracy of ASTRAL and improves on MP-EST and SuperTriplets.

Conclusions: Theoretical and empirical results (on both simulated and real biological datasets) suggest that STELAR is a valuable technique for species tree estimation from gene tree distributions.

Keywords: Gene tree incongruence; Incomplete lineage sorting; Multi-species coalescent process; Phylogenomics.

Publication types

  • Validation Study

MeSH terms

  • Algorithms
  • Bayes Theorem
  • Computer Simulation / statistics & numerical data*
  • Genetic Speciation*
  • Phylogeny*
  • Software*