Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model

Metin Balaban; Nishat Anjum Bristy; Ahnaf Faisal; Md Shamsuzzoha Bayzid; Siavash Mirarab

doi:10.1093/bioadv/vbac055

Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model

Bioinform Adv. 2022 Aug 12;2(1):vbac055. doi: 10.1093/bioadv/vbac055. eCollection 2022.

Authors

Metin Balaban¹, Nishat Anjum Bristy², Ahnaf Faisal², Md Shamsuzzoha Bayzid², Siavash Mirarab¹

Affiliations

¹ Bioinformatics and System Biology Program, University of California San Diego, San Diego, CA 92093, USA.
² Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh.

Abstract

While alignment has been the dominant approach for determining homology prior to phylogenetic inference, alignment-free methods can simplify the analysis, especially when analyzing genome-wide data. Furthermore, alignment-free methods present the only option for emerging forms of data, such as genome skims, which do not permit assembly. Despite the appeal, alignment-free methods have not been competitive with alignment-based methods in terms of accuracy. One limitation of alignment-free methods is their reliance on simplified models of sequence evolution such as Jukes-Cantor. If we can estimate frequencies of base substitutions in an alignment-free setting, we can compute pairwise distances under more complex models. However, since the strand of DNA sequences is unknown for many forms of genome-wide data, which arguably present the best use case for alignment-free methods, the most complex models that one can use are the so-called no strand-bias models. We show how to calculate distances under a four-parameter no strand-bias model called TK4 without relying on alignments or assemblies. The main idea is to replace letters in the input sequences and recompute Jaccard indices between k-mer sets. However, on larger genomes, we also need to compute the number of k-mer mismatches after replacement due to random chance as opposed to homology. We show in simulation that alignment-free distances can be highly accurate when genomes evolve under the assumed models and study the accuracy on assembled and unassembled biological data.

Availability and implementation: Our software is available open source at https://github.com/nishatbristy007/NSB.

Supplementary information: Supplementary data are available at Bioinformatics Advances online.

Grants and funding

R35 GM142725/GM/NIGMS NIH HHS/United States