Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters

David Pellow; Darya Filippova; Carl Kingsford

doi:10.1089/cmb.2016.0155

Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters

J Comput Biol. 2017 Jun;24(6):547-557. doi: 10.1089/cmb.2016.0155. Epub 2016 Nov 9.

Authors

David Pellow¹, Darya Filippova², Carl Kingsford³

Affiliations

¹ 1 The Blavatnik School of Computer Science, Tel Aviv University , Tel Aviv, Israel .
² 2 Roche Sequencing Solutions , Pleasanton, California.
³ 3 Computational Biology Department, School of Computer Science, Carnegie Mellon University , Pittsburgh, Pennsylvania.

Abstract

Using a sequence's k-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. As k-mer sets often reach hundreds of millions of elements, traditional data structures are often impractical for k-mer set storage, and Bloom filters (BFs) and their variants are used instead. BFs reduce the memory footprint required to store millions of k-mers while allowing for fast set containment queries, at the cost of a low false positive rate (FPR). We show that, because k-mers are derived from sequencing reads, the information about k-mer overlap in the original sequence can be used to reduce the FPR up to 30 × with little or no additional memory and with set containment queries that are only 1.3 - 1.6 times slower. Alternatively, we can leverage k-mer overlap information to store k-mer sets in about half the space while maintaining the original FPR. We consider several variants of such k-mer Bloom filters (kBFs), derive theoretical upper bounds for their FPR, and discuss their range of applications and limitations.

Keywords: Bloom fitters; efficient data structures; genomics; k-mers.; string algorithms.

MeSH terms

Algorithms*
Computational Biology / methods*
Computer Simulation
Humans
Probability
Sequence Analysis, DNA / methods*
Software