DSK: k-mer counting with very low memory usage

Guillaume Rizk; Dominique Lavenier; Rayan Chikhi

doi:10.1093/bioinformatics/btt020

DSK: k-mer counting with very low memory usage

Bioinformatics. 2013 Mar 1;29(5):652-3. doi: 10.1093/bioinformatics/btt020. Epub 2013 Jan 16.

Authors

Guillaume Rizk¹, Dominique Lavenier, Rayan Chikhi

Affiliation

¹ Algorizk, 75013 Paris, France.

PMID: 23325618
DOI: 10.1093/bioinformatics/btt020

Abstract

Summary: Counting all the k-mers (substrings of length k) in DNA/RNA sequencing reads is the preliminary step of many bioinformatics applications. However, state of the art k-mer counting methods require that a large data structure resides in memory. Such structure typically grows with the number of distinct k-mers to count. We present a new streaming algorithm for k-mer counting, called DSK (disk streaming of k-mers), which only requires a fixed user-defined amount of memory and disk space. This approach realizes a memory, time and disk trade-off. The multi-set of all k-mers present in the reads is partitioned, and partitions are saved to disk. Then, each partition is separately loaded in memory in a temporary hash table. The k-mer counts are returned by traversing each hash table. Low-abundance k-mers are optionally filtered. DSK is the first approach that is able to count all the 27-mers of a human genome dataset using only 4.0 GB of memory and moderate disk space (160 GB), in 17.9 h. DSK can replace a popular k-mer counting software (Jellyfish) on small-memory servers.

Availability: http://minia.genouest.org/dsk

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Genome, Human
Humans
Sequence Analysis, DNA / methods*
Sequence Analysis, RNA / methods*
Software*