Format

Send to

Choose Destination
See comment in PubMed Commons below
Bioinformatics. 2014 Dec 15;30(24):3524-31. doi: 10.1093/bioinformatics/btu584. Epub 2014 Aug 28.

Merging of multi-string BWTs with applications.

Author information

1
Department of Computer Science, 201 S. Columbia St. UNC-CH, Chapel Hill, NC 27599, USA.

Abstract

MOTIVATION:

The throughput of genomic sequencing has increased to the point that is overrunning the rate of downstream analysis. This, along with the desire to revisit old data, has led to a situation where large quantities of raw, and nearly impenetrable, sequence data are rapidly filling the hard drives of modern biology labs. These datasets can be compressed via a multi-string variant of the Burrows-Wheeler Transform (BWT), which provides the side benefit of searches for arbitrary k-mers within the raw data as well as the ability to reconstitute arbitrary reads as needed. We propose a method for merging such datasets for both increased compression and downstream analysis.

RESULTS:

We present a novel algorithm that merges multi-string BWTs in [Formula: see text] time where LCS is the length of their longest common substring between any of the inputs, and N is the total length of all inputs combined (number of symbols) using [Formula: see text] bits where F is the number of multi-string BWTs merged. This merged multi-string BWT is also shown to have a higher compressibility compared with the input multi-string BWTs separately. Additionally, we explore some uses of a merged multi-string BWT for bioinformatics applications.

PMID:
25172922
PMCID:
PMC4318930
DOI:
10.1093/bioinformatics/btu584
[Indexed for MEDLINE]
Free PMC Article
PubMed Commons home

PubMed Commons

0 comments
How to join PubMed Commons

    Supplemental Content

    Full text links

    Icon for Silverchair Information Systems Icon for PubMed Central
    Loading ...
    Support Center