Format

Send to

Choose Destination
Brief Bioinform. 2017 Mar 1;18(2):183-194. doi: 10.1093/bib/bbw011.

Effect of lossy compression of quality scores on variant calling.

Author information

1
Electrical Engineering department, 350 Serra Mall, Stanford, CA, USA.
2
Department of Electrical Engineering, Stanford University, Stanford, CA, USA.
3
Department of Medicine, Stanford University, Stanford, CA, USA.
4
Stanford Center for Inherited Cardiovascular Disease, Stanford University, Stanford, CA, USA.
5
Department of Genetics, Stanford University, Stanford, CA, USA.

Abstract

Recent advancements in sequencing technology have led to a drastic reduction in genome sequencing costs. This development has generated an unprecedented amount of data that must be stored, processed, and communicated. To facilitate this effort, compression of genomic files has been proposed. Specifically, lossy compression of quality scores is emerging as a natural candidate for reducing the growing costs of storage. A main goal of performing DNA sequencing in population studies and clinical settings is to identify genetic variation. Though the field agrees that smaller files are advantageous, the cost of lossy compression, in terms of variant discovery, is unclear.Bioinformatic algorithms to identify SNPs and INDELs use base quality score information; here, we evaluate the effect of lossy compression of quality scores on SNP and INDEL detection. Specifically, we investigate how the output of the variant caller when using the original data differs from that obtained when quality scores are replaced by those generated by a lossy compressor. Using gold standard genomic datasets and simulated data, we are able to analyze how accurate the output of the variant calling is, both for the original data and that previously lossily compressed. We show that lossy compression can significantly alleviate the storage while maintaining variant calling performance comparable to that with the original data. Further, in some cases lossy compression can lead to variant calling performance that is superior to that using the original file. We envisage our findings and framework serving as a benchmark in future development and analyses of lossy genomic data compressors.

KEYWORDS:

Genomic data; lossy compression; quality scores; variant calling

PMID:
26966283
PMCID:
PMC5862240
DOI:
10.1093/bib/bbw011
[Indexed for MEDLINE]
Free PMC Article

Supplemental Content

Full text links

Icon for Silverchair Information Systems Icon for PubMed Central
Loading ...
Support Center