Format

Send to

Choose Destination
Cell Syst. 2018 Aug 22;7(2):219-226.e5. doi: 10.1016/j.cels.2018.07.005.

Statistical Binning for Barcoded Reads Improves Downstream Analyses.

Author information

1
Computer Science and AI Lab, Massachusetts Institute of Technology, Cambridge, MA, USA.
2
Computer Science and AI Lab, Massachusetts Institute of Technology, Cambridge, MA, USA; Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA.
3
Data Sciences Platform, Broad Institute, Cambridge, MA, USA; Program in Medical and Population Genetics, Broad Institute, Cambridge, MA, USA; Stanley Center for Psychiatric Research, Broad Institute, Cambridge, MA, USA; Department of Genetics, Harvard Medical School, Boston, MA, USA.
4
Computer Science and AI Lab, Massachusetts Institute of Technology, Cambridge, MA, USA; Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA. Electronic address: bab@mit.edu.

Abstract

Sequencing technologies are capturing longer-range genomic information at lower error rates, enabling alignment to genomic regions that are inaccessible with short reads. However, many methods are unable to align reads to much of the genome, recognized as important in disease, and thus report erroneous results in downstream analyses. We introduce EMA, a novel two-tiered statistical binning model for barcoded read alignment, that first probabilistically maps reads to potentially multiple "read clouds" and then within clouds by newly exploiting the non-uniform read densities characteristic of barcoded read sequencing. EMA substantially improves downstream accuracy over existing methods, including phasing and genotyping on 10x data, with fewer false variant calls in nearly half the time. EMA effectively resolves particularly challenging alignments in genomic regions that contain nearby homologous elements, uncovering variants in the pharmacogenomically important CYP2D region, and clinically important genes C4 (schizophrenia) and AMY1A (obesity), which go undetected by existing methods. Our work provides a framework for future generation sequencing.

KEYWORDS:

barcoded short-reads; linked-reads; read mapping; third-generation sequencing

Supplemental Content

Full text links

Icon for Elsevier Science Icon for PubMed Central
Loading ...
Support Center