Resolving Conflicting Predictions from Multimapping Reads

J Comput Biol. 2016 Mar;23(3):203-17. doi: 10.1089/cmb.2015.0164. Epub 2016 Jan 8.

Abstract

The first step in the analysis of data produced by ultra-high-throughput next-generation sequencing technology is to map short sequence "reads" to a reference genome, if available. Sequencing errors, repeat regions, and polymorphisms may lead a read to align to multiple locations in the genome reasonably well. While ignoring such multimapping reads, or some of their alignments, will reduce the sensitivity of almost any type of downstream analysis (e.g., detecting structural variants), erroneous mappings will typically yield false positive predictions. Here we propose a framework that aims to identify true predictions among a large set of candidate predictions by selecting for each read a unique mapping that collectively imply conflict-free predictions. We formulate this problem as the maximum facility location problem, for which we propose LP-rounding heuristics. We provide a theoretic guarantee on the quality of the solution and demonstrate the utility of our algorithm in resolving conflicting deletions implied by simulated reads mapping ambiguously to Craig Venter's genome model and Illumina sequencing reads of the well-studied NA12878 individual.

Keywords: algorithms; combinatorial optimization.

MeSH terms

  • Algorithms*
  • Contig Mapping / methods*
  • Contig Mapping / standards
  • Genome, Human*
  • Human Genome Project
  • Humans
  • Polymorphism, Genetic