Format

Send to

Choose Destination
Bioinformatics. 2012 Sep 1;28(17):2215-22.

Improved gap size estimation for scaffolding algorithms.

Author information

1
Department of Computational Biology, mKTH Royal Institute of Technology, Science for Life Laboratory, School of Computer Science and Communication, Solna, Sweden. ksahlin@csc.kth.se

Abstract

MOTIVATION:

One of the important steps of genome assembly is scaffolding, in which contigs are linked using information from read-pairs. Scaffolding provides estimates about the order, relative orientation and distance between contigs. We have found that contig distance estimates are generally strongly biased and based on false assumptions. Since erroneous distance estimates can mislead in subsequent analysis, it is important to provide unbiased estimation of contig distance.

RESULTS:

In this article, we show that state-of-the-art programs for scaffolding are using an incorrect model of gap size estimation. We discuss why current maximum likelihood estimators are biased and describe what different cases of bias we are facing. Furthermore, we provide a model for the distribution of reads that span a gap and derive the maximum likelihood equation for the gap length. We motivate why this estimate is sound and show empirically that it outperforms gap estimators in popular scaffolding programs. Our results have consequences both for scaffolding software, structural variation detection and for library insert-size estimation as is commonly performed by read aligners.

AVAILABILITY:

A reference implementation is provided at https://github.com/SciLifeLab/gapest.

SUPPLEMENTARY INFORMATION:

Supplementary data are availible at Bioinformatics online.

PMID:
22923455
DOI:
10.1093/bioinformatics/bts441
[Indexed for MEDLINE]

Supplemental Content

Full text links

Icon for Silverchair Information Systems
Loading ...
Support Center