Format

Send to

Choose Destination
Bioinformatics. 2015 Mar 1;31(5):720-7. doi: 10.1093/bioinformatics/btu725. Epub 2014 Oct 30.

Fast and accurate site frequency spectrum estimation from low coverage sequence data.

Author information

1
Department of Biostatistics, University of California, Los Angeles, Los Angeles, CA 90095, USA, Department of Human Genetics and Biomathematics, University of California, Los Angeles, Los Angeles, CA 90095, USA and Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA.
2
Department of Biostatistics, University of California, Los Angeles, Los Angeles, CA 90095, USA, Department of Human Genetics and Biomathematics, University of California, Los Angeles, Los Angeles, CA 90095, USA and Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA Department of Biostatistics, University of California, Los Angeles, Los Angeles, CA 90095, USA, Department of Human Genetics and Biomathematics, University of California, Los Angeles, Los Angeles, CA 90095, USA and Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA.

Abstract

MOTIVATION:

The distribution of allele frequencies across polymorphic sites, also known as the site frequency spectrum (SFS), is of primary interest in population genetics. It is a complete summary of sequence variation at unlinked sites and more generally, its shape reflects underlying population genetic processes. One practical challenge is that inferring the SFS from low coverage sequencing data in a straightforward manner by using genotype calls can lead to significant bias. To reduce bias, previous studies have used a statistical method that directly estimates the SFS from sequencing data by first computing site allele frequency (SAF) likelihood for each site (i.e. the likelihood a site has each possible allele frequency conditional on observed sequence reads) using a dynamic programming (DP) algorithm. Although this method produces an accurate SFS, computing the SAF likelihood is quadratic in the number of samples sequenced.

RESULTS:

To overcome this computational challenge, we propose an algorithm, 'score-limited DP' algorithm, which is linear in the number of genomes to compute the SAF likelihood. This algorithm works because in a lower triangular matrix that arises in the DP algorithm, all non-negligible values of the SAF likelihood are concentrated on a few cells around the best-guess allele counts. We show that our score-limited DP algorithm has comparable accuracy but is faster than the original DP algorithm. This speed improvement makes SFS estimation practical when using low coverage NGS data from a large number of individuals.

AVAILABILITY AND IMPLEMENTATION:

The program will be available via a link from the Novembre lab website (http://jnpopgen.org/).

PMID:
25359894
PMCID:
PMC4341071
DOI:
10.1093/bioinformatics/btu725
[Indexed for MEDLINE]
Free PMC Article

Supplemental Content

Full text links

Icon for Silverchair Information Systems Icon for PubMed Central
Loading ...
Support Center