• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of genoresGenome ResearchCSHL PressJournal HomeSubscriptionseTOC AlertsBioSupplyNet
Genome Res. Jun 2011; 21(6): 952–960.
PMCID: PMC3106328

SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples

Abstract

Reductions in the cost of sequencing have enabled whole-genome sequencing to identify sequence variants segregating in a population. An efficient approach is to sequence many samples at low coverage, then to combine data across samples to detect shared variants. Here, we present methods to discover and genotype single-nucleotide polymorphism (SNP) sites from low-coverage sequencing data, making use of shared haplotype (linkage disequilibrium) information. For each population, we first collect SNP candidates based on independent sequence calls per site. We then use MARGARITA with genotype or phased haplotype data from the same samples to collect 20 ancestral recombination graphs (ARGs). We refine the posterior probability of SNP candidates by considering possible mutations at internal branches of the 40 marginal ancestral trees inferred from the 20 ARGs at the left and right flanking genotype sites. Using a population genetic prior distribution on tree-branch length and Bayesian inference, we determine a posterior probability of the SNP being real and also the most probable phased genotype call for each individual. We present experiments on both simulation data and real data from the 1000 Genomes Project to prove the applicability of the methods. We also explore the relative tradeoff between sequencing depth and the number of sequenced samples.

Recent advances in sequencing technologies enable the sequencing of personal genomes to identify most genetic variations present in one sample (Venter et al. 2001; Levy et al. 2007; Wang et al. 2008; Wheeler et al. 2008; Kim et al. 2009). To achieve high accuracy at almost all of the accessible sites requires high average depth; for example, the average depth in Kim et al. (2009) is 27.8×. This high depth is expensive and limits the number of samples that can be sequenced. An alternative strategy to find sequence variants shared in a population was introduced in Liti et al. (2009), where 70 haploid yeast samples were sequenced with only 1–4× coverage to find sequence variants. The 1000 Genomes Project (2010) is taking a similar approach and in its low-coverage pilot has sequenced 179 samples at an average 3.7× coverage.

Several methods have been introduced to detect variants from sequencing individual genomes (Li et al. 2008; H Li et al. 2009). The standard approach is to estimate the likelihood of sequencing data given possible genotypes, and then convert to the probability of genotypes given data using Bayes' rule with an assumption about the prior probability of heterozygous and homozygous sequence variants. These methods work well with high-coverage data, but have low power and unacceptable false-positive rates (FPRs) when applied to individual samples with low-coverage sequencing data. For example, R Li et al. (2009) reported 0.04% FPRs per base pair for a single sample with 4× coverage data, implying that cumulative FPRs would go up to 1−(1−0.0004)100 ≈ 4% per base pair, or 40 per kilobase when applied to 100 independent samples. The rate of true SNPs would be expected to be approximately six SNPs per kilobase An external file that holds a picture, illustration, etc.
Object name is 952inf1.jpg, meaning that false-positives would outnumber true SNP calls by approximately seven to one, giving ~87% false discovery rate (FDR). Consistent with this, when we use SAMtools (H Li et al. 2009) separately on 100 samples with 4× coverage as described below, we see cumulative false-positive rates of 5% per base pair (see below). Moreover, genotype error rates when analyzing low-coverage samples independently are, not surprisingly, high: 0.041, 0.283, and 0.030 for homozygous reference, heterozygous, and homozygous non-reference genotypes, respectively.

In this study, we present two new methods to discover SNPs from low-coverage sequencing data by combining data across samples, which were developed to detect SNPs in the low-coverage pilot in the 1000 Genomes Project. In the first method, nonlinkage disequilibrium analysis (NLDA), we apply a dynamic programming algorithm to estimate the posterior probability of k non-reference alleles in 2m chromosomes in O(m2) time for all values of k from 1 to 2m − 1. Having obtained the posterior probability of k non-reference alleles in 2m chromosomes, we calculate the probability of a SNP at a site by the probability of k > 0, given assumptions about variant frequency and allele frequency distribution. This method can be applied to the whole genome of hundreds of samples in reasonable computing time.

In the second method, linkage disequilibrium analysis (LDA), we make use of shared haplotype structure to estimate posterior probabilities of SNPs and genotypes. To do this, we build a set of possible ancestral recombination graphs (ARG) for samples using MARGARITA (Minichiello and Durbin 2006) on genotypes or phased haplotypes at previously genotyped sites. For example, we built 20 ARGs for samples in the low-coverage pilot data of the 1000 Genomes Project from genotypes/phased haplotypes from the HapMap 3 project (The International HapMap 3 Consortium 2010). Having built the ARGs, for each candidate SNP site, we collect marginal ancestral trees inferred at the left and right flanking genotyped sites, 40 in total. We estimate the SNP posterior probability by evaluating the likelihood of the observed sequencing data for all possible mutations in the 40 trees, assuming that any sequence variant in the m samples is caused by a single mutation. Both simulated and real data show that LDA has the same SNP discovery rate as NLDA and produces lower false-positive rates. However, the complexity of LDA, O(NAm2nt) with number of nucleotides NA = 4 and the number of trees nt = 40, makes LDA inapplicable to analyze the whole genome with hundreds of samples. Fortunately, we found that very few sites with low NLDA posterior probability have high LDA posterior probability, and so we adopt a strategy in which we first collect potential SNP candidates using NLDA with a threshold selected to ensure that the SNP candidate set is feasible for LDA. Then we apply LDA to the SNP candidate set and use the posterior probability of LDA to determine SNPs at a chosen threshold. We filter false-positive calls by removing sites where there are three SNP calls within 10 bp (FW10) (Li et al. 2008). We can impute genotypes and phased haplotypes of m samples under the same LDA framework. These methods have been used to provide one of the primary call sets for the low-coverage pilot of The 1000 Genomes Project (2010).

Results

We implemented QCALL as described in the Methods section.

NLDA and LDA comparison on simulation data

We simulated 3000 haplotypes across a 5-Mbp region of chromosome 20 as described in the Data section of the Methods. Then we created five nested populations with 1600× sequencing coverage in total, 50 samples with 32× coverage, 100 samples with 16× coverage, 200 samples with 8× coverage, 266 samples with 6× coverage, and 400 samples with 4× coverage. There are 24,289, 28,181, 31,675, 32,807, and 34,807 SNPs, respectively, in these simulated populations. We also simulate a population of 60 samples that have the same sequencing depths (3.7× average coverage) as the 60 CEU samples from the low-coverage pilot of the 1000 Genomes Project (CEU samples are from Utah residents with Northern and Western European ancestry).

We applied both NLDA and LDA methods (see Methods section) to the 400 samples with 4× coverage to understand performance of these two methods in SNP calling (Fig. 1). It is clear that LDA is better than NLDA in detecting SNPs, as it provides a lower false-positive rate and a higher discovery rate. We applied a filter to remove sets of three or more SNP calls within 10 bases (FW10), as we found that most of the calls are false-positives caused by misalignment of reads around short insertions or deletions (indels). FW10 helps to lower the false-positive rates and keeps almost the same power in detecting true SNPs. In Table 1 we show the number of false-positives for each sequencing strategy at a 0.99 posterior confidence level (Q20) with a FW10 filter to obtain the false-positives of each strategy (Table 1). We found that false-positives in simulated data are mainly caused by indels, e.g., 929/942 false-positives of 400 samples with 4× happen within 5 bp of indels (Table 2). A stronger filter, FW5, which removes sets of two SNP calls within 5 bp reduces the false-positive number further to 510, but also removes many more true positives (overall loss of 9.8% true positives). An alternative way to filter false-positives around indels would be to realign reads around indels, as is possible with DIndel (Albers et al. 2010) and GATK (McKenna et al. 2010). If these removed all false-positives around indels, then we could in theory obtain a false-positive rate in simulated data of about 1/Mbp (1 × 5/24,804 ~ 0.0002 FDR) for 50 samples with 32× or about 2.6/Mbp (2.6 × 5/29,823 ~ 0.0004 FDR) for 400 samples with 4×.

Table 1.
Distribution of false-positives as a function of base pair distance from the nearest indel
Table 2.
Discovery rates with different sequencing strategies
Figure 1.
Discovery and false-positive rates of QCALL for 400 samples with 4.0× coverage sequencing data. LDA and LDA,FW10 stand for linkage disequilibrium analysis without FW10 and with FW10. The same notation is applied to NLDA.

Number of sequenced samples versus sequencing depth

With low-coverage sequencing data, it is difficult to detect SNPs with low non-reference allele frequency (the total number of non-reference allele among 2m haplotypes of m samples-nrAF), as the lower the nrAF, the smaller the chance to observe sequencing data that supports the non-reference (alternative) allele (Fig. 2). This issue becomes more serious for heterozygous SNPs when they need data to support both alleles. For example, the marginal rate of detecting singleton SNPs drops from 99% with 32× coverage data to 18% with 4× coverage data. However, for a fixed sequencing budget, one can sequence more samples with low coverage than at high coverage. Below, we show that we can increase the total number of population variants found by decreasing coverage and increasing the number of sequenced samples.

Figure 2.
SNP discovery power for different sequencing strategies, all using 1600× data, plotted as a function of the number of non-reference alleles present in the sequenced samples.

Starting with the 24,289 SNPs in the first 50 samples (100 haplotypes), we detect 24,029 (98.9%) SNPs from 32× coverage sequencing data. The marginal discovery rate (the fraction of SNPs with a particular nrAF that are discovered) is therefore 99% for singleton SNPs (Fig. 1). We miss 240 SNPs, most of which are singletons. When we sequence more samples with lower coverage, we start to miss some SNPs from the 100 haplotypes of the first 50 samples, but we gain SNPs in the additional sequenced samples. For example, when we reduce the sequencing depth from 32× to 16×, we lose 187 SNPs from the first 100 haplotypes, but gain 3628 new SNPs from the 100 new sequenced haplotypes. Table 2 shows how these net gains progress as we sequence more samples at a lower depth. The most variants are found when we sequence 400 samples of 4× coverage.

If we look at detection power as a function of nrAF calculated from all 3000 sequences in the simulation, 400 samples at 4× also show the best power to detect SNPs with 1% nrAF, although at lower population frequencies, 266 samples at 6× give slightly higher power (Fig. 3). The strategy of 50 samples with 32× coverage shows the worst performance at low nrAF; for example, it detects ~40% SNPs with 0.005 nrAF, while that of 400 samples with 4× is about 75%. The simulation results indicate that the strategy of sequencing a large number of samples with low depths (4×–6×) is better than that of sequencing a small number of samples with high depths in detecting rare SNPs. However, there is no difference between these strategies in detecting high nrAF SNPs, e.g., all strategies get to 100% discovery rates for SNPs with nrAF > 5%.

Figure 3.
SNP discovery power for different sequencing strategies as a function of the non-reference allele frequency in the population. The continuous lines show empirical results from the simulation with the allele frequency estimated from all 3000 simulated ...

CEU samples of Pilot 1 in the 1000 Genomes Project

We analyze the same 5-Mbp region on chromosome 20 (43,000,000–48,000,000) in 60 samples from the CEU population of the low-coverage pilot of the 1000 Genomes Project (see Data section in Methods). The corresponding call set on the full genome contributed to the results from the low-coverage pilot of The 1000 Genomes Project Consortium (2010). We first applied NLDA to select 61,308 SNP candidates with 1% threshold. Then we used LDA to select 16,954 SNP calls with 90% threshold (Q10). Of these calls, 31% are in HapMap 2 and 67% are in dbSNP, equivalent to 33% novel calls. The calls show a ratio between transitions (mutations between A and G, or between C and T) and transversions (mutations from A or G to C or T, or vice versa) of 2.28, which is consistent with the value of 2.30 for the final 1000 Genomes Project call set in this interval (The 1000 Genomes Project Consortium 2010), though above the genome average of ~2.1.

We applied NLDA and LDA to the 60 simulated samples that have the same depths as the CEU samples from the low-coverage pilot of the 1000 Genomes Project. Results on these simulation data show that we are able to detect about 19,077 SNPs from 25,268 SNPs from 60 samples with 3.7× coverage data (~75%). We called 456 false-positives, equivalent to 456/(5 × 106) ~ 10−4 FPR or 456/19533 ~ 2.33% FDR. We also compare the marginal discovery rate of QCALL as a function of the non-reference allele frequency on simulation data and real data, using the 43 samples in the 1000 Genomes Project CEU sample for which there is HapMap 2 genotype data to provide the truth for the real data calls. The power as a function of allele frequency is remarkably similar (see Fig. 4).

Figure 4.
Marginal discovery rates as a function of non-reference allele count in 43 samples, from the CEU simulation and from 1000 Genomes Project data evaluated at HapMap 2 sites not in HapMap 3, on the 43 sequenced samples overlapping HapMap 2.

Genotype accuracy

One advantage of the LDA method is the ability to provide more accurate genotypes estimated from low-coverage data based on a local structure haplotype. For example, the NLDA genotype estimator, which generates the posterior probability of genotypes by using Bayes' rule, has an error rate of about 0.424 for heterozygous SNPs. LDA, however, assigns genotypes/haplotypes for samples by averaging over sets of calls that are consistent with local haplotype structure (see Genotyping section in Methods).

Empirical experiments on the CEU population 1000 Genomes Project data comparing with HapMap II genotypes not at HapMap 3 sites (which were used to build the ARGs) give an overall genotype FDR for LDA of 2.7%, corresponding to 1.4%, 3.9%, 4.2% FDR for homozygous, heterozygous, and homozygous non-reference genotypes, respectively (see Table 3). These FDRs are competitive with those of Beagle (Browning and Yu 2009), which is another haplotype-based approach to genotype calling from likelihood data comparable to LDA (2.8% overall, 0.8%, 5.7%, 3.4% by genotype category).

Table 3.
Average genotype error rates according to HapMap 2 genotypes of 5 Mbp on chromosome 20 (20:43,000,000–48,000,000) of 43 samples that are the overlapped samples between the 60 CEU samples and HapMap 2 samples

For simulation data, the overall genotype FDR of QCALL drops from 2.56% to 1.94% when we increase the number of sequenced samples from 50 to 400. We believe this decrease under-represents the potential of the tree-based calling approach of QCALL, and is instead limited by the ability of MARGARITA to scale effectively to large sample sets, since we have noticed that for 400 samples, MARGARITA, which implements a greedy algorithm, gets locked into incorrect structures. We are exploring other approaches to generating ARGs to avoid this problem.

Discussion

Detecting SNPs from multiple samples with low-coverage data is an efficient approach to detect low-frequency SNPs in a population. Experimental results show that QCALL with NLDA and LDA methods detects shared variants from multiple samples better than analyzing individual samples independently. In particular, the genotype accuracy is substantially improved.

The probability of detecting a SNP at a site depends on the number of non-reference alleles present in the sequencing samples and the evidence in the sequencing data for the observation. The strategy of sequencing a large number of samples with low coverage increases the expected number of non-reference alleles in the sample, but lowers confidence of the evidence for seeing them compared with the strategy of sequencing a small number of samples with high coverage. The best strategy for a particular nrAF is a tradeoff between the two factors. For example, at 0.005 nrAF the probability of there being at least one non-reference allele in 50 samples (100 haplotypes) is 0.3942, and so the resulting discovery rate cannot be higher than 0.3942, even at very high depth. However, the probability of there being at least one non-reference allele in 400 samples (800 haplotypes) is 0.9819, and it is likely that there will be more than one, so the discovery rate at 4× is ~73%. However, there is almost no difference between two strategies for high nrAF SNPs (common SNPs), as both strategies achieve near 100% power. Even when the overall power to detect variants is similar, there are circumstances in which sequencing a larger number of samples at lower depth can be preferable so as to better characterize the allele frequency of variants, or when phenotyped samples are being sequenced for an association study and increasing the number of sequenced samples increases statistical power.

Many false-positive calls are caused by short indels, where sequencing reads are mapped wrongly to the reference, particularly when the indels occur at the beginning or end of the reads. Thus, we often found a set of false-positives around an indel. FW10 is a simple and quite efficient method to remove the false-positives, as they are often very dense around the indel. However, FW10 cannot solve the problem when there are fewer than three false-positives or the false-positives are separated by more than 10 bp. An alternative solution is to realign reads around indels, as is possible with Dindel (Albers et al. 2010) and GATK (McKenna et al. 2010). LDA gives good quality SNP calls, but it has two main limitations; first, it is computationally expensive; and second, it requires ARGs to have been previously created from genotyped data. The computational cost can be overcome by prescreening with NLDA to filter out sites without evidence of being SNPs. QCALL takes about 10 h for 1 Mbp segment of 400 samples, but a proportion of the 10 h are used to prepare likelihoods of sequencing data from multiple samples. To solve the requirement for genotype data, we are developing methods to add new samples into existing ARGs or build ARGs directly from sequencing data.

Although our discussion of the method and results has been in the context of full-genome shotgun data, QCALL can also be used on targeted sequencing data, such as from exome projects (Ng et al. 2010), given that genotype data are available from which to build ARGs with MARGARITA. Furthermore, it can be used for other types of bi-allelic variant that are in local linkage disequilibrium with SNPs, such as small insertions or deletions (indels), by limiting to two possible states rather than the four bases. For these other uses it is possible to change the prior expectation of the transition to transversion ratio from 2, which is typical for human whole-genome SNPs, to, for example, 3.5, which is typical of coding regions, or 1 when encoding other variant types. QCALL was used for calling short indel genotypes for the 1000 Genomes Project pilot (The 1000 Genomes Project Consortium 2010).

Finally, the LDA approach we discussed here is related to other haplotype-sharing imputation methods such as BEAGLE (Browning and Yu 2009) mentioned above, IMPUTE (Howie et al. 2009), or MACH (Li et al. 2010). These can all be adapted for variant calling from low-coverage sequencing and, in fact, both BEAGLE and MACH have been also used in the 1000 Genomes Project, with the results being combined with those from QCALL to provide final consensus calls (The 1000 Genomes Project Consortium 2010).

Methods

Data

All experimental results were obtained on a 5-Mbp region of chromosome 20 (43,000,001–48,000,000) in NCBI 36 human reference (International Human Genome Consortium 2006, Build 36, hg18).

Simulation data

We simulated 3000 haplotypes using MaCs with the same population parameters provided in Chen et al. (2009). We used MAQ to simulate 51-bp paired-end reads for 800 haplotypes with error parameters estimated from one Illumina lane of NA12750 (The 1000 Genomes Project Consortium 2010). We mapped the reads to Human Genome reference NCBI 36 using BWA (Li and Durbin 2009) and stored the output in BAM format. We build simulated “HapMap 3” sites by identifying SNPs from 10 haplotypes and selected the same number of sites as in HapMap 3 data, taking the nearest site seen twice in the 10 simulated haplotypes to each true HapMap 3 site. We simulated five sets of data with a total 1600× coverage: 50 samples with 32× coverage, 100 samples with 16× coverage, 200 samples with 8× coverage, 266 samples with 6× coverage, and 400 samples with 4× coverage. We also simulated 60 samples with 3.7× to model the data from the 60 CEU samples of the low-coverage pilot in the 1000 Genomes Project.

Real data

We used the same 5-Mbp region (chromosome 20, 43–48 Mbp) of the CEU population in the low-coverage pilot of the 1000 Genomes Project.

Nonlinkage disequilibrium analysis (NLDA)

Assume we have observed data D = (d1,…,dm) of m samples at site s and likelihoods p(di | gi) for di given possible genotype g. p(di | gi) can be estimated using SAMtools (H Li et al. 2009) or GATK (McKenna et al. 2010). For example, SAMtools uses the method of Li et al. (2008), where homozygous likelihoods p(di | gi = aa) are calculated as the product of estimated base errors for non-a bases from the sequencing quality values, corrected for nonindependence of errors, and heterozygous liklelihoods p(di | gi = ab) as An external file that holds a picture, illustration, etc.
Object name is 952inf2.jpg times the product of estimated base errors for non-ab bases, since there is a half chance of observing an a or b (see SAMtools [H Li et al. 2009] and MAQ [Li et al. 2008] for more detail).

Assume the haplotypes of m samples at a site come from bi-allelic alleles, a and b. Obviously, the posterior probability of a SNP at s given observed data D, p(s = SNP | D) is 1—the probability of 2m haplotypes being equal to the reference allele r at s.

equation image

where a configuration g = (g1,…,gm) is the genotypes of m samples, p(g) and p(D | g) are the prior probability of g and the probability of D given genotypes g. The prior probability of a configuration is considered as the prior probability of a mutation that results in g, An external file that holds a picture, illustration, etc.
Object name is 952inf3.jpg, where θ is the population scaled mutation rate and k is the number of mutant alleles in g. Denote na to be the number of allele a in g,

equation image

We set θ = 0.001 for standard human SNP calling. It can be set as a program parameter for other uses of QCALL.

Assuming that the sequencing data is independent between samples, the probability of D given m genotypes g = (g1,…,gm), p(D | g), is calculated as

equation image

The key to calculating p(s = SNP | D) in Equation 1 is to compute the normalization factor,

equation image

We have

equation image

where

equation image

and

equation image

is the total probability of all possible genotype configurations g of m samples such that the number of a alleles in g is equal to k.

equation image

where Qm−1,k presents for the total probability of all figurations of m − 1 samples such that the number of allele a among m − 1 samples equals k.

Using this recursion, we can calculate Qm,k from the individual genotype likelihoods p(di | gi) in O(m2) steps by dynamic programming. Having obtained Qm,k, we can easily estimate

equation image

in Equation 1.

Linkage disequilibrium analysis

First, we give an informal description, then the technical details. We assume that genetic variants are caused by a single mutation on a coalescent tree during evolution. Figure 5 shows an example of an ancestral tree at some sites for four samples, s1, s2,s3, and s4. Assuming for the moment that this tree is correct, and that we know the ancestral base value and the position of a mutation on the tree, for example, the mutation from A to C shown in Figure 5, we can infer the base for each haplotype at the site, and hence, the genotypes of the individuals. Given the genotypes, we can calculate the likelihood of the sequencing data D given the tree, the root value, and the mutation. Since we do not know the root value and the mutation site, we integrate over them, weighting the mutation probabilities by the expected branch length under a population genetic prior, and then we average over a sample of trees to provide an estimate of the total likelihood of data D given that there was a mutation. Conditional on there being a mutation, we marginalize over genotypes to generate genotype posteriors.

Figure 5.
An illustrative example of a coalescent tree for four samples (eight haplotypes). Given a value at the root, A in this example, and a mutation from A to C in this example, we can infer genotypes for the four samples and, hence, compute the probability ...

To find the coalescent trees, we use MARGARITA (Minichiello and Durbin 2006) to estimate ancestral recombination graphs (ARGs) from known genotypes or phased haplotypes at samples at sites genotyped on SNPs. We prefer phased haplotypes to unphased genotypes because MARGARITA works better with phased data. However, it can work on unphased genotype data, or a mixture. For example, for the low-coverage pilot of the 1000 Genomes Project we used phased haplotypes from HapMap 3 for most samples, but genotypes for a few samples for which phased haplotypes were not available. For the simulation data we used phased haplotypes at a subset of sites selected to correspond to HapMap 3 sites as described above.

MARGARITA (Minichiello and Durbin 2006) can only handle a limited number of SNPs (markers) in terms of running time and memory, and therefore we cut the whole genome into 1-Mbp segments. To make ARGs consistent at the ends of the 1-Mbp segments, we expanded 0.5 Mbp at each end of each 1-Mbp segment, so MARGARITA was run across overlapping intervals of 2 Mb. We kept 20 ARGs for further analysis as a compromise between QCALL's accuracy and running time. A higher number of ARGs does not improve the performance of QCALL much but increases running time linearly. For example, MARGARITA takes, on average, ~8 h to build 20 ARGs for 400 samples on one 2-Mb segment.

MARGARITA only gives trees at the sites that were used to build it. We approximated coalescent trees at candidate SNP sites s by the trees T at the left and right flanking genotyped sites. Let Δ and An external file that holds a picture, illustration, etc.
Object name is 952inf4.jpg be the two cases where there is one and no mutation at s. We compute the probability of a mutation at s given D by Bayes' rule:

equation image

where the priors An external file that holds a picture, illustration, etc.
Object name is 952inf5.jpg and An external file that holds a picture, illustration, etc.
Object name is 952inf6.jpg are derived from standard neutral population genetics theory.

We start solving Equation 3 by estimating the probability of D given no mutation,An external file that holds a picture, illustration, etc.
Object name is 952inf7.jpg. To handle the situation where there are errors in the reference sequence, we set

equation image

where r is the true (ancestral) unmutated reference

equation image

where ε is the error rate in the observed reference, which we set to ε = 2 × 10−5 based on empirical experiments in the 1000 Genomes Project. Given true base r and no mutation at s, all genotypes of m samples must be rr, leading to

equation image

To estimate p(D | Δ,T), we scan all possible mutations on trees of T and integrate the probabilities of D given these mutations weighted by a prior distribution over mutations. Let us start with reference r,

equation image

where tk is a tree at a flanking site of s. We assume trees tk are independent and have the same prior probability,An external file that holds a picture, illustration, etc.
Object name is 952inf8.jpg.

To estimate An external file that holds a picture, illustration, etc.
Object name is 952inf9.jpg, we scan all possible mutations in tk such that the reference r must exist among m genotypes. We also consider the case where r is not represented in the m observed samples and was caused by a mutation outside tk.

equation image

where μ is the prior probability of an external mutation from a to r, set to that of a mutation at a leaf branch of 2m + 1 leaves, p(a,r) is the prior probability of a mutation from a to r, p(e | tk) is the prior probability of a mutation happening on edge e, and p(D | ear) [p(D | era)] is the probability of data given a mutation from a to r (r to a) at edge e.

The first part of Equation 6 allows for an external mutation from a to r outside tk and the second part handles the case where a mutation happens at an edge in tk. μ is set proportional to 1/(2m + 1) and normalized with

equation image

The prior probability of a mutation from a to r, p(a,r), can be set to allow for an arbitrary transition to transversion ratio. For standard genome-wide calls we set this to be 2.0:

equation image

p(D | gi = aa : i = 1..m) is simply estimated as

equation image

The prior probability of a mutation happening at e is set such that the more recent mutations have lower prior probability.

equation image

where na is the number of haplotype a at the leaves when mutation ab happens at edge e, and N(na) is the number of possible mutations in tk that result in na haplotype a at the leaves. We normalize p(e | tk) such that

equation image

Let g = (g1,…,gm) be the genotypes of m samples that result from mutation ar (or ra) at edge e,

equation image

Merging Equations 5 and 6, we have

equation image

We note that the complexity of computing p(D | Δ,T) in Equation 7 is O(NAm2nt), where the number of nucleotides, NA = 4, and number of flanking trees, nt =40.

Genotyping

Let g = (g1,…,gm) be the genotypes of m samples at s. Given a mutation at s, we calculate the posterior probability for gi as follows:

equation image

where r is the reference allele and p(r) is the prior probability estimated as in Equation 4. p(gi = ab | D,Δ,T,r) is given by

equation image

where

equation image

Let An external file that holds a picture, illustration, etc.
Object name is 952inf10.jpg be the set of edges in tk, where j (j = 0,1, or 2) haplotype(s) of sample i are mutants caused by a mutation at An external file that holds a picture, illustration, etc.
Object name is 952inf11.jpg. p(D,gi = ab | Δ,tk,r) is estimated under the following cases:

If a = b = r, then p(D,gi = aa | r = a,tk,Δ) is the sum of probabilities of all possible mutations from a to x on edge An external file that holds a picture, illustration, etc.
Object name is 952inf12.jpg, or from x to a on edge An external file that holds a picture, illustration, etc.
Object name is 952inf13.jpg:

equation image

If a = br, then p(D,gi = aa | r ≠ a,tk,Δ) is the sum of probabilities of mutations from a to r on edges outside tk or on edge An external file that holds a picture, illustration, etc.
Object name is 952inf14.jpg or from r to a on edge An external file that holds a picture, illustration, etc.
Object name is 952inf15.jpg:

equation image

If ab, then p(D,gi = ab | r,tk,Δ) is the sum of a mutation from a to b or b to a on edge An external file that holds a picture, illustration, etc.
Object name is 952inf16.jpg. r must be either a or b:

equation image

Having obtained posterior genotype probabilities p(gi = ab | D,Δ,T), we determine the genotype of sample i as the maximum likelihood genotype:

equation image

Haplotype phasing

Let h = (h1,…,h2m) be the 2m haplotypes of m samples at site s. We compute the posterior probability of hi = a given a mutation Δ, observed data D, marginal coalescent trees T as:

equation image

where r is the reference allele and p(r) is the prior probability estimated as in Equation 4.

p(hi = a | D,T,Δ,r) is calculated as

equation image

where

equation image

Denote Ei,k be the set of edges in tk, such that a mutation An external file that holds a picture, illustration, etc.
Object name is 952inf17.jpgresults in hi.

equation image

and

equation image

Having obtained p(hi = a | D,T,Δ), we determine the haplotype of sample i as the maximum likelihood allele

equation image

Issue with singletons and haplotype phasing

Singletons are a special case where a mutation happens at leaf branches. For each singleton, there are two possible mutations at leaf branches resulting in the same genotype configuration (Fig. 6). This results in an equal posterior probability for both alleles at the singleton. Thus, we cannot phase singletons. In practice, when our genotype calls indicate there is a singleton (all homozygous except one that is heterozygous) we only give a genotype call for the heterozygous sample and do not attempt to phase it.

Figure 6.
Two mutations at two edges of a singleton (edges connected to haplotypes fourth or eighth) lead to the same genotype configuration.

Acknowledgments

We thank James Stalker, Thomas Keane, and David Craig for making the .bam files, Heng Li for his significant help with MAQ and SAMtools, Gary Chen for MaCs, The 1000 Genomes Project Consortium for their data, the HapMap 3 Consortium for providing genotype calls and the Gilean McVean group for phasing them, and the Goncalo Abecasis and Richard Durbin groups for comments and feedback. Funding for this project was provided by Microsoft Research and Wellcome Trust grant WT089088/Z/09/Z to R.D.

Footnotes

[Software to implement the methods is available in the QCALL package from ftp://ftp.sanger.ac.uk/pub/rd/QCALL.]

Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.113084.110.

References

  • The 1000 Genomes Project Consortium 2010. A map of human genome variation from population scale sequencing. Nature 467: 1061–1073 [PMC free article] [PubMed]
  • Albers CA, Lunter G, MacArthur DG, McVean G, Ouwehand WH, Durbin R 2011. Dindel: Accurate indel calls from short-read data. Genome Res (this issue). doi: 10.1101/gr.112326.110 [PMC free article] [PubMed]
  • Browning BL, Yu Z 2009. Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies. Am J Hum Genet 85: 847–861 [PMC free article] [PubMed]
  • Chen GK, Marjoram P, Wall JD 2009. Fast and flexible simulation of DNA sequence data. Genome Res 19: 136–142 [PMC free article] [PubMed]
  • Howie BN, Donnelly P, Marchini J 2009. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet 5: e1000529 doi: 10.1371/journal.pgen.1000529 [PMC free article] [PubMed]
  • The International HapMap 3 Consortium 2010. Integrating common and rare genetic variation in diverse human populations. Nature 467: 52–58 [PMC free article] [PubMed]
  • Kim JI, Ju YS, Park H, Kim S, Lee S, Yi JH, Mudge J, Miller NA, Hong D, Bell CJ, et al. 2009. A highly annotated whole-genome sequence of a Korean individual. Nature 460: 1011–1015 [PMC free article] [PubMed]
  • Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, Walenz BP, Axelrod N, Huang J, Kirkness EF, Denisov G, et al. 2007. The diploid genome sequence of an individual human. PLoS Biol 5: e254 doi: 10.1371/journal.pbio.0050254 [PMC free article] [PubMed]
  • Li H, Durbin R 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25: 1754–1760 [PMC free article] [PubMed]
  • Li H, Ruan J, Durbin R 2008. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18: 1851–1858 [PMC free article] [PubMed]
  • Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25: 2078–2079 [PMC free article] [PubMed]
  • Li R, Li Y, Fang X, Yang H, Wang J, Kristiansen K, Wang J 2009. SNP detection for massively parallel whole-genome resequencing. Genome Res 19: 1124–1132 [PMC free article] [PubMed]
  • Li Y, Willer JC, Ding J, Scheet P, Abecasis GR 2010. MaCH: Using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol 32: 1–19 [PMC free article] [PubMed]
  • Liti G, Carter DM, Moses AM, Warringer J, Parts L, James SA, Davey RP, Roberts IN, Burt A, Koufopanou V, et al. 2009. Population genomics of domestic and wild yeasts. Nature 458: 337–341 [PMC free article] [PubMed]
  • McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, et al. 2010. The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20: 1297–1303 [PMC free article] [PubMed]
  • Minichiello MJ, Durbin R 2006. Mapping trait loci by use of inferred ancestral recombination graphs. Am J Hum Genet 79: 910–922 [PMC free article] [PubMed]
  • Ng SB, Buckingham KJ, Lee C, Bigham AW, Tabor HK, Dent KM, Huff CD, Shannon PT, Jabs EW, Nickerson DA, et al. 2010. Exome sequencing identifies the cause of a mendelian disorder. Nat Genet 42: 30–35 [PMC free article] [PubMed]
  • Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, et al. 2001. The sequence of the human genome. Science 291: 1304–1351 [PubMed]
  • Wang J, Wang W, Li R, Li Y, Tian G, Goodman L, Fan W, Zhang J, Li J, Guo Y, et al. 2008. The diploid genome sequence of an Asian individual. Nature 456: 60–65 [PMC free article] [PubMed]
  • Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen YJ, Makhijani V, Roth GT, et al. 2008. The complete genome of an individual by massively parallel DNA sequencing. Nature 452: 872–876 [PubMed]

Articles from Genome Research are provided here courtesy of Cold Spring Harbor Laboratory Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • MedGen
    MedGen
    Related information in MedGen
  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...