![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2007 Sebastiani and Abad-Grau; licensee BioMed Central Ltd. Bayesian estimates of linkage disequilibrium 1Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA 2Software Engineering Department, University of Granada, Granada 18071, Spain Corresponding author.Paola Sebastiani: sebas/at/bu.edu; María M Abad-Grau: mabad/at/ugr.es Received February 4, 2007; Accepted June 25, 2007. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. This article has been cited by other articles in PMC.Abstract Background The maximum likelihood estimator of D' – a standard measure of linkage disequilibrium – is biased toward disequilibrium, and the bias is particularly evident in small samples and rare haplotypes. Results This paper proposes a Bayesian estimation of D' to address this problem. The reduction of the bias is achieved by using a prior distribution on the pair-wise associations between single nucleotide polymorphisms (SNP)s that increases the likelihood of equilibrium with increasing physical distances between pairs of SNPs. We show how to compute the Bayesian estimate using a stochastic estimation based on MCMC methods, and also propose a numerical approximation to the Bayesian estimates that can be used to estimate patterns of LD in large datasets of SNPs. Conclusion Our Bayesian estimator of D' corrects the bias toward disequilibrium that affects the maximum likelihood estimator. A consequence of this feature is a more objective view about the extent of linkage disequilibrium in the human genome, and a more realistic number of tagging SNPs to fully exploit the power of genome wide association studies. Background Single nucleotide polymorphisms (SNPs) are an invaluable resource to identify regions of the human genome that may be associated with disease. A key to this process is linkage disequilibrium (LD) that is defined as the non-random association between the alleles of SNPs [1]. Although LD may occur between SNPs that are not in linkage but are associated, we will focus on the LD due to the spatial structure of the genome. In this situation, the non-random association implies that pairs of alleles in the same haplotype occur differently from what we would expect in a random pairing and several measures of LD have been proposed to capture the departure from independent pairing of the alleles of SNPs [2]. In this paper we will limit attention to D, its normalized version D', and the well known bias of the Maximum Likelihood Estimate (MLE) of D' toward disequilibrium [2,3]. This bias is particularly large in small samples and SNPs with rare alleles to the point that SNPs whose alleles occur independently may be inferred to be in strong LD [4]. However, relying on small samples to identify patterns of LD is not unusual: for example, the International HapMap Project (IHMP) aims to establish genome-wide patterns of LD using genotype data of at most 30 trios or 45 unrelated individuals [5]. The genotype data typed in this small number of samples are used to describe the extent of LD in the human genome, and derive a map of the haplotypes and the SNPs that are sufficient to tag the human genome. These results will have a deep impact on genome wide association studies and, in particular, inferring larger blocks of LD than the real ones may lead to the selection of an insufficient number of SNPs and hence decrease the power of genome wide association studies. In this scenario, biasing the estimate of LD toward equilibrium appears to be a safer alternative. Several solutions have been proposed to reduce the bias of the MLE of D' toward disequilibrium [4]. A pragmatic solution is to impose some "ad hoc" threshold on the minimum allele frequency (MAF) of those SNPs that can be used to infer the pattern of LD [6]. Imposing this threshold leads to a non-random selection of SNPs and may introduce ascertainment bias [7,8]. The thought behind our approach is that the bias of the MLEs of D and D' is due to the lack of information in the data to discriminate between equilibrium and different magnitude of disequilibrium, and any attempt to correct this bias is due to fail as it was acknowledged in [3]. However, it is known that, on average, the strength of LD due to linkage decreases as the physical distance between SNPs increases [9,10]. Therefore, we propose a Bayesian estimator of D' that allows us to integrate data with prior information about the pattern of LD decay. To this end, we use a prior distribution on the pairwise dependencies between different SNPs that is a decreasing function of their physical distance. We show how to compute the posterior estimate of D' using Markov Chain Monte Carlo methods, and provide a numerical approximation that can be used for fast estimation of LD in large regions of the human genome. As we show in simulated and real data from the IHMP, the effect of the prior distribution is to drastically reduce the bias toward disequilibrium even in small samples, and to remove the need of arbitrary thresholds on the MAF. We also show that, compared to the MLE, our estimators lead to infer patterns of LD decay that are much closer to published results [10], and confirms the existence of haplotype blocks as regions of low recombination. The method is implemented in a computer program called Bayesian Linkage (BLink) [11]. Results and discussion The traditional D and D' Given two SNPs L1 and L2, with alleles A/a and B/b, A and B the major alleles, we define the probability of the haplotype ij by pij = p(L1 = i, L2 = j), i = A, a, j = B, b. As in [12], we assume the relation pA ≥ pB on the probabilities pA = p(L1 = A), pB = p(L2 = B) from which the inequality pAb ≥ paB follows. The two SNPs are in linkage equilibrium when the co-occurrence of two alleles on the same haplotype is random, e.g. pij = pipj for all i = A, a, j = B, b. On the other hand, LD implies some form of dependency in the alleles on the same haplotype and hence departure from independence of the probabilities pij. Although there are many ways to measure departure from independence in a 2 × 2 table [13], a widely used measure of LD is the parameter D defined by
Because the domain of D is a function of the allele frequencies, different normalization methods have been proposed to facilitate the interpretation [2]. The most common one is the measure Maximum likelihood estimation Suppose now that we have a data set of N individuals and n = 2N known haplotypes for the two SNPs (we assume here known phase for all haplotypes and discuss the phasing issue at the end of this section). We denote by nij (i = A, a, j = B, b) the frequencies of the four haplotypes, and by ni and nj the allele frequencies with nA ≥ nB. Assuming that the four haplotypes follow a multinomial distribution with probabilities pij, the likelihood function can be written as: and the MLE of pij, pA and pB are from which we derive the MLE of D, where I(x X) is the indicator function defined as I(x X) = 1 if x X and 0 otherwise. Note that:• • These two facts determine the bias toward disequilibrium of Bayesian approach Our Bayesian estimator is based on the following intuition: on average, the magnitude of disequilibrium between two SNPs decreases at exponential rate with their physical distance. We use this information to build a conjugate prior distribution on the parameters pij with the property that, a priori, the larger the distance between two SNPs, the more likely the two SNPs are in linkage equilibrium. The standard conjugate prior to a multinomial distribution is a Dirichlet distribution with density function defined as: Given data nij, the posterior distribution is still a Dirichlet distribution with density function: in which the prior hyper-parameters αij are updated into αij + nij. The prior means of the parameters pij are αij/αT, where αT = ∑ijαij. The posterior means become E(pij|n) = (αij + nij)/(αT + n) and can be used as point estimates of the parameters. Furthermore, the posterior distributions of the marginal probabilities pA and pB follow Beta distributions with hyper-parameters (αA + nA, αa + na) and (αB + nB, αb + nb), for αA = αAB + αAb, αa = αT - αA, αB = αAB + αaB, and αb = αT - αB. The inference on the parameters D, Ds and D' is more complex. First, we note that we can write these parameters as follows:
Equations (9) and (10) define the parameters Ds and D' as mixtures of two components, with weights p(D < 0) and p(D ≥ 0). The two components are non linear functions of the parameters pij, as is the parameter D, and make the exact inference on these parameters intractable. However, we can resort on Markov Chain Monte Carlo methods to generate a sample of values of either parameters from their posterior distribution that can be used for further inference. In Figure Figure1,1
Choice of the prior distribution To complete the specification of the Bayesian model, we need to provide values for the hyper-parameters. Because we wish to encode the information that departure from equilibrium of any two SNPs is a function of their physical distance, we define:
with α > 0, and the parameter d that represents the physical distance between the two SNPs in Mb (1 Mb = 1, 000 nucleotide bases). With this choice of hyper-parameters, the prior means of the probabilities pij are E(pij) = 1/4 for all i = A, a, j = B, b, so that, a priori, the two SNPs are expected to be in equilibrium. On the other hand, the posterior means of pij can be written as: and, hence, as a weighted average of nij/n (the MLE estimates of pij) and the prior probabilities 1/4. The first weight n/(n + 4α(1 - exp(-d))) is an increasing function of the sample size n, and a decreasing function of α and d, while the second weight 4α(1 - exp(-d))/(n + 4α(1 - exp(-d))) is an increasing function of α and d, and a decreasing function of n. Therefore, for large sample sizes, the posterior means of pij approach the MLEs. This is consistent with the fact that, in large samples, the effect of the prior distribution on the posterior distribution becomes negligible. However, when the distance d decreases, the function 1 - exp(-d) approaches 0, and the weight n/(n + 4α(1 - exp(-d))) approaches 1, so that the Bayesian estimate becomes closer to the MLE. In the limiting case d = 0, or α = 0, the two estimates are identical. For fixed α and increasing distance (essentially d > 0.5Mb), the second weight approaches its maximum value 4α/(n + 4α), and larger values of α further increase the weight given to the prior mean. To contain the effect of the prior distribution, we use α = 1 and simulation studies that are described in the next section show that this choice produces a good trade-off between robustness and bias. In the absence of a closed form expression for the prior distributions of the parameters D,
As an example, Table 1 displays the frequencies of the four haplotypes AB, Ab, aB and ab that were observed between SNPs S1 and S2 Chromosome 22, at the positions 15040669, 15043944. These are real data that were derived from the thirty trios of the CEPH population (Utah residents with ancestry from northern and western Europe) who provided the DNA samples for the IHMP [5]. The observed haplotype frequencies are consistent with the hypothesis of linkage equilibrium, because the expected number of haplotypes ab is 0.5 under equilibrium and the assumption that the population allele frequencies equal the marginal estimates pa = 0.03 and pb = 0.14. However, the lack of observed haplotypes ab could be due to perfect LD between each pair of SNPs. Given that the physical distance between S1 and S2 is 0.0032 Mb, and the average D' in chromosome 22 ranges between 0.8 and 1 for SNPs that are within 0.01 Mb, and becomes less than 0.5 for SNPs that are distant more than 0.1 Mb [5], it is likely that S1 and S2 are in disequilibrium. Consider now a third SNP S3 in the position 15405264 of chromosome 22. The frequencies of the four haplotypes between S1 and S3 is the same as in Table 1 but now the physical distance between these two SNPs is 0.364 Mb. Given the extent of LD, equilibrium is more likely between S1 and S3, although the haplotype frequencies are the same. The MLE of D' is 1 in both cases, with the same confidence interval, while the Bayesian estimate of D' changes with the distance between the two SNPs. The plots in Figure Figure33
Approximate estimates In practical applications, we have computed the Bayesian estimate of D' for regions with at most 200 SNPs that would correspond to examining a block of approximately 500 kb assuming one SNP every 2.5 kb. However, if the focus is generating a point estimate of the parameters to be able to display LD over large regions or an entire chromosome as we have shown in [15], resort to MCMC methods may become unfeasible. It is possible to compute the exact posterior mean of D, and from this we can derive approximate estimates of Ds and D' based on a Taylor expansion. To this end, we replace the weights p(D < 0) and p(D ≥ 0) in Equations (9) and (10) by the indicator functions I(E(D|n) < 0) and I(E(D|n) ≥ 0), and the expectation of the non linear functions D/max D by the first order Taylor expansion: The main source of error in this approximation is due to replacing the probability P(D ≥ 0) by the indicator function. When we are in a clear situation of disequilibrium, the probability of the event (D < 0) is almost 0 or 1, and the approximate posterior expectation of Ds and D' approaches the exact values. When p(D < 0) is far from 0 and 1, then the error increases and biases the estimates toward disequilibrium. This is consistent with the fact that the approximation is close to the MLE and therefore suffers of some bias toward disequilibrium. However, we will show with results of simulations in the next section that this bias is smaller. Because of this similarity with the MLE, we will refer to these approximate estimates as the maximum a posteriori (MAP). Unknown phase When the genotype data are unphased, the ML estimation uses the EM algorithm to infer the unknown phase given the distribution of known haplotypes [16]. We adopt the same procedure for the calculation of the MAP estimates. Given the frequencies of known haplotypes, nij, i = A, a, j = B, b, the algorithm first computes the MAP estimates of the haplotype frequencies pij, and then alternates an expectation step to replace the unphased haplotypes by their expected phase and a maximization step to compute the MAP estimates using observed and expected haplotypes. The algorithm typically converges in less than 4 steps. Unknown haplotypes are regarded as missing values in the stochastic analysis, so that they become parameters of the model and are estimated within the Gibbs sampling algorithm. We also note that, when the genotype data are from trios, we use all phased haplotypes to compute the initial frequencies, regardless of whether they are transmitted from parents to offspring. The method is implemented in the computer program BLink that is developed in C++ and is available from the supplementary web site [11]. The software accepts genotype data from either unrelated individuals or nuclear families consisting of two parents and one child. Evaluation We examined the performance of the Bayesian estimator in three groups of simulated data and a real data set derived from the IHMP. All data used in this evaluation are available from the supplementary web site [11]. Materials and methods Group 1 The objectives of the first simulation study were (1) to compare the performance of the Bayesian estimates and the MLE for different sample sizes and small values of the MAF, and (2) to assess the accuracy of the MAP approximation to the stochastic estimates of Ds and D'. We generated samples of 60, 120, 240 haplotypes, in which we modeled the true D' as D' = exp(-d) for a distance d ranging from 0 to 0.5 Mb. For each value of D' and each sample size, we generated 1,000 samples of haplotypes by using the joint probability of haplotypes defined by Equation (1), with pB generated from a uniform distribution in the interval [0.5; 0.9) and pA generated from a uniform distribution in the interval [pB; 0.95). In each simulated sample, we computed the MLE, and the MAP estimate of Ds, as well as the stochastic estimate of Ds using Gibbs sampling. To compute the stochastic estimates we run the chain for an initial burn-in of 1, 000 iterations and then based the inference on a second sample of 1, 000 iterations. We used as point estimate the median value of the simulated sample and α = 1 in each analysis. Group 2 In this second set, we generated a sample of 1, 000 individuals in a region of 0.5 Mb with the program MS that simulates genotype data under a variety of neutral models [17]. We considered a population of 1 million individuals, a mutation rate of 10E - 9 per base pair, and a recombination rate of 8 × 10E - 9 between adjacent base pairs per generation. Only 10% of the 8080 SNPs in the sample of 1, 000 individuals were randomly selected and, from this sample, we randomly generated subsamples of sizes 60, 120, 240 and 480 haplotypes. In the absence of "true" values for D', we studied the decay of LD inferred by the MLE and the MAP estimator for increasing physical distances, versus the LD decay inferred in the original sample of 1, 000 individuals. Each point in the plot is the average estimate of D' for all the SNPs within a physical distance of d ± 0.01 Mb. By averaging the LD between pairs of SNPs at increasing distance, these plots are used to summarize the decay of LD over large regions [18,10]. Ascertainment bias was assessed by repeating the analysis with these thresholds on the MAF: 0, 0.05, 0.1, 0.2. Sensitivity to the prior distribution was assessed by repeating the analysis for α = 0.25, 1, 2, 4. Group 3 To examine the robustness of the MAP estimators, we also generated data under a different model of allele frequency, linkage disequilibrium and population differentiation that is implemented in the software COSI [19]. We simulated a sample of 1, 000 individuals under the calibrated model for the European population that considers bottlenecks, migration and recombination hotspots spacing 0.085 Mb [19]. We randomly selected 10% of the generated 32452 SNPs and from this sample we randomly selected subsamples of 60, 120, 240 and 480 haplotypes. We produced LD decay plots using the thresholds 0, 0.05, 0.1, 0.2 on the MAF and the range of α values that were used for the analysis of the simulated data in group 2. Real data Real data were obtained from the first phase of the IHMP [5]. We used genotype data of the 30 trios of the CEPH and Yoruba in Ibadan, Nigeria and chromosome 22 because its pattern of LD has been widely studied [9]. This chromosome was genotyped in 19120 and 19854 SNPs in the CEPH and Yoruba samples. We produced LD decay plots using the thresholds on the MAF and the range of α values that we used for the analysis of the simulated data in group 2. We also produced more informative graphical displays of pairwise LD, by generating bi-dimensional maps similar to those generated by the program Haploview, but with a lower resolution to enable the display of LD over larger regions. The maps were generated with the program BMapBuilder [15] using the MLE and the MAP estimate of D'. Results Figure Figure44
Figures Figures5,5
The LD decay plots in Figure Figure66 Figure Figure77 Figure Figure99
Conclusion A good estimation of D' is crucial for a better understanding of patterns of LD, a robust identification of haplotype blocks, more accurate algorithms for haplotype reconstruction, and better reproducibility of genetic studies. The popular MLE of D' is biased toward disequilibrium, and requires the use of thresholds on the MAF that have been shown to introduce ascertainment bias. By using an informative prior that models the LD between SNPs based on their physical distance, we define a Bayesian estimator that outperforms the MLE without increasing computational complexity. Our estimator is slightly biased toward equilibrium, but this bias tends to disappear quickly with increasing sample sizes, and at a faster rate than the bias toward disequilibrium of the MLE. Furthermore, our evaluation shows that the MAP estimator does not require any thresholds on the MAF. There are several limitations to this work. The probability distribution of the haplotypes is modelled using a multinomial distribution with a Dirichlet prior, and this assumption can be relaxed to include more general models. Also, the prior distribution does not take into account recombination hotspots. We have assessed the impact of this assumption in our simulations, but more evaluation is needed. Our analysis is now limited to biallelic SNPs, however our Bayesian model can be extended to include measures of LD for multi-allelic SNPs. For example, a first-order approximation of the average estimator of D' suggested in [2] can be computed by averaging the MAP estimates. Some more work is needed to examine the effect of the prior hyper-parameters. In future work we will extend our results to other measures of LD, particularly r2 = D/(pApapBpb). Some preliminary results that are posted in our supplementary web site suggest that a Bayesian estimator of r2 developed along the line of the estimator introduced in this paper would gain robustness. Authors' contributions PS developed the stochastic method, designed and carried out part of the simulations, and drafted the manuscript. MAG developed and evaluated the approximate method, and she implemented the method in a computer program. Both authors designed the study, and read and approved the final manuscript. Acknowledgements Research supported by NIH/NHLBI grant R21 HL080463-01, NIH/NIDDK 1R01DK069646-01A1 and the Spanish research program [projects TIN2004-06204-C03-02 and TIN2005-02516]. Comments of three anonymous reviewers were very helpful to improve the initial version of the manuscript. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||
Nat Rev Genet. 2003 Aug; 4(8):587-97.
[Nat Rev Genet. 2003]Genetics. 1987 Oct; 117(2):331-41.
[Genetics. 1987]Genetics. 1987 Oct; 117(2):331-41.
[Genetics. 1987]Genetics. 1988 Nov; 120(3):849-52.
[Genetics. 1988]Ann Hum Genet. 2002 May; 66(Pt 3):223-33.
[Ann Hum Genet. 2002]Nature. 2005 Oct 27; 437(7063):1299-320.
[Nature. 2005]Ann Hum Genet. 2002 May; 66(Pt 3):223-33.
[Ann Hum Genet. 2002]Bioinformatics. 2005 Jan 15; 21(2):263-5.
[Bioinformatics. 2005]Theor Popul Biol. 2003 May; 63(3):245-55.
[Theor Popul Biol. 2003]Genetics. 1988 Nov; 120(3):849-52.
[Genetics. 1988]Nature. 2002 Aug 1; 418(6897):544-8.
[Nature. 2002]Genetics. 1987 Oct; 117(2):331-41.
[Genetics. 1987]Genetics. 1964 Jan; 49(1):49-67.
[Genetics. 1964]Bioinformatics. 2005 Jan 15; 21(2):263-5.
[Bioinformatics. 2005]Ann Hum Genet. 2002 May; 66(Pt 3):223-33.
[Ann Hum Genet. 2002]Theor Popul Biol. 2003 May; 63(3):245-55.
[Theor Popul Biol. 2003]Nature. 2005 Oct 27; 437(7063):1299-320.
[Nature. 2005]Bioinformatics. 2006 Aug 15; 22(16):1933-4.
[Bioinformatics. 2006]Bioinformatics. 2002 Feb; 18(2):337-8.
[Bioinformatics. 2002]Am J Hum Genet. 2000 Dec; 67(6):1544-54.
[Am J Hum Genet. 2000]Nature. 2001 May 10; 411(6834):199-204.
[Nature. 2001]Genome Res. 2005 Nov; 15(11):1576-83.
[Genome Res. 2005]Nature. 2005 Oct 27; 437(7063):1299-320.
[Nature. 2005]Nature. 2002 Aug 1; 418(6897):544-8.
[Nature. 2002]Bioinformatics. 2006 Aug 15; 22(16):1933-4.
[Bioinformatics. 2006]Nature. 2005 Oct 27; 437(7063):1299-320.
[Nature. 2005]Genetics. 1987 Oct; 117(2):331-41.
[Genetics. 1987]