- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

# A Flexible Bayesian Framework for Modeling Haplotype Association with Disease, Allowing for Dominance Effects of the Underlying Causative Variants

## Abstract

Multilocus analysis of single-nucleotide–polymorphism (SNP) haplotypes may provide evidence of association with disease, even when the individual loci themselves do not. Haplotype-based methods are expected to outperform single-SNP analyses because (i) common genetic variation can be structured into haplotypes within blocks of strong linkage disequilibrium and (ii) the functional properties of a protein are determined by the linear sequence of amino acids corresponding to DNA variation on a haplotype. Here, I propose a flexible Bayesian framework for modeling haplotype association with disease in population-based studies of candidate genes or small candidate regions. I employ a Bayesian partition model to describe the correlation between marker-SNP haplotypes and causal variants at the underlying functional polymorphism(s). Under this model, haplotypes are clustered according to their similarity, in terms of marker-SNP allele matches, which is used as a proxy for recent shared ancestry. Haplotypes within a cluster are then assigned the same probability of carrying a causal variant at the functional polymorphism(s). In this way, I can account for the dominance effect of causal variants, here corresponding to any deviation from a multiplicative contribution to disease risk. The results of a detailed simulation study demonstrate that there is minimal cost associated with modeling these dominance effects, with substantial gains in power over haplotype-based methods that do not incorporate clustering and that assume a multiplicative model of disease risks.

It is widely accepted that population-based disease-marker association studies of samples of unrelated affected cases and unaffected controls have the potential to map genes contributing to complex traits, provided that the causative variants are not extremely rare.^{1}^{,}^{2} The success of this approach relies on genotyping genetic markers—typically SNPs that are in strong linkage disequilibrium (LD) with the functional polymorphism(s)—generated as a result of the shared ancestry of sampled individuals in the flanking region. Initial association studies focused on genotyping high-density SNPs in candidate genes with a functional basis for disease and/or located in regions highlighted by the results of previous linkage-based analyses. However, with improvements in the efficiency of high-throughput SNP genotyping technology, genomewide scans of hundreds of thousands of markers are now under way with the large sample sizes required to detect the modest genetic effects we expect for complex traits.^{3}

One of the most attractive features of SNPs for complex disease-gene mapping is their abundance throughout the genome. However, single-locus analyses—testing for disease association with each SNP, in turn—do not take into account the background patterns of LD between loci and, hence, may be inefficient even before the issue of multiple testing with many markers is addressed. Data from the International Haplotype Map (HapMap) project suggest that much of the human genome can be arranged in blocks of common SNPs in strong LD with one another.^{4}^{,}^{5} Haplotype diversity within blocks is very much driven by mutation, rather than by ancestral recombination events. Thus, much of common genetic variation can be structured into haplotypes within blocks that are rarely disturbed by meiosis. Furthermore, Clark^{6} emphasizes that the functional properties of a protein are determined by the linear sequence of amino acids corresponding to DNA variation on a haplotype. For example, there is evidence that a combination of causal variants in *cis* in the *HPC2/ELAC2* gene increases the risk of prostate cancer.^{7} This suggests that appropriate multilocus analyses of SNP haplotypes within blocks of strong LD may provide evidence of association for modest genetic effects, even when the individual polymorphisms themselves do not.

The most convenient framework for the development of statistical methodology for multilocus analyses of population-based association studies is the logistic-regression model. It is common to assume a multiplicative model of disease risks, so that paternally and maternally derived alleles contribute independent effects. Under this assumption, the logistic-regression model can be parameterized in terms of the risk (or, more precisely, the odds) of disease for each marker-SNP haplotype. Within this framework, it is straightforward to accommodate covariates that may include environmental and other nongenetic risk factors, polygenic effects, and genotypes at ancestrally informative markers, to allow for underlying population stratification. To allow for unknown phase, we consider all possible pairs of haplotypes consistent with the observed genotype data for each individual.^{8}^{}^{}^{–}^{11} Each consistent haplotype configuration is weighted in the logistic-regression likelihood by the corresponding phase-assignment probability, easily estimated within blocks of strong LD by statistical algorithms, such as PHASE,^{12}^{,}^{13} that also take into account additional uncertainty due to missing genotype data.

A major limitation of haplotype-based analyses is lack of parsimony, since we require an odds parameter in the logistic-regression model for each distinct marker-SNP haplotype consistent with the observed genotype data. However, we can take advantage of the expectation that chromosomes carrying the same causal variant tend to share more-recent common ancestry at the underlying functional polymorphism(s) than do a random pair from the population and, thus, are more likely to carry similar haplotypes in the flanking genomic region. Thus, by clustering marker-SNP haplotypes according to their similarity, we can assign the same genetic effect(s) to haplotypes within the same clade, reducing the number of parameters in the logistic-regression model, without substantial loss of information.^{11}^{,}^{14}^{}^{}^{}^{}^{}^{}^{–}^{21} Morris^{11} clusters marker-SNP haplotypes according to a Bayesian partition model.^{22}^{,}^{23} The model is specified by selecting cluster “centers” from the set of distinct haplotypes consistent with the observed genotype data, identified via implementation of the expectation-maximization (EM) algorithm.^{24} The remaining haplotypes are then assigned to the nearest cluster center, where similarity is defined in terms of marker-allele matches, appropriate for haplotype diversity driven by mutation, such as that expected within blocks of SNPs in strong LD with each other.

In this article, I generalize the approach developed by Morris,^{11} to allow for more-flexible modeling of marker-SNP haplotype association with disease—in particular, to account for dominance effects, here corresponding to any deviation from multiplicative disease risks. This is achieved by introducing a latent variable to describe the presence or absence of causal variants at the functional polymorphism(s) on each marker-SNP haplotype, in a way similar to that of Clayton et al.^{25} I assume that each causal variant has the same genetic effects on disease. Thus, the logistic-regression model can be parameterized in terms of (i) the additive effect (or multiplicative risk) of causal variants and (ii) the dominance effect of causal variants over other alleles at the functional polymorphism(s). The Bayesian partition model is then used to describe the correlation between marker-SNP haplotypes and alleles at the functional polymorphism(s). Under this model, each haplotype allocated to the same cluster is assigned the same probability of carrying a causal variant. I develop a reversible-jump Markov chain–Monte Carlo (MCMC) algorithm, GENEBPMv2, to sample over the space of haplotype clusters and the corresponding probabilities that they carry a causal variant at the functional polymorphism(s), in addition to additive and dominance effects of the causal variants and covariate-regression coefficients.

The GENEBPMv2 algorithm can be used to assess the evidence in favor of disease association with polymorphisms in a candidate gene or a small candidate region. However, as the length of the candidate region increases, recombination will play an increasingly important role in driving haplotype diversity, and there will be greater uncertainty and less accuracy in phase assignment from unphased genotype data. I illustrate the method by application to high-density unphased SNP-genotype data across an 890-kb candidate region flanking the *CYP2D6* gene, for association with a recessive poor-drug-metabolizer (PDM) phenotype.^{26} The results of my analysis provide overwhelming evidence of association of the PDM phenotype with polymorphisms in the candidate region and correctly highlight the recessive effect of the underlying causal variants. Further, I identify two clusters of haplotypes, both with high probability of carrying different causal variants in *CYP2D6,* relative to all other haplotypes. Consequently, I am able to distinguish individuals with PDM carrying two copies of the most common causal variant from those carrying rarer, high-risk mutations at the functional polymorphisms in *CYP2D6.*

Dominance is often overlooked in haplotype-based association studies because the gain in power over a model of multiplicative disease risks is generally not appreciable, unless there is a strong recessive effect of the causative variant or there is overdominance, and because there is a fear that a less parsimonious model will lose power as a result of the multiple-testing burden. However, I demonstrate, by simulation, that, within the Bayesian framework developed here, there is minimal loss in power from allowance for deviations from a multiplicative model of disease risks with the GENEBPMv2 algorithm. Encouragingly, my results suggest substantial gains in power over existing haplotype-based tests of association that do not allow for dominance and clustering. This clearly demonstrates that, with an appropriate analysis such as that implemented here in the GENEBPMv2 algorithm, dominance effects can and should be included in haplotype-based association studies to increase power without penalty for reduced parsimony.

## Model and Methods

Consider a case-control sample of unrelated individuals, typed at *N* marker SNPs in a candidate gene or small candidate region, yielding unphased genotypes **G**. Alleles at each SNP are coded as “1” for the major allele (i.e., the most frequent in the population) and as “2” for the minor allele. The disease status of individual *i* is denoted *y*_{i}=1 if affected and *y*_{i}=0 if unaffected. I allow for additional covariates, **x**, each scaled to have zero mean and unit variance. The set of *J* distinct marker-SNP haplotypes consistent with the observed unphased genotypes is denoted , where *H*_{j} denotes the *j*th most frequent haplotype. Relative haplotype frequencies, **h**, are estimated by means of maximum likelihood via implementation of the EM algorithm, where *h*_{j} denotes the frequency of *H*_{j}. I denote by the set of observed data and estimated haplotype frequencies.

I assume that alleles at the functional polymorphism(s) in the candidate gene can be classified as high-risk “causative” variants and as low-risk “protective” variants. Each causative variant is assumed to confer the same risk of disease, and likewise for protective variants. I can then model the log-odds of disease within a logistic-regression framework, parameterized in terms of additive and dominance effects of the causative variant, denoted β_{A} and β_{D}, respectively. Under the null model, *M*_{0}, of no association of disease with polymorphisms in the candidate region, β_{A}=β_{D}=0, whereas the alternative model, *M*_{1}, corresponds to β_{A}>0 and β_{D} unconstrained. Within the Bayesian paradigm, it is common to evaluate the evidence against the null model by means of the Bayes factor,

where denotes the marginal likelihood of observed phenotype data under model .

The marginal likelihood of model can be calculated by integration over model parameters θ, summarized by the directed acyclic graph (DAG) in figure 1. Model parameters include the genetic effects β_{A} and β_{D} but also baseline log-odds of disease, μ, and covariate-regression coefficients, γ. However, since I do not observe genotypes at the functional polymorphism(s) directly, θ must also include parameters to describe the correlation between causal variants and marker-SNP haplotypes in the candidate gene, here taken to be specified by a Bayesian partition model. Then,

where denotes the likelihood of observed phenotype data, given parameters θ, and is their joint prior density under model .

### Bayesian Partition Model

The Bayesian partition model is defined by specifying *K* cluster centers, ordered and without replacement from the set of haplotypes, *H*, denoted by . Haplotype *H*_{j} is then assigned to the cluster with the maximum similarity metric, defined as

for the *k*th cluster center, *C*_{k}. The *n*th SNP similarity metric, *s*_{jk(n)}, is given by

where *H*_{j(n)} and *C*_{k(n)} denote the allele present at SNP *n* on haplotype *H*_{j} and cluster center *C*_{k}, respectively, and *q*_{n} denotes the relative sample frequency of the minor allele in controls. If haplotype *H*_{j} is equidistant from more than one cluster center, it is assigned to the center with minimum *k.* According to the Bayesian partition model, each haplotype assigned to the same clade is then assumed to have the same probability of carrying a causal variant at the functional polymorphism(s), denoted _{k} for the *k*th cluster.

The similarity metric, *S*_{jk}, treats haplotypes that share rare alleles as less diverse than those that share common alleles, because they are expected to share more-recent common ancestry. Thus, I quantify allele sharing by the complimentary-allele frequency, in the same way as did Durrant et al.^{20}—that is, by 1-*q*_{n} for minor-allele sharing at SNP *n* and by *q*_{n} for major-allele sharing. A number of alternative metrics have been proposed—for example, those that weight SNP-allele matches according to their distance from a putative disease locus in the context of fine mapping^{18} or that allow for mismatch of alleles due to ancestral mutation or gene-conversion events.^{21}

### Likelihood Calculation

To calculate the likelihood term, , in equation (2), I must consider all possible pairs of marker-SNP haplotypes consistent with the observed unphased genotype data. Then, the likelihood can be expressed as a summation over **H**, weighted by the corresponding phase-assignment probabilities, given by

Assuming Hardy-Weinberg equilibrium,

and , where is an indicator variable of the consistency of the genotype *G*_{i} with the pair of haplotypes *H*_{j1} and *H*_{j2}.

I must then consider the three possible genotype categories at the functional polymorphism(s), denoted by , corresponding, respectively, to two protective variants (not necessarily copies of the same allele), to one protective variant and one causative variant, and to two causative variants (not necessarily copies of the same allele). Thus, denoting the genotype of the *i*th individual at the functional polymorphism(s) by *Z*_{i}, it follows that

where

and denotes the cluster assignment of haplotype *H*_{j} in partition *C*. Finally, within a logistic-regression framework,

where the linear component, η_{i}, is given by

### Prior-Density Function

The Bayes factor (Λ) in favor of disease association with polymorphisms in the candidate gene depends crucially on the prior-density function of parameters, , under each model . Logistic-regression model parameters are assumed a priori to be independent of the partition of haplotypes. The baseline log-odds of disease is assumed to have a uniform prior distribution, whereas covariate-regression coefficients are assumed to have independent standard normal prior distributions, irrespective of the model of association. Under the null model, *M*_{0}, the genetic effects β_{A}=β_{D}=0 and the haplotype clustering is irrelevant to disease risk, so that

Conversely, under the alternative model, *M*_{1}, the genetic effects are assumed a priori to have standard normal distributions, subject to the constraint β_{A}>0. In defining the Bayesian partition model, each haplotype in *H* has equal prior probability of selection as one of the *K* cluster centers, where each cluster has independent uniform prior probability of carrying a causal variant at the functional polymorphism(s). Furthermore, the unconditional prior density of the number of clusters, *K*>1, has a geometric distribution, such that . Thus,

I investigate the properties of the Bayes factor, Λ, for these prior-density functions by simulation, explained below.

### MCMC Algorithm

I have developed a Metropolis-Hastings MCMC algorithm^{27}^{,}^{28} to approximate the posterior density of parameters under model , given by

in the integrand of equation (2). The dimensionality of θ depends on the number of clusters, *K,* of haplotypes. This can be addressed by incorporating a birth-death process for the number of clusters via implementation of a reversible-jump step in the MCMC algorithm.^{29} At each stage of the algorithm, a new candidate set of parameter values, θ^{′}, is proposed by making a small change to the current set, θ, as detailed in appendix B. The proposed set of parameter values is then accepted in place of θ, with probability proportional to ; otherwise, the current set is retained.

The MCMC algorithm is run for an initial burn-in period to allow convergence from a randomly selected set of starting values of θ, assessed using standard diagnostics.^{30} After convergence, each set of parameter values accepted, or retained, by the algorithm represents a random draw from the posterior density . Autocorrelation between consecutive draws is reduced by recording the sampled set of parameter values only at every *t*th iteration of the algorithm, for some suitably large *t.*

To approximate the Bayes factor in equation (1), I perform two independent runs of the MCMC algorithm—one under the null model, *M*_{0}, and one under the alternative model, *M*_{1}. Over *R* recorded MCMC outputs for each independent run,

where

and denotes the likelihood in equation (3), recorded in the *r*th output under model .

### Interpretation

The Bayes factor, Λ, reflects the strength of evidence in favor of disease association with polymorphisms in the candidate gene. By convention, corresponds to positive evidence of association, whereas and correspond to strong and decisive evidence, respectively.^{31} The posterior probability of association can then be approximated by

where is the prior probability against the null model, reflecting beliefs about association between disease and polymorphisms in the candidate gene before the data is looked at. This probability might take into account the functional relevance of the gene or the results of previous linkage and association studies of the same disease. The issue of multiple testing with many candidate genes or regions can also be addressed by increasing the prior probability in favor of the null model for each.^{32}

### Software Availability

The GENEBPMv2 software has been developed to (i) obtain maximum-likelihood estimates of the relative frequencies of haplotypes consistent with a sample of observed SNP genotypes, via application of the EM algorithm; (ii) implement the MCMC algorithm to sample over the space of covariate-regression coefficients under the null model of no association; (iii) implement the reversible-jump MCMC algorithm to sample over the space of haplotype clusters and the corresponding probabilities that they carry the causal variant at the functional polymorphism(s), in addition to the additive and dominance effects of the causal variant and covariate-regression coefficients under the alternative model of association; and (iv) estimate Λ and summarize the output of the MCMC algorithm. The GENEBPMv2 software is available, as a suite of Linux executables, on request from the author.

## Results

In this section, I demonstrate the utility of the proposed method by applying the GENEBPMv2 algorithm to the detection of association of the PDM phenotype with polymorphisms in the *CYP2D6* gene. I also present the results of a simulation study to investigate the performance of the GENEBPMv2 algorithm to detect disease association with polymorphisms in candidate regions of up to 100 kb in length.

### Example Application: *CYP2D6*

The gene *CYP2D6* on human chromosome 22q13 has an established role in drug response and is known to be involved in the metabolism of ~20% of commonly prescribed compounds.^{33} Four functional polymorphisms have been identified in the gene; the most common mutation, G1846A, occurs with relative frequency of 20.7% in the general population.^{34} The PDM phenotype acts in a recessive fashion and is manifested in individuals homozygous or compound heterozygous for causal variants at these functional polymorphisms.

Hosking et al.^{26} genotyped 1,108 individuals at 32 SNP markers across an 890-kb region flanking *CYP2D6,* to evaluate the efficacy of mapping methods to identify the gene. By the typing of the sample at the four known functional polymorphisms, 41 individuals were predicted to have the PDM phenotype. Single-locus analysis of the markers identified 10 SNPs displaying strong association with the PDM phenotype and residing in a block of strong LD that includes *CYP2D6.* Here, I apply the GENEBPMv2 software to the 32 marker SNPs (excluding the functional polymorphisms) to demonstrate (i) evidence of the recessive effect of causal variants in *CYP2D6,* by allowing for deviations from a multiplicative model of PDM phenotype risks, and (ii) clustering of haplotypes carrying the same causal variant in *CYP2D6.*

Implementation of the EM algorithm identified 906 haplotypes consistent with the observed marker-SNP genotype data. There were 17 common haplotypes with estimated relative frequency of at least 1%. I performed two independent runs of the MCMC algorithm: once under the null model (β_{A}=β_{D}=0) and once under the alternative model (β_{A}>0 and β_{D} unconstrained). Each run of the MCMC algorithm consisted of an initial 100,000 iteration burn-in period, to allow convergence from a random starting parameter set. In the subsequent 1,000,000-iteration sampling period, output of the algorithm was recorded every 1,000th iteration.

Figure 2 presents a summary of output from the sampling period of the MCMC algorithm (1,000 recorded outputs) under the alternative model of association. Figure Figure2*A*2*A* and and2*B*2*B* presents the prior and posterior distributions of the additive and dominance effects of causative variants at functional polymorphisms in *CYP2D6,* with figure 2*C* demonstrating the posterior correlation between them. Finally, figure 2*D* presents the posterior distribution of the number of clusters, here ranging from 3 to 9, with a mode of 4. The Λ is 48.436, which provides overwhelming evidence in favor of association of the PDM phenotype with polymorphisms in the candidate region. The posterior mean (±SD) additive and dominance effects of causal variants in *CYP2D6* are estimated as 4.17 (±0.52) and −2.26 (±0.67), respectively. The dominance effect is negative, which correctly highlights the recessive nature of deviations from the multiplicative model of disease risks.

*CYP2D6*gene.

*A,*Prior and posterior distribution of additive

**...**

Figure 3 presents a summary of the posterior partition of common marker-SNP haplotypes across the 890-kb region flanking the *CYP2D6* gene under the alternative model, identifying four clear clusters, labeled “A–D.” The dendogram was constructed using standard average-linkage hierarchical-clustering techniques^{35} based on a posterior measure of pairwise similarity, given by the proportion of MCMC outputs in which each pair of haplotypes appears in the same cluster of the Bayesian partition model. To interpret the clustering, I estimate the posterior probability, ψ_{j}, that haplotype *H*_{j} carries a causal variant at the functional polymorphisms in *CYP2D6* (table 1). Over *R* outputs of the MCMC algorithm,

where ^{(r)}_{TC(j)} denotes the probability that haplotype *H*_{j} carries a causal variant at the functional polymorphisms recorded in the *r*th output. Marker-SNP haplotypes in cluster A have a 98% probability of carrying a causal variant, whereas those in cluster B have a 15%–16% probability, both considerably higher than the baseline risk of 1%–2% in clusters C and D. Note that, for a case-control study, ψ_{j} does not represent the population risk of carrying a causal variant on haplotype *H*_{j}, since the sample is enriched for affected individuals and, thus, for high-risk alleles.

*CYP2D6*gene. Haplotypes are coded according

**...**

*H*

_{j}Carries a Causal Variant at the Functional Polymorphisms in

*CYP2D6*

As a final stage in the analysis, I investigated the relatedness of the 41 individuals with the PDM phenotype, illustrated by the dendogram presented in figure 4, constructed using hierarchical-clustering techniques based on the output of the MCMC algorithm for the alternative model. Similarity between a pair of individuals is given by the posterior mean number of haplotypes they share from the same cluster of the Bayesian partition model over all MCMC outputs. For the *r*th output, the mean sharing is calculated over all possible haplotype configurations consistent with the observed genotype data of the pair of individuals, weighted by the corresponding phase-assignment probabilities. For each combination of phase assignments, sharing is scored as “2” if the individuals share both pairs of haplotypes from the same cluster(s), as “1” if they share one pair of haplotypes from the same cluster, and as “0” otherwise.

*CYP2D6*gene. The dendogram is constructed to illustrate

**...**

Figure 4 indicates the genotype of each individual at functional polymorphism(s) in *CYP2D6*: “1” corresponds to the common G1846A mutation, whereas “2” and “3” correspond to rarer mutations delA2548 and delT1707, respectively. The dendogram distinguishes, with remarkable accuracy, individuals with different *CYP2D6* genotypes. The 32 individuals carrying the 1/1 genotype form a tight cluster, with posterior mean sharing of haplotypes from the same cluster close to 2, as would be expected. All 32 of these individuals carry two haplotypes from cluster A, which suggests that this clade is highly associated with the G1846A mutation in *CYP2D6.* The seven individuals carrying the 1/2 genotype also form a tight cluster. In addition to carrying one haplotype from cluster A, all seven of these individuals also carry one haplotype from cluster B, which suggests that this rarer clade is highly associated with the delA2548 mutation in *CYP2D6.* Morris et al.^{36} analyzed the same data in an attempt to fine map functional polymorphisms in the *CYP2D6* gene, using the COLDMAP software. Their analysis was also able to distinguish cases carrying zero, one, or two copies of the most common mutation in the gene, although the clustering was not quite so clear cut.

### Simulation Study

For simulation purposes, I consider a range of complex disease models for different causative variant frequencies at a single functional polymorphism. For each simulation model, I generate 1,000 replicates of unphased marker-SNP genotype data for unrelated cases and controls. Each replicate is obtained as follows:

- 1. Generate an ancestral recombination graph
^{37}for a population of 20,000 SNP haplotypes from a realization of the coalescent process with recombination, simulated using the MS software.^{38}I assume scaled mutation and recombination rates of 4 per 10 kb of the region.^{39}For an effective population size of 10,000 individuals, this corresponds to a mutation rate of 10^{-8}per site, per chromosome, and per generation, and a uniform recombination rate of 1 cM per Mb. - 2. Select the functional polymorphism at random from all SNPs, so that the causative variant will occur at a rate close to the predefined frequency.
- 3. Select markers at random from the remaining SNPs, with probability , where
*p*is the minor-allele frequency (MAF). This distribution reflects the bias in MAF towards common SNPs in public databases, such as the International HapMap project.^{4}^{,}^{5} - 4. Generate a diploid individual by sampling a pair of haplotypes, at random and with replacement, from the population of 20,000 chromosomes. Generate the disease phenotype of the individual according to their genotype at the functional polymorphism and the predefined simulation disease model. Repeat this step until the required number of cases and controls have been simulated.
- 5. Retain the unphased genotype of each individual only at the marker SNPs.

For each replicate of data, I obtain maximum-likelihood estimates of marker-SNP haplotype frequencies via implementation of the EM algorithm. I then perform two independent runs of the MCMC algorithm: once under the null model of no association (β_{A}=β_{D}=0) and once under a general alternative model allowing for nonmultiplicative disease risks (β_{A}>0 and β_{D} unconstrained). Each run of the MCMC algorithm consists of an initial 100,000 iteration burn-in period, with output recorded every 1,000th iteration in the subsequent 1,000,000-iteration sampling period. For each replicate of data, output from each run of the MCMC algorithm is used to approximate Λ.

Table 2 presents summary statistics to assess the properties of the GENEBPMv2 algorithm under the null simulation model, in which all individuals have the same risk of disease, regardless of their genotype at the functional polymorphism. Results are presented for a range of different candidate regions and sample sizes. As expected, the LD between marker SNPs decreases as the length of the candidate region—and, hence, the distance between them—increases. As a consequence, there is greater haplotype diversity in larger candidate regions. For larger sample sizes, the mean number of haplotypes increases, since there is greater opportunity to observe rare haplotypes. The mean number of common haplotypes, however, is unaffected by sample size. The mean number of clusters in the partition of haplotypes increases with the length of the candidate region but decreases with sample size. This reflects the increased haplotype diversity in large candidate regions but the reduced variability in cluster membership for larger sample sizes. The mean Λ is close to zero, irrespective of the length of the candidate region and sample size. The proportions of replicates with positive and strong evidence of association are ~7%–8% and ~1%–2%, respectively.

I next consider a simulation model of disease-marker association, parameterized in terms of (i) the population frequency of the causal variant and (ii) the genotype relative risks (GRRs) of individuals homozygous and heterozygous for the causal variant, with the homozygous protective variant genotype as baseline. Figure 5 presents the mean number of clusters in the partition of haplotypes, as a function of the disease model for candidate regions of 30 and 100 kb in length, typed at 5 and 20 SNPs, respectively, each for samples of 1,000 cases and 1,000 controls. As expected, the mean number of clusters increases with the strength of association between disease and the functional polymorphism. The number of clusters is greatest in candidate regions 100 kb in length, presumably because of the increased haplotype diversity (table 2). Nevertheless, this still reflects improved parsimony in comparison with the number of distinct haplotypes consistent with the observed genotype data.

*A*), and for

**...**

Figure 6 presents the mean Λ in favor of disease association as a function of the disease model for candidate regions of 30 and 100 kb, typed at 5 and 20 marker SNPs, respectively, each for samples of 1,000 cases and 1,000 controls. The results are entirely as expected, where the magnitude of Λ increases with causal-variant frequency and with the strength of association between disease and the functional polymorphism. Furthermore, for a fixed causal-variant frequency, the mean Λ is generally higher in large candidate regions with greater haplotype diversity. This presumably reflects increased precision in clustering of haplotypes carrying the causal variant, despite the effects of recombination on the similarity metric and phase-reconstruction process.

*A*), and for a causative-allele frequency of 0.2 in

**...**

#### Loss of information by ignoring dominance

For each replicate of data, I perform a third independent run of the MCMC algorithm, this time under the alternative model of association assuming multiplicative disease risks (β_{A}>0 and β_{D}=0). Figure 6 presents the increase in the mean Λ, by allowing for deviations from this multiplicative model, as a function of the disease model for candidate regions of 30 and 100 kb, typed at 5 and 20 marker SNPs, respectively, each for samples of 1,000 cases and 1,000 controls. For the lower causal-variant frequency (i.e., 0.05), affected individuals tend to be heterozygous, rather than homozygous, for the causal variant. As a result, it becomes difficult to disentangle the additive and dominance effects of the causal variant, resulting in minimal gains by allowing for deviations from a multiplicative model of disease risks. However, for higher causal-variant frequencies (e.g., 0.2 and 0.5), affected individuals homozygous for the causal variant are more common, and there is, consequently, a greater loss of information by ignoring dominance, except in situations where the disease risks are approximately multiplicative—for example, a heterozygous GRR of 2 and a homozygous GRR of 5.

#### Comparison with existing methods

For each replicate of data, I also perform a standard likelihood-ratio test of association of disease with marker SNPs in the candidate region, using the haplotype-based methodology developed by Zaykin et al.^{9} that does not allow for clustering or dominance. Disease status is modeled in a logistic-regression framework, parameterized in terms of the multiplicative risk of disease of each haplotype. To allow for unknown phase, all possible pairs of haplotypes consistent with the observed genotype data are considered, weighted in the logistic-regression model by the corresponding phase-assignment probability calculated from the maximum-likelihood estimates of the haplotype frequencies already obtained via implementation of the EM algorithm. Rare haplotypes, occurring with estimated relative sample frequency of <5%, are pooled to improve parsimony.

Figure 7 presents the power of the GENEBPMv2 algorithm to detect association with the use of a 5% significance threshold, as a function of the disease model, for candidate regions of 30 and 100 kb in length, typed at 5 and 20 marker SNPs, respectively, each for samples of 1,000 cases and 1,000 controls. The significance threshold was determined from the null distributions of Λ for samples of 1,000 cases and 1,000 controls and was 0.639 and 0.678 for candidate regions of 30 and 100 kb, respectively. As expected, power increases with the frequency of the causative variant and with the strength of association between the disease and the functional polymorphism. There is also greater power to detect association with 20 SNPs in a candidate region of 100 kb, compared with 5 SNPs in 30 kb, despite the increase in haplotype diversity, the increased uncertainty in the phase-assignment process, and the effects of recombination on the haplotype-similarity metric.

**...**

Figure 7 also presents the gain in power of the GENEBPMv2 algorithm over the standard likelihood-ratio test of association of disease with marker SNPs in the candidate region, again evaluated using a 5% significance threshold. The GENEBPMv2 algorithm is generally as powerful, with noticeable increases in power for causative variants of frequency 0.2. The difference in power is most noteworthy for candidate regions of 100 kb. With increased marker-SNP haplotype diversity, it is most likely that the causative variant is carried by several closely related rare haplotypes that can be identified through clustering in the Bayesian partition model but that will be lost through pooling by frequency.

## Discussion

In the context of association with a binary trait, “dominance” refers to any deviation from a multiplicative model of disease risks. Allowing for dominance in haplotype-based studies requires one parameter in the logistic-regression model for each observed diplotype (i.e., pair of haplotypes). Methods developed under this diplotype model will lack power to detect association in a standard frequentist-analysis framework, unless the deviation from multiplicative disease risks is extreme. Lin et al.^{40} describe a general-likelihood approach to test for association of a single target haplotype with disease, allowing for dominance. Their codominant model, which allows for deviations from multiplicative risks of the target haplotype, is less powerful than a multiplicative model for detecting association of three SNPs in the *XRCC1* gene with breast cancer. Furthermore, they cannot simultaneously consider the joint effects of all haplotypes in the gene without correcting for multiple testing of each target one by one, which is clearly suboptimal.

To overcome the problem of lack of parsimony, I use a Bayesian partition model to cluster SNP haplotypes according to their similarity, which is used as a proxy for recent shared ancestry. Each haplotype allocated to the same clade is assigned the same probability of carrying a causal variant at the functional polymorphism(s). By assuming that each causal variant has the same genetic effect on disease, the logistic-regression model can be parameterized in terms of an additive component (the multiplicative contribution to risk) and a dominance component (any nonmultiplicative contribution to risk). A similar approach has been used by Waldron et al.,^{21} in the context of fine mapping, by clustering phased SNP haplotypes into just two clades, high- and low-risk, according to the Bayesian partition model. In this way, inclusion of dominance effects of causal variants at the functional polymorphism(s) requires only a single additional parameter over a model of multiplicative disease risks. One of the main advantages of this framework is flexibility, since the logistic-regression model could easily be extended to incorporate interaction with nongenetic risk factors and epistasis between functional polymorphisms in two candidate regions, without introducing the large numbers of additional parameters that would be required in existing haplotype-based methods.

I have developed a Bayesian reversible-jump MCMC algorithm, GENEBPMv2, to sample from the posterior distribution of haplotype clusters and the corresponding probabilities that they carry a causal variant at the functional polymorphism(s), in addition to the additive and dominance effects of the causal variant and any additional covariate-regression parameters, given observed phenotype and genotype data. I allow for unphased genotype data by considering all possible haplotype configurations, weighted in the logistic-regression model by the corresponding phase-assignment probabilities. These probabilities are estimated by maximum likelihood via implementation of an EM algorithm, although other, more sophisticated haplotype-reconstruction techniques, such as PHASE,^{12}^{,}^{13} could also be used. Output from the MCMC algorithm can be used to estimate the Bayes factor in favor of association, together with the posterior distribution of additive and dominance effects of the underlying causal variants. The current implementation of the algorithm allows for up to 100 SNPs, or 2,000 distinct haplotypes, consistent with the observed genotype data. Typically, analysis of 20 SNPs, typed in 1,000 cases and 1,000 controls across a 100-kb candidate region, requires <15 min of computation time with a dedicated Pentium IV work station.

The results of this simulation study suggest that there is minimal cost associated with modeling the dominance effects of causative variants at the functional polymorphism(s) within the Bayesian MCMC framework presented here. In fact, there is an increase in the Bayes factor in candidate regions of up to 100 kb when the causative variants are common. Furthermore, I demonstrate increased power of the GENEBPMv2 algorithm over a standard likelihood-ratio test of association of disease with marker SNPs in the candidate region, using the haplotype-based methodology developed by Zaykin et al.^{9} that does not allow for clustering or dominance. These results clearly demonstrate that, with an appropriate analysis, such as GENEBPMv2, dominance effects can and should be included in haplotype-based association studies to increase power without substantial penalty for reduced parsimony.

Analysis of marker-SNP haplotypes is appropriate within candidate genes or small candidate regions subject to limited ancestral recombination. It is possible that even these small regions will be spanned by a number of blocks of SNPs in strong LD, interrupted by hotspots of recombination. Nevertheless, the results of this simulation study suggest that the GENEBPMv2 algorithm performs well in candidate regions of up to 100 kb, despite the fact that a block model of LD was not used to generate the data. In fact, the GENEBPMv2 algorithm performed best in candidate regions with increased haplotype diversity, particularly in comparison with existing haplotype-based methods that do not allow for dominance or clustering. Further investigation is required to assess the detrimental effects of recombination in genetic regions of >100 kb. Of course, haplotype-based analysis across several megabases or a complete chromosome in a genome scan would be inappropriate because of the effects of recombination on the clustering process and the expected inaccuracies in phase assignment. Exceptions to this rule might include (i) tagging SNPs selected within LD blocks, with each block analyzed independently, and (ii) aggressively selected tagging SNPs, which are often chosen to be tested in specific combinations, as haplotypes, in the subsequent association study.^{41}^{,}^{42} Furthermore, haplotype-based analyses may provide additional information with high-density genotyping in a follow-up study of associated regions from an initial genome scan. The pattern of haplotype clustering may help to refine the likely location of the underlying functional polymorphism(s) and may identify cases with high probability of carrying causal variants to be sequenced for novel mutations.

## Acknowledgments

A.P.M. acknowledges financial support from the Leverhulme Trust and the Wellcome Trust. A.P.M. thanks Louise Hosking and Chun-Fang Xu, from GlaxoSmithKline, for providing the *CYP2D6* data. A.P.M. also thanks Prof. David Balding, from Imperial College, and an anonymous reviewer, for their helpful comments in preparing the revised version of this article.

## AppendixA: Glossary of Notation

*y*_{i}- Phenotype of individual
*i,*where 0 indicates unaffected status and 1 indicates affected status *G*_{in}- Genotype of individual
*i*at marker SNP*n* *x*_{il}- Response of individual
*i*for the*l*th covariate *H*_{j}- The
*j*th most frequent marker-SNP haplotype consistent with genotype data *h*_{j}- Estimated relative frequency of haplotype
*H*_{j} - β
_{A} - Additive effect of causal variants at the functional polymorphism(s)
- β
_{D} - Dominance effect of causal variant at the functional polymorphism(s)
*Z*_{i}- Genotype of individual
*i*at the functional polymorphism(s) *K*- Number of clusters of marker-SNP haplotypes
*C*_{k}- Marker-SNP haplotype center of cluster
*k* _{k}- Probability that the haplotype in the
*k*th cluster carries causal variant at the functional polymorphism(s) in partition*C* *T*_{C}(*j*)- Cluster assignment of haplotype
*H*_{j}in partition*C* - γ
_{l} - Logistic-regression coefficient for the
*l*th covariate - μ
- Baseline log-odds of disease
- θ
- Model parameters
- Observed data and estimated haplotype frequencies
- Model of association between disease and polymorphisms in the candidate gene, where
*M*_{0}indicates the model with no association and*M*_{1}indicates the model with association - Λ
- Bayes factor in favor of disease association with polymorphisms in the candidate gene

## AppendixB: Details of the MCMC Algorithm

I have developed a reversible-jump Metropolis-Hastings MCMC algorithm to approximate the posterior-density function, , for model (eq. [4]), where and, for observed data, . For each iteration of the algorithm, a new set of parameter values, θ^{′}, is proposed according to predetermined weights, **w**, chosen to optimize mixing and convergence (table A1). The proposed parameter values are substituted for the current set, provided that

where ε is a standard uniform random variable and Δ denotes the Hastings ratio of proposal probabilities,

Otherwise, the current set of parameter values is retained. The possible changes to the parameter set are summarized below, where ε is a standard uniform random variable.

### TableA1.

Relative Weights | ||||||

Change (j) | Proposal | Parameters | K=1 | K=2 | 2<K<n | K=n |

1 | Cluster birth | K, C, | 0 | .25 | .25 | 0 |

2 | Cluster death | K, C, | 0 | 0 | .25 | .25 |

3 | Cluster-center swap | C | 0 | .10 | .10 | 0 |

4 | Cluster center | C | 0 | .10 | .10 | 0 |

5 | Causal-variant probability | 0 | .10 | .10 | .10 | |

6 | Baseline log-odds of disease | μ | .05 | .05 | .05 | .05 |

7 | Causal-variant additive effect | β_{A} | 0 | .05 | .05 | .05 |

8 | Causal-variant dominance effect | β_{D} | 0 | .05 | .05 | .05 |

9 | Covariate-regression coefficient | γ | .05 | .05 | .05 | .05 |

Total weight | .10 | .75 | 1.00 | .55 |

Note.— Relative weights for changes under the null model, *M*_{0}, are given by *K*=1. Relative weights for changes under the alternative model, *M*_{1}, are given by *K*>1.

#### Change 1: Propose a Cluster Birth

The proposed number of clusters is given by *K*=*K*+1. Select a position, *k*^{*}, at random for the new cluster in the list of ordered cluster centers. Select at random from *H* a haplotype, *H*_{j}, that is not already a cluster center, so that *C*^{′}_{k*}=*H*_{j}. Generate a new probability that haplotypes in the new cluster carry the causal variant at the functional polymorphism(s), ^{′}_{k*}, from a uniform distribution. Then,

and

To ensure reversibility, .

#### Change 2: Propose a Cluster Death

The proposed number of clusters is given by *K*=*K*-1. Select a cluster, *k*^{*}, at random for death. The proposed cluster centers and probabilities of carrying the causal variant at the functional polymorphism(s) are then given by

and

To ensure reversibility, .

#### Change 3: Propose a Cluster-Center Swap

The following proposal procedure is performed *K* times. Select a pair of clusters, *k*_{1} and *k*_{2}, at random. The proposed cluster-center swap is given by

and

Δ=1.

#### Change 4: Propose a Cluster-Center Change

The following proposal procedure is performed *K* times. Select a cluster, *k,* at random. Select at random from *H* a haplotype, *H*_{j}, that is not already a cluster center, so that *C*^{′}_{k}=*H*_{j}. Δ=1.

#### Change 5: Propose a New Cluster Causal-Variant Probability

The following proposal procedure is performed *K* times. Select a cluster, *k,* at random. The proposed probability that haplotypes in cluster *k* carry the causal variant at the functional polymorphism(s) is given by . Δ=1 but, to ensure reversibility,

#### Change 6: Propose a New Baseline Log-Odds of Disease

The proposed parameter is given by μ^{′}=μ+ν_{M}(ε-0.5), where *ν*_{M} denotes the maximum change in the parameter value. Δ=1.

#### Change 7: Propose a New Additive Effect of the Causal Variant

The proposed parameter is given by , where ν_{A} denotes the maximum change in the parameter value. Δ=1.

#### Change 8: Propose a New Dominance Effect of the Causal Variant

The proposed parameter is given by , where *ν*_{D} denotes the maximum change in the parameter value. Δ=1.

#### Change 9: Propose a New Covariate-Regression Coefficient

The following proposal procedure is performed *L* times. Select a covariate, *l,* at random. The proposed regression coefficient for the selected covariate is given by , where ν_{C} denotes the maximum change in the parameter value. Δ=1.

## References

*SCN1A*: implications for linkage-disequilibrium gene mapping. Am J Hum Genet 73:551–565 [PMC free article] [PubMed]

**American Society of Human Genetics**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (1.4M)

- Direct analysis of unphased SNP genotype data in population-based association studies via Bayesian partition modelling of haplotypes.[Genet Epidemiol. 2005]
*Morris AP.**Genet Epidemiol. 2005 Sep; 29(2):91-107.* - Genetic association mapping via evolution-based clustering of haplotypes.[PLoS Genet. 2007]
*Tachmazidou I, Verzilli CJ, De Iorio M.**PLoS Genet. 2007 Jul; 3(7):e111.* - SNPs, haplotypes, and model selection in a candidate gene region: the SIMPle analysis for multilocus data.[Genet Epidemiol. 2004]
*Conti DV, Gauderman WJ.**Genet Epidemiol. 2004 Dec; 27(4):429-41.* - Tag SNP selection for association studies.[Genet Epidemiol. 2004]
*Stram DO.**Genet Epidemiol. 2004 Dec; 27(4):365-74.* - The role of haplotypes in candidate gene studies.[Genet Epidemiol. 2004]
*Clark AG.**Genet Epidemiol. 2004 Dec; 27(4):321-33.*

- Haplotype Estimation Using Sequencing Reads[American Journal of Human Genetics. 2013]
*Delaneau O, Howie B, Cox AJ, Zagury JF, Marchini J.**American Journal of Human Genetics. 2013 Oct 3; 93(4)687-696* - Haplotype Analysis Improved Evidence for Candidate Genes for Intramuscular Fat Percentage from a Genome Wide Association Study of Cattle[PLoS ONE. ]
*Barendse W.**PLoS ONE. 6(12)e29601* - Association of FTO variants with BMI and fat mass in the self-contained population of Sorbs in Germany[European Journal of Human Genetics. 2010]
*Tönjes A, Zeggini E, Kovacs P, Böttcher Y, Schleinitz D, Dietrich K, Morris AP, Enigk B, Rayner NW, Koriath M, Eszlinger M, Kemppinen A, Prokopenko I, Hoffmann K, Teupser D, Thiery J, Krohn K, McCarthy MI, Stumvoll M.**European Journal of Human Genetics. 2010 Jan; 18(1)104-110* - Association Mapping by Generalized Linear Regression With Density-Based Haplotype Clustering[Genetic epidemiology. 2009]
*Igo RP Jr, Li J, Goddard KA.**Genetic epidemiology. 2009 Jan; 33(1)16-26* - Comparison of multimarker logistic regression models, with application to a genomewide scan of schizophrenia[BMC Genetics. ]
*Wason JM, Dudbridge F.**BMC Genetics. 1180*