- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

# Bayesian Fine-Scale Mapping of Disease Loci, by Hidden Markov Models

## Abstract

We present a new multilocus method for the fine-scale mapping of genes contributing to human diseases. The method is designed for use with multiple biallelic markers—in particular, single-nucleotide polymorphisms for which high-density genetic maps will soon be available. We model disease-marker association in a candidate region via a hidden Markov process and allow for correlation between linked marker loci. Using Markov-chain–Monte Carlo simulation methods, we obtain posterior distributions of model parameter estimates including disease-gene location and the age of the disease-predisposing mutation. In addition, we allow for heterogeneity in recombination rates, across the candidate region, to account for recombination hot and cold spots. We also obtain, for the ancestral marker haplotype, a posterior distribution that is unique to our method and that, unlike maximum-likelihood estimation, can properly account for uncertainty. We apply the method to data for cystic fibrosis and Huntington disease, for which mutations in disease genes have already been identified. The new method performs well compared with existing multi-locus mapping methods.

## Introduction

The problem in the localization of genes contributing to human diseases has been at the forefront of research in genetic epidemiology for many years now. Linkage-based analyses, often performed in candidate regions of the genome, have had success in locating, to within 1 cM, genes contributing major effects to human disease. However, for genes contributing less significant effects to polygenic disorders, linkage methods have been shown to be less powerful than population-based disease-marker–association studies (Risch and Merikangas 1996).

The key to population-based disease-gene mapping is the relationship between physical distance and the strength of disease-marker association. A higher level of association with the disease at marker *A* than at marker *B* suggests that, in previous generations, less recombination has occurred between the disease gene and marker *A* and thus that this marker is the closer of the two to the disease gene. The simplest approach toward identification of a likely location for a disease gene on a map of candidate marker loci is a *single-locus* approach. On the map, the marker with greatest evidence of association with the disease is taken as being most tightly linked to the predisposing gene.

Greater power and accuracy to locate a disease gene would be expected by taking account of information from all markers, simultaneously, in the region of the disease gene, in so called *multilocus* models. A number of these multilocus methods have been proposed recently (Terwilliger 1995; Xiong and Guo 1997; Collins and Morton 1998) and have had some success in locating the known mutations for cystic fibrosis (CF), Huntington disease (HD), Friedreich ataxia, and progressive myoclonus epilepsy. These methods rely on the assumption of independent marker loci in the region of the disease gene. Under this assumption, log likelihoods are calculated for each marker in turn and summed to form a *composite* log likelihood for the set of loci. This assumption is, of course, incorrect, since we would expect correlation between linked markers. Composite likelihoods are thus only an approximation for the full likelihood obtained by use of complete haplotypes.

In the present report, we present a new multilocus method for the mapping of disease genes, one that takes account of correlation between linked marker loci. The method is designed specifically for use with biallelic markers such as single-nucleotide polymorphisms (SNPs). Current research is likely to provide a highly dense map of SNPs in the near future, in which they are perhaps as frequent as one marker per kilobase of the human genome (Kruglyak 1999).

Consider a disease that, as a result of a single mutation at the disease locus a number of generations ago, was introduced into a population. All affected individuals today will be descended from this founder chromosome. Thus, in a sample of chromosomes ascertained today, the allele that we observe at an SNP linked to the disease gene will depend on whether, at that locus, the chromosome is *identical by descent* (IBD) with the founder chromosome. If, at the marker, the chromosome is IBD with the founder, we observe the ancestral allele. If the chromosome is not IBD with the founder, we may observe either the ancestral or the nonancestral allele, the probability of which will depend on the relative population frequencies of the two alleles. The probability that a chromosome is IBD with the founder will be greater for a marker in the proximity of the disease gene than for more-distant markers, since there will have been less opportunity for recombination. Thus, stronger disease-marker association will be expected at markers adjacent to the disease gene.

For a given location on a chromosome, IBD status itself is not directly observable and can be thought of as a *hidden state.* The probability of changing from one hidden state to another at adjacent markers depends only on previous recombination events between them and thus can be modeled as a function of the physical intermarker distance. For a fine-scale map of markers, this probability will be small, and if it assumed that there is no interference, will be independent of similar probabilities defined in any other interval between adjacent markers. Under these conditions, we can employ a hidden Markov model (Rabiner 1989) to describe marker haplotype frequencies in the vicinity of the disease gene, accounting for the correlation between linked marker loci.

The model that we present here is similar to that of McPeek and Strahs (1999), who also use a hidden Markov process to account for correlation between linked marker loci. We employ Markov-chain Monte Carlo (MCMC) stochastic simulation methods in a Bayesian framework, which has a number of advantages over the maximum-likelihood approach used by McPeek and Strahs (1999). We are able to properly account for the uncertainty in the ancestral marker haplotype—unlike the method of McPeek and Strahs (1999), which treats it as a nuisance parameter to be estimated. With this approach, we obtain posterior distributions for model parameter estimates, including disease-gene location and the age of the mutation (for a complete list of the model parameters used in the present study, see Appendix A). In addition, the flexibility of this framework allows us to incorporate heterogeneity in recombination rates in the region of the disease gene, to account for crossover hot and cold spots. We apply the method to data for CF (Kerem et al. 1989) and HD (MacDonald et al. 1991), for which mutations in disease genes already have been identified.

## Models and Methods

In this section, we derive a model for disease-marker association in a candidate region, using hidden Markov processes. We begin with the simplest case—of a single founding mutation of a normal allele at the disease locus to a high-risk allele. Any chromosome in the current generation can be divided into regions, each corresponding to one of two possible ancestral states. A region may be IBD with the ancestral founder chromosome and is then labeled “F”; otherwise, the region does not descend from the founder and is labeled “N.” The probability that, at any given locus, a chromosome in the current generation is IBD with the founder is denoted as “α.”

The occurrence of the different ancestral states, F or N, along a chromosome is a result of recombination events in previous generations. Consider two particular loci on a chromosome selected at random from the current generation. Given the chromosome's ancestral state at locus 1, it is straightforward to calculate the probabilities of the two ancestral states at locus 2. Let “NR” denote the event “no recombination has occurred between the loci”; then, for example,

where *MRR*=*F* is used to denote the event “most recent recombination event occurred, at locus 2, with a chromosome IBD with the founder”; similarly,

since a recombination event must have occurred between two loci of different ancestral states. We assume that the probability that, at locus 2, a chromosome is IBD with the founder has remained constant over time, *Pr*(*MRR*=*F*)=α.

This principle can be generalized to more than two loci on the chromosome. The probabilities of the two ancestral states at any locus on the chromosome, given the ancestral state at an adjacent locus on the chromosome, depend only on recombination events between the loci and not on recombination elsewhere along the chromosome (under the assumption of no interference). Thus, given the ancestral state at some starting locus of a chromosome, we can calculate joint probabilities of ancestral states at any other loci, using two independent Markov chains, one acting on each side of the starting locus.

Consider a map of SNPs with known location in a candidate region of the chromosome and assume an arbitrary location *x* for the disease locus, 0, on this map. Given this location, the map is effectively divided into two regions with “L” markers present to the left of the disease locus and “R” markers present to the right. The marker loci to the left of the disease locus are denoted “−1, −2, …, −L,” where −1 is adjacent to the disease locus, −2 is adjacent to −1, and so on; similarly, the marker loci to the right of the disease locus are denoted “1, 2, …, R.” The physical distances (in Mb) between the disease locus and marker loci −1 and 1 are denoted “*d*_{−1}” and *d*_{1}, respectively. The distance between any pair of adjacent marker loci to the left of the disease locus, −*i* and -(*i*+1), is denoted “*d*_{-(i+1)}”; similarly, *d*_{(i+1)} denotes the distance between marker loci *i* and (*i*+1) to the right of the disease locus. The choice of location of the disease locus, *x,* thus defines a unique set of interlocus distances.

A chromosome can be considered as two independent paths of ancestral states, conditional on the ancestral state at the disease locus, *S*_{0}. For the marker loci to the left of the disease locus, the path is denoted “_{L}={*S*_{0},*S*_{-1},*S*_{-2},...,*S*_{-L}},” whereas, for marker loci to the right, the path is denoted “_{R}={*S*_{0},*S*_{1},*S*_{2},...,*S*_{R}}.”

Consider the marker loci to the right of the disease locus. The chromosome's ancestral state at locus *i*+1, *S*_{i+1}, depends only on both the chromosome's ancestral state at locus *i,* *S*_{i}*,* and the occurrence of previous generations of recombination events between the pair of adjacent loci. In the same way as for equations (1) and (2), we define *transition probabilities* τ^{i+1}_{SiSi+1} of ancestral state *S*_{i+1}, at locus *i*+1, given the chromosome's ancestral state, *S*_{i}*:*

Here, is the probability of no recombination events in generations since the founding mutation in the interval between marker loci *i* and *i*+1. The parameter γ>0 represents the expected frequency, since the founding mutation, of recombination events per 1 Mb of a chromosome in the candidate region. If we assume that the physical distance of 1 Mb corresponds to a genetic distance of 1 cM, then 100γ can be interpreted as the number of generations since the founding mutation.

Given values for the transition parameters α and γ, we can calculate the probability, ρ[*S*_{i}|*S*_{0}] , that a chromosome is of ancestral state *S*_{i} at marker locus *i,* conditional on the chromosome's ancestral state at the disease locus, *S*_{0}, using the recursive formula

for all *i*>1, and . Ancestral-state frequencies at loci to the left of the disease locus are calculated in the same way, on the basis of an independent Markov process defined in terms of the same model parameters α and γ.

The model described thus far can be used to calculate the probability that, at any marker locus in the candidate region, a chromosome is IBD with the founder, given the chromosome's ancestral state at the disease locus, *S*_{0}. Of course, ancestral states are hidden and cannot be observed. At each SNP in the candidate region, one of two possible alleles, denoted as “*M*_{i1}” and “*M*_{i2},” can occur at marker locus *i.* The marker allele present is dependent only on the chromosome's ancestral state at marker locus *i* and not on that elsewhere in the region. If, at marker locus *i,* the chromosome is IBD with the founder chromosome, then the allele present will be the same allele that is present on the founder chromosome, if it is assumed that no mutations have occurred at the marker locus; if the chromosome is not IBD with the founder, then either allele may be present, with probability *p*_{i} denoting the frequency of allele *M*_{i1} on such chromosomes. Thus, given that a chromosome is of ancestral state *S*_{0}, the expected frequency of allele *M*_{i1} is given by

The parameter ω_{i} is an indicator variable taking the value 1 if allele *M*_{i1} is present on the founder chromosome and taking the value 0 otherwise. Clearly,

We have assumed here that, conditional on a set of adjacent marker loci being in state N, the probability of the observed haplotype is simply the product of the allele frequencies *p*_{i} or 1-*p*_{i}*,* at marker *i,* from which it is constructed. McPeek and Strahs (1999) have suggested the use of a *k*th-order (in practice, *k*=1) Markov-chain model for haplotype frequencies across loci in state N. Such an approach could be easily incorporated into the model presented here.

Consider a sample of *n*_{A} chromosomes obtained from affected cases and *n*_{U} chromosomes obtained from unaffected controls. We do not assume here that we can identify homologous pairs of chromosomes occurring together in the same individual in the sample. If this information is known, it can be easily incorporated in the analysis.

We cannot directly identify the chromosome's ancestral state at the disease locus. Instead, we observe the disease phenotype of the individual from whom it is obtained, assumed here to be either affected (=) or unaffected (=). The disease phenotype of an individual depends on the ancestral state at the disease locus on their pair of homologous chromosomes. Since we do not assume that we can identify homologous pairs of chromosomes occurring together in the same individual in the sample, we average over the possible ancestral states, *S*^{′}_{0}, at the disease locus for the second chromosome, weighting by their relative frequencies:

We assume a multiplicative model for the disease, with parameters β_{F} and β_{N} for the ancestral states *S*_{0}=*F* and *S*_{0}=*N*, respectively, at the disease locus. Thus, the penetrance of genotype *S*_{0}*S*^{′}_{0} is given by *Pr*(=|*S*_{0}*S*^{′}_{0})=β_{S0}β_{S′0}. Hence,

Under this model, we can calculate expected SNP frequencies in affected and unaffected individuals in the population. As an example, consider marker locus *i.* The probability that a chromosome is obtained from an individual of disease phenotype and bears allele *M*_{ij} at marker locus *i* is denoted by “^{}_{ij}.” Then,

since disease status and SNP type are independent, conditional on the chromosome's ancestral state at the disease locus. Thus, when we substitute for the appropriate *Pr*(|*S*_{0}) from equation (9) and for *m*^{S0}_{ij} from equations (6) and (7),

In a case-control study, affected individuals are ascertained with greater probability than is their population frequency, so that a sample will be enriched with case chromosomes. We denote by *n*^{}_{ij} the observed frequencies of allele *M*_{ij} in the sample of chromosomes obtained from individuals of disease phenotype . Table Table11 presents the expected case-control frequencies of SNP alleles at marker locus *i.* The parameter *Q* is the population frequency of the disease, which is assumed to be known, and κ is a sample-enrichment factor: κ={[(1-*Q*)*n*_{A}]/*Qn*_{U}}. The expected frequencies are scaled by the parameter *T*=1+*Q*(κ-1) to sum to 1.

*M*

_{i1}and

*M*

_{i2}at Marker Locus

*i,*for Known Disease Frequency

*Q,*Sample-Enrichment Factor κ, and

*T*= 1 +

*Q*(κ−1)

### Allowing for Mutations at Marker Loci

In deriving the model thus far, we have assumed that no mutations at marker loci have occurred since the founding disease mutation on the ancestral chromosome. The method is designed for use with SNPs, which are thought to have low mutation rates in humans, ~10^{−8}–10^{−9}/locus/generation (Nielsen 2000). For recent disease mutations, the effects of such a low rate of mutation will be negligible. Nevertheless, we may wish for the model to account for marker mutation.

Under the assumption of no marker mutation, we observe, at that locus, only the ancestral allele at marker *i* on a chromosome IBD with the founder. However, if we allow for marker mutation, we may observe the nonancestral allele at a locus in state F. In terms of the indicator parameter for marker *i,*

where *m* is the mutation rate per locus per generation and 100γ is the number of generations since the founding mutation.

### Allowing for Phenocopies

The model described has assumed, thus far, that all mutant chromosomes have descended from a single ancestral founder. This assumption is unlikely to be realistic for most human diseases (Penisi 1998). Phenocopies may occur either as a result of multiple mutations in the same gene or, especially for complex diseases, as a result of the effects of multiple susceptibility loci and the environment. In this section, we develop the model to allow for phenocopies, under the assumption that there is a single major mutation that accounts for a substantial proportion of affected individuals in the current generation. This is true, for example, of CF, for which the major ΔF508 mutation in the CFTR gene accounts for almost 70% of all chromosomes in affected individuals, with many other mutations in the same gene accounting for the remaining 30% (Kerem et al. 1989). Previous approaches, with the exception of that of McPeek and Strahs (1999), fail to explicitly allow for this in their association models.

Assume that the major mutation (*F*) accounts for a proportion λ of all mutant chromosomes in the current generation and that the major mutation and all other mutations () have the same penetrance, β_{F}*.* If equation (8) is generalized to allow for three possible ancestral states,

Hence, as defined in equation (9),

Then, in the same way as for equation (10),

where the appropriate *Pr*(|*S*_{0}) is obtained from equation (11). However, since the phenocopies may be spurious or will have descended from many different ancestral founding chromosomes, we assume that, in terms of the occurrence of marker alleles, they are indistinguishable from any chromosome not bearing the major mutation at the disease locus; in other words, as defined in equations (6) and (7).

### Likelihood Calculations

Expected SNP allele frequencies to the left and right of the disease locus are determined by independent Markov processes. Thus, over the whole candidate region, the log-likelihood of a sample of data for a fixed location of the disease gene, *x,* and a given set of hidden Markov–model parameters is given by

where Γ is a vector of model parameters, Γ=(α,β_{F},β_{N},γ,λ)^{T}, and *p* and *w* are vectors of allele frequencies and ancestral indicators, respectively:

and

The log-likelihood to the right of the disease locus is given by

where *C*_{R} is constant for a known population disease frequency *Q:* . The independent log-likelihood to the left of the disease locus is calculated similarly.

### Parameter Estimation

The hidden Markov model described here is overparameterized. We reduce the number of free parameters by noticing the following relationships.

First, the population frequency of the disease is given by

The frequency of the disease is generally known so that we can eliminate α from the likelihood calculation:

since α>0.

Second, the likelihood is constant for a fixed ratio of disease-model parameters β=β_{N}/β_{F}. Thus, the two parameters can be eliminated from the likelihood and can be replaced by a single penetrance parameter for which β1, since it is assumed that, at the disease locus, the mutation has greater propensity for the development of the disease than does the normal allele. Overall, for a known disease frequency *Q,* the vector of model parameters to be estimated reduces to Γ=(β,γ,λ)^{T}, together with the allele frequencies and ancestral indicators.

We use MCMC methods to obtain posterior distributions for the model parameter estimates. The advantage of this approach is that we can incorporate prior information for the model parameters—which may be useful if we have reliable values for the age of the mutation or disease model, for example. We employ a Metropolis-Hastings algorithm (Metropolis et al. 1953; Hastings 1970) to obtain realizations of each model parameter by sampling from the full conditional distribution, using a rejection-sampling scheme. Each iteration of the sampling scheme consists of a six-step procedure summarized in Appendix B. From initial parameter values, the algorithm is run for a substantial burn-in period, to allow convergence. During the subsequent sampling period, realizations of the parameter set are recorded every 100th iteration. Over many iterations, posterior distributions of parameter estimates are obtained from these realizations.

### Allowing for Heterogeneity in Recombination Rates

In developing the hidden Markov model for fine-scale mapping, we have assumed a constant ratio of recombination fraction to physical distance across the whole of the candidate region. However, it is thought that recombination hot spots and cold spots occur along the genome. Lazzeroni (1998) accounts for heterogeneity in recombination rates in a generalized least-squares approach to fine-scale mapping, by allowing the ratio of physical distance to genetic distance to be different to the left and to the right of the disease locus. However, this may not be sufficient to allow for the variability in recombination rates, particularly in larger candidate regions.

As an alternative, we propose that the ratio of physical distance to genetic distance across the candidate region can be described by a first-order Gaussian autoregressive process. We divide the candidate region into *K* equal intervals so that the rate of recombination, *y*_{r}*,* in the *r*th interval is given by *y*_{1}=μ+ε^{*}_{1} and *y*_{r}=μ+(*y*_{r-1}-μ)+ε_{r}, where μ is the mean recombination rate across the region and is the first-order correlation coefficient. The errors are assumed to be independently distributed, so that ε^{*}_{1}~*N*(0,σ^{2}/(1-)^{2}) and ε_{r}~*N*(0,σ^{2}) for *r*=2,3,...,*K**.* The log-likelihood of a sample of *K* recombination rates from this process is then given by

We assume that μ is known from existing physical and genetic maps. Uncertainty for this parameter can be incorporated by assuming a tight prior distribution for μ, centered about the estimated ratio. For example, in the region of the CFTR gene for CF, a physical distance of 1.6 Mb corresponds to a genetic distance of ~0.8 cM (Collins et al. 1996)—in other words, μ=.5.

The recombination rates across the region can then be used in calculating the probability of no recombination events in the interval between any pair of adjacent marker loci; for example, the probability of no recombination events in the interval between marker *i* and *i*+1 is given by , where θ_{i+1}=Σ^{K}_{r=1}*y*_{r}π_{r}/Σ^{K}_{r=1}π_{r} and π_{r} denotes the proportion of the *r*th recombination-rate interval contained in the interval between the two marker loci. The log-likelihood of the sample of data for a given set of recombination rates *y* and hidden Markov–model parameters is expressed by (*data*|*x*,Γ,*p*,*w*,*y*)_{TOT}. The recombination rates are, in effect, nuisance parameters, so that

In this way, we can then incorporate heterogeneous recombination rates into the Metropolis-Hastings rejection-sampling scheme for the model parameters as described in Appendix B.

### Allowing for Nonindependent Recombinational Histories

In developing the hidden Markov model for disease-marker association in the region of a disease gene, we have assumed independent recombinational histories for each chromosome in the sample. However, the key to this approach to disease-gene mapping is that all—or at least a majority of—affected individuals share a recent single common ancestor bearing the disease-predisposing mutation. Treating the recombinational histories as independent is equivalent to assessing a star-shaped genealogy, which is not consistent with likely demographic scenarios for the development of a disease mutation in a finite population. Instead, we expect particular pairs of chromosomes to have a more recent common ancestor than do other pairs of chromosomes—and, consequently, to share a greater proportion of their recombinational history. The effect of this shared ancestry is to down-weight the contribution of each case chromosome to the total log-likelihood by a factor [1+(*n*_{A}-1)*c*]^{-1}, where *c* is given by

Since the correction factor is <1, we effectively down-weight the contribution of each case chromosome to the total log-likelihood, to account for the dependence between them. We emphasize here that a quasi-likelihood approach is not applicable in a Bayesian framework; but it does suggest the use of a likelihood approximation. We propose to multiply the log-likelihood calculated under a star-shaped genealogy by the same correction factor. This has the effect of increasing the variance of the posterior distribution, to account for the shared ancestry of the case chromosomes.

## Examples

To illustrate our proposed method, we consider two diseases: CF and HD. Mutations responsible for the occurrence of these two diseases have been located on the genome and are thus ideal for testing the accuracy and precision of the new method. In this section, we apply the proposed method to marker-haplotype data collected in candidate regions for the two disease genes (Kerem et al. 1989; MacDonald et al. 1991). In both samples, cases and controls have been typed by RFLPs. Since these markers have low rates of mutation, we have assumed that *m*=0, corresponding to no marker mutation in the period since the founding disease mutation.

### CF

CF is one of the most common autosomal recessive disorders affecting whites, occurring with an incidence of 1 case/2,000 births. Initial scans of the genome in the 1980s provided evidence of a single CF gene on chromosome 7q31 (Kerem et al. 1989). More recently, a 3-bp deletion (ΔF508) has been identified within this region in the CFTR gene. It is now known that ΔF508 accounts for ~68% of all chromosomes in affected individuals today, with the remainder consisting of several other, rarer mutations in the same gene. Kerem et al. (1989) collected marker data from affected cases and healthy controls, using 23 RFLPs in a 1.8-Mb candidate region of chromosome 7q31, from the MET locus to marker D7S426.

Figure Figure11 presents odds ratios for each of the RFLPs in the candidate region. There is strongest evidence of disease-marker association in a region of 0.6–0.9 Mb from the MET locus, with a peak observed at 0.869 Mb. Within this region, however, there is a single marker, 0.889 Mb from the MET locus, at which disease-marker association is much lower. This marker is, in fact, closest to the ΔF508 mutation in the CFTR gene, at ~0.880 Mb from the MET locus.

Previous analyses of these data by published methods have yielded a variety of results. Terwilliger (1995) places the mutation 0.77 Mb from the MET locus, with a 99.9% support interval of 0.69–0.87 Mb. Although this interval overlaps part of the CFTR gene, it does not include the ΔF508 mutation. Xiong and Guo (1997) obtained an improved estimate of the location of ΔF508, at 0.80 Mb, although this was derived from only a selected subset of the CF data, a subset for which any case chromosomes not bearing the ΔF508 mutation were excluded. With additional information for the region of the mutation (Morral et al. 1994), Collins and Morton (1998) analyzed the same subset and obtained an estimate of 0.83 Mb.

We applied the hidden Markov model–based mapping method proposed here to the complete CF data set of Kerem et al. (1989). We assumed a disease frequency of *Q*=.0005, on the basis of estimates for the population from which the sample was ascertained. We also assumed a mean recombination rate of .5, since, in the candidate region around the CFTR gene, the physical distance of 1.6 Mb corresponds to a genetic distance of 0.8 cM (Collins et al. 1996). A number of sets of initial values for the model parameters were considered, all resulting in similar posterior distributions and parameter estimates after an initial burn-in period of the Metropolis-Hastings rejection-sampling scheme followed by a sampling period of a further 1 million iterations for which every 100th iteration was recorded. Regardless of the starting values for the model parameters, there is rapid convergence to parameter estimates, which also appear to mix well (data not shown).

Figure Figure22 presents the posterior distributions of the location of the mutation, the hidden Markov–model parameters β, α, λ, and γ, and the first-order autoregressive parameters and σ^{2} for recombination-rate heterogeneity across the candidate region, when independent recombinational histories for the case and control chromosomes are assumed. Also presented is the distribution of the hidden Markov–model log-likelihood obtained throughout the sampling period. Table Table22 presents the initial parameter values for this run, together with the true parameter values (where known) and summary statistics from the posterior distributions.

The mean estimate of the location of the mutation is Mb from the MET locus, with a 99% credibility interval of 0.731–0.838 Mb. Although there is substantial error in this estimate, the results are consistent with estimates obtained by other case-control–based mapping methods, which have been described above. The frequency of the mutation is estimated as . This is in agreement with a mutation-frequency estimate of .224 based on a fully penetrant recessive disease with frequency .0005 (Kerem et al. 1989). The estimate of the disease-model parameter approaches 0, which is as would be expected for a fully penetrant recessive disease for which β_{F}=1 and β_{N}=0. The estimated major-mutation proportion is , which is close to the estimate that 70% of existing CF chromosomes bear the ΔF508 mutation. The estimated age of the mutation is , corresponding to 205 generations. Again, this is not inconsistent with other, independent estimates of the age of ΔF508, which suggest that it is ~200 generations old (Serre et al. 1990). Credibility intervals for the first-order autoregressive parameters do not include 0, suggesting that there is recombination-rate heterogeneity across the candidate region.

For comparison, we have also applied the hidden Markov model–based mapping method to the same set of data but have modeled dependence between case chromosomes by using the conditional coalescent as proposed by McPeek and Strahs (1999). Figure Figure33 presents the posterior distribution of the model parameters that is based on every 100th of 1 million iterations of the Metropolis-Hastings rejection-sampling scheme, with the log-likelihood being corrected for between-chromosome correlations.

We obtain an improved estimate of the location of ΔF508: 0.798 Mb from the MET locus, with a 99% credibility interval of 0.610–1.069 Mb, this time with the true location of the mutation being included. The other model parameter estimates remain relatively unchanged, but with noticeably wider posterior credibility intervals (data not shown). The exception is in the first-order autoregressive-process–correlation parameter, the mean estimate of which is considerably closer to 0 under the coalescent model (.053) than under independence (.250). This would suggest that much of the correlation between marker loci is accounted for by the correlation between related chromosomes.

With the same correction, McPeek and Strahs (1999) estimate the location of the mutation to be 0.95 Mb from the MET locus, with a 99% confidence interval of 0.28–1.62 Mb (calculated on the basis of their presented 95% confidence interval). The difference, in estimated location, between the two methods is likely a result of McPeek and Strahs's (1999) assumption of a homogeneous recombination rate of 1 cM–1 Mb across the map of marker loci.

Table Table33 presents the posterior ancestral-haplotypes probabilities realized over the 1 million iterations of the Metropolis-Hastings rejection-sampling scheme. There is complete agreement over all but the markers most distant from the ΔF508 mutation at which levels of disease-marker association are weakest. For this particular sample, maximizing the model likelihood over the ancestral haplotype, as in the method of McPeek and Strahs (1999), would be expected to yield results similar to those of our proposed method, since the maximum-likelihood estimate has such high posterior probability. With less certainty with regard to ancestral haplotypes, maximum-likelihood–based approaches may suffer bias and warrant further investigation.

### HD

HD is a midlife-onset autosomal dominant neurodegenerative disorder occurring at an incidence of ~1 case/10,000. The HD gene was first mapped to chromosome 4p16, in the region of marker D4S10, by Gusella et al. (1983, 1984). More recently, the Huntington's Disease Collaborative Research Group (1993) has identified within this region a large gene (IT15) with an expandable unstable trinucleotide-repeat sequence. It is now known that IT15 genes with many repeats of the trinucleotide sequence are responsible for the development of the disease. MacDonald et al. (1991) collected marker data from HD and normal chromosomes in a 2.5-Mb region of chromosome 4p16, from marker D4S90 to D4S10, using 27 RFLPs.

Figure Figure44 presents odds ratios for each of the RFLPs in the candidate region. The strongest evidence of disease-marker association lies in the interval between markers D4S182 and D4S180, at 2.38 Mb and 2.85 Mb, respectively, from marker D4S90. This is in agreement with the location of IT15 at ~2.5–2.6 Mb from marker D4S90. As in the CF data of Kerem et al. (1989), there are RFLPs with low levels of disease-marker association within this interval. Despite this apparent inconsistency, Xiong and Guo (1997), using their case-control–based mapping method, obtained 2.62 Mb from marker D4S90 as the estimated location of the disease gene.

We also have applied the hidden Markov model–based mapping method to the HD data of MacDonald et al. (1991). We assumed independent recombinational histories for the case chromosomes and a disease frequency of *Q*=10^{-4}, in line with published estimates for populations of European descent. We also assumed a mean recombination rate of 1 in the candidate region, so that the usual 1 Mb– to–1 cM correspondence holds. We considered various sets of initial values for the model parameters, all resulting in similar posterior distributions and parameter estimates after the same burn-in period and sampling period that were employed in the analysis of the CF data.

Figure Figure55 presents, for the HD data, the posterior distributions for the hidden Markov model and autoregressive parameters. Summary statistics from the posterior distributions of model parameters are presented in table table4,4, together with true values (where known). The mean estimate of the location of the mutation is 2.52 Mb from marker D4S90, with a 99% credibility interval of 2.20–2.75 Mb. The mean estimate is accurate, being contained within the IT15 gene for HD. The wide credibility interval reflects the considerable variation in the strength of disease-marker association in the IT15 gene (fig. (fig.44).

The estimate of the disease model parameter is >0, which we would expect for a dominant disease. The estimated age of the mutation is , corresponding to 137 generations, and is not inconsistent with other estimates of the age of HD (Kaplan et al. 1995; Xiong and Guo 1997). Credibility intervals for the autoregressive parameters do not include 0, suggesting recombination-rate heterogeneity across the candidate region.

## Discussion

We have presented a new multilocus method for the fine-scale mapping of disease genes. We model disease-marker association in the vicinity of a disease gene by means of a hidden Markov process used in a way similar to that employed by McPeek and Strahs (1999). In this way, both models account for correlation between the markers, a clear advantage over many existing multilocus composite-likelihood methods that assume independence (Terwilliger 1995; Xiong and Guo 1997; Collins and Morton 1998). In addition, both models allow for mutation at marker loci.

We employ MCMC methods in a Bayesian framework, to obtain posterior distributions for model parameter estimates including those for disease-gene location and the age of the disease-predisposing mutation. A potential advantage of this approach, over both the maximum-likelihood estimation used by McPeek and Strahs (1999) and other existing multipoint methods, is that, where appropriate, we are able to incorporate prior information for model parameters. In addition, by integrating over the marker haplotype present on the founding chromosome, we allow for the uncertainty in its makeup, in contrast to McPeek and Strahs (1999), who consider only the maximum-likelihood estimate.

Our model is more sophisticated than previous models in that we allow for recombination-rate heterogeneity across the candidate region, using a first-order Gaussian autoregressive process. In this way, we can allow for recombination hot spots and cold spots that may lead to bias in existing models. However, it would be relatively straightforward to incorporate variable recombination rates in the model proposed by McPeek and Strahs (1999).

We have used our method to identify the location of two known mutations—one for CF and one for HD. For HD, we obtain an accurate estimate of the location of the mutation within the IT15 gene, which is known to be responsible for the development of the disorder. In deriving the hidden Markov model for disease-marker association, we have assumed a multiplicative model for the disease. HD is a dominant (i.e., nonmultiplicative) disorder, suggesting that our method is robust to deviations from a multiplicative-disease model.

For CF, we have presented two sets of simulation results, corresponding to two possible models of dependence in the recombinational histories of chromosomes in affected individuals. First, we have assumed independence, implying a star-shaped genealogy, which yields, for the location of the mutation, a 99% credibility interval that does not contain the true location of ΔF508. This result is consistent with analyses of the same data set by other multilocus models that assume independence between case chromosomes (Terwilliger 1995; Xiong and Guo 1997; Collins and Morton 1998). This clearly suggests deficiency in the star-shaped genealogical model of case-chromosome ancestry. For the second set of simulations, we correct for correlation between case chromosomes by means of a conditional coalescent model of dependence, proposed by McPeek and Strahs (1999). They justify the correction by means of quasi-likelihood arguments (Wedderburn 1974) that do not hold in a Bayesian framework. However, the same arguments suggest the use of an approximate log-likelihood, calculated by multiplication, by a correction factor, of the log-likelihood under independent recombinational histories (McPeek and Strahs 1999). This has the effect of increasing the variance of the posterior distribution, to account for the shared ancestry of case chromosomes.

An alternative approach to take account of the dependence between chromosomes is to model their ancestry directly, by means of a genealogical tree. In such a model, we can explicitly allow for multiple disease mutations, mutations at marker loci within the candidate region, and recombination events in the ancestry of the case sample. Lam et al. (2000) have constructed a genealogical tree for case chromosomes by using a combination of parsimony and likelihood methods, in which each chromosome in the tree is separated from its parent by a single marker mutation or recombination event. They then proceed to map the disease mutation as if the tree were known with certainty. A more appropriate approach would be to integrate over all possible genealogies, an approach that can be approximated by simulation. Graham and Thomson (1998) used such an approach to generate genealogical trees that are consistent with an observed sample of chromosomes, using a Moran (1962) model with known demographic parameters. However, their model assumes knowledge of the ancestral marker haplotype, the number of generations since the common ancestor, and the development of the population during this period. It is currently restricted to interval mapping using pairs of marker loci. Generalization of this approach to a full multilocus analysis with less stringent assumptions remains a challenge that will require considerable work in the future.

## Acknowledgments

A.P.M. acknowledges financial support from Pfizer Limited. We thank the referees for their helpful comments on the submitted version of this article.

## Appendix A: Summary of Model Parameters

Parameter | Range | Definition |

α | [0,1] | Probability that chromosome is IBD with founder, at any given locus |

β_{F} | [0,1] | Disease model parameter associated with mutation at disease locus |

β_{N} | [0,1] | Disease model parameter associated with normal allele at disease locus |

γ | [0,∞] | Recombination-rate parameter per 1 Mb of DNA in candidate region |

p_{i} | [0,1] | Frequency of allele M_{i1} on chromosomes not IBD with founder, at locus i |

ω_{i} | 0 or 1 | Indicator variable denoting presence/absence of allele M_{i1} |

Q | [0,1] | Population frequency of disease |

κ | [0,∞] | Sample-enrichment factor |

## Appendix B : Metropolis-Hastings Rejection-Sampling Scheme

Each iteration of the Metropolis-Hastings sampling scheme consists of a seven-step procedure. We denote the current parameter set by *x*, Γ, *p*, and *w*. In addition, the current set of recombination rates is denoted *y*, modeled as a first-order autoregressive process with known mean recombination rate μ and current parameters and σ^{2}. If we do not wish to allow for heterogeneity in recombination rates across the candidate region, *y*=1 and we ignore step 7 of the sampling scheme. The likelihood of a sample of cases and controls for the current parameter set is denoted *L*(*data*|*x*,Γ,*p*,*w*,*y*)_{TOT}. The likelihood of the set of recombination rates for the current parameter set is denoted *L*(*y*|μ,,σ^{2})_{AR1}. Throughout, we assume each to be drawn at random from the proposal distribution U(−.5,.5) and each υ to be drawn from U(0,1).

- 1.For each marker
*j*in turn, propose a new allele frequency,*p*^{*}_{j}=*p*_{j}+ν_{p}, where ν_{p}determines the maximum possible change from the current allele frequency. Since*p*_{j}[0,1], proposed allele frequencies outside this range are reflected back into the parameter space. The likelihood for the proposed parameter set is denoted*L*(*data*|*x*,Γ,*p*^{j*},*w*,*y*)_{TOT}where*p*^{j*}is the vector of current allele frequencies with*p*_{j}replaced by the proposed*p*^{*}_{j}. The proposed allele frequency is accepted to the current parameter set if the acceptance probability is - 2.For each marker
*j,*in turn, propose a new ancestral indicator:The likelihood for the proposed parameter set is denoted*L*(*data*|*x*,Γ,*p*,*w*^{j*},*y*)_{TOT}where*w*^{j*}is the vector of current ancestral indicators with ω_{j}replaced by the proposed ω^{*}_{j}. We then accept the proposed ancestral indicator to the current parameter set if the acceptance probability is - 3.Propose a new location for the disease gene,
*x*^{*}=*x*+ν_{x}, where the parameter ν_{x}determines the maximum change from the current disease gene location. We restrict the location of the disease gene to the candidate region, so that proposed locations distal to the first and last markers on the map are reflected back into the candidate region. The likelihood for the proposed parameter set is denoted*L*(*data*|*x*^{*},Γ,*p*,*w*,*y*)_{TOT}and the proposed location is accepted to the current parameter set if - 4.Propose a new penetrance parameter, β
^{*}=β+ν_{β}, where ν_{β}determines the maximum change from the current penetrance parameter. The penetrance parameter is restricted to β[0,1] so that proposed penetrances outside this range are reflected back into the parameter space. The likelihood for the proposed parameter set is denoted*L*(*data*|*x*,Γ^{β*},*p*,*w*,*y*)_{TOT}where Γ^{β*}is the vector of current hidden Markov–model parameters with β replaced by the proposed β^{*}. The proposed penetrance parameter is then accepted to the current parameter set if the acceptance probability is - 5.Propose a new age of the mutation, γ
^{*}=γ+ν_{γ}, where ν_{γ}determines the maximum change from the current age of the mutation. The age of the mutation is restricted to be positive so that a negative proposed age is reflected back into the valid parameter space. The likelihood for the proposed parameter set is denoted*L*(*data*|*x*,Γ^{γ*},*p*,*w*,*y*)_{TOT}, where Γ^{γ*}is the vector of current hidden Markov–model parameters with γ replaced by the proposed γ^{*}. The proposed age of the mutation is then accepted to the current parameter set if the acceptance probability is - 6.Propose a new major mutation proportion, λ
^{*}=λ+ν_{λ}, where ν_{λ}determines the maximum change from the current proportion. Since λ is a proportion, it is restricted to [0,1]. Proposed proportions outside this range are reflected back into the valid parameter space. The likelihood for the proposed parameter set is denoted*L*(*data*|*x*,Γ^{λ*},*p*,*w*,*y*)_{TOT}, where Γ^{λ*}is the vector of current hidden Markov–model parameters with λ replaced by the proposed λ^{*}. The proposed major mutation proportion is then accepted to the current parameter set if the acceptance probability is - 7.Propose a new set of
*K*recombination rates so that, for each*i*=1,2,...,*K**,**y*^{*}_{i}=*y*_{i}+ν_{y}, where ν_{y}determines the maximum change from the current recombination rates. Each recombination rate is restricted to be non-negative so that negative proposals are reflected back into the valid parameter space. In the same step, we also propose new autoregressive parameter values,^{*}=+ν_{}and σ^{*}=σ^{*}+ν_{σ}, where ν_{}and ν_{σ}determine the maximum change in parameter values for and σ, respectively. The correlation parameter [0,1] and the standard deviation σ is restricted to be positive. Proposals outside the permitted space are reflected back to valid parameter values. The likelihoods for the proposed parameters and recombination rates are denoted*L*(*y*^{*}|μ,^{*},σ^{*2})_{AR1}and*L*(*data*|*x*,Γ,*p*,*w*,*y*^{*})_{TOT}. The complete set of proposed recombination rates and autoregressive parameters are accepted to the current set if

At any stage we can incorporate prior distributions for model parameters by multiplying the appropriate acceptance probability by the ratio of prior probabilities for the proposed and current parameter values.

## References

**American Society of Human Genetics**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (960K)

- Fine-scale mapping of disease loci via shattered coalescent modeling of genealogies.[Am J Hum Genet. 2002]
*Morris AP, Whittaker JC, Balding DJ.**Am J Hum Genet. 2002 Mar; 70(3):686-707. Epub 2002 Feb 8.* - High-resolution multipoint linkage-disequilibrium mapping in the context of a human genome sequence.[Am J Hum Genet. 2001]
*Rannala B, Reeve JP.**Am J Hum Genet. 2001 Jul; 69(1):159-78. Epub 2001 Jun 15.* - Assessment of linkage disequilibrium by the decay of haplotype sharing, with application to fine-scale genetic mapping.[Am J Hum Genet. 1999]
*McPeek MS, Strahs A.**Am J Hum Genet. 1999 Sep; 65(3):858-75.* - On selecting markers for association studies: patterns of linkage disequilibrium between two and three diallelic loci.[Genet Epidemiol. 2003]
*Garner C, Slatkin M.**Genet Epidemiol. 2003 Jan; 24(1):57-67.* - A chronology of fine-scale gene mapping by linkage disequilibrium.[Stat Methods Med Res. 2001]
*Lazzeroni LC.**Stat Methods Med Res. 2001 Feb; 10(1):57-76.*

- High-resolution genetic mapping with pooled sequencing[BMC Bioinformatics. ]
*Edwards MD, Gifford DK.**BMC Bioinformatics. 13(Suppl 6)S8* - A Two-Stage Approximation for Analysis of Mixture Genetic Models in Large Pedigrees[Genetics. 2010]
*Habier D, Totir LR, Fernando RL.**Genetics. 2010 Jun; 185(2)655-670* - The Limits of Fine-Scale Mapping[Genetic epidemiology. 2009]
*Smith LP, Kuhner MK.**Genetic epidemiology. 2009 May; 33(4)344-356* - Fine mapping - 19th century style[BMC Genetics. ]
*Molitor J, Zhao K, Marjoram P.**BMC Genetics. 6(Suppl 1)S63* - Genetic epidemiology, genetic maps and positional cloning.[Philosophical Transactions of the Royal Soc...]
*Morton NE.**Philosophical Transactions of the Royal Society B: Biological Sciences. 2003 Oct 29; 358(1438)1701-1708*

- Bayesian Fine-Scale Mapping of Disease Loci, by Hidden Markov ModelsBayesian Fine-Scale Mapping of Disease Loci, by Hidden Markov ModelsAmerican Journal of Human Genetics. Jul 2000; 67(1)155PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...