# Joint Modeling of Linkage and Association: Identifying SNPs Responsible for a Linkage Signal

## Abstract

Once genetic linkage has been identified for a complex disease, the next step is often association analysis, in which single-nucleotide polymorphisms (SNPs) within the linkage region are genotyped and tested for association with the disease. If a SNP shows evidence of association, it is useful to know whether the linkage result can be explained, in part or in full, by the candidate SNP. We propose a novel approach that quantifies the degree of linkage disequilibrium (LD) between the candidate SNP and the putative disease locus through joint modeling of linkage and association. We describe a simple likelihood of the marker data conditional on the trait data for a sample of affected sib pairs, with disease penetrances and disease-SNP haplotype frequencies as parameters. We estimate model parameters by maximum likelihood and propose two likelihood-ratio tests to characterize the relationship of the candidate SNP and the disease locus. The first test assesses whether the candidate SNP and the disease locus are in linkage equilibrium so that the SNP plays no causal role in the linkage signal. The second test assesses whether the candidate SNP and the disease locus are in complete LD so that the SNP or a marker in complete LD with it may account fully for the linkage signal. Our method also yields a genetic model that includes parameter estimates for disease-SNP haplotype frequencies and the degree of disease-SNP LD. Our method provides a new tool for detecting linkage and association and can be extended to study designs that include unaffected family members.

## Introduction

Positional cloning is widely used for identification of genes involved in human diseases. To date, hundreds of disease genes have been identified solely on the basis of their chromosomal position (Botstein and Risch 2003); examples include hemochromatosis (Feder et al. 1996), inflammatory bowel disease (Hugot et al. 2001; Ogura et al. 2001), and lactose intolerance (Enattah et al. 2002). The first step in a traditional positional-cloning approach involves a genomewide linkage analysis performed on a collection of families with multiple affected individuals. Often, linkage analysis results in a candidate region of 10–20 Mb. To localize the susceptibility allele more precisely, disease-marker association analyses with additional genetic markers specific to the linked region can be performed. With recent progress on high-throughput SNP genotyping (Sachidanandam et al. 2001; Syvanen 2001; Oliphant et al. 2002; Olivier et al. 2002) and the HapMap project (International HapMap Consortium 2003), these follow-up association studies are becoming less expensive and now routinely include hundreds or thousands of markers.

Association analysis often compares marker-allele frequencies between unrelated case and control subjects. In this design, only a subset of the samples originally collected for linkage analysis can be reused. As an alternative, family-based association methods have been developed. Family-based association tests offer a compromise between traditional linkage studies and case-control association studies. The classic family-based transmission/disequilibrium test was proposed to test for association in the presence of linkage in family trios containing two parents and one affected offspring (Spielman et al. 1993). This approach has been extended to discordant sib pairs (Curtis 1997; Boehnke and Langefeld 1998), sibships with multiple affected and unaffected sibs (Spielman and Ewens 1998), general pedigrees (Martin et al. 2000), and quantitative traits (Allison 1997; Rabinowitz 1997; Abecasis et al. 2000*a*, 2000*b*).

A shortcoming of these family-based association methods is that, although they test for association, they cannot distinguish between potentially causal SNPs and other variants showing weaker association, except in the case of quantitative traits (Cardon and Abecasis 2000). Göring and Terwilliger (2000) proposed a unified theoretical model for linkage and linkage disequilibrium (LD) analysis through the use of a “pseudomarker” locus, but their approach cannot accommodate information contributed by flanking markers. Horikawa et al. (2000) suggested a modified association approach by examining how the evidence of linkage was partitioned in accordance with the genotype at the associated SNP, but they did not explore the properties of this approach. Li et al. (2004) explored the relationship between family-specific weights based on the affected individuals’ genotypes and family-specific nonparametric linkage (NPL) scores, but their method does not quantify the relationship between SNP alleles and the linkage signal. Sun et al. (2002) developed an approach that identifies SNPs whose genotypes can fully explain the observed linkage signal, but their test does not identify SNPs that play a partial role in explaining the linkage signal.

In this article, we describe a statistical framework that identifies candidate SNPs that can fully or partly explain the observed linkage signal, through joint modeling of linkage and association with the use of affected sib pairs (ASPs). Our method uses genotype information contributed by both the candidate SNP and the flanking markers. When a candidate SNP is identified as being able to account for linkage, our approach estimates the degree of LD between the candidate SNP and disease alleles. The estimate of disease-SNP LD quantifies the degree to which the linkage signal can be explained by the candidate SNP. We summarize the available information using a simple likelihood of the marker data conditional on the trait data, with disease penetrances and disease-SNP haplotype frequencies as parameters. We estimate model parameters by maximum likelihood and propose two likelihood-ratio tests to characterize the relationship between the candidate SNP and the putative disease locus. We calculate LD between disease and SNP alleles on the basis of haplotype-frequency estimates. Our method can identify both associated and potentially causal SNPs. Here, we focus on the ASP study design, but our method can be readily extended to accommodate unaffected individuals and other family structures, as well as unrelated individuals.

## Methods

### Assumptions and Definitions

We assume that there is a set of ASPs typed for a candidate SNP and *M*0 flanking markers that can help evaluate evidence for linkage. We wish to evaluate evidence for association at the candidate SNP and to estimate the degree of LD with the unobserved disease locus. If there are multiple SNPs, we consider them one at a time as the candidate SNP. We allow LD between the candidate SNP and the unobserved disease alleles, but we assume linkage equilibrium between the flanking markers and the candidate SNP. Our goal is to quantify the relationship between the candidate SNPs and the unobserved disease alleles. Our method assumes that a single diallelic polymorphism directly contributes to risk in each linked region. We address the implications of multiple disease variants in the “Discussion” section.

Consider a diallelic disease locus with disease-predisposing allele *D* (with frequency *p*_{D}) and wild-type allele *d* (with frequency *p*_{d}=1-*p*_{D}) and a nearby diallelic SNP with alleles *A* (with frequency *p*_{A}) and *a* (with frequency *p*_{a}=1-*p*_{A}). Denote the four disease-SNP haplotypes as *DA, Da, dA,* and *da* (with frequencies *p*_{DA}, *p*_{Da}, *p*_{dA}, and *p*_{da}, respectively). We assume Hardy-Weinberg equilibrium in the general population for all markers, including the superlocus formed by the combination of the disease and SNP loci. Let *f*_{g}=*P*(*affected*|*g*) be the penetrance for a given genotype at the disease locus. By definition, the population prevalence of the disease, *K,* is equal to *f*_{dd}*p*^{2}_{d}+2*f*_{Dd}*p*_{d}*p*_{D}+*f*_{DD}*p*^{2}_{D}, the attributable fraction equals *K*-*f*_{dd}/*K**,* and the genotype relative risk (GRR) equals *f*_{g}/*f*_{dd}.

Let *X*=(*X*_{1},…,*X*_{k},*X*_{SNP},*X*_{k+1},…,*X*_{M}) be the observed marker genotypes for the ASP, and let the probability of no change in identity-by-descent (IBD) status between consecutive markers be ψ_{m}=θ^{2}_{m}+(1-θ_{m})^{2}, where θ_{m} is the recombination fraction between markers *m* and *m* + 1 (1*m**M*-1). Let *I*_{m}, *I*_{SNP}, and *I*_{D} be the possibly unknown number of alleles shared IBD by an ASP at marker *m,* at the candidate SNP, and at the putative disease locus, respectively. For now, assume that there is no recombination between the candidate SNP and the disease locus, so that *I*_{SNP}=*I*_{D}. Denote disease locus IBD-sharing probabilities for an ASP by *z*_{i}=*P*(*I*_{D}=*i*|*ASP*), where *i*=0,1,2, and *z*=(*z*_{0},*z*_{1},*z*_{2}). For ease of computation, we assume that there is no genetic interference, so that {*I*_{m}} forms a hidden Markov chain.

### Conditional Probability of Marker Data, Given an ASP

We wish to calculate *P*(*X*|*ASP*), the probability of the marker genotype data *X* for an ASP. By applying the forward and backward algorithms of Baum (1972), *P*(*X*|*ASP*) can be calculated as

where *k* and *k* + 1 are the flanking markers on the left- and right-hand sides of the candidate SNP.

At an arbitrary marker *m* (1*m**M*),

and

Special cases are *L*_{1}(*I*_{1})=*P*(*X*_{1}|*I*_{1}) and *R*_{M}(*I*_{M})=*P*(*X*_{M}|*I*_{M}). The conditional probabilities of the genotype data, given the number of alleles shared IBD by the sib pair at marker *m,* *P*(*X*_{m}|*I*_{m}), are given in table 1 (Thompson 1975). IBD transition probabilities, *P*(*I*_{m+1}|*I*_{m}), are given in table 2 (Risch 1990). Recursive calculation of *L*_{m}(*I*_{m}) and *R*_{m}(*I*_{m}) allows the rapid evaluation of *P*(*X*|*ASP*) in a manner linear to the number of markers, *M.*

To calculate *P*(*X*_{SNP},*I*_{D}|*ASP*), let *G*_{j} denote the disease-SNP haplogenotype for sib *j*=1,2. Summing over all ordered haplogenotypes that are consistent with the observed SNP genotypes, we get

where *P*(*G*_{1},*G*_{2}|*I*_{D}) can be calculated from table 1 by regarding each haplogenotype as a genotype of the superlocus that has up to four alleles. For a sib pair, *P*(*I*_{D}) takes the values (1/4, 1/2, 1/4). To illustrate how equation (2) is calculated, consider an ASP with SNP genotype *A*/*A* for the first sib and *a*/*a* for the second sib. The disease-SNP genotypes that are consistent with the observed SNP genotypes are *G*_{1} {*DA/DA, DA/dA, dA/dA*} and *G*_{2}{*Da*/*Da*,*Da*/*da*,*da*/*da*}. If *I*_{D}=0, then the numerator of equation (2) is

Similarly, we can obtain the probability of an ASP, where

In the calculation of equations (2) and (3), we assume that the disease-affection statuses of the ASP are conditionally independent, given their genotypes at the disease locus. This is a common assumption for parametric likelihood calculation. It is exactly true when there are no other genetic or environmental risk factors shared among siblings, and it is a reasonable approximation when there are multiple disease-causing variants or shared environmental risk factors.

Our calculation allows analysis with missing genotypes. For example, to accommodate ASPs in which only one sib is genotyped at the candidate SNP, we sum over all possible SNP genotypes for the sib with missing genotype. Our calculation can also be readily extended to sib-pair samples that include unaffected individuals, by replacing *f*_{Gj} in equations (2) and (3) with 1-*f*_{Gj} for an unaffected individual.

### The Relationship between Disease Locus and Candidate SNP

A useful measure of LD between two loci is the squared statistical correlation, defined as *r*^{2}=(*p*_{DA}-*p*_{D}*p*_{A})^{2}/[*p*_{D}(1-*p*_{D})*p*_{A}(1-*p*_{A})] in a sample of phased haplotypes. Multiplying *r*^{2} by the sample size yields the χ^{2} statistic for comparison of allele frequencies between cases and controls in a random population sample. *r*^{2} measures the degree of LD between the candidate SNP and the putative disease locus, as represented by the observed linkage signal, and can quantify the degree to which the linkage signal is explained by the candidate SNP. The candidate SNP and the putative disease locus can be in linkage equilibrium (*r*^{2}=0), complete LD (*r*^{2}=1), or partial LD (0<*r*^{2}<1). Under linkage equilibrium, the candidate SNP is not associated with the putative disease locus and plays no causal role in the linkage signal. Under complete LD, the candidate SNP or a marker in complete LD with it can fully account for the linkage signal; we call this model “plausible causality.” Under partial LD, the candidate SNP partially accounts for the linkage signal.

We parameterize our models by using three penetrances, *f*_{dd}, *f*_{Dd}, and *f*_{DD}, in addition to (1) allele frequencies *p*_{D} and *p*_{A} for the linkage equilibrium model, (2) single-allele frequency *p*=*p*_{D}=*p*_{A} for the complete LD model, and (3) haplotype frequencies *p*_{DA}, *p*_{Da}, and *p*_{dA} for the general model. Given only ASPs, each of these models is identifiable, except the linkage equilibrium model, in which parameters (*f*_{dd}, *f*_{Dd}, *f*_{DD}, *p*_{D}, and *p*_{A}) are not all identifiable, because the data contain information for only *p*_{A} and *z*=(*z*_{0},*z*_{1},*z*_{2}), corresponding to a total of 3 df, since *z*_{0}+*z*_{1}+*z*_{2}=1. To achieve an identifiable model, note that, under linkage equilibrium, *P*(*X*_{SNP},*I*_{D}|*ASP*)=*P*(*X*_{SNP}|*I*_{D})*P*(*I*_{D}|*ASP*) and that *P*(*X*_{SNP}|*I*_{D}) depends on only *p*_{A}. Thus, the linkage equilibrium model can be reparameterized in terms of (*z*_{0},*z*_{1},*p*_{A}), resulting in a likelihood similar to the traditional maximum LOD score (MLS) linkage test (Risch 1990) but with an additional parameter, *p*_{A}. Here, we assume that the candidate SNP is completely linked to the putative disease locus. In theory, one could allow recombination between the candidate SNP and the putative disease locus as well. However, there is confounding between recombination and IBD sharing at the SNP (Risch 1990). A commonly used approach to avoid confounding in multipoint MLS calculation is to assume no recombination. IBD-sharing probabilities *z*=(*z*_{0},*z*_{1},*z*_{2}) should satisfy the triangle constraint 0*z*_{1}0.5 and 0*z*_{0}0.5*z*_{1} (Holmans 1993).

The previous models assume that the candidate SNP is completely linked to the putative disease locus. If the candidate SNP is unlinked, then IBD-sharing probabilities at the SNP should be *z*=(1/4,1/2,1/4), and the only estimable parameter is *p*_{A}. As such, the relationship between the candidate SNP and the putative disease locus falls into one of four models (table 3).

For a sample of independent ASPs, the retrospective likelihood of the data is

where the product is taken over all independent ASPs. Here, we chose to use a retrospective likelihood because the data are ascertained through the disease-affection statuses of the ASPs. The use of a retrospective likelihood can avoid the problem of ascertainment bias so that the parameter estimates are valid for the general population. To maximize equation (4), we use a simplex algorithm (Nelder and Mead 1965), an optimization method that does not require derivatives. Below, we represent the maximum of a particular likelihood subject to its parameter constraints by *.* In addition, we estimate *r*^{2} from frequency estimates of disease-SNP haplotype frequencies. The estimate of *r*^{2} is of particular interest in the case of partial disease-SNP LD; it reflects the degree to which a linkage result is explained by the candidate SNP.

### Likelihood-Ratio Statistic

Given different relationships between the candidate SNP and the disease locus, we can test for linkage, association, and plausible causality. We evaluate evidence for linkage with (see table 3 for explanations of *L*_{LE}, *L*_{UL}, *L*_{GM}, and *L*_{LD}). We evaluate evidence for association by testing whether the candidate SNP is in linkage equilibrium with the disease locus by use of the likelihood-ratio statistic . Rejection of linkage equilibrium between the disease and SNP loci suggests the candidate SNP is associated with the disease locus and can account (in part) for the observed linkage signal. We examine plausible causality by testing whether the candidate SNP is in complete LD with the disease locus by use of the likelihood-ratio statistic . Rejection of complete LD for an associated SNP suggests that the SNP cannot fully account for the observed linkage signal. If there is a single disease causal variant in the region, then it must be another SNP; otherwise, there might be other disease causal variants in the region.

### Empirical Null Distributions for Tests of Linkage Equilibrium and Complete LD

The asymptotic distributions of *T*_{LE} and *T*_{LD} under the null hypotheses might, in principle, be approximated by a mixture of χ^{2} distributions (Self and Liang 1987), but we have not derived the degrees of freedom and mixing parameters, because of the complexity of parameter constraints and boundaries. Alternatively, the significance of the tests can be assessed empirically by simulating marker genotypes under the null hypothesis and comparing the observed statistic with the simulated null distribution. One possibility would be to estimate disease-locus parameters and marker-allele frequencies under the null hypothesis and then to simulate genotypes for the candidate SNP and flanking markers conditional on the estimated parameters and observed phenotypes. In our preliminary investigations, this approach led to inflated type I error rates (data not shown), and we describe below, in detail, alternative strategies that may be theoretically less efficient but perform well in all the settings we examined.

For the *T*_{LE} statistic, employed when the null hypothesis assumes linkage equilibrium between trait and marker loci, we sample SNP genotypes conditional on flanking-marker genotypes and estimated model parameters. In contrast, for the *T*_{LD} statistic, employed when the null hypothesis assumes complete LD between the candidate SNP alleles and disease-susceptibility alleles, we sample flanking-marker genotypes conditional on the observed candidate SNP genotypes and estimated parameters.

For the linkage-equilibrium model, we use the observed data to obtain the SNP allele-frequency estimate and the IBD-sharing probability estimates at the candidate SNP. To obtain a simulated sample under linkage equilibrium, for each ASP, we retain flanking-marker data and simulate the IBD configuration at the candidate SNP in accordance with

for *I*_{D} = 0,1,2 and where *P*(*X*_{1},…,*X*_{k}|*I*_{D}) and *P*(*X*_{k+1},…,*X*_{M}|*I*_{D}) are the left- and right-chain probabilities calculated in equation (1). Given the IBD configuration at the candidate SNP, the ASP’s candidate-SNP genotypes can then be sampled on the basis of the estimated candidate-SNP allele frequency, . Note that, when flanking-marker genotypes are not available, *P*(*X*_{1},…,*X*_{k}|*I*_{D}) = *P*(*X*_{k+1},…,*X*_{M}|*I*_{D})=1, so that sampling is conditional on only the estimated parameter values and phenotypes. We obtain the null distribution of *T*_{LE} by simulating a large number of replicates and calculating the statistic for each simulated data set.

Our procedure for simulating the null distribution of *T*_{LD} is different. Note that, if the candidate SNP is in complete LD with the disease locus alleles (or it is the disease locus itself), then candidate SNP genotypes should be sufficient to explain IBD sharing in the region. This observation has previously been used by Sun et al. (2002), who calibrated the significance of their test by sampling flanking-marker genotypes conditional on the observed SNP genotypes for each ASP. For each ASP, we leave the candidate SNP genotypes for the ASP unchanged from their observed values. Then, we sample an IBD configuration at the candidate SNP conditional on the observed SNP genotypes for the ASP and the estimated parameters and obtained from the complete LD model in accordance with

which can be obtained from equation (2). Finally, we sample genotypes for flanking markers, conditional on the IBD configuration at the candidate SNP. Specifically, we sample genotypes at marker *k* in accordance with transition probabilities *P*(*I*_{k}|*I*_{D}) and the allele frequencies of marker *k.* The genotypes of marker *k* + 1 are sampled similarly but with transition probabilities *P*(*I*_{k+1}|*I*_{D}). Moving left and right along the chromosome, we simulate flanking-marker genotypes on the basis of *P*(*I*_{m-1}|*I*_{m}) and *P*(*I*_{m+1}|*I*_{m}), respectively. We obtain the null distribution of *T*_{LD} by simulating a large number of replicates and calculating the statistic for each simulated data set. This procedure for generating the empirical distribution of *T*_{LD} has some limitations. In particular, when there are no flanking markers, our procedure leaves the original data unchanged, and so it is not possible to evaluate the significance of a particular value for *T*_{LD}. Nevertheless, and as shown in the “Results” section, flanking markers provide most of the information required to distinguish between markers in complete LD and those in partial LD with the disease locus; thus, the distribution of *T*_{LD} when there are no flanking markers is of little practical interest.

### Simulations

We conducted a number of simulations to explore the properties of our proposed tests of no association and plausible causality and the resulting estimates of genetic model parameters. Table 4 describes the disease models we considered, which varied over a range of attributable fractions, disease-allele frequencies, GRRs, and sibling recurrence-risk ratio λ_{s}, defined as the recurrence risk for a sib of an affected individual divided by the population disease prevalence (Risch 1987). For all disease models, the population prevalence *K* of the disease was fixed at 2%.

In each model, we assumed the disease- and SNP-allele frequencies to be identical and, except where noted, used a map of 10 markers with eight equally frequent alleles (heterozygosity [*H*] of .875) evenly spaced at 11.16-cM intervals, corresponding to recombination fraction 0.10 under the no-interference map function of Haldane (1919). We centered the disease and SNP loci in the middle of the map and assumed zero recombination between them. We removed disease-locus genotypes prior to data analysis. For each of the disease models in table 4, we simulated 5,000 replicates of 500 ASPs under linkage equilibrium or complete LD to obtain null distributions and to determine critical values for each test. We simulated 2,000 replicates of 500 ASPs with various levels of disease-SNP LD to assess the empirical power of the corresponding tests. Here, we simulated the null distributions by using their generating values. In the “Discussion” section, we consider the impact of the use of null distributions estimated using our computationally intensive resampling procedures.

## Results

### Power to Reject Linkage Equilibrium (No Association)

Figure 1 displays the estimated power to reject the hypothesis of linkage equilibrium as a function of *r*^{2} for the disease models in table 4. As expected, the power of *T*_{LE} increases as *r*^{2} increases, for all disease models, and it is at its maximum when *r*^{2}=1. Figure 1 also shows that, for models with the same λ_{s}, it is relatively easier to detect association for a less-common disease allele than for a common one. More generally, we found that the power of the test is closely related to the GRR. We found that, for a fixed λ_{s}, lower disease-allele frequencies generally corresponded to a higher GRR for the disease models we considered.

*r*

^{2}=0). Results are based on 2,000 replicates of 500 ASPs. All models have population disease prevalence

*K*= 2% and sibling recurrence-risk ratio λ

_{s}=1.1. Power was assessed at the 5% level.

These simulation results indicate that our method has good power to detect whether a candidate SNP is associated with the putative disease locus, even when the genetic effect is modest (λ_{s}=1.1). In each set of simulations, we found that the power of the test of linkage equilibrium to detect disease-SNP association does not depend on the magnitude of the observed linkage signal. Our results show that, given the same genetic effect as measured by λ_{s}, the power of *T*_{LE} estimated using replicates with smaller MLS is nearly identical to the power of *T*_{LE} estimated using replicates with larger MLS. For complex diseases with modest genetic effects, increased sharing near a disease locus can be overwhelmed by sampling variation in IBD estimates. Even when the evidence for linkage is absent, our method still yields a valid and useful test of association.

### Power to Reject Complete LD (Plausible Causality)

Intuitively, one would expect the power to reject the hypothesis of complete LD to increase as *r*^{2} decreases from 1, with maximum power when *r*^{2}=0. Figure 2 shows that the simulation results agree well with this expectation for our recessive and additive models but not for our dominant models. In our simulations, we found that the magnitude of the *T*_{LD} statistic is highly correlated with the MLS when *r*^{2} is low, and the dependence becomes less strong as *r*^{2} increases. To illustrate this effect, we estimated power using those replicate data sets in which MLS is >1 or >2.5 for a dominant model with *p*_{D}=*p*_{A}=0.30 and λ_{s}=1.3. Figure 3 suggests that our ability to detect complete LD is dramatically enhanced as the MLS increases. For example, when *r*^{2}=0, the power is 46% given no minimum MLS requirement but increases to 88% when MLS >1 and increases to nearly 100% when MLS >2.5.

*r*

^{2}=1). Results are based on 2,000 replicates of 500 ASPs. All models have population disease prevalence

*K*= 2% and sibling recurrence-risk ratio λ

_{s}=1.3. Power was assessed at the 5% level.

*K*= 2%, allele frequency

*p*

_{D}=

*p*

_{A}=0.30, and sibling recurrence-risk ratio λ

_{s}=1.3. Power

**...**

In general, determining that a SNP is not in complete LD with the disease allele is more difficult than detecting whether it is associated with the disease allele, and it generally requires a larger sample. Our simulation results suggest that, at the same level of genetic effect as measured by λ_{s}, it is easier to evaluate whether a SNP might be in complete LD with the disease allele if the disease is recessive. This might be because the ASPs are more likely to share two alleles IBD under a recessive model than under a dominant model, and such excess IBD sharing provides more information on linkage. Since the power of *T*_{LD} depends strongly on the magnitude of the observed linkage signal, this situation provides greater power to evaluate whether a candidate SNP is plausibly causal.

### Parameter Estimates

Our method yields maximum-likelihood estimates of the disease-SNP haplotype frequencies directly, and, from those, we can calculate estimates of LD measures, such as *r*^{2}. Mean parameter estimates and empirical SDs for the additive model, given 500 and 2,000 ASPs, are listed in table 5. The bias of parameter estimates is similar for dominant and recessive models (data not shown).

The bias of the maximum-likelihood allele-frequency estimates generally decreases as *r*^{2} increases to values close to 1, corresponding to greater information about the disease locus. Maximum-likelihood estimates are asymptotically unbiased under appropriate regularity conditions, notably when no null hypothesis parameter value is on the boundary of the parameter space. In our case, *r*^{2}=1 results in two disease-SNP haplotypes with a frequency of 0, so our parameter estimates may be biased even in large samples. From table 5, we see that, when *r*^{2}=1, *p*_{D} can be underestimated and λ_{s} is slightly overestimated, with the bias decreasing as the sample size and magnitude of genetic effect increase.

### Impact of Flanking Markers

To investigate the impact of flanking-marker data on our tests of disease-SNP LD, we simulated additional data sets using a map of 10 flanking markers, each with two equally frequent alleles (*H*=.50). We analyzed each of our data sets using 0, 2, 4, or all 10 flanking markers. Figure 4 suggests that having at least two flanking markers improves performance for both tests, but especially for the test of complete LD. Results based on 2, 4, or 10 flanking markers show only slight differences. Note that results are presented in figure 4*B* for the evaluation of the significance of *T*_{LD} when there are no flanking markers. Although this is possible for a simulation study such as this (in which the true population parameter values are known and were used to simulate the null distribution of the statistic), it is not practical for analysis of real data, since our procedure for evaluating the empirical distribution of *T*_{LD} requires information on flanking markers. In practice, this is not a serious limitation, because *T*_{LD} has very low power when there are no flanking markers, and we recommend that it should be used only when at least two flanking markers are available.

*K*= 2%, allele frequency

*p*

_{D}=

*p*

_{A}=0.15, and sibling recurrence-risk ratios λ

_{s}=1.1 (

*A*

**...**

We next assessed the impact of flanking-marker heterozygosity on power estimation by conducting additional simulations using two flanking markers, each with two (*H*=.50), four (*H*=.75), or eight (*H*=.875) equally frequent alleles. We found that, for the test of linkage equilibrium, the power for two and four equally frequent alleles is only slightly lower than for eight equally frequent alleles, but the difference in power is more pronounced for the test of complete LD (fig. 5). The differences between power for 2, 4, and 10 flanking markers when flanking-marker *H* was 0.75 were modest (data not shown), suggesting that even just two highly polymorphic flanking markers may provide substantial power to detect disease-SNP LD. The utility of even two flanking markers is especially helpful when linkage data are not available and additional genotyping is required.

*K*= 2%, allele frequency

*p*

_{D}=

*p*

_{A}=0.15, and sibling recurrence-risk ratios λ

_{s}=1.1

**...**

We also evaluated the impact of flanking-marker densities on power (fig. 6) by simulating data sets using a map of 10 flanking markers, each with four equally frequent alleles (*H*=.75). Clearly, denser markers give greater power, but the increment of power is not substantial for the range of densities we considered. In practice, having flanking markers within ~10 cM should be enough for the initial evaluation of a candidate SNP.

### Comparison of GIST and STEPC

Li et al. (2004) examined whether a SNP might account in part for an observed linkage signal by testing for a correlation between SNP genotypes and family-specific NPL scores. Their method is implemented in the software package GIST. Their simulations show that GIST is useful for identifying SNPs that are in LD with the disease locus and suggest that GIST could also identify associated SNPs in the absence of evidence for linkage. We compared our test of *r*^{2}=0 with GIST (table 6). We assessed the significance of *T*_{LE} at the 5% significance level by comparing the observed statistic with the empirical null distribution simulated in accordance with the resampling procedure described in the “Methods” section. Results are based on 500 replicates of 500 ASPs. For each replicate, the empirical null distribution of *T*_{LE} was obtained by resampling 1,000 times. The results indicate that our test has greater power than GIST for the models we considered. For example, our test has 89% power to reject linkage equilibrium when *r*^{2}=.67 for an additive model with *p*_{D}=*p*_{A}=0.15 and λ_{s}=1.1 at the 5% significance level, whereas GIST has 76% power.

Sun et al. (2002) examined whether a SNP could fully explain the observed linkage signal and implemented their approach in the software package STEPC. The method of Sun et al. (2002) is based on the observation that if a SNP is the only variant in the region that influences the trait, then conditional on the affected relatives’ SNP genotypes, there should be no increased IBD sharing in the region among affected individuals. We compared our test of *r*^{2}=1 with STEPC (table 6) by using 500 replicates of 500 ASPs. Again, the significance of *T*_{LD} was assessed empirically using the empirical null distribution simulation procedure for *T*_{LD} described in the “Methods” section. The two tests have nearly identical performance when *r*^{2}=0. However, the power of STEPC drops quickly as *r*^{2} increases, so that, when SNPs are in moderate LD with the putative disease locus, our method has better resolving power and thus should identify a smaller set of potential explanatory SNPs.

## Discussion

We have developed a statistical framework that quantifies the relationship between SNP alleles and unobserved trait alleles through joint modeling of linkage and association by use of ASP data. We described a parametric likelihood of the marker genotypes conditional on the trait data under the assumption that there is a single disease-causing variant in the region. Our unified likelihood framework naturally leads to two tests: (1) a test of whether a candidate SNP is in linkage equilibrium with the putative disease locus and (2) a test of whether the candidate SNP is in complete LD with the putative disease locus. In the first case, the rejection of linkage equilibrium suggests that the candidate SNP is associated with the putative disease locus and that the candidate SNP or one in LD with it accounts, at least in part, for the observed linkage signal. In the second case, the rejection of complete LD indicates that the candidate SNP cannot fully account for the linkage signal. Our method also yields estimates of interesting genetic parameters, including the disease-locus and SNP-allele frequencies, the locus-specific risk ratio λ_{s}, and the degree of disease-SNP LD. Our method uses ASPs and does not require parental genotypes. This feature is important for late-onset diseases, for which parents may not be available to study.

Simulation studies show that our method has good power to detect disease-SNP association, even when the sibling recurrence-risk ratio is as low as 1.1. We compared our test of linkage equilibrium with GIST (Li et al. 2004) and found our test to be more powerful in the models that we considered. The increase of power may come from the fact that our method is model-based, whereas GIST is nonparametric and is based on model-free NPL scores. Like GIST, the power of our test of linkage equilibrium does not depend on the overall strength of the linkage signal. Evidence of disease-SNP LD from our test also reveals underlying linkage, which might be overwhelmed by sampling variation in IBD estimates. This feature makes our method a useful tool for detecting linkage as well as association. We also compared our test of complete LD with STEPC (Sun et al. 2002) and found that the two tests have similar performance under linkage equilibrium, but our test has greater power to distinguish those SNPs that are in strong but incomplete LD with the putative disease locus.

In contrast to previous approaches (Sun et al. 2002; Li et al. 2004), our method yields disease-SNP haplotype-frequency estimates in the general population without the requirement of a separate control sample. These quantities lead to the estimate of disease-SNP *r*^{2}, a measure that can be used to quantify the degree to which a linkage signal is explained by a candidate SNP. Our estimate of *r*^{2} also provides information about the distance between the candidate SNP and the unobserved disease locus and helps refine the region in which further candidate SNPs should be examined. The disease-allele frequency estimate may be helpful to researchers in selecting additional nearby SNPs, by focusing on those with frequencies close to the predicted disease-allele frequency. This approach becomes increasingly useful as *r*^{2} increases.

Our tests benefit from genotype information on flanking markers, which are available in many gene mapping studies. Our results show that even two highly polymorphic flanking markers can provide nearly as much information as many more markers for this purpose. Compared with other family-based association tests and a previous joint model of linkage and association (Göring and Terwilliger 2000), our model has the attractive feature of incorporating flanking-marker information when it is available. An alternative joint model of linkage and association was developed independently by Cantor et al. (2005). In contrast to our approach, theirs includes recombination as an additional parameter and fixes disease model parameters, including penetrances and the disease-allele frequency. Like the approach of Göring and Terwilliger (2000), the model of Cantor et al. (2005) does not readily incorporate genotype information contributed by flanking markers.

Since the test of linkage equilibrium does not depend strongly on the overall evidence of linkage, flanking-marker heterozygosity has less impact on power for this test than for the test of complete LD, which is highly dependent on the strength of linkage evidence. We assume linkage equilibrium between the flanking markers and the candidate SNP in our likelihood calculation. For flanking markers that are close together, they may show strong evidence of LD. For such data, we recommend selecting a small number of flanking markers in linkage equilibrium so that the linkage equilibrium assumption is satisfied.

The likelihood framework described in this article is applicable to dichotomous traits only. However, many disease-related traits—for instance, blood pressure and cholesterol level—are continuous in nature. Dichotomization can result in a loss of power of the corresponding tests. Fulker et al. (1999) developed a method that tests for linkage while simultaneously modeling allelic association by use of the variance-components framework. Their method was further extended to general pedigrees (Abecasis et al. 2000*a*; Cardon and Abecasis 2000). Attenuation of evidence for linkage when association to SNP alleles is modeled suggests that the candidate SNP accounts for linkage and provides information about unobserved trait alleles. Although these methods provide bounds on disease-allele frequencies and disease-marker LD (Cardon and Abecasis 2000), they provide no direct estimate of these quantities. To address the same question for quantitative traits, we plan to develop a statistical framework that summarizes the available information by a retrospective likelihood, with the putative trait locus and the SNP haplotype frequencies as parameters.

In most gene mapping studies, the ASPs are likely to be selected from a sample that was originally collected for linkage analysis and for which flanking-marker genotypes are available. Our likelihood calculation naturally allows for missing genotypes at the candidate SNP. Given fixed genotyping resources, researchers may initially type many SNPs in only one sib per ASP. This approach halves the genotyping costs, and additional simulations suggested that even one sib per ASP can provide meaningful information on whether a candidate SNP is associated with the disease locus. This suggests that, for an initial screen of SNPs, it may be cost effective to genotype only one sib per ASP, with genotyping of the other sibs done when a candidate SNP shows at least suggestive evidence of association.

In our simulations, we assessed the empirical power of *T*_{LE} and *T*_{LD} by simulating the null distributions generated using the true parameter values. For real data sets, these values are unknown, and the null distributions must be generated using the estimated parameter values. We described simulation procedures to obtain the empirical null distributions. For the test of linkage equilibrium, we simulate the SNP genotypes for each ASP, conditional on their flanking-marker genotypes, and leave the flanking-marker genotypes unchanged from their observed values. For the test of complete LD, we leave the SNP genotypes unchanged and simulate the flanking-marker genotypes, conditional on the observed SNP genotypes, to remove excess sharing explained by the flanking markers. The difference in these two simulation procedures is due to the inherent difference of the two tests. Under linkage equilibrium, the candidate SNP provides no information on the unobserved disease locus, and the SNP genotypes can be sampled by gene-dropping simulations. In contrast, under complete LD, the candidate SNP is statistically identical to the unobserved disease locus, and the SNP genotypes need to be preserved to retain complete information on the unobserved disease locus.

We examined the performance of our null distribution simulation procedures and found that the simulated null distributions for both tests agree well with those generated in accordance with their true parameter values at all levels of disease-SNP LD and that both simulation procedures give correct type I errors (fig. 7). Evaluation of the significance of our tests by use of their simulated null distributions can be computationally intensive in practice, especially when the sample size is large. For an initial screen of SNPs, one might choose to evaluate the significance of the linkage equilibrium test empirically only if *T*_{LE}3.84 (at the 5% significance level), since χ^{2}_{1} approximates the lower bound for the asymptotic distribution of *T*_{LE}, and one may test whether a candidate SNP is potentially causal only if it shows significant evidence of association.

*K*= 2% and allele frequency

*p*

_{D}=

*p*

_{A}=0.15. The solid line in each plot is the density of the empirical null distribution

**...**

We described our likelihood framework in the context of ASPs, but our method can be readily extended to other study designs. We are extending our method to sibships of arbitrary size and disease-phenotype configuration and to include unrelated affected or unaffected individuals. Unaffected individuals are more representative of the general population and may help to infer the underlying genetic model parameters. We expect the power of our test of linkage equilibrium to increase when unrelated unaffected individuals are added to the study.

Despite its flexibility, our method has limitations. Like all statistical methods, ours is unable to distinguish the true disease causal variant from an allele that is in complete LD with it. We assume that there is a single disease causal variant in the candidate region. However, many complex diseases are influenced by multiple genetic variants and are possibly the result of gene-gene and gene-environment interactions. Individual variants may be neither necessary nor sufficient to explain the effect of a single locus on disease susceptibility—for example, three independent SNPs, a frameshift variant, and two missense variants of *NOD2* were identified as determining susceptibility to inflammatory bowel disease (Hugot et al. 2001). If only one causal variant is assumed, then we expect our model to indicate that each variant is associated with the underlying disease loci but none is causal (i.e., for all, 0<*r*^{2}<1). For complex diseases that are influenced by multiple genetic variants, fitting the data under the assumption of a single-locus disease model is equivalent to testing the marginal effect of a specific locus. If the marginal effect of that locus is modest, then we may have limited power to detect association. For those cases, it might be desirable to develop a method that allows the analysis of two-locus or even multilocus disease models.

In summary, we have developed a unified likelihood framework to estimate useful genetic parameters and to test for both linkage equilibrium and complete LD between a candidate SNP and the putative disease locus. Results from these two tests complement each other in answering the question of whether the candidate SNP can account in part or in full for the observed linkage signal. An estimate of the disease-SNP LD provides a measure to quantify the degree of contribution of the candidate SNP to linkage evidence. Together with the disease locus and the SNP-allele frequency estimates, our method will be valuable in helping researchers to evaluate the role of a candidate SNP in disease susceptibility and to fine-map disease genes. We have implemented our method in a C++ program, which can be downloaded from the University of Michigan Center for Statistical Genetics Web site.

## Acknowledgments

This research is supported by National Institutes of Health grants HG00376 (to M.B.) and HG02651 (to G.R.A.). M.L. is currently supported by a University of Michigan Rackham predoctoral fellowship. We gratefully thank two anonymous reviewers for their valuable comments.

## Electronic-Database Information

The URL for data presented herein is as follows:

## References

*a*) A general test of association for quantitative traits in nuclear families. Am J Hum Genet 66:279–292 [PMC free article] [PubMed]

*b*) Pedigree tests of transmission disequilibrium. Eur J Hum Genet 8:545–551 [PubMed] [Cross Ref]10.1038/sj.ejhg.5200494

*NOD2*associated with susceptibility to Crohn’s disease. Nature 411:603–606 [PubMed] [Cross Ref]10.1038/35079114

**American Society of Human Genetics**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (280K) |
- Citation

- Efficient study designs for test of genetic association using sibship data and unrelated cases and controls.[Am J Hum Genet. 2006]
*Li M, Boehnke M, Abecasis GR.**Am J Hum Genet. 2006 May; 78(5):778-92. Epub 2006 Mar 20.* - The expected power of genome-wide linkage disequilibrium testing using single nucleotide polymorphism markers for detecting a low-frequency disease variant.[Ann Hum Genet. 2002]
*Ohashi J, Tokunaga K.**Ann Hum Genet. 2002 Jul; 66(Pt 4):297-306.* - Power-based, phase-informed selection of single nucleotide polymorphisms for disease association screens.[Genet Epidemiol. 2006]
*Saccone SF, Rice JP, Saccone NL.**Genet Epidemiol. 2006 Sep; 30(6):459-70.* - On selecting markers for association studies: patterns of linkage disequilibrium between two and three diallelic loci.[Genet Epidemiol. 2003]
*Garner C, Slatkin M.**Genet Epidemiol. 2003 Jan; 24(1):57-67.* - Tag SNP selection for association studies.[Genet Epidemiol. 2004]
*Stram DO.**Genet Epidemiol. 2004 Dec; 27(4):365-74.*

- Cntnap4/Caspr4 Differentially Contributes to GABAergic and Dopaminergic Synaptic Transmission[Nature. 2014]
*Karayannis T, Au E, Patel JC, Kruglikov I, Markx S, Delorme R, Héron D, Salomon D, Glessner J, Restituito S, Gordon A, Rodriguez-Murillo L, Roy NC, Gogos J, Rudy B, Rice ME, Karayiorgou M, Hakonarson H, Keren B, Huguet G, Bourgeron T, Hoeffer C, Tsien RW, Peles E, Fishell G.**Nature. 2014 Jul 10; 511(7508)236-240* - Complex Pedigrees in the Sequencing Era: To Track Transmissions or Decorrelate?[Genetic epidemiology. 2014]
*Li D, Zhou J, Thomas DC, Fardo DW.**Genetic epidemiology. 2014 Sep; 38(0 1)S29-S36* - Combined linkage and family-based association analysis improves candidate gene detection in Genetic Analysis Workshop 18 simulation data[BMC Proceedings. ]
*Li Y, Foo JN, Liany H, Low HQ, Liu J.**BMC Proceedings. 8(Suppl 1)S29* - Low-frequency intermediate penetrance variants in the ROCK1 gene predispose to Tetralogy of Fallot[BMC Genetics. ]
*Palomino Doza J, Topf A, Bentham J, Bhattacharya S, Cosgrove C, Brook JD, Granados-Riveron J, Bu’Lock FA, O’Sullivan J, Stuart AG, Parsons J, Relton C, Goodship J, Henderson DJ, Keavney B.**BMC Genetics. 1457* - Genome-wide linkage analysis for human longevity: Genetics of Healthy Ageing Study[Aging cell. 2013]
*Beekman M, Blanché H, Perola M, Hervonen A, Bezrukov V, Sikora E, Flachsbart F, Christiansen L, De Craen AJ, Kirkwood TB, Rea IM, Poulain M, Robine JM, <recruitment Bologna>, Stazi MA, Passarino G, Deiana L, Gonos ES, Valensin S, Paternoster L, Sørensen TI, Tan Q, Helmer Q, Van den Akker EB, Deelen J, Martella F, Cordell HJ, Ayers KL, Vaupel JW, Törnwall O, Johnson TE, Schreiber S, Lathrop M, Skytthe A, Westendorp RG, Christensen K, Gampe J, Nebel A, Houwing-Duistermaat JJ, Slagboom PE, Franceschi C.**Aging cell. 2013 Apr; 12(2)184-193*

- PubMedPubMedPubMed citations for these articles

- Joint Modeling of Linkage and Association: Identifying SNPs Responsible for a Li...Joint Modeling of Linkage and Association: Identifying SNPs Responsible for a Linkage SignalAmerican Journal of Human Genetics. 2005 Jun; 76(6)934

Your browsing activity is empty.

Activity recording is turned off.

See more...