• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of ajhgLink to Publisher's site
Am J Hum Genet. May 2006; 78(5): 778–792.
Published online Mar 20, 2006. doi:  10.1086/503711
PMCID: PMC1474028

Efficient Study Designs for Test of Genetic Association Using Sibship Data and Unrelated Cases and Controls

Abstract

Linkage mapping of complex diseases is often followed by association studies between phenotypes and marker genotypes through use of case-control or family-based designs. Given fixed genotyping resources, it is important to know which study designs are the most efficient. To address this problem, we extended the likelihood-based method of Li et al., which assesses whether there is linkage disequilibrium between a disease locus and a SNP, to accommodate sibships of arbitrary size and disease-phenotype configuration. A key advantage of our method is the ability to combine data from different family structures. We consider scenarios for which genotypes are available for unrelated cases, affected sib pairs (ASPs), or only one sibling per ASP. We construct designs that use cases only and others that use unaffected siblings or unrelated unaffected individuals as controls. Different combinations of cases and controls result in seven study designs. We compare the efficiency of these designs when the number of individuals to be genotyped is fixed. Our results suggest that (1) when the disease is influenced by a single gene, the one sibling per ASP–control design is the most efficient, followed by the ASP-control design, and familial cases contribute more association information than singleton cases; (2) when the disease is influenced by multiple genes, familial cases provide more association information than singleton cases, unless the effect of the locus being tested is much smaller than at least one other untested disease locus; and (3) the case-control design can be useful for detecting genes with small effect in the presence of genes with much larger effect. Our findings will be helpful for researchers designing and analyzing complex disease-association studies and will facilitate genotyping resource allocation.

Association analysis provides a powerful tool for identifying genetic variants that predispose to complex diseases. Association analysis with use of genetic markers (such as SNPs) relies on the presence of linkage disequilibrium (LD), which occurs when specific alleles at the disease and marker loci appear together in gametes more frequently than expected by chance. With the recent availability of high-throughput SNP genotyping and decreasing genotyping costs, association studies with use of SNPs are beginning to be conducted genomewide.1,2 Such analyses have been facilitated by progress on the International HapMap Project,3,4 which cataloged and genotyped millions of SNPs, allowing informative tagging SNPs to be selected for different populations. Genomewide association studies typically involve hundreds or thousands of individuals and, since genotyping on such a large scale is still expensive, it is important to choose efficient study designs.

In gene-mapping studies, affected sib pairs (ASPs) or multiplex affected sibships are often collected for linkage analyses. Although these individuals may be reused in follow-up association studies, this is not always done. Traditionally, association-mapping studies with the case-control design have been used to test for disease-marker association by selecting one affected sibling per sibship, to form the case group, and comparing the alleles or genotype frequencies with a random sample of unaffected individuals. It has been shown that power can be substantially increased by including families with more affected siblings57 in association studies. The increase of power is due to the enrichment of disease-predisposing alleles in affected sibships; this, in turn, leads to improved power to detect genetic association because of larger allele-frequency differences between cases and controls.

Efficient use of data sets that include related individuals in association studies requires a unified statistical framework that allows the joint analysis of all available sampling units. In this article, we extend the association test proposed by Li et al.8 to the analysis of sibships of arbitrary size and disease-phenotype configuration and to accommodate parental genotypes, when available. Our method allows the analysis of data containing mixed types of sampling units that are based on a unified retrospective likelihood framework and therefore can evaluate evidence of disease-marker association on the basis of different sampling units, ranging from unselected unrelated individuals to large sibships. We consider scenarios for which genotypes are available for unrelated cases, ASPs, or only one sibling per ASP. We construct designs that use affected individuals only and others that use unaffected siblings or unrelated unaffected individuals as controls. Using our unified likelihood framework, we compare efficiency of these study designs when the number of individuals to be genotyped is fixed.

As noted elsewhere by Risch,6 we show that designs with unrelated controls are more powerful than are designs with family-based controls. Our results also suggest that, for diseases that are influenced by multiple genes, familial cases provide more association information than do singleton cases, unless the effect of the test locus is much smaller than at least one other untested disease locus. Similar phenomena have been observed by Risch9 for single major-locus models with an additive polygenic background and by Howson et al.10 for certain two-locus models. Further, we show that the case-control design can be useful for detecting genes with small effect in the presence of genes with much larger effect.

Methods

We consider the problem of disease-marker association analysis with mixed types of sampling units. Our goals are to develop a unified likelihood framework that allows the joint analysis of all available data and to compare efficiency of different study designs for testing association between disease and a candidate SNP, given fixed genotyping resources. We discuss the impact of phenotyping cost in the “Discussion” section.

Assumptions and Definitions

We assume there is a set of sibships genotyped at a candidate SNP and, optionally, M[gt-or-equal, slanted]0 flanking markers. We assume the SNP, with alleles A and a (with frequencies pA and pa), is completely linked (recombination fraction θ=0) to a diallelic disease locus, with disease-predisposing allele D and alternate allele d (with frequencies pD and pd). We wish to evaluate evidence of association at the candidate SNP by modeling the disease-SNP haplotypes DA, Da, dA, and da (with frequencies pDA, pDa, pdA, and pda, respectively) and the penetrances fg=P(affected|g) for disease genotypes g[set membership]{dd, Dd, DD}. As shown later, unrelated individuals do not allow the estimation of all these independent parameters. In samples that include only unrelated individuals, we assume that the disease and SNP loci are in complete LD (r2=1), so that their allele frequencies are identical. The assumption that r2=1 results in an identifiable model but no loss of statistical efficiency, since we can still extract maximum information from the available data.

By definition, the population prevalence of the disease K=fddp2d+2fDdpdpD+fDDp2D, and the genotype relative risk (GRR)=fg/fdd for g[set membership]{Dd,DD}. We allow LD between the candidate SNP and the unobserved disease alleles but assume linkage equilibrium between the flanking markers and the superlocus formed by combining the disease and SNP loci. We assume Hardy-Weinberg equilibrium in the general population for all markers, including the superlocus. We further assume that the disease phenotypes of the siblings are independent, given their genotypes at the disease locus, and that there is a single disease causal variant in the region. We investigate the impact of multiple disease variants in the “Simulations” section.

For a sibship with s siblings, let

equation image

be the observed unordered marker genotypes, Y be the disease phenotypes, and G be the disease-SNP haplo-genotypes. Let θm be the recombination fraction between markers m and m+1 (1[less-than-or-eq, slant]m[less-than-or-eq, slant]M-1). The inheritance pattern at marker m is completely described by a binary inheritance vector vm of length 2s,11,12 whose entries indicate the outcome of the paternal and maternal meioses for the s siblings in the sibship. Let vD and vSNP denote the inheritance vectors at the disease locus and the candidate SNP, respectively. Complete linkage between the disease and SNP loci implies vD[equivalent]vSNP. For ease of computation, we assume there is no genetic interference, so that {vm} forms a hidden Markov chain.

Conditional Probability of Marker Data, Given Disease Phenotypes for a Sibship with s Siblings

We wish to evaluate P(X|Y), the conditional probability of marker genotypes X, given disease phenotypes Y for a sibship with s siblings. By the law of the total probability,

equation image

where the summation is taken over all disease-SNP haplogenotypes that are consistent with the observed SNP genotypes. Summing over all possible inheritance vectors at the disease locus and applying Baum’s13 forward and backward algorithms,

equation image

where k and k+1 are flanking markers on the left and right side of the candidate SNP. The summation over all possible inheritance vectors allows the handling of incomplete inheritance information and phase ambiguity by incorporating prior probabilities of the inheritance vectors. At any marker m(1[less-than-or-eq, slant]m[less-than-or-eq, slant]M),

equation image

and

equation image

The calculation of equation (2) requires three probabilities: (1) the prior probability of inheritance vector vD, (2) the inheritance vector transition probability between two consecutive markers, and (3) the conditional probability of marker genotypes, given the inheritance vector at that marker. Clearly, the prior probability P(vD)=2-2s.

The transition probability between inheritance vectors at markers m and m+1 can be obtained from the transition matrix, which is expressed as the Kronecker power of 2×2 transition matrices corresponding to transitions at each of the 2s meioses,

equation image

For example, for a sib pair,

equation image

and

equation image

Let Odadm and Omomm represent the ordered genotypes of the father and the mother at marker m. In ordered genotypes, the maternal allele always precedes the paternal allele. Although observed genotypes are typically unordered, summing over ordered genotypes is computationally convenient, because, taken together, ordered genotypes for the founders and the inheritance vector specify the genotypes of all individuals in the pedigree. Thus, the conditional probability of sibship genotype Xm, given inheritance vector vm, can be calculated as

equation image

where P(Xm|Odadm,Omomm,vm) takes the value of 1 if the sibship’s genotype data Xm are consistent with the ordered parental genotypes Odadm and Omomm and the inheritance vector vm, and 0 otherwise. The summation is taken over all ordered parental genotypes. P(G|vG) can be calculated in a similar fashion, by regarding each haplogenotype as a genotype of the superlocus formed by combining the disease and SNP loci.

Recursive calculation of Lm(vm) and Rm(vm) with use of these three probabilities allows equation (2) to be evaluated in a manner linear in the number of marker loci M. Equation (2) is an extension of the retrospective likelihood calculation for ASPs described by Li et al.8 Here, the sibship size can be >2, and siblings can be either affected or unaffected. Our likelihood calculation easily allows for missing genotypes. For example, to accommodate sibships in which only a subset of the siblings is genotyped at the candidate SNP, we sum over all possible SNP genotypes for those siblings of known disease status but with missing SNP genotypes. It is essential to include all these members, because siblings with known phenotypes but missing genotypes contribute association information.

Our calculation can be readily extended to accommodate parental genotypes. Following the derivation of equation (2), the critical part in the calculation is the conditional probability of marker genotypes for the siblings and their parents, given the inheritance vector at a particular marker. Let Xdadm and Xmomm represent the observed unordered parental genotypes at marker m. Then the conditional probability of the observed genotypes given the inheritance vector at marker m is

equation image

where the summation is taken over all ordered parental genotypes that are consistent with the observed unordered parental genotypes. This extension enables us to analyze nuclear families with genotyped parents, including parent-affected offspring trios, which are the basic sampling units used by the transmission/disequilibrium test.14

Under the assumption that the disease phenotypes are independent given the genotypes at the disease locus, P(Y|G) is the product of simple functions of penetrances. An affected sibling j (1[less-than-or-eq, slant]j[less-than-or-eq, slant]s) with disease-SNP haplo-genotype Gj contributes a term fGj, and an unaffected sibling j contributes a term 1-fGj. By the law of the total probability, the probability of the disease phenotypes for the sibship

equation image

Substituting equation (2), P(Y|G), and P(Y) into equation (1), we obtain the conditional probability for the sibship P(X|Y) as a function of model parameters {fdd,fDd,fDD,pDA,pDa,pdA}.

In the calculation of P(Y|G) and P(Y), we assume that the disease statuses of the siblings are conditionally independent, given their genotypes at the disease locus. This assumption is exactly true only when there are no other genetic or environmental risk factors shared among the siblings. If the disease is influenced by multiple disease variants, then the calculation will depend on genotypes at the other disease loci as well. For example, if the disease is influenced by two unlinked disease loci, then

equation image

where subscripts 1 and 2 denote the two unlinked disease loci.

Conditional Probability of Marker Data, Given Disease Phenotype for a Single Individual

In principle, equation (1) can be applied to singleton individuals who can be regarded as sibships with one sibling. However, data sets containing solely unrelated individuals do not allow the estimation of all our model parameters. In this case, we assume that the disease and SNP loci are in complete LD, so that pD=pA, and we reparameterize our model. For case-control data,

equation image

which is a function of {fdd,fDd,fDD,pA}. For a sample of unrelated cases, P(XSNP|Y) is simply a function of the two SNP genotype frequencies, P(AA|case) and P(Aa|case). For studies that involve only unrelated individuals, flanking markers do not contribute information on association; therefore, we need to consider only the SNP genotypes. It is worth noting that, for SNPs that are in incomplete LD with the disease locus, the genetic effect will be underestimated; however, there is no loss of efficiency for the association test.

Pooling across Different Sampling Units

A key advantage of our likelihood calculation is that it allows the joint analysis of different sampling units in a unified statistical framework, which leads to more efficient use of the available data. The retrospective likelihood for data that contain N independent sibships, which may be of different sizes and disease phenotype configurations, is

equation image

Here, we choose to use a retrospective likelihood, since the sibships are ascertained through disease status. Using a retrospective likelihood avoids the problem of ascertainment bias and provides parameter estimates that are valid for the general population.15,16 In addition, it ensures that our test remains valid even if there are additional genetic or environmental factors that induce correlation between family members.

Test of Association

We wish to evaluate whether a SNP is associated with the putative disease locus. Under the null hypothesis of no association, the SNP and the disease locus are in linkage equilibrium, and the disease-SNP haplotype frequencies are the product of the corresponding disease and SNP allele frequencies (for example, pDA=pDpA). In this case, parameters that need to be estimated are {fdd,fDd,fDD,pD,pA}, and we set

equation image

Under the alternative hypothesis, we maximize a total of six parameters {fdd,fDd,fDD,pDA,pDa,pdA}. For data including only ASPs or only unrelated individuals, these parameters are not all identifiable, and we reparameterize the likelihood as described by Li et al.8 or maximize a subset of the parameters as detailed in table 1. We perform this maximization using a simplex algorithm,17 an optimization method that does not require derivatives.

Table 1
Parameters and Constraints for Different Sampling Units[Note]

Following Li et al.,8 we use a likelihood-ratio statistic to test for association. We compare the likelihood maximized under the general model (0[less-than-or-eq, slant]r2[less-than-or-eq, slant]1), equation M1, with the likelihood maximized under the null model (r2=0), equation M2, using the likelihood-ratio statistic equation M3. Parameters associated with each model for the different sampling units and the corresponding parameter constraints are summarized in table 1. For data sets that contain only unrelated cases and controls, our association test is similar to the unconstrained genotype test proposed by Thompson et al.,18 except that we do not assume known disease prevalence. Our test is also similar to the goodness-of-fit test proposed by Wittke-Thompson et al.19

In principle, the asymptotic distribution of TLE under the null hypothesis can be approximated by mixture of χ2 distributions,20 but we have not derived the degrees of freedom and mixing parameters because of the complexity of parameter constraints and boundaries. Instead, we assess significance of the test statistic empirically by simulating marker genotypes under the null hypothesis and comparing the observed statistic with the simulated null distribution.

Under the null hypothesis, we sample SNP genotypes for a sibship conditional on their observed flanking-marker genotypes and parameter estimates for the linkage equilibrium model. We leave flanking-marker genotypes unchanged from their observed values. For a single individual, we sample the SNP genotype according to the estimated SNP genotype frequencies. The null distribution of TLE can be obtained by calculating the statistic for a large number of simulated data sets.

Study Designs for Test of Genetic Association

Our likelihood calculation allows the analysis of sibships of arbitrary size and disease-phenotype configuration, including unrelated affected or unaffected individuals, ASPs, and discordant sib pairs (DSPs). For ease of presentation, we consider only sibships of size [less-than-or-eq, slant]2. To construct different study designs, we select either (1) one or two cases from each ASP or (2) unrelated affected individuals. We use either cases only or select controls from unrelated unaffected individuals or unaffected siblings. Different combinations of cases and controls result in seven study designs (fig. 1). It is worth noting that both the one sibling per ASP–control design and the case-control design use unrelated affected and unaffected individuals. The difference is that, for the one sibling per ASP–control design, the cases are selected from ASPs, whereas, in the case-control design, the cases are randomly selected from the general population.

Figure  1
Association study designs. The black arrows denote individuals to be genotyped at the candidate SNP. The number of individuals to be genotyped at the SNP is fixed at 1,000 for each study design.

Given fixed genotyping resources, it is important to know which study designs are the most powerful for detecting disease-SNP association. Since disease-mapping studies often start from linkage analysis and since flanking-marker genotypes often are already available, for these studies, we compare the efficiency of different study designs by fixing the total number of individuals to be genotyped at the candidate SNP, and we do not account for the cost or effort associated with collecting flanking-marker data.

Simulations

We performed a set of simulations to evaluate the efficiency of different study designs and to compare the statistical power of our test with other existing association tests. Table 2 describes the single-locus disease models that we considered, which varied over a range of attributable fractions, disease allele frequencies, and GRRs. We set the locus-specific sibling recurrence risk ratio λs21 to 1.02.

Table 2
Characteristics of the Simulated Single-Locus Disease Models When λs=1.02[Note]

When simulating the data, we assumed that the disease and SNP allele frequencies are identical, in contrast to our model in which these frequencies are allowed to differ. Setting these frequencies to be equal allowed us to compare the efficiency of different study designs over a broad range of LD (0[less-than-or-eq, slant]r2[less-than-or-eq, slant]1) between the disease alleles and the SNP. We assumed a map of 10 markers, each with four equally frequent alleles (heterozygosity H=0.75) evenly spaced at 11.16-cM intervals, corresponding to θ=0.1 under Haldane’s22 no-interference map function. We centered the disease locus and candidate SNP in the middle of the map and assumed zero recombination between them. The disease locus genotypes were removed prior to data analysis. For each of the disease models in table 2, we simulated 5,000 replicate data sets for each design under linkage equilibrium, to estimate the null distribution. We next simulated 2,000 replicate data sets with various levels of LD, to assess the empirical power of our association test.

To examine the impact of multilocus inheritance on the relative efficiency of the case-control design and the one sibling per ASP–control design, we also simulated data sets using the additive multilocus disease models for which the multilocus penetrance is the total of the penetrance summands as defined by Risch.23 For example, given L unlinked diallelic loci contributing to susceptibility in a recessive manner, the penetrance for each genotype is

equation image

Here, fbase is the baseline penetrance for the genotype containing no disease-predisposing genotypes, Δl is the increment in penetrance for the disease-predisposing genotype at locus l, and Il is an indicator of whether the individual is homozygous for the disease-predisposing allele at locus l.

We simulated data sets assuming L[gt-or-equal, slanted]2 unlinked diallelic disease loci, each with predisposing allele frequency of 0.1. We simulated an associated SNP with minor-allele frequency of 0.1 completely linked to the first disease locus, which we call the test locus. We considered three scenarios: (1) increasing the locus-specific λs at the test locus from 1.02 to 1.25 but fixing the locus-specific λs at the remaining background disease loci at 1.02, (2) fixing the locus-specific λs at each disease locus at 1.02 and increasing the number of disease loci from 2 to 10, and (3) increasing the locus-specific λs at one of the background disease loci from 1.02 to 1.7 but fixing the locus-specific λs at the remaining disease loci, including the test locus, at 1.02. We fixed the disease prevalence at 5% in all scenarios. All disease genotypes were removed prior to data analysis. Precise details of the penetrances are available in the appendix.

Results

In this section, we compare power of different study designs when the number of individuals to be genotyped is fixed under single-locus disease models. Further, we examine designs with familial cases and singleton cases under multilocus disease models. We also evaluate the usefulness of flanking markers, compare our approach with other tests of association, and illustrate how to combine data from different family structures.

Power Comparisons of Different Study Designs

For each of the 12 disease models in table 2, we estimated the empirical power of the seven study designs for test of association at four levels of disease-SNP LD (r2=0.25, 0.50, 0.75, and 1). We ranked each study design by its estimated power assessed at the 1% empirical significance level, so that the most powerful design has rank 1 and the least powerful design has rank 7. Each study design was ranked 12×4=48 times. Figure 2 displays the histograms of ranks for each study design. Our simulation results for the single-locus models indicate that, for a fixed number of SNP genotypes, the one sibling per ASP–control design is usually most powerful (rank=1 in 41 of 48 settings, average rank = 1.31), followed by the ASP-control design. For all 12 single-locus disease models we considered, the case-control design is less powerful than designs that include familial cases. In addition, we found that the DSP design is always less powerful than designs that include population controls. For a fixed genotyping effort, we also found that, under common dominant (pD=0.7) and rare recessive (pD=0.1) models, designs including only affected individuals can be more powerful than designs that also include unaffected individuals. Nevertheless, we generally do not advocate such designs, since they are more vulnerable to genotyping error and deviations from Hardy-Weinberg equilibrium. Our results suggest that the rankings were similar when r2=1 and when r2=0.25, and no designs behave better or worse at these two extremes.

Figure  2
Histograms of ranks for different study designs. Results are based on 2,000 replicates of the corresponding sampling units for each study design. All models have disease prevalence of K=5% and sibling recurrence risk ratio of λs=1.02. ...

Given a set of ASPs, an investigator may initially genotype candidate SNPs in only one sibling per ASP, halving genotyping costs on the cases. We compared the power of the ASP-control design with that of the one sibling per ASP–control design, where the latter uses only one sibling per ASP from the ASPs generated for the previous design (table 3). We found that the loss of power by genotyping only one sibling per ASP generally is modest. This suggests that, for an initial screen of SNPs, it may be cost effective to initially genotype only one sibling per ASP, with genotyping of the other siblings performed only when a candidate SNP shows at least suggestive evidence of association.

Table 3
Power (%) Comparison of the ASP-Control Design and the One Sibling per ASP–Control Design for a Fixed Number of Sibships[Note]

Impact of Multiple Disease Loci

We analyzed simulated data sets that included multiple disease loci. We compared the power of the case-control design and the one sibling per ASP–control design, since these designs are typically among the most powerful and represent a choice commonly faced by investigators—namely, whether to collect familial cases or unrelated cases. Figure 3 indicates that, for the five-locus additive disease models that we considered, where the background untested disease loci have small effect (λs=1.02), both study designs have increasing power as the effect of the test locus increases. Further, the increment in power is more pronounced for the one sibling per ASP–control design. Similar patterns were observed for models for which all disease loci are dominant or recessive and for models with larger numbers of disease loci (data not shown).

Figure  3
Comparison of case-control design and one sibling per ASP–control design, under five-locus disease models, when the effect size of the test locus increases and the effect size of the four remaining disease loci are fixed. Results are based on ...

We also investigated the impact of multiple disease loci when all loci have the same effect (fig. 4). We found that the case-control design has approximately constant power across different number of disease loci, whereas the power of the one sibling per ASP–control design decreases as the number of disease loci increases, corresponding to greater familial aggregation. A similar finding was reported by Risch,9 who found that, when the sibling residual correlation is high, multiplex affected sibships and familial cases sometimes provide a smaller advantage over randomly selected cases. Although the advantage of familial cases diminishes as the number of disease loci increases, we found that the one sibling per ASP–control design remained more powerful than the case-control design for all the disease models that we considered.

Figure  4
Power comparison of case-control design and one sibling per ASP–control design, under multilocus disease models, when the effect size of each disease locus is fixed and the number of disease loci increases. Results are based on 2,000 replicate ...

As expected, for disease models for which the test locus has a fixed small effect (λs=1.02), we found that the power of the case-control design is not influenced by the effect size of the untested background locus (fig. 5) when the disease prevalence is fixed. In contrast, the power of the one sibling per ASP–control design decreases as the effect size of the background locus increases. Generally, the one sibling per ASP–control design has greater power than the case-control design, but the case-control design becomes more powerful when the effect of the background locus is very large (λs>1.6 in our simulations) relative to the test locus (λs=1.02 in our simulations).

Figure  5
Power comparison of case-control design and one sibling per ASP–control design, under five-locus disease models, when the effect size of the large-effect background disease locus increases and the effect size of the small-effect disease loci, ...

Improvement of Power by Including Flanking Markers

In sib pair samples, our method makes use of genotypes on flanking markers, which may provide valuable information about the underlying disease models, especially when these markers are closely linked to the unobserved disease locus. To assess the utility of flanking-marker data on our test of association, we repeated the power estimation procedure for the ASP design and the one sibling per ASP design. Table 4 suggests that the flanking-marker data can substantially increase the association power for dominant and additive models. Our results also indicate that the flanking markers are more useful for SNPs in which one allele is rare.

Table 4
Improvement of Power (%) by Including Flanking Markers[Note]

Comparison with Other Tests of Association

We compared the power of our test with other tests of association. The significance for all tests was determined empirically by simulating null distributions and getting critical values. For the case-control design, we compared with Pearson’s χ22 statistic for the 2×3 table of genotype frequencies in cases and controls. For the ASP-control design, we compared with Risch and Teng’s5 test

equation image

where n is the number of sibships, each with r affected sibs, and u is the number of unrelated controls matched to each sibship,

equation image

and equation M4 and equation M5 are the estimated SNP allele frequency for the ASPs and the controls, respectively. For the design with 250 ASPs and 500 controls, r=2, u=2, and n=250.

Table 5 shows that, for most of the models we considered and especially for recessive models, our test has greater power than the Pearson’s χ22 test. Our test is less powerful than Risch and Teng’s test for the additive models examined. This is because Risch and Teng’s test is a 1-df test, whereas our test relies on the estimation of several disease model parameters, resulting in more degrees of freedom. We found that when assuming a prespecified disease model (e.g., additive, dominant, or recessive) by imposing constraints on penetrances estimated in our model, the power of the two tests became comparable (data not shown). This suggests that fixing some disease model parameters is likely to improve the power if these parameters can be approximated from previous studies. Note that neither our test nor Risch and Teng’s test controls for population stratification.

Table 5
Power (%) Comparison with Other Tests of Association[Note]

Combining Data from Different Family Structures

A key advantage of our method is its ability to combine data from different sampling units. In many association studies, particularly those that follow an initial linkage study, an investigator may have different sampling units available. For example, the data may contain nuclear families with different numbers of genotyped parents and affected and unaffected siblings collected for the initial linkage analysis and unrelated affected or unaffected individuals from additional sampling. A simple strategy for analyzing such data would be to use all unrelated affected individuals and one affected sibling per sibship to form the case group and then use all unrelated unaffected individuals to form a control group. However, this does not use all available data and can give variable results, depending on which affected siblings are selected.7

To assess the gain in power by using all available data simultaneously, we simulated different combinations of ASPs, DSPs, unrelated cases, and unrelated controls. We compared the power of our test when using all available data with tests that use only partial data obtained by selecting one sibling per sampling unit. Table 6 suggests that there could be a substantial loss of power when only a subset of the data is used. As expected, when the proportion of data being used decreases, the loss of power increases, suggesting that when the majority of the data are sampled from sibships or families, it is important to use all available data.

Table 6
Power (%) Comparison with Complete Data and Partial Data

As when deciding to genotype all affected siblings or only one sibling per ASP, we found that including the genotypes of all affected family members increases power. When it is not cost-effective to do this additional genotyping for all markers, it could be considered for an additional follow-up phase.

Discussion

We have developed a unified likelihood framework to test for disease-marker association that allows the analysis of sibships of arbitrary size and disease-phenotype configuration. Our likelihood calculations allow us to accommodate different association-study designs and to compare their efficiencies. By use of simulation studies, we found that when the number of individuals to be genotyped at the candidate SNP is fixed, for single-locus models, the one sibling per ASP–control design was generally most powerful, followed by the ASP-control design. As others have noted6, we also found that familial cases contributed more association information than did singleton cases and that the DSP design was less powerful than designs that include unrelated unaffected individuals. This pattern holds for disease prevalence 2%[less-than-or-eq, slant]K[less-than-or-eq, slant]20%, with similar relative efficiency for the seven study designs that we considered (data not shown). Additional simulations reveal that our conclusions regarding the relative efficiency of different study designs remain unchanged at more-stringent critical values (α=.0001, .00001, and .000001).

In most of our simulations, we generated data in which allele frequencies were identical at the disease and SNP loci. To evaluate the robustness of our model to differences in allele frequency between the two loci, we conducted additional simulations over a broad range of allele-frequency differences. We considered combinations of pD [set membership]{.1,.3,.5,.7} and pA [set membership]{.1,.3,.5,.7} for dominant, additive, and recessive models. Results of these additional simulations suggest that the relative efficiency of different study designs remains unchanged, although all designs have low power when the allele frequencies are very different, since r2 is low.

Our results show that the proposed test is usually more powerful than the Pearson’s χ2 test for the case-control design. One reason for this advantage is that our test uses an explicit genetic model for the disease, whereas the Pearson’s χ2 test is nonparametric in nature. Our results are consistent with those of Thompson et al.,18 who showed that even simple modeling assumptions, such as assuming Hardy-Weinberg equilibrium in the general population, increase power of genetic-association studies.

Our method does not depend on transmission disequilibrium and can incorporate parental genotypes when available. To evaluate the potential gain in power afforded by collecting parental genotypes, we generated data sets with 500 controls and 500 parent–affected offspring trios for disease models (table 2). We analyzed the data, first taking into account only genotypes for the 500 unrelated cases and controls (average power=39%; α=.01) and then also incorporating parental genotypes (average power=54%, α=.01). We expect that parental data will be less useful on a per-genotype basis but will still provide useful information on allelic association.

Our method assumes that the superlocus formed by combining the disease and SNP loci is in Hardy-Weinberg equilibrium in the general population. In the presence of population stratification, the Hardy-Weinberg equilibrium assumption may be violated and our test may be invalid. An important step for avoiding population stratification is to carefully match cases and controls on the basis of their genetic background. When the degree of stratification is small, it may be possible to adjust our test statistics with genomic control24 or a similar strategy.

We initially assumed that there is a single disease-predisposing variant in the region. As others have noted,5 under this assumption, familial cases tend to be enriched for the disease-predisposing allele and thus create a stronger contrast with unaffected individuals. For diseases that are influenced by multiple genes, the advantage of familial cases will depend on the underlying disease models. Our results indicate that, for disease models for which the test locus has equal or stronger effect than the remaining background disease loci, familial cases provide more association information than do randomly selected cases. This remains true unless the effect size of the test locus is much smaller (e.g., λs=1.02) than at least one other untested disease locus (e.g., λs=1.6). A similar pattern was observed by Howson et al.10 for additive and crossover two-locus disease models and by Allison et al.25 for extreme sampling in quantitative trait linkage/association studies.

Our findings have important implications for genetic-association studies of many complex diseases, such as depression and schizophrenia, for which loci of large effect have not been identified. For such diseases, designs with familial cases are likely to be a good choice for the initial association studies. One might consider genotyping additional affected family members for those markers that show suggestive evidence of association. Our findings also have implications for disease for which a major gene is known to play a role—such as many auto-immune disorders for which a strong human leukocyte antigen effect has been demonstrated—and age-related macular degeneration, for which two major loci have been identified.1,2630 For these diseases, the standard case-control design might be preferred for detecting genes that contribute only a small fraction of the overall disease risk.

Enabled by improvements in genotyping technologies, association studies are beginning to be conducted genomewide.1,2 We believe our method will be useful for analyzing the results of these studies. Nevertheless, applying our method to hundreds of thousands of markers may present a computational challenge, because it relies on an iterative procedure to maximize the likelihood of the data under alternative models. If computational resources are limited, one option is to first screen all markers with a computationally inexpensive test and then apply our method to markers that show suggestive evidence of association.

In this article, we focused on comparing efficiency of different study designs when the genotyping cost is fixed. Although familial cases provide more association information than do singleton cases in most settings we considered, familial cases (if not already sampled) are typically more difficult to collect and hence may result in higher phenotyping costs. It would be interesting to investigate the relative efficiency of familial cases and singleton cases, taking into account both genotyping and phenotyping costs, with the goal of minimizing total study cost.

In summary, we have developed a unified statistical framework to test for disease-marker association, using sibships of arbitrary size and disease-phenotype configuration. Our method can be readily extended to allow general pedigrees. We compared the efficiency of seven study designs when the number of individuals to be genotyped at the candidate SNP is fixed. Our results suggest that familial cases are more advantageous than are randomly selected cases when the disease follows a single-locus model. This also appears to be true for multilocus disease models, unless the effect size of the test locus is much smaller than that of at least one untested disease locus. On a cost basis, genotyping one sibling per affected sibship and using existing flanking-marker information provides a powerful design for initial association studies. We believe our findings will be helpful for researchers designing and analyzing complex disease–association studies and will increase power and facilitate genotyping resource allocation. We implemented our method in a C++ program, which can be downloaded from the University of Michigan Center for Statistical Genetics Web site.

Acknowledgments

This research was supported by National Institutes of Health grants HG00376 (to M.B.) and HG02651 and EY12562 (to G.R.A.). M.L. was previously supported by a University of Michigan Rackham predoctoral fellowship. We gratefully thank two anonymous reviewers for their valuable comments.

Appendix : Parameters for Multilocus Disease Models

Assume the disease is influenced by L unlinked diallelic disease loci. For locus l (1[less-than-or-eq, slant]l[less-than-or-eq, slant]L), let Dl denote the disease-predisposing allele and dl denote the low-risk allele. Let fbase denote the baseline penetrance for the genotype in which all disease loci are homozygous for the low-risk allele. For an individual with genotype [set membership]{dldl,Dldl,DlDl}, let gl denote the genotype score that counts the number of the Dl alleles. Further, assume that the penetrance is increased over the baseline by w(gl). The increment of penetrance depends on the marginal disease model at the corresponding locus. For example, for additive, dominant, and recessive models w(gl) can be defined as shown in table A1. For an individual with genotype scores (g1,…,gL), the corresponding multilocus penetrance is

equation image

The individual’s disease status can be determined once the genotype is known. Samples of unrelated cases and controls and familial cases can be simulated as usual.

Tables TablesA2A2A2A4 list disease-model parameters for the additive multilocus-disease models described in this article.

Table A1

Increment of Penetrance

w(gl)
Genotype
at Locus l
glAdditiveDominantRecessive
dldl0000
Dldl10.5ΔlΔl0
DlDl2ΔlΔlΔl

Table A2

Disease-Model Parameters for Figure 3[Note]

Additive
Dominant
Recessive
λs,testfbaseΔtestΔbackgroundfbaseΔtestΔbackgroundfbaseΔtestΔbackground
1.02.0264.0471.0471.0255.0258.0258.0435.1307.1307
1.05.0237.0745.0471.0226.0408.0258.0427.2067.1307
1.10.0206.1054.0471.0194.0578.0258.0418.2924.1307
1.15.0182.1291.0471.0169.0707.0258.0412.3581.1307
1.20.0162.1491.0471.0148.0817.0258.0406.4134.1307
1.25.0145.1667.0471.0130.0913.0258.0401.4622.1307

Note.— Data are a comparison of five-locus disease models for which the effect size of the test locus increases and the effect size of the four remaining disease loci are fixed. The disease prevalence K is fixed at 5%. The predisposing allele frequency at each locus is fixed at 0.1. The locus-specific sibling recurrence risk ratio, λs,test, at the test locus is increased from 1.02 to 1.25, and the locus-specific sibling recurrence risk ratio at the four remaining disease loci is fixed at 1.02. Δtest is the increment of penetrance at the test locus, and Δbackground is the increment of penetrance at each of the four remaining disease loci.

Table A3

Disease Model Parameters for Figure 4[Note]

Additive
Dominant
Recessive
LfbaseΔfbaseΔfbaseΔ
2.0406.0471.0402.0258.0474.1307
4.0301.0471.0304.0258.0448.1307
6.0217.0471.0206.0258.0422.1307
8.0123.0471.0107.0258.0395.1307
10.0029.0471.0009.0258.0369.1307

Note.— Data are a comparison of L-locus disease models for which the effect size of each disease locus is fixed and the number of disease loci increases. The disease prevalence K is fixed at 5%. The predisposing allele frequency at each locus is fixed at 0.1. The locus-specific sibling recurrence risk ratio λs at all disease loci is fixed at 1.02. Δ is the increment of penetrance at each locus.

Table A4

Disease Model Parameters for Figure 5[Note]

Additive
Dominant
Recessive
λs,largefbaseΔsmallΔlargefbaseΔsmallΔlargefbaseΔsmallΔlarge
1.02.0264.0471.0471.0255.0258.0258.0435.1307.1307
1.1.0206.0471.1054.0194.0258.0578.0418.1307.2924
1.2.0162.0471.1491.0148.0258.0817.0406.1307.4134
1.3.0129.0471.1826.0114.0258.1001.0397.1307.5064
1.4.0101.0471.2108.0084.0258.1155.0389.1307.5847
1.5.0076.0471.2357.0058.0258.1292.0382.1307.6537
1.6.0053.0471.2582.0035.0258.1415.0376.1307.7161
1.7.0033.0471.2789.0013.0258.1528.0370.1307.7735

Note.— Data are a comparison of five-locus disease models for which the effect size of the large-effect background disease locus increases and the effect size of the small-effect disease loci, including the test locus, is fixed. The disease prevalence K=5%. The predisposing allele frequency at each locus is fixed at 0.1. The locus-specific sibling recurrence risk ratio, λs,large, at the large-effect background locus is increased from 1.02 to 1.7, and the locus-specific sibling recurrence risk ratio at the four remaining loci, including the test locus, is fixed at 1.02. Δlarge is the increment of penetrance at the large-effect background locus, and Δsmall is the increment of penetrance at each of the four remaining loci.

Web Resource

The URL for data presented herein is as follows:

University of Michigan Center for Statistical Genetics, http://www.sph.umich.edu/csg/abecasis/lamp/

References

1. Klein RJ, Zeiss C, Chew EY, Tsai J-Y, Sackler RS, Haynes C, Henning AK, SanGiovanni JP, Mane SM, Mayne ST, Bracken MB, Ferris FL, Ott J, Barnstable C, Hoh J (2005) Complement factor H polymorphism in age-related macular degeneration. Science 308:385–389 [PMC free article] [PubMed] [Cross Ref]10.1126/science.1109557
2. Maraganore DM, de Andrade M, Lesnick TG, Strain KJ, Farrer MJ, Rocca WA, Pant PVK, Frazer KA, Cox DR, Ballinger DG (2005) High-resolution whole-genome association study of Parkinson disease. Am J Hum Genet 77:685–693 [PMC free article] [PubMed]
3. International HapMap Consortium (2003) The International HapMap Project. Nature 426:789–796 [PubMed] [Cross Ref]10.1038/nature02168
4. ——— (2005) A haplotype map of the human genome. Nature 437:1299–1320 [PMC free article] [PubMed] [Cross Ref]10.1038/nature04226
5. Risch N, Teng J (1998) The relative power of family-based and case-control designs for linkage disequilibrium studies of complex human diseases. I. DNA pooling. Genome Res 8:1273–1288 [PubMed]
6. Risch N (2000) Searching for genetic determinants in the new millennium. Nature 405:847–856 [PubMed] [Cross Ref]10.1038/35015718
7. Fingerlin TE, Boehnke M, Abecasis GR (2004) Increasing the power and efficiency of disease-marker case-control association studies through use of allele-sharing information. Am J Hum Genet 74:432–443 [PMC free article] [PubMed]
8. Li M, Boehnke M, Abecasis GR (2005) Joint modeling of linkage and association: identifying and quantifying SNPs responsible for a linkage signal. Am J Hum Genet 76:934–949 [PMC free article] [PubMed]
9. Risch N (2001) Implications of multilocus inheritance for gene-disease association studies. Theor Popul Biol 60:215–220 [PubMed] [Cross Ref]10.1006/tpbi.2001.1538
10. Howson JMM, Barratt BJ, Todd JA, Cordell HJ (2005) Comparison of population- and family-based methods for genetic association analysis in the presence of interacting loci. Genet Epidemiol 29:51–67 [PubMed] [Cross Ref]10.1002/gepi.20077
11. Lander ES, Green P (1987) Construction of multilocus genetic linkage maps in humans. Proc Natl Acad Sci USA 84:2363–2367 [PMC free article] [PubMed]
12. Kruglyak L, Daly MJ, Reeve-Daly MP, Lander ES (1996) Parametic and nonparametric linkage analysis: a unified multipoint approach. Am J Hum Genet 58:1347–1363 [PMC free article] [PubMed]
13. Baum LE (1972) An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities 3:1–8
14. Spielman RS, McGinnis RE, Ewens WJ (1993) Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am J Hum Genet 52:506–516 [PMC free article] [PubMed]
15. Cannings C, Thompson EA (1977) Ascertainment in the sequential sampling of pedigrees. Clin Genet 12:208–212 [PubMed]
16. Epstein M P, Lin X, Boehnke M (2002) Ascertainment-adjusted parameter estimates revisited. Am J Hum Genet 70:886–895 [PMC free article] [PubMed]
17. Nelder JA, Mead R (1965) A simplex method for function minimization. Comput J 7:308–313
18. Thompson D, Witte JS, Slattery M, Goldgar D (2004) Increased power for case-control studies of single nucleotide polymorphisms through incorporation of family history and genetic constraints. Genet Epidemiol 27:215–224 [PubMed] [Cross Ref]10.1002/gepi.20018
19. Wittke-Thompson JK, Pluzhnikov A, Cox NJ (2005) Rational inferences about departures from Hardy-Weinberg equilibrium. Am J Hum Genet 76:967–986 [PMC free article] [PubMed]
20. Self SG, Liang KY (1987) Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under non-standard conditions. J Am Stat Assoc 82:605–610
21. Risch N (1987) Assessing the role of HLA-linked and unlinked determinants of disease. Am J Hum Genet 40:1–14 [PMC free article] [PubMed]
22. Haldane JBS (1919) The combination of linkage values and the calculation of distances between the loci of linked factors. J Genet 8:299–309
23. Risch N (1990) Linkage strategies for genetically complex traits. I. Multilocus models. Am J Hum Genet 46:222–228 [PMC free article] [PubMed]
24. Devlin B, Roeder K (1999) Genomic control for association studies. Biometrics 55:997–1004 [PubMed] [Cross Ref]10.1111/j.0006-341X.1999.00997.x
25. Allison DB, Heo M, Schork NJ, Wong S-L, Elston RC (1998) Extreme selection strategies in gene mapping studies of oligogenic quantitative traits do not always increase power. Hum Hered 48:97–107 [PubMed] [Cross Ref]10.1159/000022788
26. Haines JL, Hauser MA, Schmidt S, Scott WK, Olson LM, Gallins P, Spencer KL, Kwan SY, Noureddine M, Gilbert JR, Schnetz-Boutaud N, Agarwal A, Postel EA, Pericak-Vance MA (2005) Complement factor H variant increases the risk of age-related macular degeneration. Science 308:419–421 [PubMed] [Cross Ref]10.1126/science.1110359
27. Edwards AO, Ritter R, Abel KJ, Manning A, Panhuysen C, Farrer LA (2005) Complement factor H polymorphism and age-related macular degeneration. Science 308:421–424 [PubMed] [Cross Ref]10.1126/science.1110189
28. Zareparsi S, Branham KE, Li M, Shah S, Klein RJ, Ott J, Hoh J, Abecasis GR, Swaroop A (2005) Strong association of the Y402H variant in complement factor H at 1q32 with susceptibility to age-related macular degeneration. Am J Hum Genet 77:149–153 [PMC free article] [PubMed]
29. Jakobsdottir J, Conley YP, Weeks DE, Mah TS, Ferrell RE, Gorin MB (2005) Susceptibility genes for age-related maculopathy on chromosome 10q26. Am J Hum Genet 77:389–407 [PMC free article] [PubMed]
30. Rivera A, Fisher SA, Fritsche LG, Keilhauer CN, Lichtner P, Meitigner T, Weber BHF (2005) Hypothetical LO C387715 is a second major susceptibility gene for age-related macular degeneration, contributing independently of complement factor H to disease risk. Hum Mol Genet 14:3227–3236 [PubMed] [Cross Ref]10.1093/hmg/ddi353

Articles from American Journal of Human Genetics are provided here courtesy of American Society of Human Genetics
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • MedGen
    MedGen
    Related information in MedGen
  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...