# Efficient Study Designs for Test of Genetic Association Using Sibship Data and Unrelated Cases and Controls

^{1}Center for Clinical Epidemiology and Biostatistics, University of Pennsylvania School of Medicine, Philadelphia; and

^{2}Department of Biostatistics and Center for Statistical Genetics, University of Michigan School of Public Health, Ann Arbor

## Abstract

Linkage mapping of complex diseases is often followed by association studies between phenotypes and marker genotypes through use of case-control or family-based designs. Given fixed genotyping resources, it is important to know which study designs are the most efficient. To address this problem, we extended the likelihood-based method of Li et al., which assesses whether there is linkage disequilibrium between a disease locus and a SNP, to accommodate sibships of arbitrary size and disease-phenotype configuration. A key advantage of our method is the ability to combine data from different family structures. We consider scenarios for which genotypes are available for unrelated cases, affected sib pairs (ASPs), or only one sibling per ASP. We construct designs that use cases only and others that use unaffected siblings or unrelated unaffected individuals as controls. Different combinations of cases and controls result in seven study designs. We compare the efficiency of these designs when the number of individuals to be genotyped is fixed. Our results suggest that (1) when the disease is influenced by a single gene, the one sibling per ASP–control design is the most efficient, followed by the ASP-control design, and familial cases contribute more association information than singleton cases; (2) when the disease is influenced by multiple genes, familial cases provide more association information than singleton cases, unless the effect of the locus being tested is much smaller than at least one other untested disease locus; and (3) the case-control design can be useful for detecting genes with small effect in the presence of genes with much larger effect. Our findings will be helpful for researchers designing and analyzing complex disease-association studies and will facilitate genotyping resource allocation.

Association analysis provides a powerful tool for identifying genetic variants that predispose to complex diseases. Association analysis with use of genetic markers (such as SNPs) relies on the presence of linkage disequilibrium (LD), which occurs when specific alleles at the disease and marker loci appear together in gametes more frequently than expected by chance. With the recent availability of high-throughput SNP genotyping and decreasing genotyping costs, association studies with use of SNPs are beginning to be conducted genomewide.^{1}^{,}^{2} Such analyses have been facilitated by progress on the International HapMap Project,^{3}^{,}^{4} which cataloged and genotyped millions of SNPs, allowing informative tagging SNPs to be selected for different populations. Genomewide association studies typically involve hundreds or thousands of individuals and, since genotyping on such a large scale is still expensive, it is important to choose efficient study designs.

In gene-mapping studies, affected sib pairs (ASPs) or multiplex affected sibships are often collected for linkage analyses. Although these individuals may be reused in follow-up association studies, this is not always done. Traditionally, association-mapping studies with the case-control design have been used to test for disease-marker association by selecting one affected sibling per sibship, to form the case group, and comparing the alleles or genotype frequencies with a random sample of unaffected individuals. It has been shown that power can be substantially increased by including families with more affected siblings^{5}^{}^{–}^{7} in association studies. The increase of power is due to the enrichment of disease-predisposing alleles in affected sibships; this, in turn, leads to improved power to detect genetic association because of larger allele-frequency differences between cases and controls.

Efficient use of data sets that include related individuals in association studies requires a unified statistical framework that allows the joint analysis of all available sampling units. In this article, we extend the association test proposed by Li et al.^{8} to the analysis of sibships of arbitrary size and disease-phenotype configuration and to accommodate parental genotypes, when available. Our method allows the analysis of data containing mixed types of sampling units that are based on a unified retrospective likelihood framework and therefore can evaluate evidence of disease-marker association on the basis of different sampling units, ranging from unselected unrelated individuals to large sibships. We consider scenarios for which genotypes are available for unrelated cases, ASPs, or only one sibling per ASP. We construct designs that use affected individuals only and others that use unaffected siblings or unrelated unaffected individuals as controls. Using our unified likelihood framework, we compare efficiency of these study designs when the number of individuals to be genotyped is fixed.

As noted elsewhere by Risch,^{6} we show that designs with unrelated controls are more powerful than are designs with family-based controls. Our results also suggest that, for diseases that are influenced by multiple genes, familial cases provide more association information than do singleton cases, unless the effect of the test locus is much smaller than at least one other untested disease locus. Similar phenomena have been observed by Risch^{9} for single major-locus models with an additive polygenic background and by Howson et al.^{10} for certain two-locus models. Further, we show that the case-control design can be useful for detecting genes with small effect in the presence of genes with much larger effect.

## Methods

We consider the problem of disease-marker association analysis with mixed types of sampling units. Our goals are to develop a unified likelihood framework that allows the joint analysis of all available data and to compare efficiency of different study designs for testing association between disease and a candidate SNP, given fixed genotyping resources. We discuss the impact of phenotyping cost in the “Discussion” section.

### Assumptions and Definitions

We assume there is a set of sibships genotyped at a candidate SNP and, optionally, *M*⩾0 flanking markers. We assume the SNP, with alleles A and a (with frequencies *p*_{A} and *p*_{a}), is completely linked (recombination fraction θ=0) to a diallelic disease locus, with disease-predisposing allele D and alternate allele d (with frequencies *p*_{D} and *p*_{d}). We wish to evaluate evidence of association at the candidate SNP by modeling the disease-SNP haplotypes DA, Da, dA, and da (with frequencies *p*_{DA}, *p*_{Da}, *p*_{dA}, and *p*_{da}, respectively) and the penetrances *f*_{g}=*P*(*affected*|*g*) for disease genotypes *g*∈{*dd*, *Dd*, *DD*}. As shown later, unrelated individuals do not allow the estimation of all these independent parameters. In samples that include only unrelated individuals, we assume that the disease and SNP loci are in complete LD (*r*^{2}=1), so that their allele frequencies are identical. The assumption that *r*^{2}=1 results in an identifiable model but no loss of statistical efficiency, since we can still extract maximum information from the available data.

By definition, the population prevalence of the disease *K*=*f*_{dd}*p*^{2}_{d}+2*f*_{Dd}*p*_{d}*p*_{D}+*f*_{DD}*p*^{2}_{D}, and the genotype relative risk (*GRR*)=*f*_{g}/*f*_{dd} for *g*∈{*Dd*,*DD*}. We allow LD between the candidate SNP and the unobserved disease alleles but assume linkage equilibrium between the flanking markers and the superlocus formed by combining the disease and SNP loci. We assume Hardy-Weinberg equilibrium in the general population for all markers, including the superlocus. We further assume that the disease phenotypes of the siblings are independent, given their genotypes at the disease locus, and that there is a single disease causal variant in the region. We investigate the impact of multiple disease variants in the “Simulations” section.

For a sibship with *s* siblings, let

be the observed unordered marker genotypes, *Y* be the disease phenotypes, and *G* be the disease-SNP haplo-genotypes. Let θ_{m} be the recombination fraction between markers *m* and *m*+1 (1⩽*m*⩽*M*-1). The inheritance pattern at marker *m* is completely described by a binary inheritance vector *v*_{m} of length 2*s*,^{11}^{,}^{12} whose entries indicate the outcome of the paternal and maternal meioses for the *s* siblings in the sibship. Let *v*_{D} and *v*_{SNP} denote the inheritance vectors at the disease locus and the candidate SNP, respectively. Complete linkage between the disease and SNP loci implies *v*_{D}≡*v*_{SNP}. For ease of computation, we assume there is no genetic interference, so that {*v*_{m}} forms a hidden Markov chain.

### Conditional Probability of Marker Data, Given Disease Phenotypes for a Sibship with *s* Siblings

We wish to evaluate *P*(*X*|*Y*), the conditional probability of marker genotypes *X,* given disease phenotypes *Y* for a sibship with *s* siblings. By the law of the total probability,

where the summation is taken over all disease-SNP haplogenotypes that are consistent with the observed SNP genotypes. Summing over all possible inheritance vectors at the disease locus and applying Baum’s^{13} forward and backward algorithms,

where *k* and *k*+1 are flanking markers on the left and right side of the candidate SNP. The summation over all possible inheritance vectors allows the handling of incomplete inheritance information and phase ambiguity by incorporating prior probabilities of the inheritance vectors. At any marker *m*(1⩽*m*⩽*M*),

and

The calculation of equation (2) requires three probabilities: (1) the prior probability of inheritance vector *v*_{D}*,* (2) the inheritance vector transition probability between two consecutive markers, and (3) the conditional probability of marker genotypes, given the inheritance vector at that marker. Clearly, the prior probability *P*(*v*_{D})=2^{-2s}.

The transition probability between inheritance vectors at markers *m* and *m*+1 can be obtained from the transition matrix, which is expressed as the Kronecker power of 2×2 transition matrices corresponding to transitions at each of the 2*s* meioses,

For example, for a sib pair,

and

Let *O*^{dad}_{m} and *O*^{mom}_{m} represent the ordered genotypes of the father and the mother at marker *m.* In ordered genotypes, the maternal allele always precedes the paternal allele. Although observed genotypes are typically unordered, summing over ordered genotypes is computationally convenient, because, taken together, ordered genotypes for the founders and the inheritance vector specify the genotypes of all individuals in the pedigree. Thus, the conditional probability of sibship genotype *X*_{m}*,* given inheritance vector *v*_{m}*,* can be calculated as

where *P*(*X*_{m}|*O*^{dad}_{m},*O*^{mom}_{m},*v*_{m}) takes the value of 1 if the sibship’s genotype data *X*_{m} are consistent with the ordered parental genotypes *O*^{dad}_{m} and *O*^{mom}_{m} and the inheritance vector *v*_{m}*,* and 0 otherwise. The summation is taken over all ordered parental genotypes. *P*(*G*|*v*_{G}) can be calculated in a similar fashion, by regarding each haplogenotype as a genotype of the superlocus formed by combining the disease and SNP loci.

Recursive calculation of *L*_{m}(*v*_{m}) and *R*_{m}(*v*_{m}) with use of these three probabilities allows equation (2) to be evaluated in a manner linear in the number of marker loci *M.* Equation (2) is an extension of the retrospective likelihood calculation for ASPs described by Li et al.^{8} Here, the sibship size can be >2, and siblings can be either affected or unaffected. Our likelihood calculation easily allows for missing genotypes. For example, to accommodate sibships in which only a subset of the siblings is genotyped at the candidate SNP, we sum over all possible SNP genotypes for those siblings of known disease status but with missing SNP genotypes. It is essential to include all these members, because siblings with known phenotypes but missing genotypes contribute association information.

Our calculation can be readily extended to accommodate parental genotypes. Following the derivation of equation (2), the critical part in the calculation is the conditional probability of marker genotypes for the siblings and their parents, given the inheritance vector at a particular marker. Let *X*^{dad}_{m} and *X*^{mom}_{m} represent the observed unordered parental genotypes at marker *m.* Then the conditional probability of the observed genotypes given the inheritance vector at marker *m* is

where the summation is taken over all ordered parental genotypes that are consistent with the observed unordered parental genotypes. This extension enables us to analyze nuclear families with genotyped parents, including parent-affected offspring trios, which are the basic sampling units used by the transmission/disequilibrium test.^{14}

Under the assumption that the disease phenotypes are independent given the genotypes at the disease locus, *P*(*Y*|*G*) is the product of simple functions of penetrances. An affected sibling *j* (1⩽*j*⩽*s*) with disease-SNP haplo-genotype *G*_{j} contributes a term *f*_{Gj}, and an unaffected sibling *j* contributes a term 1-*f*_{Gj}. By the law of the total probability, the probability of the disease phenotypes for the sibship

Substituting equation (2), *P*(*Y*|*G*), and *P*(*Y*) into equation (1), we obtain the conditional probability for the sibship *P*(*X*|*Y*) as a function of model parameters {*f*_{dd},*f*_{Dd},*f*_{DD},*p*_{DA},*p*_{Da},*p*_{dA}}.

In the calculation of *P*(*Y*|*G*) and *P*(*Y*), we assume that the disease statuses of the siblings are conditionally independent, given their genotypes at the disease locus. This assumption is exactly true only when there are no other genetic or environmental risk factors shared among the siblings. If the disease is influenced by multiple disease variants, then the calculation will depend on genotypes at the other disease loci as well. For example, if the disease is influenced by two unlinked disease loci, then

where subscripts 1 and 2 denote the two unlinked disease loci.

### Conditional Probability of Marker Data, Given Disease Phenotype for a Single Individual

In principle, equation (1) can be applied to singleton individuals who can be regarded as sibships with one sibling. However, data sets containing solely unrelated individuals do not allow the estimation of all our model parameters. In this case, we assume that the disease and SNP loci are in complete LD, so that *p*_{D}=*p*_{A}, and we reparameterize our model. For case-control data,

which is a function of {*f*_{dd},*f*_{Dd},*f*_{DD},*p*_{A}}. For a sample of unrelated cases, *P*(*X*_{SNP}|*Y*) is simply a function of the two SNP genotype frequencies, *P*(*AA*|*case*) and *P*(*Aa*|*case*). For studies that involve only unrelated individuals, flanking markers do not contribute information on association; therefore, we need to consider only the SNP genotypes. It is worth noting that, for SNPs that are in incomplete LD with the disease locus, the genetic effect will be underestimated; however, there is no loss of efficiency for the association test.

### Pooling across Different Sampling Units

A key advantage of our likelihood calculation is that it allows the joint analysis of different sampling units in a unified statistical framework, which leads to more efficient use of the available data. The retrospective likelihood for data that contain *N* independent sibships, which may be of different sizes and disease phenotype configurations, is

Here, we choose to use a retrospective likelihood, since the sibships are ascertained through disease status. Using a retrospective likelihood avoids the problem of ascertainment bias and provides parameter estimates that are valid for the general population.^{15}^{,}^{16} In addition, it ensures that our test remains valid even if there are additional genetic or environmental factors that induce correlation between family members.

### Test of Association

We wish to evaluate whether a SNP is associated with the putative disease locus. Under the null hypothesis of no association, the SNP and the disease locus are in linkage equilibrium, and the disease-SNP haplotype frequencies are the product of the corresponding disease and SNP allele frequencies (for example, *p*_{DA}=*p*_{D}*p*_{A}). In this case, parameters that need to be estimated are {*f*_{dd},*f*_{Dd},*f*_{DD},*p*_{D},*p*_{A}}, and we set

Under the alternative hypothesis, we maximize a total of six parameters {*f*_{dd},*f*_{Dd},*f*_{DD},*p*_{DA},*p*_{Da},*p*_{dA}}. For data including only ASPs or only unrelated individuals, these parameters are not all identifiable, and we reparameterize the likelihood as described by Li et al.^{8} or maximize a subset of the parameters as detailed in table 1. We perform this maximization using a simplex algorithm,^{17} an optimization method that does not require derivatives.

Following Li et al.,^{8} we use a likelihood-ratio statistic to test for association. We compare the likelihood maximized under the general model (0⩽*r*^{2}⩽1), , with the likelihood maximized under the null model (*r*^{2}=0), , using the likelihood-ratio statistic . Parameters associated with each model for the different sampling units and the corresponding parameter constraints are summarized in table 1. For data sets that contain only unrelated cases and controls, our association test is similar to the unconstrained genotype test proposed by Thompson et al.,^{18} except that we do not assume known disease prevalence. Our test is also similar to the goodness-of-fit test proposed by Wittke-Thompson et al.^{19}

In principle, the asymptotic distribution of *T*_{LE} under the null hypothesis can be approximated by mixture of χ^{2} distributions,^{20} but we have not derived the degrees of freedom and mixing parameters because of the complexity of parameter constraints and boundaries. Instead, we assess significance of the test statistic empirically by simulating marker genotypes under the null hypothesis and comparing the observed statistic with the simulated null distribution.

Under the null hypothesis, we sample SNP genotypes for a sibship conditional on their observed flanking-marker genotypes and parameter estimates for the linkage equilibrium model. We leave flanking-marker genotypes unchanged from their observed values. For a single individual, we sample the SNP genotype according to the estimated SNP genotype frequencies. The null distribution of *T*_{LE} can be obtained by calculating the statistic for a large number of simulated data sets.

### Study Designs for Test of Genetic Association

Our likelihood calculation allows the analysis of sibships of arbitrary size and disease-phenotype configuration, including unrelated affected or unaffected individuals, ASPs, and discordant sib pairs (DSPs). For ease of presentation, we consider only sibships of size ⩽2. To construct different study designs, we select either (1) one or two cases from each ASP or (2) unrelated affected individuals. We use either cases only or select controls from unrelated unaffected individuals or unaffected siblings. Different combinations of cases and controls result in seven study designs (fig. 1). It is worth noting that both the one sibling per ASP–control design and the case-control design use unrelated affected and unaffected individuals. The difference is that, for the one sibling per ASP–control design, the cases are selected from ASPs, whereas, in the case-control design, the cases are randomly selected from the general population.

Given fixed genotyping resources, it is important to know which study designs are the most powerful for detecting disease-SNP association. Since disease-mapping studies often start from linkage analysis and since flanking-marker genotypes often are already available, for these studies, we compare the efficiency of different study designs by fixing the total number of individuals to be genotyped at the candidate SNP, and we do not account for the cost or effort associated with collecting flanking-marker data.

### Simulations

We performed a set of simulations to evaluate the efficiency of different study designs and to compare the statistical power of our test with other existing association tests. Table 2 describes the single-locus disease models that we considered, which varied over a range of attributable fractions, disease allele frequencies, and GRRs. We set the locus-specific sibling recurrence risk ratio λ_{s}^{21} to 1.02.

When simulating the data, we assumed that the disease and SNP allele frequencies are identical, in contrast to our model in which these frequencies are allowed to differ. Setting these frequencies to be equal allowed us to compare the efficiency of different study designs over a broad range of LD (0⩽*r*^{2}⩽1) between the disease alleles and the SNP. We assumed a map of 10 markers, each with four equally frequent alleles (heterozygosity *H*=0.75) evenly spaced at 11.16-cM intervals, corresponding to θ=0.1 under Haldane’s^{22} no-interference map function. We centered the disease locus and candidate SNP in the middle of the map and assumed zero recombination between them. The disease locus genotypes were removed prior to data analysis. For each of the disease models in table 2, we simulated 5,000 replicate data sets for each design under linkage equilibrium, to estimate the null distribution. We next simulated 2,000 replicate data sets with various levels of LD, to assess the empirical power of our association test.

To examine the impact of multilocus inheritance on the relative efficiency of the case-control design and the one sibling per ASP–control design, we also simulated data sets using the additive multilocus disease models for which the multilocus penetrance is the total of the penetrance summands as defined by Risch.^{23} For example, given *L* unlinked diallelic loci contributing to susceptibility in a recessive manner, the penetrance for each genotype is

Here, *f*_{base} is the baseline penetrance for the genotype containing no disease-predisposing genotypes, Δ_{l} is the increment in penetrance for the disease-predisposing genotype at locus *l,* and *I*_{l} is an indicator of whether the individual is homozygous for the disease-predisposing allele at locus *l.*

We simulated data sets assuming *L*⩾2 unlinked diallelic disease loci, each with predisposing allele frequency of 0.1. We simulated an associated SNP with minor-allele frequency of 0.1 completely linked to the first disease locus, which we call the test locus. We considered three scenarios: (1) increasing the locus-specific λ_{s} at the test locus from 1.02 to 1.25 but fixing the locus-specific λ_{s} at the remaining background disease loci at 1.02, (2) fixing the locus-specific λ_{s} at each disease locus at 1.02 and increasing the number of disease loci from 2 to 10, and (3) increasing the locus-specific λ_{s} at one of the background disease loci from 1.02 to 1.7 but fixing the locus-specific λ_{s} at the remaining disease loci, including the test locus, at 1.02. We fixed the disease prevalence at 5% in all scenarios. All disease genotypes were removed prior to data analysis. Precise details of the penetrances are available in the appendix.

## Results

In this section, we compare power of different study designs when the number of individuals to be genotyped is fixed under single-locus disease models. Further, we examine designs with familial cases and singleton cases under multilocus disease models. We also evaluate the usefulness of flanking markers, compare our approach with other tests of association, and illustrate how to combine data from different family structures.

### Power Comparisons of Different Study Designs

For each of the 12 disease models in table 2, we estimated the empirical power of the seven study designs for test of association at four levels of disease-SNP LD (*r*^{2}=0.25, 0.50, 0.75, and 1). We ranked each study design by its estimated power assessed at the 1% empirical significance level, so that the most powerful design has rank 1 and the least powerful design has rank 7. Each study design was ranked 12×4=48 times. Figure 2 displays the histograms of ranks for each study design. Our simulation results for the single-locus models indicate that, for a fixed number of SNP genotypes, the one sibling per ASP–control design is usually most powerful (*rank*=1 in 41 of 48 settings, average rank = 1.31), followed by the ASP-control design. For all 12 single-locus disease models we considered, the case-control design is less powerful than designs that include familial cases. In addition, we found that the DSP design is always less powerful than designs that include population controls. For a fixed genotyping effort, we also found that, under common dominant (*p*_{D}=0.7) and rare recessive (*p*_{D}=0.1) models, designs including only affected individuals can be more powerful than designs that also include unaffected individuals. Nevertheless, we generally do not advocate such designs, since they are more vulnerable to genotyping error and deviations from Hardy-Weinberg equilibrium. Our results suggest that the rankings were similar when *r*^{2}=1 and when *r*^{2}=0.25, and no designs behave better or worse at these two extremes.

*K*=5% and sibling recurrence risk ratio of λ

_{s}=1.02.

**...**

Given a set of ASPs, an investigator may initially genotype candidate SNPs in only one sibling per ASP, halving genotyping costs on the cases. We compared the power of the ASP-control design with that of the one sibling per ASP–control design, where the latter uses only one sibling per ASP from the ASPs generated for the previous design (table 3). We found that the loss of power by genotyping only one sibling per ASP generally is modest. This suggests that, for an initial screen of SNPs, it may be cost effective to initially genotype only one sibling per ASP, with genotyping of the other siblings performed only when a candidate SNP shows at least suggestive evidence of association.

### Impact of Multiple Disease Loci

We analyzed simulated data sets that included multiple disease loci. We compared the power of the case-control design and the one sibling per ASP–control design, since these designs are typically among the most powerful and represent a choice commonly faced by investigators—namely, whether to collect familial cases or unrelated cases. Figure 3 indicates that, for the five-locus additive disease models that we considered, where the background untested disease loci have small effect (λ_{s}=1.02), both study designs have increasing power as the effect of the test locus increases. Further, the increment in power is more pronounced for the one sibling per ASP–control design. Similar patterns were observed for models for which all disease loci are dominant or recessive and for models with larger numbers of disease loci (data not shown).

**...**

We also investigated the impact of multiple disease loci when all loci have the same effect (fig. 4). We found that the case-control design has approximately constant power across different number of disease loci, whereas the power of the one sibling per ASP–control design decreases as the number of disease loci increases, corresponding to greater familial aggregation. A similar finding was reported by Risch,^{9} who found that, when the sibling residual correlation is high, multiplex affected sibships and familial cases sometimes provide a smaller advantage over randomly selected cases. Although the advantage of familial cases diminishes as the number of disease loci increases, we found that the one sibling per ASP–control design remained more powerful than the case-control design for all the disease models that we considered.

**...**

As expected, for disease models for which the test locus has a fixed small effect (λ_{s}=1.02), we found that the power of the case-control design is not influenced by the effect size of the untested background locus (fig. 5) when the disease prevalence is fixed. In contrast, the power of the one sibling per ASP–control design decreases as the effect size of the background locus increases. Generally, the one sibling per ASP–control design has greater power than the case-control design, but the case-control design becomes more powerful when the effect of the background locus is very large (λ_{s}>1.6 in our simulations) relative to the test locus (λ_{s}=1.02 in our simulations).

### Improvement of Power by Including Flanking Markers

In sib pair samples, our method makes use of genotypes on flanking markers, which may provide valuable information about the underlying disease models, especially when these markers are closely linked to the unobserved disease locus. To assess the utility of flanking-marker data on our test of association, we repeated the power estimation procedure for the ASP design and the one sibling per ASP design. Table 4 suggests that the flanking-marker data can substantially increase the association power for dominant and additive models. Our results also indicate that the flanking markers are more useful for SNPs in which one allele is rare.

### Comparison with Other Tests of Association

We compared the power of our test with other tests of association. The significance for all tests was determined empirically by simulating null distributions and getting critical values. For the case-control design, we compared with Pearson’s χ^{2}_{2} statistic for the 2×3 table of genotype frequencies in cases and controls. For the ASP-control design, we compared with Risch and Teng’s^{5} test

where *n* is the number of sibships, each with *r* affected sibs, and *u* is the number of unrelated controls matched to each sibship,

and and are the estimated SNP allele frequency for the ASPs and the controls, respectively. For the design with 250 ASPs and 500 controls, *r*=2, *u*=2, and *n*=250.

Table 5 shows that, for most of the models we considered and especially for recessive models, our test has greater power than the Pearson’s χ^{2}_{2} test. Our test is less powerful than Risch and Teng’s test for the additive models examined. This is because Risch and Teng’s test is a 1-df test, whereas our test relies on the estimation of several disease model parameters, resulting in more degrees of freedom. We found that when assuming a prespecified disease model (e.g., additive, dominant, or recessive) by imposing constraints on penetrances estimated in our model, the power of the two tests became comparable (data not shown). This suggests that fixing some disease model parameters is likely to improve the power if these parameters can be approximated from previous studies. Note that neither our test nor Risch and Teng’s test controls for population stratification.

### Combining Data from Different Family Structures

A key advantage of our method is its ability to combine data from different sampling units. In many association studies, particularly those that follow an initial linkage study, an investigator may have different sampling units available. For example, the data may contain nuclear families with different numbers of genotyped parents and affected and unaffected siblings collected for the initial linkage analysis and unrelated affected or unaffected individuals from additional sampling. A simple strategy for analyzing such data would be to use all unrelated affected individuals and one affected sibling per sibship to form the case group and then use all unrelated unaffected individuals to form a control group. However, this does not use all available data and can give variable results, depending on which affected siblings are selected.^{7}

To assess the gain in power by using all available data simultaneously, we simulated different combinations of ASPs, DSPs, unrelated cases, and unrelated controls. We compared the power of our test when using all available data with tests that use only partial data obtained by selecting one sibling per sampling unit. Table 6 suggests that there could be a substantial loss of power when only a subset of the data is used. As expected, when the proportion of data being used decreases, the loss of power increases, suggesting that when the majority of the data are sampled from sibships or families, it is important to use all available data.

As when deciding to genotype all affected siblings or only one sibling per ASP, we found that including the genotypes of all affected family members increases power. When it is not cost-effective to do this additional genotyping for all markers, it could be considered for an additional follow-up phase.

## Discussion

We have developed a unified likelihood framework to test for disease-marker association that allows the analysis of sibships of arbitrary size and disease-phenotype configuration. Our likelihood calculations allow us to accommodate different association-study designs and to compare their efficiencies. By use of simulation studies, we found that when the number of individuals to be genotyped at the candidate SNP is fixed, for single-locus models, the one sibling per ASP–control design was generally most powerful, followed by the ASP-control design. As others have noted^{6}, we also found that familial cases contributed more association information than did singleton cases and that the DSP design was less powerful than designs that include unrelated unaffected individuals. This pattern holds for disease prevalence 2%⩽*K*⩽20%, with similar relative efficiency for the seven study designs that we considered (data not shown). Additional simulations reveal that our conclusions regarding the relative efficiency of different study designs remain unchanged at more-stringent critical values (α=.0001, .00001, and .000001).

In most of our simulations, we generated data in which allele frequencies were identical at the disease and SNP loci. To evaluate the robustness of our model to differences in allele frequency between the two loci, we conducted additional simulations over a broad range of allele-frequency differences. We considered combinations of *p*_{D} ∈{.1,.3,.5,.7} and *p*_{A} ∈{.1,.3,.5,.7} for dominant, additive, and recessive models. Results of these additional simulations suggest that the relative efficiency of different study designs remains unchanged, although all designs have low power when the allele frequencies are very different, since *r*^{2} is low.

Our results show that the proposed test is usually more powerful than the Pearson’s χ^{2} test for the case-control design. One reason for this advantage is that our test uses an explicit genetic model for the disease, whereas the Pearson’s χ^{2} test is nonparametric in nature. Our results are consistent with those of Thompson et al.,^{18} who showed that even simple modeling assumptions, such as assuming Hardy-Weinberg equilibrium in the general population, increase power of genetic-association studies.

Our method does not depend on transmission disequilibrium and can incorporate parental genotypes when available. To evaluate the potential gain in power afforded by collecting parental genotypes, we generated data sets with 500 controls and 500 parent–affected offspring trios for disease models (table 2). We analyzed the data, first taking into account only genotypes for the 500 unrelated cases and controls (average *power*=39%; α=.01) and then also incorporating parental genotypes (average *power*=54%, α=.01). We expect that parental data will be less useful on a per-genotype basis but will still provide useful information on allelic association.

Our method assumes that the superlocus formed by combining the disease and SNP loci is in Hardy-Weinberg equilibrium in the general population. In the presence of population stratification, the Hardy-Weinberg equilibrium assumption may be violated and our test may be invalid. An important step for avoiding population stratification is to carefully match cases and controls on the basis of their genetic background. When the degree of stratification is small, it may be possible to adjust our test statistics with genomic control^{24} or a similar strategy.

We initially assumed that there is a single disease-predisposing variant in the region. As others have noted,^{5} under this assumption, familial cases tend to be enriched for the disease-predisposing allele and thus create a stronger contrast with unaffected individuals. For diseases that are influenced by multiple genes, the advantage of familial cases will depend on the underlying disease models. Our results indicate that, for disease models for which the test locus has equal or stronger effect than the remaining background disease loci, familial cases provide more association information than do randomly selected cases. This remains true unless the effect size of the test locus is much smaller (e.g., λ_{s}=1.02) than at least one other untested disease locus (e.g., λ_{s}=1.6). A similar pattern was observed by Howson et al.^{10} for additive and crossover two-locus disease models and by Allison et al.^{25} for extreme sampling in quantitative trait linkage/association studies.

Our findings have important implications for genetic-association studies of many complex diseases, such as depression and schizophrenia, for which loci of large effect have not been identified. For such diseases, designs with familial cases are likely to be a good choice for the initial association studies. One might consider genotyping additional affected family members for those markers that show suggestive evidence of association. Our findings also have implications for disease for which a major gene is known to play a role—such as many auto-immune disorders for which a strong human leukocyte antigen effect has been demonstrated—and age-related macular degeneration, for which two major loci have been identified.^{1}^{,}^{26}^{}^{}^{}^{–}^{30} For these diseases, the standard case-control design might be preferred for detecting genes that contribute only a small fraction of the overall disease risk.

Enabled by improvements in genotyping technologies, association studies are beginning to be conducted genomewide.^{1}^{,}^{2} We believe our method will be useful for analyzing the results of these studies. Nevertheless, applying our method to hundreds of thousands of markers may present a computational challenge, because it relies on an iterative procedure to maximize the likelihood of the data under alternative models. If computational resources are limited, one option is to first screen all markers with a computationally inexpensive test and then apply our method to markers that show suggestive evidence of association.

In this article, we focused on comparing efficiency of different study designs when the genotyping cost is fixed. Although familial cases provide more association information than do singleton cases in most settings we considered, familial cases (if not already sampled) are typically more difficult to collect and hence may result in higher phenotyping costs. It would be interesting to investigate the relative efficiency of familial cases and singleton cases, taking into account both genotyping and phenotyping costs, with the goal of minimizing total study cost.

In summary, we have developed a unified statistical framework to test for disease-marker association, using sibships of arbitrary size and disease-phenotype configuration. Our method can be readily extended to allow general pedigrees. We compared the efficiency of seven study designs when the number of individuals to be genotyped at the candidate SNP is fixed. Our results suggest that familial cases are more advantageous than are randomly selected cases when the disease follows a single-locus model. This also appears to be true for multilocus disease models, unless the effect size of the test locus is much smaller than that of at least one untested disease locus. On a cost basis, genotyping one sibling per affected sibship and using existing flanking-marker information provides a powerful design for initial association studies. We believe our findings will be helpful for researchers designing and analyzing complex disease–association studies and will increase power and facilitate genotyping resource allocation. We implemented our method in a C++ program, which can be downloaded from the University of Michigan Center for Statistical Genetics Web site.

## Acknowledgments

This research was supported by National Institutes of Health grants HG00376 (to M.B.) and HG02651 and EY12562 (to G.R.A.). M.L. was previously supported by a University of Michigan Rackham predoctoral fellowship. We gratefully thank two anonymous reviewers for their valuable comments.

## Appendix : Parameters for Multilocus Disease Models

Assume the disease is influenced by *L* unlinked diallelic disease loci. For locus *l *(1⩽*l*⩽*L*), let *D*_{l} denote the disease-predisposing allele and *d*_{l} denote the low-risk allele. Let *f*_{base} denote the baseline penetrance for the genotype in which all disease loci are homozygous for the low-risk allele. For an individual with genotype ∈{*d*_{l}*d*_{l},*D*_{l}*d*_{l},*D*_{l}*D*_{l}}, let *g*_{l} denote the genotype score that counts the number of the *D*_{l} alleles. Further, assume that the penetrance is increased over the baseline by *w*(*g*_{l}). The increment of penetrance depends on the marginal disease model at the corresponding locus. For example, for additive, dominant, and recessive models *w*(*g*_{l}) can be defined as shown in table A1. For an individual with genotype scores (*g*_{1},…,*g*_{L}), the corresponding multilocus penetrance is

The individual’s disease status can be determined once the genotype is known. Samples of unrelated cases and controls and familial cases can be simulated as usual.

Tables TablesA2A2A2–A4 list disease-model parameters for the additive multilocus-disease models described in this article.

### Table A1

w(g_{l}) | ||||

Genotype at Locus l | g_{l} | Additive | Dominant | Recessive |

d_{l}d_{l} | 0 | 0 | 0 | 0 |

D_{l}d_{l} | 1 | 0.5Δ_{l} | Δ_{l} | 0 |

D_{l}D_{l} | 2 | Δ_{l} | Δ_{l} | Δ_{l} |

### Table A2

Additive | Dominant | Recessive | |||||||

λ_{s,test} | f_{base} | Δ_{test} | Δ_{background} | f_{base} | Δ_{test} | Δ_{background} | f_{base} | Δ_{test} | Δ_{background} |

1.02 | .0264 | .0471 | .0471 | .0255 | .0258 | .0258 | .0435 | .1307 | .1307 |

1.05 | .0237 | .0745 | .0471 | .0226 | .0408 | .0258 | .0427 | .2067 | .1307 |

1.10 | .0206 | .1054 | .0471 | .0194 | .0578 | .0258 | .0418 | .2924 | .1307 |

1.15 | .0182 | .1291 | .0471 | .0169 | .0707 | .0258 | .0412 | .3581 | .1307 |

1.20 | .0162 | .1491 | .0471 | .0148 | .0817 | .0258 | .0406 | .4134 | .1307 |

1.25 | .0145 | .1667 | .0471 | .0130 | .0913 | .0258 | .0401 | .4622 | .1307 |

Note.— Data are a comparison of five-locus disease models for which the effect size of the test locus increases and the effect size of the four remaining disease loci are fixed. The disease prevalence *K* is fixed at 5%. The predisposing allele frequency at each locus is fixed at 0.1. The locus-specific sibling recurrence risk ratio, λ_{s,test}, at the test locus is increased from 1.02 to 1.25, and the locus-specific sibling recurrence risk ratio at the four remaining disease loci is fixed at 1.02. Δ_{test} is the increment of penetrance at the test locus, and Δ_{background} is the increment of penetrance at each of the four remaining disease loci.

### Table A3

Additive | Dominant | Recessive | ||||

L | f_{base} | Δ | f_{base} | Δ | f_{base} | Δ |

2 | .0406 | .0471 | .0402 | .0258 | .0474 | .1307 |

4 | .0301 | .0471 | .0304 | .0258 | .0448 | .1307 |

6 | .0217 | .0471 | .0206 | .0258 | .0422 | .1307 |

8 | .0123 | .0471 | .0107 | .0258 | .0395 | .1307 |

10 | .0029 | .0471 | .0009 | .0258 | .0369 | .1307 |

Note.— Data are a comparison of *L*-locus disease models for which the effect size of each disease locus is fixed and the number of disease loci increases. The disease prevalence *K* is fixed at 5%. The predisposing allele frequency at each locus is fixed at 0.1. The locus-specific sibling recurrence risk ratio λ_{s} at all disease loci is fixed at 1.02. Δ is the increment of penetrance at each locus.

### Table A4

Additive | Dominant | Recessive | |||||||

λ_{s,large} | f_{base} | Δ_{small} | Δ_{large} | f_{base} | Δ_{small} | Δ_{large} | f_{base} | Δ_{small} | Δ_{large} |

1.02 | .0264 | .0471 | .0471 | .0255 | .0258 | .0258 | .0435 | .1307 | .1307 |

1.1 | .0206 | .0471 | .1054 | .0194 | .0258 | .0578 | .0418 | .1307 | .2924 |

1.2 | .0162 | .0471 | .1491 | .0148 | .0258 | .0817 | .0406 | .1307 | .4134 |

1.3 | .0129 | .0471 | .1826 | .0114 | .0258 | .1001 | .0397 | .1307 | .5064 |

1.4 | .0101 | .0471 | .2108 | .0084 | .0258 | .1155 | .0389 | .1307 | .5847 |

1.5 | .0076 | .0471 | .2357 | .0058 | .0258 | .1292 | .0382 | .1307 | .6537 |

1.6 | .0053 | .0471 | .2582 | .0035 | .0258 | .1415 | .0376 | .1307 | .7161 |

1.7 | .0033 | .0471 | .2789 | .0013 | .0258 | .1528 | .0370 | .1307 | .7735 |

Note.— Data are a comparison of five-locus disease models for which the effect size of the large-effect background disease locus increases and the effect size of the small-effect disease loci, including the test locus, is fixed. The disease prevalence *K*=5%. The predisposing allele frequency at each locus is fixed at 0.1. The locus-specific sibling recurrence risk ratio, λ_{s,large}, at the large-effect background locus is increased from 1.02 to 1.7, and the locus-specific sibling recurrence risk ratio at the four remaining loci, including the test locus, is fixed at 1.02. Δ_{large} is the increment of penetrance at the large-effect background locus, and Δ_{small} is the increment of penetrance at each of the four remaining loci.

## Web Resource

The URL for data presented herein is as follows:

## References

**American Society of Human Genetics**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (983K) |
- Citation

- Joint modeling of linkage and association: identifying SNPs responsible for a linkage signal.[Am J Hum Genet. 2005]
*Li M, Boehnke M, Abecasis GR.**Am J Hum Genet. 2005 Jun; 76(6):934-49. Epub 2005 Apr 5.* - Detection of disease genes by use of family data. I. Likelihood-based theory.[Am J Hum Genet. 2000]
*Whittemore AS, Tu IP.**Am J Hum Genet. 2000 Apr; 66(4):1328-40. Epub 2000 Mar 29.* - The relative power of family-based and case-control designs for linkage disequilibrium studies of complex human diseases I. DNA pooling.[Genome Res. 1998]
*Risch N, Teng J.**Genome Res. 1998 Dec; 8(12):1273-88.* - Linkage analysis in the presence of errors IV: joint pseudomarker analysis of linkage and/or linkage disequilibrium on a mixture of pedigrees and singletons when the mode of inheritance cannot be accurately specified.[Am J Hum Genet. 2000]
*Göring HH, Terwilliger JD.**Am J Hum Genet. 2000 Apr; 66(4):1310-27. Epub 2000 Mar 23.* - Family-based association studies.[J Natl Cancer Inst Monogr. 1999]
*Gauderman WJ, Witte JS, Thomas DC.**J Natl Cancer Inst Monogr. 1999; (26):31-7.*

- Common genetic loci influencing plasma homocysteine concentrations and their effect on risk of coronary artery disease[The American Journal of Clinical Nutrition....]
*van Meurs JB, Pare G, Schwartz SM, Hazra A, Tanaka T, Vermeulen SH, Cotlarciuc I, Yuan X, Mälarstig A, Bandinelli S, Bis JC, Blom H, Brown MJ, Chen C, Chen YD, Clarke RJ, Dehghan A, Erdmann J, Ferrucci L, Hamsten A, Hofman A, Hunter DJ, Goel A, Johnson AD, Kathiresan S, Kampman E, Kiel DP, Kiemeney LA, Chambers JC, Kraft P, Lindemans J, McKnight B, Nelson CP, O'Donnell CJ, Psaty BM, Ridker PM, Rivadeneira F, Rose LM, Seedorf U, Siscovick DS, Schunkert H, Selhub J, Ueland PM, Vollenweider P, Waeber G, Waterworth DM, Watkins H, Witteman JC, den Heijer M, Jacques P, Uitterlinden AG, Kooner JS, Rader DJ, Reilly MP, Mooser V, Chasman DI, Samani NJ, Ahmadi KR.**The American Journal of Clinical Nutrition. 2013 Sep; 98(3)668-676* - Cntnap4/Caspr4 Differentially Contributes to GABAergic and Dopaminergic Synaptic Transmission[Nature. 2014]
*Karayannis T, Au E, Patel JC, Kruglikov I, Markx S, Delorme R, Héron D, Salomon D, Glessner J, Restituito S, Gordon A, Rodriguez-Murillo L, Roy NC, Gogos J, Rudy B, Rice ME, Karayiorgou M, Hakonarson H, Keren B, Huguet G, Bourgeron T, Hoeffer C, Tsien RW, Peles E, Fishell G.**Nature. 2014 Jul 10; 511(7508)236-240* - Proceedings: Consideration of Genetics in the Design of Induced Pluripotent Stem Cell-Based Models of Complex Disease[Stem Cells Translational Medicine. 2014]
*Grieshammer U, Shepard KA.**Stem Cells Translational Medicine. 2014 Nov; 3(11)1253-1258* - Haplotype association analysis of combining unrelated case-control and triads with consideration of population stratification[Frontiers in Genetics. ]
*Wen SH, Tsai MY.**Frontiers in Genetics. 5103* - Fine Mapping on Chromosome 13q32–34 and Brain Expression Analysis Implicates MYO16 in Schizophrenia[Neuropsychopharmacology. 2014]
*Rodriguez-Murillo L, Xu B, Roos JL, Abecasis GR, Gogos JA, Karayiorgou M.**Neuropsychopharmacology. 2014 Mar; 39(4)934-943*

- Efficient Study Designs for Test of Genetic Association Using Sibship Data and U...Efficient Study Designs for Test of Genetic Association Using Sibship Data and Unrelated Cases and ControlsAmerican Journal of Human Genetics. 2006 May; 78(5)778

Your browsing activity is empty.

Activity recording is turned off.

See more...