# Family-Based Tests of Association in the Presence of Linkage

^{1}Biostatistics and

^{2}Epidemiology, Harvard School of Public Health, and

^{3}Department of Psychiatry, Massachusetts General Hospital and Harvard Medical School, Harvard University, Boston

## Abstract

Linkage analysis may not provide the necessary resolution for identification of the genes underlying phenotypic variation. This is especially true for gene-mapping studies that focus on complex diseases that do not exhibit Mendelian inheritance patterns. One positional genomic strategy involves application of association methodology to areas of identified linkage. Detection of association in the presence of linkage localizes the gene(s) of interest to more-refined regions in the genome than is possible through linkage analysis alone. This strategy introduces a statistical complexity when family-based association tests are used: the marker genotypes among siblings are correlated in linked regions. Ignoring this correlation will compromise the size of the statistical hypothesis test, thus clouding the interpretation of test results. We present a method for computing the expectation of a wide range of association test statistics under the null hypothesis that there is linkage but no association. To standardize the test statistic, an empirical variance-covariance estimator that is robust to the sibling marker-genotype correlation is used. This method is widely applicable: any type of phenotypic measure or family configuration can be used. For example, we analyze a deletion in the *A2M* gene at the 5′ splice site of “exon II” of the bait region in Alzheimer disease (AD) discordant sibships. Since the *A2M* gene lies in a chromosomal region (chromosome 12p) that consistently has been linked to AD, association tests should be conducted under the null hypothesis that there is linkage but no association.

## Introduction

Although linkage analysis has been applied successfully to the mapping of genes involved in the pathogenesis of diseases exhibiting Mendelian inheritance, its application in the setting of genetically complex diseases has been less fruitful (Risch and Merikangas 1996). With complex diseases, the resolution from linkage analysis is reduced, and extended segments of the genome containing large numbers of genes may be implicated in disease etiology (Hauser and Boehnke 1997; Roberts et al. 1999). Fine mapping of these linked regions may be accomplished through the use of allelic-association methods that are designed to jointly detect linkage and gametic-phase disequilibrium. Detecting association significantly refines the search for disease susceptibility genes, because linkage disequilibrium between a genetic marker and disease susceptibility polymorphisms is expected to exist only over relatively small genetic distances in most populations. The sequential approach of linkage-based genomic screening followed by dissection of linked regions with association methodology recently has been used to identify a susceptibility locus for human hypertension (Bray et al. 2000).

Allelic association can be detected through traditional contingency-table analysis using cases and controls (Woolf 1955). Although straightforward to implement, tests based on this approach are sensitive to spurious association caused by population admixture (Ott 1989). Family-based association tests (FBATs) are a class of tests that utilize within- and between-family marker-inheritance patterns to test for association and that are safeguarded, by design, from confounding caused by admixture (Ewens and Spielman 1995). A widely used FBAT is the transmission/disequilibrium test (TDT; Terwilliger and Ott 1992; Spielman et al. 1993), which uses the marker genotypes of an affected child and those of his/her parents to test for association. FBATs have received much attention lately, with numerous extensions and generalizations of the TDT being proposed in the literature. Recently, Rabinowitz and Laird (2000) developed a unified approach to family-based association tests that puts tests of different genetic models, tests of different sampling designs, tests involving different disease phenotypes, tests with missing parents, and tests of different null hypotheses, all in the same framework. Algorithms for calculating the distribution of association test statistics for these many settings are also presented.

A distinction must be made between tests for linkage that use association methods and tests for association in the presence of linkage. Letting θ be the recombination parameter and δ be a measure of allelic association, the tests for linkage that use association methods have a composite null hypothesis (type I *H*_{0}) that can be expressed as *H*_{0}:δ=0 or θ=1/2. The null hypothesis for testing association in the presence of linkage (type II *H*_{0}) is *H*_{0}:δ=0 and θ<1/2. Both settings have the same alternative hypothesis, *H*_{a}:δ>0 and θ<1/2. Complications arise in tests addressing the type II *H*_{0} setting, because sibling marker genotypes are correlated under *H*_{0} (Martin et al. 1997; Lazzeroni and Lange 1998). Ignoring the correlation in the type II *H*_{0} setting compromises the α level of the tests. In this article, we show that valid tests for association in the presence of linkage may be performed using the mean of the test statistic computed via the Rabinowitz-Laird (RL) algorithm for the type I *H*_{0} setting and an empirical variance-covariance estimator that adjusts for the correlation among sibling marker genotypes. This provides a convenient means for testing allelic association in the presence of linkage that can be used with a wide range of test statistics and any pedigree configuration. For example, the nine strategies for testing the type I *H*_{0} advocated by S. Horvath, X. Xu, and N. Laird (unpublished data), which include applications to binary, quantitative and time-to-onset phenotypes, can all be adapted to the type II *H*_{0} setting with the method presented here. We note that in the biallelic setting and with a qualitative trait, the pedigree disequilibrium test (PDT; Martin et al. 2000*c*) is similar to the approach developed here.

As an illustration, we focus on the reported association between alleles of the *A2M* gene and late-onset Alzheimer disease. Blacker et al. (1998) reported a strong association between a deletion near the 5′ splice site of exon 18 of the *A2M* gene (*A2M*-18i) and AD in a sample of sibships from the National Institute of Mental Health (NIMH) Genetics Initiative (Blacker et al. 1997). During the course of the *A2M* association study, linkage to a nearby region on chromosome 12 was reported as part of a genome screen (Pericak-Vance et al. 1997). Subsequent linkage analyses revealed linkage peaks at or near the *A2M* gene (Rimmler et al. 1997; Rogaeva et al. 1998; Wu et al. 1998; Kehoe et al. 1999; Scott et al. 1999). The reported *A2M* association has been controversial, with further findings both confirmatory and nonconfirmatory (Dow et al. 1999; Rogaeva et al. 1999; Rudrasingham et al. 1999; Romas et al. 2000). In any case, *A2M* is useful as an illustration of association tests conducted in the presence of linkage. We use the NIMH data set, in which a strong *A2M*/AD association has been reported (Blacker et al. 1999), to illustrate our method.

## FBATs

We assume that there are *N* nuclear families, with *n*_{i} children in each family. Let *m*_{ij} be the marker genotype for the *j*th child in the *i*th family and *m*_{i} be the vector of marker genotypes for the *n*_{i} children in the *i*th family. In addition, the vector of parental marker genotypes will be denoted by *M*_{i}. Let *X*(*m*_{ij}) be an *h*×1 vector that codes for marker genotype. Depending on the coding scheme, *X*(*m*_{ij}) may be a scalar or a vector (see Schaid 1996; Laird et al. 2000; S. Horvath, X. Xu, and N. Laird [unpublished data]). Last, let *y*_{ij} be the phenotype of the *j*th child in the *i*th family and *T*(*y*_{ij}) be some function of the phenotype. In what follows we will often abbreviate *X*(*m*_{ij}) with *X*_{ij} and *T*(*y*_{ij}) with *T*_{ij} and drop the subscript indicating family when dealing with data from only one family.

Association test statistics are constructed to detect correlation between genotype and phenotype. In this article, we restrict attention to the class of test statistics that can be expressed as

where the summation is over all children in all families and *S*_{i} is the contribution from the *i*th nuclear family, *i*=1,…,*N*. Test statistics in this general class constitute the majority of family-based association test statistics proposed in the literature, including tests in the multiallelic setting, tests using quantitative phenotypes, and tests that allow missing parental marker information (Laird et al. 2000; Rabinowitz and Laird 2000). For example, with simplex families, letting *T*_{ij} be an indicator function for child disease status and *X*_{ij} be the count of a particular marker allele, *S*_{i} counts the total number of alleles in the affected child and *S* is the same test statistic used in the TDT. Other types of test statistics are discussed in S. Horvath, X. Xu, and N. Laird (unpublished data).

Under the assumption that the *N* families are unrelated, the distribution of the test statistic *S* under *H*_{0} depends on the distributions of the independent *S*_{i}, *i*=1,…,*N*. For the *i*th family, the general distribution of *S*_{i} depends on the joint distribution of the observed children’s marker genotypes, children’s phenotypes, and parental marker genotypes *p*(*m*_{i},*M*_{i},*y*_{i}). Under the type I *H*_{0}, *p*(*m*_{i},*M*_{i},*y*_{i}) depends on allele frequencies and the genetic model; conditioning on the phenotypes and the parental genotypes eliminates these unknown nuisance parameters and makes the distribution of *S*_{i} dependent only on the conditional distribution of the children’s marker genotypes (Lazzeroni and Lange 1998). When parental genotypes are unknown, the nuisance parameters can be eliminated by conditioning on the sufficient statistic for the parental genotypes *S*(*M*), which is composed of the observed parental genotypes (when available) *M*_{obs} and the children’s genotype configuration *C*_{m} (Rabinowitz and Laird 2000). The distribution under the type II *H*_{0} is discussed in the next section.

Using the conditional distribution of the children's marker genotypes, we take the approach of standardizing *S* and using the large sample normal or χ^{2} approximation. In this case, the mean and variance of the *S*_{i} are required. For the type I *H*_{0}, letting Φ_{I}=[*S*(*M*),*y*], S. Horvath, X. Xu, and N. Laird (unpublished data) show that *E*(*S*_{i}|Φ_{I}) can be computed with the univariate conditional distribution of the children’s marker genotype, and *Var*(*S*_{i}|Φ_{I}) can be computed with the univariate and bivariate conditional distributions of the children’s marker genotypes, where *Var*(·) refers to the variance-covariance matrix. That is, by using just the joint distributions of (*m*_{ij},*m*_{ik}) (which, under the type I *H*_{0}, do not depend on *j* and *k*), we can compute *Var*(*S*_{i}|Φ_{I}). These distributions can be computed using the RL algorithm for the type I *H*_{0}.

## Tests of Association in the Presence of Linkage

As discussed above, association tests performed in areas of known linkage may significantly refine gene-mapping studies. The challenge is that, among siblings, genetic markers that reside within linked regions are correlated even in the absence of association and after conditioning on Φ_{I}=[*y*,*S*(*M*)]. The dependence exists because siblings with similar phenotypes are more likely to share the putative disease genes, even in the absence of allelic association. Linkage between a marker and the putative disease gene, therefore, induces positive correlation between the genetic markers of siblings with similar phenotypes. The opposite holds for siblings with disparate phenotypes. The correlation makes *p*(*m*|Φ_{I}) dependent on the recombination parameter and the genetic model for the phenotype.

Conditioning on the minimal sufficient statistic for θ and the phenotypes removes the dependence of the marker genotypes on θ and *y* under the type II *H*_{0}. When the patterns of allele sharing among siblings can be unambiguously determined, they serve as the minimal sufficient statistic for θ (Rabinowitz and Laird 2000). With incomplete identification of the allele sharing patterns, the outcome space of the children’s marker genotypes given the minimal sufficient statistic under the type II *H*_{0} may be computed using the RL algorithm (type II *H*_{0} case). Therefore, under the type II *H*_{0}, the minimal sufficient statistic Φ_{II} consists of the minimal sufficient statistic for the recombination parameter *S*(θ), the minimal sufficient statistic for the parental marker genotypes *S*(*M*), and the observed phenotypes *y*.

Since patterns of allele sharing are defined by the joint realization of sibling marker genotypes, the conditional outcome space consists of the various joint outcomes of sibling marker genotypes satisfying the constraints of the minimal sufficient statistic for the type II *H*_{0} (Martin et al. 1997; Rabinowitz and Laird 2000). Therefore, after conditioning on Φ_{II}, the convenient expression of *E*(*S*_{i}|Φ_{II}) and *Var*(*S*_{i}|Φ_{II}), in terms of the univariate and bivariate conditional distribution of marker genotypes under the type I *H*_{0}, cannot be paralleled. Rather, under the type II *H*_{0}, expressions for *E*(*S*_{i}|Φ_{II}) and *Var*(*S*_{i}|Φ_{II}) using the RL algorithm can be found with the multinomial distribution.

For a given family, assume that there are *p* compatible realizations of the sibling marker genotypes, and let *r* be a *p*×1 random vector, with the *k*th element being an indicator function that assumes the value 1, when the realization of the sibling marker genotypes corresponds to the *k*th element of the conditional outcome space, and 0 otherwise. The set of possible outcomes is given in tables 4–7 in Rabinowitz and Laird (2000) for nuclear families. Because, under the type II *H*_{0} and conditional on Φ_{II}, all outcomes are equally likely, with probability 1/*p*, *r* follows a multinomial distribution, with mean and variance given by

and

where 1_{p} is a *p*×1 vector of 1s and *I*_{p} is a *p*×*p* dimensional identity matrix.

The moments of *S*_{i} can be derived using the moments of *r*. Let *S*^{r}_{i} be an *h*×*p* matrix with the *k*th column equal to where *m*^{(k)}=(*m*^{(k)}_{i1},…,*m*^{(k)}_{ini}) is the vector of sibling marker genotypes corresponding to the *k*th element of the conditional outcome space and *h* is the length of the marker genotype coding vector *X*. The conditional mean and variance of *S*_{i} are

and

Under the type II *H*_{0}, the approximate distribution of *S*-*E*(*S*|Φ_{II}) is .

The last column of table table11 indicates which combinations of parental marker genotypes and children marker configurations are potentially informative in the biallelic setting with the RL algorithm applied to the type II *H*_{0} setting. When parental data are missing (as is often the case for late-onset diseases), sibships with more than two sibs and *C*_{m}={*AA*,*AB*} or *C*_{m}={*BB*,*AB*} are not informative, because allele sharing cannot be discerned. The removal of these types of sibships may cause a substantial loss in the effective sample size, especially when one of the alleles is rare, because homozygotes of the rare allele will be infrequent. An alternative to conditioning on the allele sharing is to take advantage of the linear form of the test statistic (eq. [1]) and to use the RL algorithm for the type I *H*_{0} to calculate the expectation, in conjunction with a robust variance-covariance estimator. The development of this approach follows.

## Factorization of *p*(*m*|Φ_{I}) under Type II *H*_{0}

In view of the potentially severe loss of information caused by conditioning on sibling identical-by-descent (IBD) patterns, we here develop a method that employs the type I *H*_{0} RL algorithm to compute and an empirical variance-covariance estimator that is robust to the correlation among the sibling marker genotypes. To show that is a valid measure of association in the presence of linkage, we derive the marginal conditional distribution for the *k*th sibling marker genotype *p*(*m*_{k}|Φ_{I}) and show that this marginal distribution is the same under both the type I *H*_{0} and the type II *H*_{0} and does not depend on the recombination parameter θ or on the observed phenotypes *y* for *k*=1,…,*n* (see Appendix). Since the linear form of the test statistic (eq. [1]) permits its expectation to be found using *p*(*m*_{k}|Φ_{I}), the RL algorithm for the type I *H*_{0} can be used to compute *E*(*S*_{i}|Φ_{I}). Therefore, without specification or estimation of θ and without parameterization of the phenotype distribution, *S*-*E*(*S*|Φ_{I}) can be used to construct an unbiased test for association in the presence of linkage. Since family-specific contributions comprise *S*-*E*(*S*|Φ_{I}), only the variances of these contributions are needed to compute *Var*[*S*-*E*(*S*|Φ_{I})]; the correlation among children need not be addressed when finding *Var*[*S*_{i}-*E*(*S*_{i}|Φ_{I})].

The derivation in the Appendix employs an ordered notation similar to that of Thomson (1995), where *m*^{*}_{k} is the marker genotype of the *k*th child, expressed in terms of the parental derived haplotypes (see Appendix). In particular, it is shown that under both the type I *H*_{0} and the type II *H*_{0}, the joint conditional probability for a family can be factored into

where *m*_{-k} is the vector of sibling marker alleles with the *k*th sibling information omitted, *M*_{u} is the unobserved parental marker genotypes, is the set of unobserved parental maker genotypes that coincide with *S*(*M*) and corresponds to the set of paternal and maternal derived markers for parents with marker genotypes *M* that result in the *k*th sibling’s observed marker genotype *m*_{k}. Marginalization of *Pr*(*m*|Φ_{I}) with respect to *m*_{-k} results in the marginal conditional probability for the *k*th sibling marker genotype with *Pr*(*m*_{k}|Φ_{I})=*Pr*[*m*_{k}|*S*(*M*)]. In addition, we show that *Pr*[*m*_{k}|*S*(*M*)] is not a function of θ and can be computed using the RL algorithm for the type I *H*_{0}. Although the factorization can be used to find the correct conditional expectation of the test statistic, it cannot be used to derive expressions for the covariance between sibling marker genotypes, because it marginalizes over the IBD relationships.

Since *S*_{i}-*E*(*S*_{i}|Φ_{I}) are independent mean 0 random vectors with unspecified variance-covariance matrices, we can apply the results of White (1980) to construct a robust variance-covariance estimator of *S*-*E*(*S*|Φ_{I}). Specifically, White (1980) addresses estimation of the variance-covariance matrix for estimated regression parameters in linear models with heteroscedastic errors. The test statistic *S*-*E*(*S*_{i}|Φ_{I}) can be couched as proportional to a vector of parameter estimates from a linear model and, therefore, the White empirical variance-covariance estimator, given by

provides a consistent estimate of the variance-covariance matrix of *S*-*E*(*S*|Φ_{I}). Alternatively, can be derived using the results of Liang and Zeger (1986) on generalized estimating equations. When *S* is vector-valued, may not be full rank. In this case, the test statistic for the type II *H*_{0} is , where is the generalized inverse of . It should be noted that the empirical variance-covariance estimator (2) reduces to a simple sum of squares for the biallelic case.

Extensions to more-complex pedigrees are straightforward. Assume that the *i*th pedigree can be split into *q*_{i} nuclear families, for *i*=1,…,*F*, and let

where *S*_{ij} is the test-statistic contribution from the *j*th nuclear family in the *i*th pedigree and *E*(*S*_{ij}|Φ_{I}) is computed using formulas by S. Horvath, X. Xu, and N. Laird (unpublished data). Although the contributions from nuclear families in the same pedigree are not independent, we can again appeal to White (1980) to construct a consistent estimate of the variance-covariance matrix of *S*-*E*(*S*|Φ_{I}):

The advantage of the empirical variance-covariance approach is that more nuclear-family marker configurations are informative than is the case with the type II conditioning method. Table Table11 indicates which nuclear family configurations are informative for the two approaches in the setting of a biallelic marker. In addition, since the conditioning is different for the two approaches, the expected values and variance-covariance terms are also not the same. We will refer to the empirical variance-covariance approach as “EV-FBAT.”

## Example: Testing for Association in the *A2M* Gene

*A2M*

As an example, we tested for association between the *A2M*-18i deletion and AD in a set of sibships from the National Institute of Mental Health (NIMH) Genetics Initiative AD Sample. The ascertainment and assessment of the AD families collected have been discussed elsewhere (Blacker et al. 1997). The sample we used is composed of 437 individuals in 120 sibships and is identical to the sample analyzed by Blacker et al. (1999); 246 of the siblings met the NINCDS/ADRDA criteria for AD and/or had autopsy confirmation of the diagnosis.

Table Table22 contains the results for testing the *A2M*-18i/AD association. The test statistic used in the applications of the RL algorithm is the sum of the *A2M*-1 alleles in AD-affected siblings. This corresponds to the following coding schemes:

and

Implementation of the RL algorithm consists of finding the expected value of *X*_{ij} conditional on the minimal sufficient statistic corresponding to the null hypothesis. Variance estimation is accomplished through the procedures described above.

Application of the RL algorithm to test for linkage and association (type I *H*_{0}) results in 51 informative sibships and a significant finding. As discussed above, the type I *H*_{0} may not be appropriate in view of the reported linkage evidence in the region spanning the *A2M* gene. Conditioning on the type II *H*_{0} minimal sufficient statistic results in a dramatic decrease in the effective sample size. With only 10 informative sibships, the test statistic is only marginally significant, and its large sample χ^{2} approximation may not be reliable (table (table2).2). With EV-FBAT, 44 sibships were informative resulting in a highly significant result (χ^{2}=2.94, *P*=.0033).

The discrepancy in the number of informative families is a consequence of the absence of parental genotype data and the distribution of genotypes among the siblings [*p*(*A*2*M*-1/*A*2*M*-1)=.732, and *p*(*A*2*M*-1/*A*2*M*-2)=.231, *p*(*A*2*M*-2/*A*2*M*-2)=.037]. The 34 families that are informative for EV-FBAT but not informative for the type II *H*_{0} conditioning approach have more than two siblings and *C*_{m}={*A*2*M*-1/*A*2*M*-1, *A*2*M*-1/*A*2*M*-2} or *C*_{m}={*A*2*M*-2/*A*2*M*-2, *A*2*M*-1/*A*2*M*-2} as the sibling marker configuration. As indicated by table table1,1, these sibships are not informative for the type II *H*_{0} RL approach because no definite allele sharing can be discerned. Because it does not condition on the allele sharing, the empirical variance approach is not subject to these constraints. The difference between the number of informative families for the type I *H*_{0} RL test and for EV-FBAT is a result of the definition of the empirical variance (2). Families with *S*_{i}=*E*(*S*_{i}|Φ_{I}) do not contribute to the test statistic or the empirical variance-covariance estimate.

To justify the EV-FBAT χ^{2} approximation with 44 informative sibships, we empirically estimated the significance level under the type II *H*_{0} for various numbers of informative sibships. We simulated sibships that were similar to the NIMH sibships in that the size distribution of the sibships was maintained, the biallelic marker had population allele frequencies of 0.20 and 0.80, and the baseline prevalence was fixed at 0.30. Because simulated data with the same number of sibships will have different numbers of informative families, we report the mean number of informative families. For each number of sibships we simulated 10,000 data sets. In figure 1, the circles represent the empirical significance levels for the mean number of informative families. The dashed lines are the pointwise 95% Monte Carlo sampling-error levels (0.0457, 0.0543). Figure 1 shows that the empirical significance level is within Monte Carlo sampling error for a large range of informative sibships. Indeed, the χ^{2} approximation appears to hold even for samples with only 20 informative sibships. With <20 informative sibships, the test appears to become conservative.

*H*

_{0}for average number of informative sibships. The dashed lines are the pointwise 95% Monte Carlo sampling error levels (0.0457, 0.0543).

Robust variance-covariance estimation has been implemented in the context of a TDT extension (TRANSMIT; Clayton 1999), conditional logistic regression (Siegmund et al. 2000), and the PDT (Martin et al. 2000*c*). All three procedures are limited to qualitative traits, whereas the application of Siegmund et al. (2000) is further restricted to discordant sibships. When applied to the *A2M* data set, the Wald statistic from conditional logistic regression with robust variance estimation produces a test statistic that is not as pronounced as that of EV-FBAT but is still significant (table (table2).2). The PDT produces a test statistic that is essentially equivalent to the test statistic of EV-FBAT in these data.

Another alternative is to use the sibship disequilibrium test (SDT; Horvath and Laird 1998). As shown in table table2,2, the SDT provides the strongest evidence for linkage disequilibrium. The SDT is well suited to the discordant sibships setting of the NIMH data, but it is restricted to qualitative phenotypes and cannot efficiently handle families with genotype-known parents.

## Discussion

One strategy for positional genomic analysis is to focus allelic-association testing on regions that have been identified through linkage analysis as putatively containing a gene or genes influencing phenotypic variation. Supplementing linkage results with association methodology is needed because, with complex diseases, linkage peaks may span regions of 10–20 cM that cover a large number of genes and are beyond the reach of positional cloning (Hauser and Boehnke 1997). A significant association finding may greatly refine the search for the underlying trait gene, since linkage disequilibrium will not generally extend over regions >1 cM in outbred populations (Pericak-Vance 1998). Although the utility of association methodology in this setting has been questioned (Terwilliger and Weiss 1998), the use of association methodology in the dissection of a region linked to human hypertension has recently yielded a susceptibility locus (Bray et al. 2000).

Candidates for the association tests within regions identified by linkage may be chosen via database searches using knowledge of biological pathways (Brookes et al. 2000). In addition, as dense maps of single-nucleotide polymorphisms (SNPs) become available and costs of genotyping decline, the dissection of linked regions may be accomplished by saturating the linked regions with SNPs and performing association tests on them. Martin et al. (2000*a,* 2000*b*) have used the *APOE* gene to illustrate the potential for using SNPs in mapping studies of complex traits.

With these strategies in mind, we have presented a method for evaluating the mean and variance-covariance of a wide range of test statistics computed under the null hypothesis that there is linkage but no association (type II *H*_{0}). The method, EV-FBAT, determines the expected value of an association test statistic by conditioning on the minimal sufficient statistic under the null hypothesis of no linkage and no association (type I *H*_{0}) and uses an empirical variance-covariance estimator that is consistent even when the sibling marker genotypes are correlated. As discussed above, the expectation of the test statistic is computed via the RL algorithm, and the resulting standardized test statistic is unbiased as a test for association in the presence of linkage. In addition, while retaining the robust properties of family based association tests, EV-FBAT does not suffer from the costly reduction in sample size caused by missing parental data that is inherent with approaches that condition on sibling IBD patterns.

The results of the *A2M*/AD example strongly suggest that the *A2M*-18i deletion is in linkage disequilibrium with a polymorphism that contributes to AD development. Whether or not the *A2M*-18i polymorphism is the polymorphism of interest (in which case the linkage disequilibrium is complete) cannot be deduced by association tests. In light of the evidence for linkage, relying on the type I *H*_{0} test alone would leave open the interpretation of the *P* value. Here, the *P* values of the type I RL approach and EV-FBAT agree; in general, we expect the type II *H*_{0} *P* values to be larger if *H*_{0} is true. Additional work will investigate the power of EV-FBAT and various proposed methods under *H*_{a}.

For qualitative traits and biallelic markers, EV-FBAT is similar to the PDT (Martin et al. 2000*c*). In the PDT, pedigrees are broken into nuclear families and discordant sibships. Let *A* and *B* be the two alleles of the marker. The contribution to the test statistic of a particular pedigree consists of weighted sums of the number of *A* alleles for each affected child minus an “expected” number of *A* alleles. This expectation is computed from unaffected siblings when the affected child belongs to a discordant sibship and is computed using a pseudocontrol (as defined by Falk and Rubinstein 1987) when the affected child belongs to a nuclear family. If a child belongs to a nuclear family and a discordant sibship, both differences are computed. Under the type II *H*_{0}, the sum of the pedigree contributions has expectation 0 and is standardized with an empirical estimator of the variance.

In this setting, the difference between the PDT and EV-FBAT is in the derivation of the expected number of *A* alleles under the type II *H*_{0}. In using the RL algorithm for the type I *H*_{0}, EV-FBAT conditions on the minimal sufficient statistic and, by definition, makes the most efficient use of the observed data in constructing the control genotype (see Cox and Hinkley [1974] or Rabinowitz and Laird [2000]). Further, the PDT can not use concordant sibships with missing parental marker information and is also limited to the dichotomous-phenotype case.

EV-FBAT uses a robust variance-covariance estimation to take into account the correlation among sibling marker genotypes under the type II *H*_{0}. In addition to the PDT and EV-FBAT, a robust variance-covariance estimation for the qualitative setting has been implemented in the context of a TDT extension (TRANSMIT; Clayton 1999) and conditional logistic regression (Siegmund et al. 2000). The method of Clayton (1999) uses the EM algorithm (Dempster et al. 1977) to impute the likelihood contribution from family trios in which there is missing parental information and/or ambiguous genetic transmissions. Such imputation requires a full specification of the family-trio likelihood that depends on estimates of allele frequencies and population genetic assumptions that are difficult to justify. A score test based on these likelihood contributions is used to test for association with a robust variance-covariance estimator when multiple siblings are allowed.

The merits of association tests based on conditional logistic regression have been discussed (Witte et al. 1998; Kraft and Thomas 2000). Siegmund et al. (2000) recommend generalized estimating equations applied to the conditional logistic likelihood when the type II *H*_{0} is used. Unlike EV-FBAT, this method does not make any use of available parental data and is restricted to discordant sibships. As with the PDT, both TRANSMIT and the Siegmund et al. (2000) procedure are limited to qualitative traits.

In summary, EV-FBAT provides a flexible framework for association testing in the presence of linkage because it can be used with any type of phenotype and with any pedigree configuration. Therefore, the researcher is not restricted to particular sampling designs and is free to test for associations with quantitative or time-to-onset traits. Indeed, with EV-FBAT, the approaches to association testing with binary, quantitative, and time-to-onset phenotypes for the type I *H*_{0} advocated by S. Horvath, X. Xu, and N. Laird (unpublished data) can all be adapted to the type II *H*_{0}. Application of EV-FBAT is limited to the class of test statistics that can be expressed in a linear form (eq. [1]), but, as discussed in Laird et al. (2000), a number of family-based association-test statistics are of this form. Furthermore, Clayton and Jones (1999) and Lunetta et al. (2000) have shown that the score statistics from generalized linear models in which the coded marker genotype is the covariate can be expressed in the form of equation (1). The case when the test statistic may depend on unknown nuisance parameters is discussed in Lunetta et al. (2000). The method is also valid as a test of the type I *H*_{0} of no linkage or no association, since the empirical variance-covariance estimator is a consistent estimator under both types of null hypotheses.

The empirical variance approach for testing association in the presence of linkage has been implemented in a program called FBAT. It is invoked with the `-e` (for empirical variance) option for the `fbat` command. The program and its documentation are available free of charge from our Web site. There are different versions of the program for different operating systems: MAC, Solaris/Sparc, and Windows. If you encounter problems, please e-mail ude.dravrah.hpsh@tabf.

## Acknowledgements

We thank Dr. Steve Horvath for valuable conversations and Dr. John Rogus for helpful comments on the manuscript. Support for this research was provided by National Institutes of Health (NIH) grant MH 59532. We are indebted to two anonymous referees for their helpful suggestions. The genotypes of the sibships were generated in the laboratory of Dr. Rudy Tanzi, with support from NIH grant R01 MH60009. Data and biomaterials were collected in three projects that participated in the NIMH Alzheimer Disease Genetics Initiative. From 1991 to 1998, the principal investigators and coinvestigators were: Marilyn S. Albert, Ph.D., and Deborah Blacker, M.D., Sc.D., Massachusetts General Hospital, Boston, grant U01 MH46281; Susan S. Bassett, Ph.D., Gary A. Chase, Ph.D., and Marshal F. Folstein, M.D., Johns Hopkins University, Baltimore, grant U01 MH46290; and Rodney C. P. Go, Ph.D., and Lindy E. Harrell, M.D., University of Alabama, Birmingham, grant U01 MH46373.

## Appendix A : Proof

We show that, under the type II *H*_{0}, the joint conditional distribution of the sibling marker genotypes *m* given the sufficient statistic for the parental marker genotypes *S*(*M*) and the observed phenotypes *y* can be factored into a form amenable to the approach discussed above. The key point is that the marginal conditional distribution of a child’s marker genotype is not a function of the recombination parameter θ or of the observed phenotypes *y*. Therefore, under the type II *H*_{0}, the expectation of the test statistic conditional on the minimal sufficient statistic for the type I *H*_{0} can be found using the type I *H*_{0} RL algorithm, without modeling the correlation between the children’s marker genotypes.

Since *S*(*M*)=(*C*_{m},*M*_{obs}), where *C*_{m} is the configuration of sibling marker genotypes and *M*_{obs} is any observed parental marker genotype, the joint conditional distribution can be expressed as

where _{S(M)} is the set of possible unobserved parental marker genotypes with elements *M*_{u} that correspond to *S*(*M*) and where *M*=(*M*_{obs},*M*_{u}).

To derive the marginal conditional distribution of a child’s marker genotype we arbitrarily select the *k*th sibling (referred to as the reference sibling) and let *m*_{-k} be the vector of sibling marker alleles with the *k*th sibling information omitted. For all *k*=1,…,*n* we have that

We next show that *Pr*(*m*_{k},*M*,*y*)=*Pr*(*m*_{k},*M*)*f*(*y*), where *f*(*y*) is the joint distribution of the sibling phenotypes. To do this, we adopt a notation similar to the ordered notation of Thomson (1995), which identifies the paternally and maternally derived haplotypes that comprise the marker genotypes of the children. This is accomplished by expanding the parental marker genotypes into specific haplotypes, *M*^{*}_{i}=[*m*^{(p)}_{i1}/*m*^{(p)}_{i2}, *m*^{(m)}_{i1}/*m*^{(m)}_{i2}], and letting *m*^{*}_{ij} be the marker genotype of the *j*th child expressed in terms of the parental-derived haplotypes. That is, *m*^{*}_{ij}=[*m*^{(p)}_{idj}/*m*^{(m)}_{id′j}], where *d*_{j}, *d*^{′}_{j}=1,2 indicate inheritance from each parent. Furthermore, let _{m*k,M} correspond to the set of paternally and maternally derived markers from parents with marker genotypes *M* that result in the *k*th sibling’s observed marker genotype *m*_{k}, and let *G*=[*g*^{(p)}_{1}/*g*^{(p)}_{2}, *g*^{(m)}_{1}/*g*^{(m)}_{2}] be the unobserved disease genotypes for the parents and *g* be the vector of unobserved disease genotypes for the children. The joint probability, *Pr*(*m*_{k},*M*,*y*), thus can be expressed as the summation

where the additional summations in (A1) are with respect to the set of possible parental disease genotype combinations and the set of siblings’ disease genotypes conditional on parental disease genotypes and where *H*=[*m*^{(p)}_{1}*g*^{(p)}_{1}/*m*^{(p)}_{2}*g*^{(p)}_{2},*m*^{(m)}_{1}*g*^{(m)}_{1}/*m*^{(m)}_{2}*g*^{(p)}_{2}] describes the parental haplotypes.

Under the assumption that sibling disease genotypes are conditionally independent given parental haplotypes, equation (A1) can be expressed as

Under the type II null hypothesis of no association, we have that *Pr*(*g*_{i}|*H*)= 1/4 for *i*=1,…,*n*; *i*≠*k*, and *Pr*(*m*^{*}_{k},*H*)=*Pr*(*m*^{*}_{k},*M*)*Pr*(*G*). Therefore, (A2) can be simplified to

Let *F*_{G} denote the expression within square brackets in equation (A3). There are 4^{n} terms in *F*_{G}, corresponding to all the combinations of disease genotypes in the *n* children. The summation over all combinations of parental disease genotypes makes the terms in *F*_{G} with the same parental disease allele sharing patterns equivalent. For example, in the case of two children with the first child being the reference sibling,

Furthermore, if we assume *m*^{*}_{1}=(*m*^{(p)}_{1},*m*^{(m)}_{1}), then we have that

where _{IBD=1p} is the set of disease allele–sharing patterns, between the two siblings, that result in them sharing the paternally but not the maternally derived disease allele. Because of the ordered notation, *Pr*(*g*_{1}|*m*^{*}_{1},*H*) is a simple function of the recombination parameter θ, which cancels in the summation.

The same logic can be applied to any disease allele–sharing patterns for any number of children, making it straightforward to show that . Therefore, *Pr*(*m*_{k},*M*,*y*)=*Pr*(*m*_{k},*M*)*f*(*y*), where *Pr*(*m*_{k},*M*) is not a function of θ or of *y*, and we have the following factorization of the joint conditional distribution:

where we have used the fact that, under the type II *H*_{0}, *Pr*[*S*(*M*),*y*]=*Pr*[*S*(*M*)]*Pr*(*y*). We can marginalize the joint distribution with respect to *m*_{-k} to obtain

The term on the right side of (A4) is the conditional distribution of marker genotypes for the *k*th sibling, *Pr*[*m*_{k}|*S*(*M*)], under the null hypothesis of no linkage and no association. It has been tabulated by Rabinowitz and Laird (2000), for arbitrary missing parental marker information, and can be used to derive *E*(*S*_{i}|Φ_{I}) under the type II *H*_{0}. In summary, we have shown that is a valid measure of association in the presence of linkage.

## Electronic-Database Information

The URL for data in this article is as follows:

## References

_{2}-Adrenergic receptor gene as a susceptibility locus in human hypertension. Circulation 101:2877–2882 [PubMed]

*a*) Analysis of association at single nucleotide polymorphisms in the APOE region. Genomics 63:7–12 [PubMed]

*b*) SNPing away at complex diseases: analysis of single-nucleotide polymorphisms around APOE in Alzheimer disease. Am J Hum Genet 67:383–394 [PMC free article] [PubMed]

*c*) A test for linkage and association in general pedigrees: the pedigree disequilibrium test. Am J Hum Genet 67:146–154 [PMC free article] [PubMed]

**American Society of Human Genetics**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (248K) |
- Citation

- A statistical method for identification of polymorphisms that explain a linkage result.[Am J Hum Genet. 2002]
*Sun L, Cox NJ, McPeek MS.**Am J Hum Genet. 2002 Feb; 70(2):399-411. Epub 2002 Jan 8.* - Accounting for linkage in family-based tests of association with missing parental genotypes.[Am J Hum Genet. 2003]
*Martin ER, Bass MP, Hauser ER, Kaplan NL.**Am J Hum Genet. 2003 Nov; 73(5):1016-26. Epub 2003 Oct 9.* - A multipoint method for detecting genotyping errors and mutations in sibling-pair linkage data.[Am J Hum Genet. 2000]
*Douglas JA, Boehnke M, Lange K.**Am J Hum Genet. 2000 Apr; 66(4):1287-97. Epub 2000 Mar 28.* - Finding genes influencing susceptibility to complex diseases in the post-genome era.[Am J Pharmacogenomics. 2001]
*Rannala B.**Am J Pharmacogenomics. 2001; 1(3):203-21.* - Family-based association studies.[Stat Methods Med Res. 2000]
*Zhao H.**Stat Methods Med Res. 2000 Dec; 9(6):563-87.*

- Utilising Family-Based Designs for Detecting Rare Variant Disease Associations[Annals of Human Genetics. 2014]
*Preston MD, Dudbridge F.**Annals of Human Genetics. 2014 Jan; 78(2)129-140* - Transmission/Disequilibrium Tests Incorporating Unaffected Offspring[PLoS ONE. ]
*Wei Q, Chen Y, Zeng Z, Shu C, Long L, Lu J, Huang Y, Yin P.**PLoS ONE. 9(12)e114892* - Polymorphisms in the GRIA1 Gene Region in Psychotic Bipolar Disorder[American journal of medical genetics. Part ...]
*Kerner B, Jasinska AJ, DeYoung J, Almonte M, Choi OW, Freimer NB.**American journal of medical genetics. Part B, Neuropsychiatric genetics : the official publication of the International Society of Psychiatric Genetics. 2009 Jan 5; 0(1)24-32* - Intra-familial tests of association between Familial Idiopathic Scoliosis and markers on 9q31.3-q34.3 and 16p12.3-q22.2[Human heredity. 2012]
*Miller NH, Justice CM, Marosy B, Swindle K, Kim Y, Roy-Gagnon MH, Sung H, Behneman D, Doheny KF, Pugh E, Wilson AF.**Human heredity. 2012; 74(1)36-44* - Comparison of Methods to Account for Relatedness in Genome-Wide Association Studies with Family-Based Data[PLoS Genetics. ]
*Eu-ahsunthornwattana J, Miller EN, Fakiola M, Wellcome Trust Case Control Consortium 2, Jeronimo SM, Blackwell JM, Cordell HJ.**PLoS Genetics. 10(7)e1004445*

- Family-Based Tests of Association in the Presence of LinkageFamily-Based Tests of Association in the Presence of LinkageAmerican Journal of Human Genetics. 2000 Dec; 67(6)1515

Your browsing activity is empty.

Activity recording is turned off.

See more...