Logo of ajhgLink to Publisher's site
Am J Hum Genet. Aug 2007; 81(2): 321–337.
Published online Jul 10, 2007. doi:  10.1086/519497
PMCID: PMC1950805

Case-Control Association Testing with Related Individuals: A More Powerful Quasi-Likelihood Score Test

Abstract

We consider the problem of genomewide association testing of a binary trait when some sampled individuals are related, with known relationships. This commonly arises when families sampled for a linkage study are included in an association study. Furthermore, power to detect association with complex traits can be increased when affected individuals with affected relatives are sampled, because they are more likely to carry disease alleles than are randomly sampled affected individuals. With related individuals, correlations among relatives must be taken into account, to ensure validity of the test, and consideration of these correlations can also improve power. We provide new insight into the use of pedigree-based weights to improve power, and we propose a novel test, the MQLS test, which, as we demonstrate, represents an overall, and in many cases, substantial, improvement in power over previous tests, while retaining a computational simplicity that makes it useful in genomewide association studies in arbitrary pedigrees. Other features of the MQLS are as follows: (1) it is applicable to completely general combinations of family and case-control designs, (2) it can incorporate both unaffected controls and controls of unknown phenotype into the same analysis, and (3) it can incorporate phenotype data about relatives with missing genotype data. The methods are applied to data from the Genetic Analysis Workshop 14 Collaborative Study of the Genetics of Alcoholism, where the MQLS detects genomewide significant association (after Bonferroni correction) with an alcoholism-related phenotype for four different single-nucleotide polymorphisms: tsc1177811 (P=5.9×10-7), tsc1750530 (P=4.0×10-7), tsc0046696 (P=4.7×10-7), and tsc0057290 (P=5.2×10-7) on chromosomes 1, 16, 18, and 18, respectively. Three of these four significant associations were not detected in previous studies analyzing these data.

We focus on the problem of testing for association between a binary trait and a genetic marker when cases and/or controls are related, with the pedigree(s) assumed known. An advantage in using multiplex families in association studies is that affected individuals who have affected relatives have a higher expected frequency of the alleles that increase susceptibility for a genetic trait than do affected individuals who do not have affected relatives. As a result, the power to detect association is expected to increase when affected individuals with affected relatives are included in the study.

Family-based association tests, such as the transmission/disequilibrium test (TDT),1 have the advantage that they are robust to population heterogeneity. However, such tests typically require genotype data for family members of an affected individual. Case-control designs are less restrictive than family-based designs, because they can allow but do not require genotype data for relatives of affected individuals, and they are generally more powerful than family-based designs.2

When related individuals are used in case-control studies, one must account for the fact that subjects who are biologically related have correlated genotypes. One approach is to use the standard χ2 statistic with a correction factor that takes into account the pedigree information3 (Wχ2corr) or with a correction factor that takes into account the conditional probability of identity-by-descent (IBD) sharing, given both the observed genotype data and the pedigree information (the “posterior kinship coefficient”)4, giving a statistic that we refer to as “WSS.” Such approaches correct the type I error but still use equal weighting of individuals, among cases and among controls, which is expected to be suboptimal, in terms of power, when individuals are related. As an alternative approach, a quasi-likelihood score (WQLS) test has been proposed.3 Like the Wχ2corr, the WQLS accounts for the correlations among related individuals, to obtain the proper type I error rates. In addition, for a given alternative model, optimal weights, depending on the pedigree information, are used in the WQLS, in an effort to improve power. For the situation when controls are not related to cases, a method for testing association of a binary trait to a haplotype has been proposed,5 where this method uses a similar weighting scheme to that of the WQLS.

We analyze the strengths and weaknesses of the Wχ2corr, WQLS, and WSS tests, and we use this improved understanding to propose a new and more powerful test, the MQLS test. (The M in MQLS stands for “more powerful” or “modified.”) The MQLS test is more widely applicable than the previously proposed tests, in two ways. First, it distinguishes between unaffected controls and controls of unknown phenotype (i.e., individuals on whom no direct phenotype information is measured) and can incorporate both into the analysis. Unaffected and unphenotyped general-population controls are the two standard types of controls in case-control studies of disease. Association tests based on combined samples may include both types. Sample sizes of cases and of controls strongly influence the power of the test, so including all available controls is desirable. However, under the alternative hypothesis, individuals who are known to be unaffected have a lower expected frequency of a predisposing allele than do individuals of unknown phenotype (when other factors such as relatives' phenotypes are held constant). This poses the problem of how best to combine the two types of controls in the analysis without compromising power. The MQLS method provides a solution to this problem.

A second way in which the MQLS test is more widely applicable than are the previously proposed tests is that it incorporates phenotype data about relatives who have missing genotype data at the marker being tested. This information is used to optimize the weights given to relatives with nonmissing genotype data at the marker being tested, following the principle that there is enrichment for predisposing variants in affected individuals with affected relatives. This enrichment principle implies, for example, that an affected individual with no phenotyped relatives should be weighted differently from an affected individual with an affected sibling and that this should still hold true when the affected sibling happens to have missing genotype data at the marker being tested. At the same time, the genotypes of the two siblings are dependent, so there should be downweighting of the siblings when they are both genotyped, which does not occur when only one is typed. The MQLS takes into account both the enrichment principle and the effects of dependence in setting the weights.

In addition to the differences just mentioned, we show that, compared with the Wχ2corr and WSS, the MQLS improves power by providing a more efficient estimator of allele frequency under the null hypothesis. In large pedigrees, a further power difference between the MQLS and WSS would theoretically arise from the fact that the WSS essentially corrects for the presence of linkage in a family when testing for association,5 whereas the MQLS allows both linkage and association to contribute to the test statistic. A more relevant difference between the WSS and MQLS is that the MQLS is computationally feasible in large pedigrees, whereas the WSS is not. Improvement of the MQLS over the WQLS is obtained primarily by capitalizing on the property that there is an enrichment for predisposing variants in affected individuals with affected relatives. The MQLS is remarkable in its computational simplicity and can be used for any set of individuals, regardless of the complexity of the relationships of the individuals.

We give a statistical argument that the MQLS should be a powerful test: the MQLS maximizes the noncentrality parameter over a general class of linear statistics for all two-allele (single-locus) disease models in outbred samples, as the effect size tends to 0, where we allow environmental effects that do not have familial correlation (see the “Development and Justification of the MQLS Test” section). We simulate various multilocus disease models and directly compare the type I error and the power of the MQLS, WQLS, and Wχ2corr tests in samples of related individuals. Since the current implementation of the WSS does not allow it to be applied to these particular simulated data sets, we instead use the true IBD-sharing information from the simulations to investigate the use of posterior kinship coefficients to compute the variances of any of the case-control statistics. We find that, even if correction for linkage is desired, the extra computation required to use posterior kinship coefficients is generally not worthwhile in small pedigrees and is not computationally practical in larger pedigrees. We apply our methods to the Genetic Analysis Workshop (GAW) 14 Collaborative Study of the Genetics of Alcoholism (COGA) data,6 to identify regions of the genome that are associated with alcohol dependence (MIM 103780).

Methods

Association Testing with a Biallelic Marker

Suppose that we have phenotype information about a binary trait for n+m sampled individuals, with each individual coded as “affected,” “unaffected,” or “unknown.” Consider a single biallelic marker with allele labels “0” and “1” (the extension to multiallelic markers is given in appendix A), and suppose that the first n of the n+m listed individuals have nonmissing genotype data at the given marker, whereas the last m individuals have missing genotype data at the marker. For the first n individuals, let Y=(Y1,…Yn)T, where Yi= 1/2×(the number of alleles of type 1 in individual i), so the value of Yi is 0,  1/2, or 1. The n+m individuals are assumed to be ascertained with respect to phenotype, and they may be arbitrarily related (including inbreeding), with the pedigree(s) that specify the relationships assumed to be known. Let p be the frequency of allele 1 in the general population, where 0<p<1. Under the null hypothesis of no association between the given marker and the trait (and making the obvious assumption that ascertainment is conditionally independent of marker genotype, given phenotype and pedigree information), Y has mean p1, where 1 is a column vector of length n with every entry equal to 1. To obtain the null variance of Y, more assumptions are required. Namely, we assume that, under the null hypothesis, (1) there is neither linkage nor association between the given marker and the trait and (2) the pedigree founders are drawn from a population in Hardy-Weinberg equilibrium (HWE) for the given marker. (Note that we do not require HWE in the founders under the alternative model.) In that case, the null variance is given by equation M1 (see, e.g., the work of Bourgain et al.3), where Φ is the kinship matrix of the nonmissing individuals, given by

equation image

and hi is the inbreeding coefficient of individual i, and [var phi]ij is the kinship coefficient between individuals i and j, where 1[less-than-or-eq, slant]i[less-than-or-eq, slant]n and 1[less-than-or-eq, slant]j[less-than-or-eq, slant]n.

Overview of the WQLS, Wχ2corr,WSS, and MQLS Tests

To our knowledge, it has not been previously pointed out that the case-control association test statistics WQLS, Wχ2corr, WSS, and MQLS can all be thought of as having the common form

equation image

where

equation image

is an estimator of allele frequency calculated under the assumption of no association,

equation image

is a contrasting estimator of allele frequency that should have a different expectation from equation M2 when there is association, and equation M3 denotes variance calculated under the assumption that the null hypothesis is true. For each of the four statistics, wtest and wnull are column vectors of length n, which can be written as functions of the phenotype and pedigree information.

The WQLS, Wχ2corr, and WSS tests assume that every genotyped individual is coded as either “case” or “control,” with no particular distinction made between unaffected controls and controls of unknown phenotype (e.g., general-population controls). Furthermore, these tests ignore the phenotype information on the additional m relatives with missing genotypes. In contrast, the MQLS treats unaffected controls and controls of unknown phenotype differently and can handle samples that contain both types. It also takes into account the phenotype data for the additional m relatives with missing genotypes. For the WQLS, Wχ2corr, and WSS, phenotype data are accordingly assumed to be coded as 1c, a column vector of length n with ith entry 1 if individual i is a case and 0 if i is a control. We define nc=1T1c as the number of cases among the n individuals with nonmissing genotypes. For the MQLS test, the phenotype data are coded as A, a column vector of length n+m having ith entry 1 if individual i is affected,

equation image

if i is unaffected, and 0 if i is of unknown phenotype. Here 0<k<1 is a constant that must be specified (see the “Development and Justification of the MQLS Test” section for details). Write AT=(ATN,ATM), where AN and AM are column vectors containing the first n elements and last m elements of A, respectively, where N and M stand for the sets of individuals with nonmissing and missing genotypes, respectively, at the particular marker. For the MQLS test to use the additional phenotype information from the m individuals with missing genotype data, additional relationship information is needed in the form of an n×m matrix ΦN,M, giving the kinship coefficients between the nonmissing and missing individuals—that is, ΦN,M has (i,j)th element equal to 2[var phi]i,n+j, where [var phi]i,n+j is the kinship coefficient between the ith nonmissing individual and the jth missing individual. We write ΦN,N[union or logical sum]M for the n×(n+m) matrix with first n columns equal to the Φ of equation (1) and last m columns equal to ΦN,M.

Table 1 gives wtest and wnull for Wχ2corr, WSS, WQLS, and MQLS, and equation M4 for Wχ2corr, WQLS, and MQLS. When the WQLS, Wχ2corr, and MQLS statistics are calculated from data, the p in the variance expression is estimated by equation M5. From table 1, we see that the weights and variances for WQLS and MQLS involve Φ-1. It can be shown that Φ is invertible provided that the sample of individuals with nonmissing genotypes does not include both members of any MZ twin pair (see appendix B). We have also developed an extension of the WQLS and MQLS to the case when the genotyped sample does include both members of one or more MZ twin pairs. In that case, we calculate the WQLS and MQLS by coding each MZ twin pair as a single individual, where, for the WQLS, we assign the phenotype to be the average of the phenotype values for the pair (so the phenotype would take value 0, 1, or  1/2 if the members of the pair were both controls, both cases, or one of each, respectively) and, for the MQLS, we assign the phenotype value to be the sum of the phenotype values for the MZ twin pair (so the phenotype would take value 0,

equation image

or 2 if the members of the pair were both of unknown phenotype, both unaffected, or both affected, respectively, and value

equation image

if one is affected and the other unaffected, etc.). It turns out that this procedure is justified by the fact that it is mathematically equivalent to substituting the Moore-Penrose generalized inverse (see, e.g., the work of Schott7) of Φ in place of Φ-1 in the formulas of table 1. Under regularity conditions on how ΦN,N[union or logical sum]M behaves as n+m→∞, which are sufficient for a central-limit theorem to hold, each of the four statistics follows a χ2 distribution with 1 df asymptotically under the null hypothesis. A previous simulation study documented the accuracy of the χ2 approximation for type I error of Wχ2corr and WQLS.3

Table 1.
wtest, wnull, and Var0equation M36 for Wχ2corr, WSS, WQLS, and MQLS

For both the Wχ2corr and WSS, equation M6 is the sample mean based on cases and equation M7 is the sample mean based on everyone. The distinction between the two statistics is in the variance calculation. Previous work4 gives the variance calculation for WSS, for the case of outbred individuals, as a function of variances and covariances of genotype indicators for sampled individuals. The variances of the indicators are then calculated as a function of genotype probabilities and are valid under violations of HWE. However, the covariances are calculated with use of an ITO method that is valid only under HWE and with use of posterior IBD probabilities that are also calculated assuming HWE. Arguably the most substantial distinction between the variance calculations for Wχ2corr and WSS is in the use of so-called prior versus posterior kinship coefficients or, equivalently, the probability of IBD sharing for a pair based on the pedigree alone versus conditional on genotype data as well as on the pedigree. When the data are in HWE and all genotyped individuals are outbred, the variance expression for WSS is the same as that for Wχ2corr in table 1, except that Φ is replaced by Φposterior, where Φposterior has (i,j)th element equal to

equation image

where ξij(k)=P(i and j share k allelesIBDatthegivenmarker|allgenotypeandpedigreeinformation). We call Φposterior the posterior kinship matrix, and, in the “Assessment of Use of Φ versus Φposterior in Variance Calculations” section, we consider the effect on power of the use of Φ versus Φposterior in the variance calculations for each of the statistics.

For WQLS, equation M8 is the best linear unbiased estimator (BLUE)8 based on everyone, and, in the special situation when the controls are not related to the cases, equation M9 is the BLUE based on cases. The original presentation3 of the WQLS was as a quasi-likelihood score statistic based on the mean model

equation image

with μ=p1+r1c—that is,

equation image

where 0<p+r<1. Under the null hypothesis of no association, r=0. An equivalent expression3 for WQLS is

equation image

where equation M10 is the quasi-likelihood estimator of p when r=0, equation M11 is the covariance matrix Σ evaluated at r=0 and equation M12—that is, equation M13—and equation M14 is the mean vector μ evaluated at r=0 and equation M15—that is, equation M16. The WQLS test has been shown3 to have certain optimality properties based on this model.

For the MQLS, equation M17 is the BLUE based on everyone. In the special case of no missing genotype data at the marker, equation M18 for the MQLS is calculated by taking a weighted average of the elements of Y in which each affected individual’s Y value gets a weight of 1, each unaffected individual’s Y value gets a weight of

equation image

and Y values of individuals of unknown phenotype get weight of 0. When some individuals have missing genotype data, then wtest,i, the weight of the ith typed individual’s Y value in equation M19, has an additional term added on, whose value is a function of the phenotypes of individuals with missing genotype data that occur in the same pedigree with individual i. In that case, wtest,i=Ai+(Φ-1fΦf,N,MAf,M)i, where f is the pedigree containing individual i, and Φ-1f, Φf,N,M, and Af,M are, respectively, Φ-1, ΦN,M, and AM, restricted to members of f. In the “Development and Justification of the MQLS Test” section, we justify the form of the MQLS, given in table 1, on theoretical grounds, and, in the “Power Comparison of Wχ2corr, WQLS, and MQLS” and the “Assessment of Use of Φ versus Φposterior in Variance Calculations” sections, we demonstrate that it generally improves power over previously proposed tests.

Development and Justification of the MQLS Test

The model on which WQLS is based (eq. [2]) has the advantage of being simple and intuitive, and it works well in samples of unrelated individuals. However, when the sample consists of related individuals and the trait is complex, this model does not capture certain features of allele-frequency differences associated with a genetic trait. Situations in which the model in equation (2) would be expected to hold more or less exactly, even in a sample of related individuals, include (1) testing for an allele-frequency difference between two distinct populations (e.g., Swedes and Japanese) without admixture and (2) testing for association with a trait within a single population when the true genetic model is a rare, fully penetrant dominant allele. In such cases, the WQLS enjoys certain optimality properties,3 which can be verified by simulation (see the “Results” section). One might hope that the simple model would be robust enough to maintain power with complex traits, but we find that power can be improved substantially by modifying the model while still retaining the computational simplicity of the original method.

The motivation for development of the MQLS is to improve the power of the quasi-likelihood score test WQLS by explicitly taking into account the fact that affected individuals who have affected relatives have a higher expected frequency of the alleles that increase susceptibility for a genetic trait than do individuals who do not have affected relatives. The simple model in equation (2) ignores this fact. Furthermore, when case individual i and control individual j are related, the model in equation (2) specifies that E(Yi-Yj)=r, regardless of how closely related i and j are, whereas, for example, if i and j were MZ twins, then, in reality, Yi-Yj=0, and, more generally, the expected difference in Y values between two individuals depends on their relationship as well as on their phenotypes.

For the MQLS, we develop a new mean model that is a function of the relationships and the phenotypes of all n+m individuals and is as follows: E(Y|A)=μ=(μ1,…μi,…μn)T, with μ=p1+rΦN,N[union or logical sum]MA—that is,

equation image

where we constrain 0<p+rN,N[union or logical sum]MA)i<1 for i=1,…,n. For a better understanding of this mean model, let us consider, for example, a case-control study in which all of the individuals in the study are outbred. If a sampled individual i has no relatives in the study or if the phenotypes of all of i’s relatives are unknown, then, under the model of equation (4), i’s expected allele frequency is μi=p if i is of unknown phenotype, μi=p+r if i is affected, and

equation image

if i is unaffected. For each affected sibling i has,  1/2r is added to the baseline value of μi, whereas, for each unaffected sibling,

equation image

is added. For example, if i is affected and has three affected siblings and no other relatives of known phenotype in the study, then μi=p+ 5/2r. Note that, under model (4), MZ twins would always have the same expected allele frequency, even if one is a case and the other is a control.

The theoretical justification for this new mean model is as follows: assume, for the moment, that the trait is caused by the marker according to an arbitrary two-allele (single-locus) disease model, in which we allow environmental effects that do not have familial correlation. This model is specified by the population-allele frequency p and the penetrances of the three possible genotypes. Set the constant k (used in the calculation of A) to be Kp, the population prevalence of the trait. Consider an arbitrary set of possibly related outbred individuals, and calculate the true value, under the model, of the expected allele frequency in individual i conditional on all available phenotypes of i and i’s relatives, μ*i=E(Yi|A), for any individual i in the sample. If we consider the ratio

equation image

where a is any affected individual with no phenotyped relatives, then, as the effect size (or differences among penetrance probabilities) tend to zero, this ratio tends to equation M20 (see appendix C for the proof). Thus, the mean model in equation (4) is asymptotically the correct one for any two-allele disease model as the effect size goes to zero. Moreover, if the individuals are inbred, then the model is asymptotically correct for an additive or multiplicative, two-allele disease model as the effect size goes to zero (but not for a general, two-allele disease model in the inbred case). Although the assumptions under which this mean model is derived are somewhat simplistic, the model captures the important feature that individuals with affected relatives are likely to be enriched for the predisposing allele relative to individuals without affected relatives.

The MQLS statistic given in table 1 is derived as the quasi-likelihood score statistic based on the model in equation (4). This is obtained by substituting ΦN,N[union or logical sum]MA in place of 1c in equation (3). The resulting formula for MQLS is

equation image

where α=AN-1ΦN,MAM,

equation image

equation M21, equation M22, and equation M23. Following the same reasoning3 as for the WQLS, the MQLS has maximal noncentrality parameter, against the alternative specified in equation (4), among a general class of linear statistics of the form equation M24, where S=VTY and V≠0 where 0 is a column vector of 0’s of length n. Note that Wχ2corr, WQLS, and MQLS are all of this form. As a result, under suitable regularity conditions, which are not discussed here, the MQLS would be asymptotically locally most powerful against the alternative specified in equation (4). Simulation studies are undertaken in the “Results” section, to assess the usefulness of this test for complex traits, where two-allele models are not expected to hold.

To use the MQLS test, a value for the constant k must be specified. We emphasize that the test will be valid for any value of k satisfying 0<k<1. The value of k affects the power of the test, and, under a two-allele model, for outbred individuals, optimal power is attained when k is the population prevalence of the trait. For complex traits, we recommend setting k equal to an estimate of the population prevalence from previous studies or registry data from the population. In the “Results” section, we demonstrate, through simulation, that power is in fact quite robust to misspecification of k.

MQLS Power-Improvement Diagnostic

We propose a method that uses only pedigree information and phenotype data (without genotype data) to determine whether the analyses using Wχ2corr, WQLS, and MQLS would be expected to give similar or dissimilar results. (If they are expected to give dissimilar results, then, in most cases, the MQLS is expected to have higher power.) One possible advantage of applying this diagnostic could be to avoid having to correct association P values for having performed multiple analyses unnecessarily, in the case when the results are predicted to be similar on the basis of the diagnostic. Because the diagnostic is based only on phenotype and pedigree information, and not on genotype data, such an approach does not create any bias in the results.

The idea is to calculate the weights assigned to the observations for each of the statistics and compare them. For the Wχ2corr, the weight of an observation depends only on the individual’s case-control status. In contrast, the weights can vary among cases and among controls in WQLS and MQLS, depending on the relationship configurations, as well as on the phenotypes of relatives. The total weights vary slightly from locus to locus, depending on missing data patterns. However, on the basis of the study design, weights for the genotyped individuals could be computed to get an idea of how different the analyses are likely to be under WQLS, MQLS, and Wχ2corr. Details of the power improvement diagnostic are given in appendix D, where we find that, to compare the Wχ2corr weights with the MQLS weights, we need only calculate the coefficients of variation of total MQLS weights among cases and total MQLS weights among controls. If these values are close to zero, then the MQLS test results should be similar to the Wχ2corr test results. The same principle holds for comparison of Wχ2corr weights with WQLS weights. Large differences in association-testing results between the Wχ2corr and either the MQLS or the WQLS tests would be expected to occur when one or both of the relevant coefficients of variation are far from zero. Our experience, in the context of our simulations, is that a coefficient of variation of ~1 or more in absolute value for the cases and/or controls will result in an improvement in power for the MQLS over the Wχ2corr.

Use of Prior versus Posterior Kinship Coefficients

Kinship coefficients (and, for an inbred population, inbreeding coefficients) are used in two different ways in the construction of the four statistics: Wχ2corr, WSS, WQLS, and MQLS. First, kinship coefficients can be used in the calculation of the total weight vector, V, for each statistic (as they are in WQLS and MQLS but not in Wχ2corr and WSS). Second, they are used in the calculation of the variance term equation M25 for each statistic. The question arises3,5 as to whether it is preferable to use the unconditional (i.e., prior) kinship coefficients,3 given by Φ, or the posterior kinship coefficients, Φposterior, which are the conditional probabilities of IBD sharing given both the observed genotype data and the pedigree information,4 keeping in mind that Φ is vastly easier to compute than Φposterior. In fact, we generally caution against the use of Φposterior in the calculation of V for WQLS and MQLS. To understand why, first consider the simpler problem of allele-frequency estimation from a sample of related individuals. The BLUE for that problem8 is the same as

equation image

When Φposterior is substituted for Φ in the formula for equation M26, the resulting estimator is, in general, biased and even inconsistent as n→∞ when there is partial IBD information. This occurs because the random variables Φposterior and Y are, in general, dependent. (Note that, in principle, these difficulties could be avoided in certain situations when Φposterior is estimated in such a way that E(Yposterior)=EY; for example, if the true IBD sharing is known or if it is estimated from markers not in linkage disequilibrium [LD] with the marker of interest.) From these considerations, it seems clear that, if Φposterior is used in the calculation of V for the WQLS and MQLS, the resulting test statistic could be badly behaved. This might explain the difficulties encountered in previous work,5 which considered the use of Φposterior in the calculation of V for a haplotype-based method with a weighting scheme similar to that of the WQLS.

There remains the possibility of using Φposterior in the calculation of the variance term. Intuitively, this would be expected to correct the statistic for the presence of linkage.5 In a context in which one would be willing to detect either linkage or association or a combination of the two, then the use of Φposterior instead of Φ in the variance calculation would be expected to result in lower power, as well as more onerous computation. However, in a context in which linkage has already been established and one wishes to correct for it in testing for association, the use of Φposterior instead of Φ in the variance calculation would be expected to result in better control of type I error. In the “Assessment of Use of Φ versus Φposterior in Variance Calculations” section, we describe simulations to test the validity of this intuition.

To use Φposterior in the variance calculation for Wχ2corr, one need only replace Φ with Φposterior in the variance formula in table 1. We call the resulting statistic WCONDχ2, where “COND” refers to conditional kinship coefficients. To calculate WCONDQLS and MCONDQLS, we replace the corresponding variance formulas in table 1 with

equation image

where wtest is defined differently for WQLS and MQLS, with the definitions given in table 1. Software such as Merlin9 can be used to calculate Φposterior in small-to-moderate–sized pedigrees.

GAW 14 COGA Data

We analyze a COGA data set6 that was previously analyzed in the Genetics Analysis Workshop (GAW) 14. There are a total of 1,614 individuals from 143 pedigrees, with each pedigree containing at least three affected individuals. We include in our analysis only those individuals who are coded as “white, non-Hispanic.” We designate as cases those individuals who are affected with ALDX1 or who have symptoms of ALDX1, where ALDX1 is defined to be DSM-III-R alcohol dependence with the Feighner Alc Definite phenotype. By these criteria, there are 830 cases with available SNP data. We designate as “unaffected controls” those individuals who are labeled as “pure unaffected,” and we designate as “controls of unknown phenotype” those individuals who are labeled as “never drank alcohol.” Among individuals with available SNP data, these criteria result in 187 unaffected controls and 13 unknown controls. Note that the MQLS makes a distinction between the two control types, whereas the Wχ2corr and WQLS do not. The data set includes 10,810 autosomal SNPs. We exclude 403 SNPs that are not polymorphic (minor-allele frequency <0.01). We analyze the remaining 10,407 SNPs using the Wχ2corr, WQLS, and MQLS tests. We could not use the WSS software package to analyze these data because, at the time of our analysis, to the best of our knowledge, there was no implementation available that would handle the situation in which controls are related to cases (although, in principle, it could be extended to that situation).

Results

Simulation Studies

We perform simulation studies to (1) assess the type I error of the MQLS; (2) compare power of the Wχ2corr, WQLS, and MQLS; (3) assess the practical impact of the use of Φ versus Φposterior in the variance calculations for the statistics; and (4) assess the robustness of power of MQLS to the choice of the parameter k. For each simulation described below, 5,000 replicates were performed.

We consider three different study designs. In the first, affected and unaffected individuals from 60 outbred, three-generation pedigrees are sampled. Each pedigree has a total of 16 individuals, related as in figure 1, with the pattern of affected and unaffected individuals varying randomly according to one of the trait models described in the next paragraph. Pedigrees are sampled conditional on obtaining exactly 20 pedigrees with 4 affected individuals, 20 with 5, and 20 with 6. In each sampled pedigree, phenotypes for all 16 individuals are observed. For each individual in a sampled pedigree, the individual’s genotypes are observed if and only if at least 30% of the individual’s siblings, parents, and offspring in the sampled pedigree are affected. The second study design is similar to the first, with two differences: (1) an additional 200 unrelated, unaffected controls are included in the study; and (2) for each individual in a sampled pedigree, the individual’s genotypes are observed if and only if at least half of the individual’s siblings, parents, and offspring in the sampled pedigree are affected. In the third study design, individuals from three extended pedigrees are sampled, with each pedigree consisting of 154 individuals over five generations. The pedigrees are sampled conditional on having at least 50 affected individuals. In each sampled pedigree, phenotypes for all 154 individuals are observed. For each individual in a sampled pedigree, the individual’s genotypes are observed if and only if at least half of the individual’s siblings, parents, and offspring in the sampled pedigree are affected.

Figure  1.
Example pedigree for study design 1 consisting of 60 outbred, three-generation pedigrees, where the overall structure of each pedigree is as depicted, but the pattern of affected and unaffected individuals in each pedigree varies randomly according to ...

We consider four different classes of multigene trait models.10 Model I has two unlinked causal SNPs, with epistasis between them and both of them acting dominantly. In model I, the frequencies of allele 1 at SNPs 1 and 2 are p1 and p2, respectively. Individuals with at least one copy of allele 1 at SNP 1 and at least one copy of allele 1 at SNP 2 have a penetrance of f1. All other individuals have a penetrance of f2<f1. We consider three different parameter settings for model I, which are listed as models I-a, I-b, and I-c in table 2. Model II also consists of two unlinked causal SNPs with epistasis between them, with SNP 1 acting recessively and SNP 2 following a general two-allele model. There are four penetrance parameters for this model, with f1>f2>f3>f4. Individuals with two copies of allele 1 at SNP 1 and two copies of allele 1 at SNP 2 have a penetrance of f1. Individuals with two copies of allele 1 at SNP 1 and one copy of allele 1 at SNP 2 have a penetrance of f2. Individuals with two copies of allele 1 at SNP 1 and no copies of allele 1 at SNP 2 have a penetrance of f3. All other individuals have a penetrance of f4. We consider one parameter setting for this class of model, which is listed as model II-a in table 2. Model III has three unlinked causal SNPs with epistasis between them and with each SNP acting dominantly. Individuals with at least one copy of allele 1 at SNP 1 and at least one copy of allele 1 at SNP 2 and/or SNP 3 have a penetrance of f1. All other individuals have a penetrance of f2<f1. We consider two different parameter settings for this class of model, which are listed as models III-a and III-b in table 2. Model V is the same as model I except that, in model V, the two causal SNPs are tightly linked and in linkage equilibrium, whereas, in model I, the two causal SNPs are unlinked and in linkage equilibrium. We consider one parameter setting for model V, which is listed as model V-a in table 2. In addition to the multigene models, we also consider a single-gene dominant model, model IV, in which individuals with at least one copy of allele 1 at SNP 1 have a penetrance of f1 and all other individuals have a penetrance of f2<f1. We consider one parameter setting for model IV, which is listed as model IV-a in table 2 and which represents a rare dominant trait that is almost fully penetrant. In addition to the allele frequencies and penetrance parameters for each model, table 2 contains the resulting population prevalence Kp, prevalence Ks conditioned on having an affected sibling, and the sibling risk ratio

equation image

these last 3 are calculated in outbreds. A broad range of models were chosen for our simulation studies: from highly penetrant disease models to disease models with low penetrance and models with high heritability to models with low heritability.

Table 2.
Allele Frequencies and Penetrance Parameters for Simulation Models[Note]

Assessment of Type I Error of MQLS

Previous simulation studies3 verified that the use of the χ2 approximations to the null distributions of Wχ2corr and WQLS give the appropriate type I error for the tests. We perform a similar verification for the MQLS by simulating under the null hypothesis of no association and no linkage. We compare the proportion of simulations in which the statistic exceeds the (1-α)th quantile of the χ2 distribution to the nominal type I error level α, for α=.01 and .05. Simulations are performed on the basis of the second study design, which consists of individuals from 60 moderate-sized pedigrees plus 200 unrelated unaffected individuals. The phenotype is simulated from model I-a (table 2). We test at an unlinked, unassociated SNP with three different allele-frequency settings, which are given in table 3.

Table 3.
Empirical Type I Error of the MQLS Test, Based on 5,000 Simulated Replicates[Note]

Table 3 gives the empirical type I error of the MQLS, estimated from 5,000 simulations, for nominal levels .05 and .01. For each simulation scenario, the empirical type I error is not significantly different from the nominal. These results verify that the use of the χ2 approximation results in an accurate assessment of significance for the MQLS.

Power Comparison of Wχ2corr, WQLS, and MQLS

To compare the power of Wχ2corr, WQLS, and MQLS, we perform simulations on the basis of the second study design, which consists of individuals from 60 moderate-sized pedigrees plus 200 unrelated unaffected individuals. Five thousand replicates from each of models I-a, I-b, I-c, II-a, III-a, and III-b are simulated. The test is performed at SNP 2 for each model, with the significance threshold set to .05.

Estimated power with the SE for the Wχ2corr, WQLS, and MQLS tests is given in figure 2. Recall that the numbers of cases and controls in each replicate are randomly determined. The average number of cases for a given simulation setting has a range of 70.2–121.6, and the average number of controls has a range of 390.2–417.4. The average coefficient of variation for the total weights of cases in the MQLS has a range of 0.3–0.4, and, for controls, it has a range from −2.6 to −1.6. Because the average coefficient of variation of total weights of controls is >1 in absolute value for the MQLS for every model considered, we expect an improvement in power of the MQLS over the Wχ2corr. As shown in figure 2, the MQLS is more powerful than both the Wχ2corr and WQLS in our simulation studies. The increase in power for the MQLS is substantial (a difference in power of at least 0.20) for models I-a, I-b, and III-a. Our theoretical results indicate that the MQLS should be powerful for two-allele disease models as effect size tends to zero. Our simulation studies indicate that, in fact, the MQLS performs well for a range of more-complex disease models.

Figure  2.
Estimated power with SE for the WQLS, Wχ2corr, and MQLS tests, on the basis of 5,000 simulated realizations for each of six different models.

In the “Development and Justification of the MQLS Test” section, we note that an example in which the WQLS has certain theoretical optimality properties is the case of a rare, fully penetrant dominant trait. To demonstrate by simulation that there can be cases in which the WQLS has higher power than the other statistics, we perform simulations that are based on the first study design (60 moderate-sized pedigrees) with the phenotype simulated from model IV-a, which approximates a rare, fully penetrant dominant. We perform the tests at the causal SNP and also at a tightly linked SNP that has allele frequency 0.5 and is associated with the causal SNP, with D=.2. When the causal SNP was tested, all three tests had power close to 1 (results not shown). Table 4 compares the power of the three tests when they are performed at the tightly linked SNP that has allele frequency 0.5 and is associated with the causal SNP with D=.2. As expected, WQLS has higher power than the other statistics in the case of a rare, fully penetrant dominant. This is because the conditional expected frequency of the allele in an individual given the phenotype information on everyone depends on only whether the individual is affected or unaffected, so the model on which the WQLS is based holds.

Table 4.
Example in Which WQLS Is More Powerful than Wχ2corr and MQLS[Note]

Assessment of Use of Φ versus Φposterior in Variance Calculations

We assess by simulation two predicted consequences of the use of Φposterior in place of Φ in the variance calculations of the statistics. To avoid the extra burden of calculating Φposterior, we instead use the true IBD-sharing information in place of Φposterior. This corresponds to the best-case scenario for the use of Φposterior, in which the markers provide complete IBD information.

The first predicted consequence is that use of Φposterior in place of Φ in the variance calculations would result in lower power in a context in which one would be willing to detect either linkage or association or a combination of the two; that is, when the null hypothesis is no association and no linkage. To test this, we simulate on the basis of the first (60 moderate-sized pedigrees) and third (three extended pedigrees) study designs, with the phenotype simulated from model V-a. We perform the tests at SNP 2, which is both linked and associated with the phenotype.

The second predicted consequence is that use of Φposterior would give better control of type I error in the context in which linkage has already been established and one wishes to correct for it in testing for association; that is, when the null hypothesis is no association. To test this, we use the same simulation scenario as for the first predicted consequence except that, instead of testing at SNP 2, which is both linked and associated with the phenotype, we tested at SNP 3, with allele frequency 0.5, which is tightly linked to both SNPs 1 and 2 but is not associated with either of them.

Table 5 demonstrates that, for moderate-size pedigrees, there is almost no difference between the use of Φ and Φposterior in the variance calculations for the statistics. This lack of difference between the two approaches holds for a SNP that is linked but not associated, as well as for a SNP that is both linked and associated. This is good news, because the calculation of Φ is vastly simpler than that of Φposterior.

Table 5.
Power Comparison for the Use of Φ versus Φposterior in Variance Calculations, for Simulations under Model V-a, When Testing at a Tightly Linked Marker, Based on 5,000 Simulated Replicates[Note]

For extended pedigrees, on the other hand, table 5 demonstrates that our predictions were correct—namely, that use of Φposterior in place of Φ in the variance calculations results in (1) lower power to test the joint null hypothesis of no association and no linkage and (2) better control of type I error to test the null hypothesis of no association in the presence of linkage. This is because, when Φ is used in the variance calculation, linkage is allowed to contribute to the signal, whereas, when Φposterior is used in the variance calculation, linkage is not allowed to contribute to the signal. In extended pedigrees, the calculation of Φposterior can present substantial difficulties. Particularly in the context of whole-genome association, use of Φposterior might be unfeasible in extended pedigrees, and, as we have seen, it makes almost no difference in moderate-size pedigrees. Therefore, as a practical matter, it seems to make sense to use Φ, with the understanding that, in extended pedigrees, this provides a test of the joint null hypothesis of no association and no linkage.

Robustness of Power of MQLS to Choice of k

The MQLS test is valid for any choice of the parameter k. In outbreds, when k is equal to the population prevalence of the disease, we have argued that the MQLS is asymptotically locally most powerful for all two-allele disease models as the effect size tends to zero. In reality, the trait will usually be complex, and the prevalence will be estimated. To see how the power of the MQLS test is affected by different choices of k, we perform a simulation study on the basis of the second study design, with the phenotype simulated from model III-a. For model III-a, the true population prevalence is Kp=.078. We perform the MQLS test with different settings of the parameter k, which are given in table 6.

Table 6.
Robustness of MQLS to Misspecification of Kp[Note]

Table 6 gives power results for the MQLS test for values for k ranging from one-quarter of the true value to 6 times the true value. In this case, choosing k to be within a factor of 3 or 4 of the population prevalence appears to give high power, suggesting that the procedure is quite robust to choice of k.

GAW 14 COGA Data

The National Institute on Alcohol Abuse and Alcoholism has estimated11 that the prevalence of alcohol dependence in the United States is ~5%. For the MQLS, we accordingly set k=0.05 in the analysis. The average coefficient of variation of the weights given by the WQLS for the cases is 1.568 and for the controls is −0.655. For the MQLS, they are 2.510 for the cases and −0.668 for the controls. The fact that the coefficients of variation for the cases are >1 for both the MQLS and WQLS suggests that these tests have the potential to give different results from those given by the Wχ2corr test, and, in particular, we might expect that there is some advantage to be gained by applying the MQLS. Table 7 gives the results of the analyses for those SNPs for which at least one of the tests has a nominal P<4.0×10-5. For 13 of these 15 SNPs, the MQLS test has the smallest P value among the three tests used. After Bonferroni correction to adjust for three different tests of association at each of 10,407 SNPs, the MQLS test is significant at the 5% level for four SNPs: tsc1177811 on chromosome 1 (P=5.9×10-7 uncorrected; .018 corrected), tsc1750530 on chromosome 16 (P=4.0×10-7 uncorrected; .012 corrected), tsc0046696 on chromosome 18 (P=4.7×10-7 uncorrected; .015 corrected), and tsc0057290 on chromosome 18 (P=5.2×10-7 uncorrected; .016 corrected). Note that the two significant SNPs on chromosome 18 are 71 cM apart. The Wχ2corr test is significant at the 5% level for an additional SNP, tsc0571038 on chromosome 11 (P=6.2×10-7 uncorrected; .019 corrected).

Table 7.
COGA Data Results[Note]

Three of the five significant SNPs in table 7 are near genes: tsc1750530 is 3 kb from the gene encoding HEAT repeat containing 3 (HEATR3 [Affymetrix; EntrezGene]), located at 16q12.1. The same SNP is also 25 kb from the gene encoding transmembrane protein 188 (TMEM188 [Affymetrix; EntrezGene]), located at 16q12.1. SNP tsc0046696 is 238 kb from the gene encoding F-box protein 15 (FBXO15 [MIM 609093; Affymetrix; EntrezGene]), located at 18q22.3. The other significant SNP on chromosome 18, tsc0057290, is 411 kb from the gene encoding VAMP (vesicle-associated membrane protein)–associated protein A (VAPA [MIM 605703; Affymetrix; EntrezGene]), 33 kDa, located at 18p11.22. To our knowledge, none of these genes are obvious candidates. The two significant SNPs that are not in close proximity to any known genes, tsc1177811 and tsc0571038, are located at 1p31.1 and 11q21, respectively (Affymetrix). Among the other SNPs in the table, two are located in or near genes of potential interest: (1) tsc1687605 is in the 3′ UTR of the gene that encodes cytochrome P450, family 2, subfamily C, polypeptide 18 (CYP2C18 [MIM 601131; Affymetrix; EntrezGene]) located at 10q24—CYP2C18 is a member of the CYP2C subfamily of P450 enzymes that is involved with drug metabolism (EntrezGene; OMIM); and (2) SNP tsc0768481 is in the same cytogenetic region (13q14-q21) as the gene that encodes 5-hydroxytryptamine receptor 2A (HTR2A [MIM 182135; Affymetrix; EntrezGene]). There is evidence that HTR2A is associated with alcohol dependence.12

Three previous analyses1315 of these data used the ALDX1 phenotype and performed family-based association tests using FBAT.16 When only white individuals were analyzed, no SNPs were significant at the 5% level with use of FBAT with Bonferroni correction.13 When all individuals were analyzed, three SNPs were significant after Bonferroni correction,14 only one of which (tsc1750530 on chromosome 16) is in the set of five SNPs we detect. These SNPs and their P values were reported14 as tsc0515272 on chromosome 3 (P=3.8×10-7 uncorrected), tsc0029429 on chromosome 9 (P=2.0×10-8 uncorrected), and tsc1750530 on chromosome 16 (P=4.5×10-7 uncorrected). False-discovery rates for these SNPs are reported15 as .0270, .0019, and .0094, respectively. The corresponding uncorrected P values for these SNPs by the MQLS are .058, .061, and 4.0×10-7, respectively. Note that the MQLS detected four significant SNPs (and the Wχ2corr detected a fifth significant SNP) with a smaller sample than that used by the FBAT to detect three significant SNPs. This indicates that the MQLS is a powerful test that provides additional results complementary to those provided by FBAT.

Analysis of 10,407 SNPs with three tests (MQLS, WQLS, and Wχ2corr) took ~35 minutes on a Pentium 4 3-GHz machine with 1 GB RAM. The calculations should scale linearly with the number of SNPs. We have not made serious attempts to optimize the code, so this time could presumably be improved. The slow step is the Cholesky decomposition7 of Φ, which would need to be performed only once if, for every SNP, the same individuals had missing genotype data. However, this is generally not the case, so, in our implementation, the Cholesky decomposition is recomputed for every SNP.

Discussion

Despite major advances in high-density genome scans, disappointing results in the mapping of many common diseases illustrate the need for more-powerful methods for detection of susceptibility loci. We specifically address the problem of genomewide association analysis of binary traits when some individuals in the sample are related with known kinship. This arises naturally, for instance, when families sampled for a linkage study are included in an association study. This can be desirable, because it is expected that affected individuals from multiplex families would have a higher expected frequency of the alleles that increase susceptibility for a genetic trait than would affected individuals who do not have affected relatives. As a result, the power to detect association is expected to increase when affected individuals from multiplex families are included in a study. However, analysis of such data presents statistical and computational challenges.

We have developed a new test, the MQLS, which is applicable to association studies with completely general combinations of family and case-control designs. For instance, the MQLS allows cases to be related to controls, and it is equally applicable to complex inbred pedigrees and to simpler study designs consisting of unrelated individuals and small outbred families. The MQLS distinguishes between unaffected controls and controls of unknown phenotype and can incorporate both into the same analysis. Furthermore, it makes use of phenotype data about relatives who have missing genotype data at a given SNP, where this information is used to optimize the weights given to relatives with nonmissing data at the SNP. We also extend the test to multiallelic markers. Our method is computationally feasible to use for genomewide association studies with hundreds of thousands or millions of SNPs. Our simulations indicate that the MQLS represents an overall—and, in many cases, substantial—improvement in power over competing methods for a broad range of multigene trait models, while controlling type I error. In a reanalysis of the GAW 14 COGA data, the MQLS detected four SNPs with genomewide-significant association to alcoholism, three of which had not been identified as significant in previous analyses.

We suggest a simple diagnostic, based on only phenotype information, that determines whether the MQLS, WQLS, and Wχ2corr would be expected to give different results. This can allow one to avoid correcting for use of three different tests in situations in which they are expected to give similar results. In our simulation studies, when the diagnostic indicated that the tests would give different results, the MQLS was generally the most powerful. In the GAW 14 COGA data, the diagnostic indicated that the tests would give different results, and, indeed, this was the case with the MQLS, WQLS, and Wχ2corr identifying 4, 0, and 1 significant markers, respectively, where the marker identified by Wχ2corr was not among those identified by MQLS. (In this case, the Bonferroni correction took into account the three different tests as well as the number of SNPs tested.) In our simulations, a diagnostic result of >1 in absolute value in either cases or controls corresponded to a noticeable power difference between the tests, but, with larger sample sizes, a smaller diagnostic result might still correspond to a substantial power difference.

We have developed a modified version of the CC-QLS software program3 that outputs the results of our new MQLS test for each SNP, as well as the results of the previously proposed3 Wχ2corr and WQLS tests. The source code will be available (see M.S.M.'s Web page).

In the simulations and data analysis, we focus on inclusion of small-to-moderate–size outbred families in case-control association studies. However, it is important to note that the MQLS is equally applicable to case-control association testing in founder populations, provided that the genealogy is known. Founder populations—for example, the Tasmanian population17 and the Hutterites18—are of interest for the mapping of complex traits for various reasons, including (1) avoidance of the problems of unknown population substructure and (2) the expectation that there would be fewer risk alleles involved in complex disorders in founder populations than in diverse continental populations. The MQLS is computationally feasible, even in a founder population as complicated as the Hutterites, among whom many of the individuals are related through multiple lines of descent and exact likelihood calculation is not feasible.18

We have examined the question of whether to use prior or posterior kinship coefficients in calculating the weights for WQLS and MQLS and the variances for Wχ2corr, WQLS, and MQLS. We recommend that prior kinship coefficients always be used in calculating the weights; otherwise, the theoretical justification for the statistics might not hold, and they could be badly behaved. In calculating the variance, we found no difference in the results obtained for small-to-moderate–size pedigrees with the two different types of kinship coefficients. Therefore, we recommend prior kinship coefficients for that calculation also, because they are much faster and simpler to compute. For large pedigrees, posterior kinship coefficients are unfeasible to obtain exactly, so it is somewhat academic to debate which is better. Nevertheless, on the basis of our simulations, we can say that, for a design consisting of a small number of large pedigrees, if one is willing to detect a signal that is driven by a combination of linkage and association, then one should obtain higher power with prior kinship coefficients, whereas, if one wants to correct for a known linkage signal to obtain a pure association test, then better type I error properties would be obtained with posterior kinship coefficients. (It is reasonable that these differences should disappear with multiple small-to-moderate–size pedigrees, because, if there is linkage but no association, then different alleles would tend to be associated with the trait in different pedigrees.)

Use of the MQLS requires specification of a constant k in the test statistic. We emphasize that the test is valid for any value of k. To optimize power, we recommend that k be set to the best available estimate of the population prevalence of the trait. Our simulation studies suggest that the power of the test is very robust to the choice of k. When k was misspecified within a factor of 3 or 4 of the true prevalence, there was little or no loss of power in our simulations.

Acknowledgments

We thank Dr. Mark Abney, for discussion and critical comments; Wataru Yoshimura from Affymetrix Application Support, for assistance with the Affymetrix (NetAffx) Web site; and two anonymous reviewers, for helpful comments. This study was supported in part by a David and Lucile Packard Fellowship (to T.T.) and by National Institutes of Health grants HG001645 and HL084715. Data were provided by the COGA (U10AA008401). We gratefully acknowledge COGA and thank Dr. Ray Crowe of the University of Iowa and Dr. Jean MacCluer of the Southwest Foundation for Biomedical Research for their help in obtaining permission to analyze the COGA data.

Appendix A: Extension of MQLS to Multiallelic Case

We extend the MQLS procedure to test for association between a trait and a multiallelic marker. Extensions of the Wχ2corr and WQLS tests to multiallelic markers have been given elsewhere.3 Suppose there are a allelic types at the marker, and let equation M27 be an [(a-1)n] vector, where Yi=(Yi1,…,Yin) and Yij= 1/2× (the number of alleles of type i that individual j has). Let p=(p1,…,pa-1)T denote the allele-frequency distribution at the marker in the general population, where pi>0 is the frequency of allelic type i, and 1Tp<1. Define r=(r1,…,ra-1)T to be the (a-1) vector of expected changes in allele frequencies for a case randomly sampled from the population. Then the mean model for the MQLS in the multiallelic case is EY=μ=p[multiply sign in circle]1+r[multiply sign in circle]N,N[union or logical sum]MA), where [multiply sign in circle] is the Kronecker product (see, e.g., the work of Schott7[p253]), and 1 is a vector of 1s of length n; that is

equation image

where we constrain

equation image

for all 1[less-than-or-eq, slant]i[less-than-or-eq, slant]a-1, 1[less-than-or-eq, slant]j[less-than-or-eq, slant]n. Under the null hypothesis of no association between the marker and the trait, we have r=0, where 0 is a zero vector of length (a-1). Let F denote an (a-1)(a-1) matrix with (i,j)th entry Fij= 1/2pi(1-pi) if i=j and Fij=- 1/2pipj if ij. Note that, under the null hypothesis of no association and no linkage, and when the pedigree founders are drawn from a population in HWE under the null hypothesis, Var(Y)=F[multiply sign in circle]Φ. The MQLS test statistic for the multiallelic case is

equation image

where α and Γ are as defined for equations (5) and (6) and equation M28 is the (i,k) entry of F-1 evaluated at equation M29, where equation M30 is the maximum quasi-likelihood estimate of p when r=0, or equivalently, the BLUE of p based on everyone, which previous work8 has shown to be equation M31 for each i. Under the null hypothesis, the MQLS statistic follows a χ2 distribution with a-1 df (asymptotically, under regularity conditions).

Appendix B: Connection between MZ Twins and Invertibility of Φ

We prove that Φ is invertible if and only if the set N does not include both members of any MZ twin pair. Because Σ is a covariance matrix, it must be symmetric and positive semidefinite and is invertible if and only if it is positive definite. Φ inherits these properties. Note that Σ and hence Φ is positive definite if and only if there is no linear combination cTY, c[set membership]Rn[backslash]{0}, such that Pr(cTY=0)=1. This is because Var(cTY)=cTΣc, and Var(cTY)=0 if and only if Pr(cTY=0)=1. If N includes i and j who are MZ twins, then Pr(Yi-Yj=0)=1, and Φ is not invertible. Suppose Φ is not invertible. Then there must be some individual i in N such that Yi can be written as a linear combination of Y-i, Pr(Yi=dTY-i)=1, d[set membership]Rn-1, where Y-i is Y excluding the ith element. Consider the situation in which every individual in N[backslash]{i} is heterozygous at the binary marker, which has positive probability for 0<p<1. Then, if i is not MZ twin to anyone in N[backslash]{i}, it is possible for i to have any genotype, and so it cannot be true that Pr(Yi=dTY-i)=1. Thus, Pr(Yi=dTY-i)=1 implies that i has an MZ twin in N[backslash]{i}.

Appendix C: Derivation of MQLS Mean Model

We show that, under a two-allele disease model, for an arbitrary set of possibly related outbred individuals, the ratio

equation image

(given in the “Development and Justification of the MQLS Test” section) tends to

equation image

as the effect size (or differences among penetrance probabilities) tend to zero. Throughout, we condition on the pattern of missing phenotype information. Consider a two-allele disease model with penetrance probabilities k, k-c2, and k-c3 for individuals who have 2, 1, or 0 alleles of type 1, respectively, where k-1[less-than-or-eq, slant]c2[less-than-or-eq, slant]k and k-1[less-than-or-eq, slant]c3[less-than-or-eq, slant]k, with at least one of c2 and c3 nonzero. (Note that, as c2 and c3 tend to zero, the population prevalence Kp will tend to k.) Under this model, we have

equation image

Furthermore, we can express

equation image

where A is the entire phenotype vector, Y-i denotes (Y1,…Yi-1,Yi+1…,Yn+m), Y-1 denotes (Y2,…,Yn+m), and Y-(n+m) denotes (Y1,…,Yn+m-1), where Yn+1,…Yn+m denote the true (unobserved) genotype values for individuals n+1,…,n+m. As c2 and c3 approach 0, we have the expansions

equation image

and

equation image

where nc is the number of affected, nu is the number of unaffected, n+m-nc-nu is the number of individuals of unknown phenotype in the study, nv,c is the number of affected individuals having a Y value of v (regardless of whether Y is actually observed or is missing in the study), and, similarly, nv,u is the number of unaffected individuals having a Y value of v. Note that, for example,

equation image

so

equation image

where P(Yj=.5|Yi=s) can be calculated in terms of the relationship between the pair of individuals (i,j), without having to consider multiple individuals jointly. Applying a similar argument to the other terms in equation (C2) and reorganizing terms, we can obtain

equation image

Plugging into equation (C1) and noting that, for a pair of outbred individuals (i,j),

equation image

and

equation image

we obtain

equation image

Then, plugging in the derived expressions for μ*a-p and P(A), and letting c2 and c3 tend to 0, we get

equation image

If we let r represent the quantity μ*a-p, then this leads to our model equation M32.

Appendix D: Details of MQLS Power Improvement Diagnostic

Recall that the test statistics Wχ2corr, WQLS, and MQLS each have the form equation M33, where S=VTY, V=(V1,…Vi,…Vn)T, and Vi can be viewed as the total weight given by the test statistic to individual i, with

equation image

Here, equation M34. Note that, by construction, equation M35 for each test statistic. To measure the difference in weights between a pair of statistics, say Wχ2corr and MQLS, for case and control individuals, we propose

equation image

and

equation image

where, for each statistic, the corresponding B is a vector of length n with ith component

equation image

which is Vi normalized by the mean total weight among cases if i is a case and is 0 if i is a control. Similarly, for each statistic, the corresponding C is a vector of length n with ith component

equation image

which is Vi normalized by the mean total weight among controls if i is a control and is 0 if i is a case. The definitions of the ψs for any other pair of statistics are analogous. Note that Bχ2corr=1c and Cχ2corr=1-1c, so that ψ(Wχ2corr,MQLS,case) reduces to the absolute value of the coefficient of variation of VMQLS among cases, and, similarly, ψ(Wχ2corr,MQLS,control), ψ(Wχ2corr,WQLS,case), and ψ(Wχ2corr,WQLS,control) reduce to the absolute values of the coefficients of variation of VMQLS among controls, VWQLS among cases, and VWQLS among controls, respectively, where the coefficient of variation is the SD divided by the mean. In the situation in which there is a group G of permutations of the n+m individuals such that (1) every element of G preserves genotyped/missing status; (2) every element of G preserves A; (3) every element of G preserves ΦN,N[union or logical sum]M; (4) for every pair of genotyped case individuals i and j, there is a permutation in G that maps i to j; and (5) for every pair of genotyped control individuals k and l, there is a permutation in G that maps k to l, then the coefficients of variation for the total weights of the cases and for the total weights of the controls are both equal to 0 for the MQLS and the WQLS statistics, and these two statistics are equivalent to the Wχ2corr test. For example, these conditions hold, at a marker with no missing genotypes, if the cases are affected sib pairs and the controls are unaffected unrelated individuals.

Web Resources

The URLs for data presented herein are as follows:

Affymetrix, https://www.affymetrix.com/analysis/netaffx/index.affx (for NetAffx information on the location of SNPs in and identification of genes that are in close proximity to the SNPs)
EntrezGene, http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=search&DB=gene (for information about genes that are close to SNPs in )
M.S.M.'s Web page, http://www.stat.uchicago.edu/~mcpeek/software/index.html (for the source code described in the text)
Online Mendelian Inheritance in Man (OMIM), http://www.ncbi.nlm.nih.gov/Omim/ (for alcohol dependence, FBXO15, VAPA, CYP2C18, and HTR2A)

References

1. Spielman R, McGinnis R, Ewens W (1993) Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am J Hum Genet 52:506–516 [PMC free article] [PubMed]
2. Risch N, Teng J (1998) The relative power of family-based and case-control designs for linkage disequilibrium studies of complex human diseases I: DNA pooling. Genome Res 8:1273–1288 [PubMed]
3. Bourgain C, Hoffjan S, Nicolae R, Newman D, Steiner L, Walker K, Reynolds R, Ober C, McPeek MS (2003) Novel case-control test in a founder population identifies P-selectin as an atopy-susceptibility locus. Am J Hum Genet 73:612–626 [PMC free article] [PubMed]
4. Slager SL, Schaid D (2001) Evaluation of candidate genes in case-control studies: a statistical method to account for related subjects. Am J Hum Genet 68:1457–1462 [PMC free article] [PubMed]
5. Browning S, Briley J, Chandra G, Charnecki J, Ehm M, Johansson K, Jones B, Karter A, Yarnall D, Wagner M (2005) Case-control single marker and haplotypic association analysis of pedigree data. Genet Epidemiol 28:110–122 [PubMed] [Cross Ref]10.1002/gepi.20051
6. Edenberg HJ, Bierut LJ, Boyce P, Cao M, Cawley S, Chiles R, Doheny KF, Hansen M, Hinrichs T, Jones K, et al (2005) Description of the data from the Collaborative Study on the Genetics of Alcoholism (COGA) and single-nucleotide polymorphism genotyping for Genetic Analysis Workshop 14. BMC Genetics Suppl 6:S2 [PMC free article] [PubMed] [Cross Ref]10.1186/1471-2156-6-S1-S2
7. Schott JR (1996) Matrix analysis for statistics. John Wiley, New York
8. McPeek MS, Wu X, Ober C (2004) A quasi-likelihood method for allele frequency estimation. Biometrics 60:359–367 [PubMed] [Cross Ref]10.1111/j.0006-341X.2004.00180.x
9. Abecasis GR, Cherny SS, Cookson WO, Cardon LR (2002) Merlin—rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet 30:97–101 [PubMed] [Cross Ref]10.1038/ng786
10. Sun L, Cox N, McPeek MS (2002) A statistical method for identification of polymorphisms that explain a linkage result. Am J Hum Genet 70:399-411 [PMC free article] [PubMed]
11. Grant BF, Harford TC, Dawson DA, Chou P, Dufour M, Pickering R (1994) Prevalence of DSM-IV alcohol abuse and dependence: United States 1992. Alcohol Health Res World 18: 243–248
12. Hwu HG, Chen CH (2000) Association of 5HT2A receptor gene polymorphism and alcohol abuse with behavior problems. Am J Med Genet 96:797–800 [PubMed] [Cross Ref]10.1002/1096-8628(20001204)96:6<797::AID-AJMG20>3.0.CO;2-K
13. Zhu X, Cooper R, Kan D, Cao G, Wu X (2005) A genome-wide linkage and association study using COGA data. BMC Genetics Suppl 6:S128 [PMC free article] [PubMed] [Cross Ref]10.1186/1471-2156-6-S1-S128
14. Chiu YF, Liu SY, Tsai YY (2005) A comparison in association and linkage genome-wide scans for alcoholism susceptibility genes using single-nucleotide polymorphisms. BMC Genetics Suppl 6:S89 [PMC free article] [PubMed] [Cross Ref]10.1186/1471-2156-6-S1-S89
15. Chen L, Liu N, Wang S, Oh C, Carriero NJ, Zhao H (2005) Whole-genome association studies on alcoholism comparing different phenotypes using single-nucleotide polymorphisms and microsatellites. BMC Genetics Suppl 6:S130 [PMC free article] [PubMed] [Cross Ref]10.1186/1471-2156-6-S1-S130
16. Rabinowitz D, Laird N (2000) A unified approach to adjusting association tests for population admixture with arbitrary pedigree structure and arbitrary missing marker information. Hum Hered 50:211–223 [PubMed] [Cross Ref]10.1159/000022918
17. Stankovich J, Bahlo M, Rubio JP, Wilkinson CR, Thomson R, Banks A, Ring M, Foote SJ, Speed TP (2005) Identifying nineteenth century genealogical links from genotypes. Hum Genet 117:188–199 [PubMed] [Cross Ref]10.1007/s00439-005-1279-y
18. Abney M, McPeek MS, Ober C (2005) Estimation of variance components of quantitative traits in inbred populations. Am J Hum Genet 66:629–650 [PMC free article] [PubMed]

Articles from American Journal of Human Genetics are provided here courtesy of American Society of Human Genetics
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...