- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

- Journal List
- NIHPA Author Manuscripts
- PMC2684820

# Proper Analysis of Secondary Phenotype Data in Case-Control Association Studies

## Abstract

Case-control association studies often collect extensive information on secondary phenotypes, which are quantitative or qualitative traits other than the case-control status. Exploring secondary phenotypes can yield valuable insights into biological pathways and identify genetic variants influencing phenotypes of direct interest. All publications on secondary phenotypes have used standard statistical methods, such as least-squares regression for quantitative traits. Because of unequal selection probabilities between cases and controls, the case-control sample is not a random sample from the general population. As a result, standard statistical analysis of secondary phenotype data can be extremely misleading. Although one may avoid the sampling bias by analyzing cases and controls separately or by including the case-control status as a covariate in the model, the associations between a secondary phenotype and a genetic variant in the case and control groups can be quite different from the association in the general population. In this article, we present novel statistical methods that properly reflect the case-control sampling in the analysis of secondary phenotype data. The new methods provide unbiased estimation of genetic effects and accurate control of false-positive rates while maximizing statistical power. We demonstrate the pitfalls of the standard methods and the advantages of the new methods both analytically and numerically. The relevant software is available at our website.

**Keywords:**case-control sampling, complex diseases, genomewide association studies, linear regression, maximum likelihood, meta-analysis, quantitative traits, secondary traits, SNPs

## INTRODUCTION

There is a proliferation of genomewide association (GWA) studies worldwide. These studies usually employ the case-control design, which consists of a sample of cases (i.e. diseased individuals) and a sample of controls (i.e., disease-free individuals). Most GWA studies measure a variety of quantitative or qualitative traits other than the disease trait that defines the case-control status. Exploring these secondary phenotypes can discover genetic variants influencing previously unstudied phenotypes and provide important clues about causal pathways. Although the main interest of case-control association studies lies in the comparison of cases and controls, analysis of secondary phenotype data may supplement the case-control comparison in the initial reports or become the primary focus of subsequent publications.

Indeed, recent months have seen an explosion of publications on genetic variants influencing human quantitative traits, such as height [Weedon et al., 2007; Sanna et al., 2008; Weedon et al., 2008; Lettre et al., 2008; Gudbjartsson et al., 2008], body mass index (BMI) [Frayling et al., 2007; Loos et al., 2008], and lipid levels [Saxena et al., 2007; Willer et al., 2008; Kathiresan et al., 2008]. The data for those publications came mostly from case-control association studies of complex diseases (e.g., diabetes, cancer and hypertension). The most recent publications were based on meta-analysis of multiple GWA studies involving thousands or tens of thousands of individuals.

All publications on quantitative traits have relied on standard linear regression analysis (i.e., classical least-squares estimation under the linear model). Five types of analysis have been conducted to assess the effects of SNPs on quantitative traits using data from case-control association studies: (1) controls only; (2) cases only; (3) combined sample of cases and controls; (4) meta-analysis of cases and controls; (5) joint analysis of cases and controls adjusted for the disease status. Methods (1) and (2) are restricted to controls and cases, respectively. Method (3) analyzes cases and controls together and ignores the disease status. In method (4), cases and controls are analyzed separately and the results are combined by a meta-analytic procedure. Method (5) analyzes cases and controls together and includes the disease status as a covariate in the model.

None of the aforementioned analysis methods is statistically correct. Because cases and controls are selected at different rates from their respective subpopulations, the case-control sample does not constitute a random sample of the general population. As a result, the population association between a SNP and a secondary trait can be distorted in the case-control sample. Thus, method (3) can be grossly misleading, especially when the secondary trait is strongly correlated with the disease and the sampling rates are very different between cases and controls. The other four methods are conditional on the disease status and are thus unaffected by the biased case-control sampling. However, the associations between a SNP and a secondary trait in the case and control groups can be quite different from the association in the general population if the secondary trait is correlated with the disease.

To illustrate the above points, we consider a dichotomous secondary trait taking the values 0 and 1 with equal probability and a SNP with minor allele frequency (MAF) of 0.2. Assume that the disease rate is 10%, the odds ratio of disease with the SNP is 1.5 under the dominant mode of inheritance, and the odds ratio of disease with the secondary trait is 2. If the SNP is unrelated to the secondary trait in the general population (i.e., the odds ratio is 1), then the odds ratios of the secondary trait with the SNP will be approximately 0.975 in the case and control groups, and the odds ratio in the combined sample of cases and controls with an equal number of cases and controls will be approximately 1.045; see Table I for details. In other words, there exist (spurious) associations between the SNP and the secondary trait in the case and control groups, as well as in the combined sample, despite the absence of association at the population level. This phenomenon implies that methods (1), (2) and (3) are all biased; methods (4) and (5) are also biased because they combine the biased results from the case and control groups. Because GWA studies typically have large sample sizes, even modest levels of bias can lead to grossly inflated false-positive rates, especially in meta-analysis.

The fact that method (3) is biased seems to contradict with the universally accepted practice of performing standard logistic regression on case-control data. Standard logistic regression analysis of case-control data indeed yields correct maximum likelihood estimates of odds ratios, although the estimate of the disease rate is generally biased [Prentice and Pyke, 1979]. This remarkable result, however, applies only to the primary disease outcome that defines the case-control status, and not to a secondary outcome that is correlated with the disease.

We have developed valid and efficient statistical methods to analyze secondary phenotype data in case-control association studies. Our methods are based on likelihood functions that properly reflect the case-control sampling. The corresponding maximum likelihood estimates are approximately unbiased and normally distributed. Furthermore, the estimates are statistically efficient in that they have the smallest variances among all valid estimates, and the corresponding association tests are the most powerful among all valid tests.

In the next section, we describe in more detail the ingredients of the new methods and the pitfalls of standard methods, particularly methods (1)-(5). In the subsequent Results section, we use Monte Carlo simulation to quantify the bias of standard methods in common situations and to evaluate the operating characteristics of the new methods. We relegate the technical details about the new and standard methods to Appendices A and B, respectively.

## METHODS

### NEW METHODS

Let *D* denote the case-control status (1 = disease; 0 = no disease) and *Y* denote the secondary phenotype. Also, let *X* denote the genotype score for a SNP of interest. Under the additive mode of inheritance, *X* is the number of minor alleles; under the dominant (recessive) model, *X* indicates, by the values 1 versus 0, whether or not the individual carries at least one minor allele (two minor alleles). We use a generalized linear model to formulate the effects of *X* on *Y*, and write the conditional density of *Y* given *X* as *P*(*Y*|*X*). If *Y* is a quantitative trait, we use the linear regression model, which specifies that the conditional distribution of *Y* given *X* is normal with mean *β*_{0} + *β*_{1}*X* and variance *σ*^{2}. If *Y* is a dichotomous trait, we use the logistic regression model, under which

We relate *Y* and *X* to *D* through the logistic regression model

We are mainly interested in *β*_{1}.

For a case-control study with a total of *n* subjects, the data consist of (*D _{i}, Y_{i}, X_{i}*) (

*i*= 1,…,

*n*). Because the sampling is conditional on the case-control status, the likelihood function takes the retrospective form, ${\mathrm{\Pi}}_{i=1}^{n}\phantom{\rule{0.2em}{0ex}}P\left({Y}_{i},{X}_{i}|{D}_{i}\right),$ which is

where *P*(*D _{i}* = 1) = Σ

_{y}Σ

_{x}

*P*(

*D*= 1|

_{i}*x, y*)

*P*(

*y*|

*x*)

*P*(

*x*),

*P*(

*D*= 0) = 1 −

_{i}*P*(

*D*= 1), and

_{i}*P*(

*D*= 0|

_{i}*X*) = 1 −

_{i}, Y_{i}*P*(

*D*= 1|

_{i}*X*). We maximize this function by the Newton-Raphson algorithm. Likelihood-based statistics (i.e., Wald, score and likelihood-ratio statistics) can be used to make inference about the parameter of main interest

_{i}, Y_{i}*β*

_{1}.

The estimation of *γ*_{0} may be unstable, especially for dichotomous *Y*. When the disease is rare such that *P*(*D* = 1|*X, Y*) ≈ *e*^{γ0+γ1X+γ2Y}, the parameter *γ*_{0} cancels in the numerator and denominator of the likelihood function. When the disease is not rare but the disease rate is known approximately, we can incorporate the information about the disease rate into the estimation. We can include environmental covariates in the models for *Y* and for *D*, but then the probability distribution of continuous environmental covariates will enter into the likelihood function as a high-dimensional nuisance parameter. We eliminate such nuisance parameters through the profile-likelihood approach. The interested readers are referred to Appendix A for the theoretical and computational details of the new methods.

## STANDARD METHODS

As described in the Introduction section, standard statistical methods have been applied to the secondary phenotype data from case-control studies in five different ways, which we will refer to as methods (1)-(5). Methods (1)-(3) are based on the prospective likelihood function Π* _{i} P*(

*Y*|

_{i}*X*), where the product is taken over the controls, the cases or all study subjects. The maximization of this function yields the classical least-squares estimation for quantitative traits and the standard logistic regression for dichotomous traits. Method (4) combines the least-squares or odds ratio estimates of the case and control samples through the inverse-variance meta-analysis procedure. Method (5) is based on the prospective likelihood function ${\mathrm{\Pi}}_{i=1}^{n}\phantom{\rule{0.2em}{0ex}}P({Y}_{i}|{X}_{i},{D}_{i}),$ which is a parametric way of combining the results of the case and control samples.

_{i}We have conducted a thorough investigation into the properties of the five standard methods. We state here the main conclusions while relegating the details to Appendix B. If the secondary phenotype is not related to the case-control status, or more precisely, *D* is independent of *Y* given *X* (i.e., *γ*_{2} = 0), then all five methods are valid. If the SNP is not associated with the case-control status, or more precisely, *D* is independent of *X* given *Y* (i.e., *γ*_{1} = 0), then all five methods yield correct estimates of odds ratios for dichotomous traits, but the least-squares estimates for quantitative traits produced by the five methods are biased unless *β*_{1} = 0 or *γ*_{2} = 0. When the disease is rare in that the probability of disease is virtually 0, all standard methods except (3) are approximately valid.

The fact that standard methods are generally invalid unless *γ*_{1} = 0 or *γ*_{2} = 0 is disconcerting. Most secondary phenotypes are strongly correlated with the case-control status, so that *γ*_{2} ≠ 0. Thus, any SNPs that are associated with the case-control status will tend to be detected as being associated with secondary phenotypes by standard methods even when the latter associations do not exist. When the associations truly exist, all five methods may produce estimates that are biased toward the null and thus reduce statistical power.

## RESULTS

We conducted extensive simulation studies to assess the performance of the standard and new methods in the analysis of secondary quantitative traits. We considered a SNP with MAF of 0.3 and additive mode of inheritance. For the model of the secondary quantitative trait, we set *β*_{0} = *σ*^{2} = 1, and let *β*_{1} = 0 and −0.12 under the null and alternative hypotheses, respectively; *β*_{1} = −0.12 means that each copy of the minor allele decreases the trait value by 12% of its standard deviation. For the disease-risk model, we set *γ*_{2} = log 2, varied the value of *γ*_{1} from 0 to log 2, and chose the value of *γ*_{0} to yield a disease rate of 1% or 5%. Note that *γ*_{1} pertains to the change in the log odds ratio of disease for each copy of the minor allele of the SNP and *γ*_{2} to the change in the log odds ratio of disease for one standard-derivation increase of the quantitative trait. The choice of *γ*_{2} = log 2 represents a strong, but not uncommon, association between the secondary phenotype and the disease. For each combination of simulation parameters, we generated 1,000,000 data sets with 1,000 cases and 1,000 controls. We compared the new method to standard methods (1)-(3), i.e., the least-squares methods based on the data from the control group, the case group, and the combined sample of cases and controls. We focused on methods (1)-(3) because methods (4) and (5) are closely related (1) and (2).

We summarize the key results in Figures 1 and and2.2. Figure 1 shows the biases of effect estimates and the coverage probabilities of 99% confidence intervals for *β*_{1} as a function of the odds ratio of disease with the SNP (i.e., *e*^{γ1}) under the alternative hypothesis (i.e., *β*_{1} = −0.12). Figure 2 displays the type I error and power for testing the null hypothesis of *β*_{1} = 0 at the nominal significance level of 1%. We restricted the horizontal axis to 1.5 because the odds ratios that have been observed in GWA studies thus far are mostly less than 1.5 although the odds ratios may be higher in candidate genes studies.

**...**

**...**

The new method performs very well. The effect estimates are virtually unbiased, and the variance estimates accurately reflect the true variations (latter data not shown). A a result, the confidence intervals have proper coverage probabilities, and the association tests have correct type I error rates. The power of the new method is always above 80%.

The least-squares method based on the combined sample of cases and controls, i.e., method (3), can be very wrong, especially when the SNP is strongly related to the disease. The effect estimates are biased, the confidence intervals have poor coverage probabilities, and the type I error is inflated. The problems associated with this method are more severe for rarer diseases. The type I error rates are 8 times and 5 times the nominal significance level under the disease rates of 1% and 5%, respectively, when the odds ratio of disease with the SNP is 1.3. This strategy may be more powerful or less powerful than the new method, dependent on how the SNP affects the disease status and the secondary phenotype. Figure 2 shows that this strategy can be substantially less powerful than the new method.

When the disease is rare, say less than 1%, the least-squares methods based on controls only and cases only, i.e., methods (1) and (2), are appropriate in that the effect estimates are approximately unbiased, the confidence intervals have adequate coverage probabilities, and the association tests have reasonable type I error rates. These two methods, however, use half of the study subjects and are thus much less powerful than the new method. For relatively common diseases, methods (1) and (2) yield biased effect estimates, improper confidence intervals and inflated type I error, especially when the SNP is strongly related to the disease. When the disease rate is 5% and the odds ratio of disease with the SNP is 1.5, the type I error rates for methods (1) and (2) are about 1.3% and 1.8%, respectively.

All the aforementioned results pertain to MAF of 0.3, *γ*_{2} of log 2, 1,000 cases and 1,000 controls, and nominal significance level of 1%. We also considered other combinations of simulation parameters. The new methods continued to perform well. The performance of standard methods became worse as MAF, *γ*_{2} or sample size was increased and as the nominal significance level was lowered. In addition, the performance of methods (1) and (2) became worse as the disease rate was increased.

Figure 3 displays the type I error rates of the five standard methods for MAF of 0.2, disease rate of 7%, 2,000 cases and 2,000 controls, and nominal significance level of 10^{−6}. The type I error rates for all five methods increase rapidly with increasing values of *γ*_{1} and *γ*_{2} and are seriously inflated even with moderate values of *γ*_{1} and *γ*_{2}. Under *γ*_{1} = log 1.3 and *γ*_{2} = 0.5, the type I error rates of methods (1)–(5) are, respectively, 1.7, 2.4, 20, 3.2 and 3.2 times the nominal significance level; under *γ*_{1} = log 1.2 and *γ*_{2} = 1, the five type I error rates are 1.9, 6.2, 25, 6.7 and 7 times the nominal significance level. Controls-only analysis has the least inflation of type I error, followed by cases-only analysis. Meta-analysis of cases and controls has higher type I error than cases-only and controls-only analyses, mainly because it has twice the sample size. The joint analysis of cases and controls with the disease status as a covariate in the linear model has slightly higher inflation of type I error than the meta-analysis. The analysis of the combined sample without adjusting for the disease status, i.e., method (3), performs much worse than the other four methods.

## DISCUSSION

The purpose of this article is two-fold: to evaluate the standard methods and to develop better methods for the analysis of secondary phenotype data in case-control association studies. We have demonstrated both analytically and numerically that all the standard methods can have severely inflated type I error and reduced power in practical situations. The new methods provide accurate control of the type I error and have the highest efficiency/power among all valid methods. We have developed efficient numerical algorithms to implement the new methods and posted the software online. With our software, analysis of a GWA study (with 500K-1,000K SNPs) can be completed in a few hours.

As mentioned in the Introduction section, the recent publications on human height, BMI and lipid levels relied on standard linear regression analysis of quantitative trait data from case-control studies of complex diseases. We are not challenging the conclusions of those publications because some of the initial results were validated in cross-sectional or cohort data, but the effect estimates and *P*-values reported in the papers might be questionable. Using valid and efficient statistical methods in the initial scans would reduce the number of false positives and increase the number of true positives among the initial results and thus enhance the success rates of validation efforts. It would be even more important to use proper statistical methods in any validation analysis.

When the disease is rare, all standard methods except (3) are approximately valid. Our simulation studies revealed that this assumption is problematic when the disease rate is appreciably higher than 1%. As shown in Figures 1 and and2,2, methods (1) and (2) have serious problems when the disease rate is 5%. We recommend to use the rare disease assumption only when the disease rate is less than 2%.

Many of the complex diseases currently under study are relatively common (5-10%), so the results of Figure 3 are particularly relevant. If the disease is type 2 diabetes, then *γ*_{2} is close to 1 for BMI and triglyceride levels and close to 0 for height. Thus, standard linear regression analysis of BMI and triglyceride levels in case-control studies of type 2 diabetes would be more biased than that of height. The most recent publications on quantitative traits were based on meta-analyses of tens of thousands of subjects, so the inflation of type I error would be much more profound than what is seen in Figure 3.

Although only a small fraction of SNPs are truly associated with a complex disease, any SNPs that are associated with the disease in the observed data (mostly by chance) will tend to be spuriously associated with secondary traits in the case and control groups as well as in the combined sample; therefore, all five standard methods can cause large-scale increases of false positive results in GWA studies. Indeed, the quantile-quantile plots in several publications [Weedon et al., 2007; Weedon et al., 2008; Gudbjartsson et al., 2008] showed substantial deviations of observed statistics from the expected (after correcting for population stratification), even at the 0.01 level of significance.

## Acknowledgments

This research was supported by the National Institutes of Health. The authors thank Chad He for his assistance in preparing the figures.

## APPENDIX A: THEORETICAL AND COMPUTATIONAL ASPECTS OF NEW STATISTICAL METHODS

We provide theoretical and computational details for the new statistical methods. In the main text of this article, *X* pertains to a single SNP. In this appendix, we expand *X* to contain all genetic and environmental factors of interest (including gene-environment interactions). We use *P _{θ}*(

*y*|

*x*) to denote the conditional density of

*Y*given

*X*=

*x*, which is formulated through a parametric model with a set of parameters

*θ*. We assume that

where *γ*_{1} is now a vector.

The data consist of (*D _{i}, Y_{i}, X_{i}*) (

*i*= 1,…,

*n*). The likelihood function is

where *P*(*x*) is the density of *X*, and

For discrete components, the integration is replaced by the summation.

Before deriving estimation methods, we need to determine which parameters are estimable or identifiable. We wish to show that two sets of parameters {*γ*_{0}, *γ*_{1}, *γ*_{2}, *θ*, *P*(*x*)} and {*γ*_{0}^{*}, *γ*_{1}^{*}, *γ*_{2}^{*}, *θ**, *P**(*x*)} are identical if they yield the same likelihood. It is natural to assume that *P _{θ}*(

*Y*|

*X*) =

*P*(

_{θ*}*Y*|

*X*) if and only if

*θ*=

*θ**and that the data matrix for (1,

*X, Y*) is of full rank.

We first consider the case in which the disease is rare such that *P*(*D* = 1|*X, Y*) ≈ e^{γ0+γ1TX+γ2Y} and *P*(*D* = 0|*X, Y*) ≈ 1. Then the likelihood function becomes

We let *D* = 0 to obtain *P _{θ}*(

*Y*|

*X*)

*P*(

*X*) =

*P*(

_{θ*}*Y*|

*X*)

*P**(

*X*), which implies

*P*(

*X*) =

*P**(

*X*) and

*θ*=

*θ**. We let

*D*= 1 to obtain

*γ*

_{1}

^{T}

*X*+

*γ*

_{2}

*Y*=

*γ*

_{1}

^{*T}

*X*+

*γ*

_{2}

*+*

^{*}Y*c*, where

*c*is a constant. Because the data matrix for (1,

*X, Y*) is of full rank, this equation yields

*γ*

_{1}=

*γ*

_{1}

^{*}and

*γ*

_{2}=

*γ*

_{2}

^{*}(implying

*c*= 0). Thus, all parameters are identifiable.

Next, we consider the case in which the disease rate is known. Since *P*(*D* = 1) is known,

Summing over *D* = 0 and 1 yields *P _{θ}*(

*Y*|

*X*)

*P*(

*X*) =

*P*(

_{θ*}*Y*|

*X*)

*P**(

*X*), which implies

*θ*=

*θ**and

*P*(

*X*) =

*P**(

*X*). This further implies

*γ*

_{0}+

*γ*

_{1}

^{T}

*X*+

*γ*

_{2}

*Y*=

*γ*

_{0}

^{*}+

*γ*

_{1}

^{*T}

*X*+

*γ*

_{2}

^{*}

*Y*. Hence, all parameters are identifiable.

The remaining case is when the disease is not rare and the disease rate is unknown. It can be shown that *γ*_{1} = *γ*_{1}^{*}, *γ*_{2}= *γ*_{2}^{*}, and

where *c*_{0} is a constant [Roeder et al., 1996]. In most cases, the above equation implies *θ* = *θ**. (In particular, we can show that *θ* = *θ** for continuous traits satisfying the linear regression model; this is also true for dichotomous traits satisfying the logistic regression model if *γ*_{2} = 0 or if *γ*_{2} ≠ 0 and there is a continuous component of *X* that is related to *D*.) Then

If *γ*_{2} ≠ 0, then the above equation clearly implies *γ*_{0}= *γ*_{0}^{*} and *P*(*x*) = *P**(*x*), so all parameters are identifiable. If *γ*_{2} = 0, we have

so *γ*_{1}, *γ*_{2} and *θ* are identifiable while *P*(*x*)/(1 + *e*^{γ0+γ1Tx}) is identifiable up to some constant.

In all cases, we estimate the parameters by maximizing the likelihood functions. Because *P*(*x*) is potentially high-dimensional, we use the profile likelihood approach. We provide estimation procedures separately for the three cases discussed above.

We first consider the rare-disease case. Write *p _{i}* =

*P*(

*X*). By differentiating the log-likelihood function with respect to

_{i}*p*, we obtain

_{i}where *λ* is the Lagrange multiplier for the constraint Σ_{i} *p _{i}* = 1, and

*n*

_{1}is the number of cases. Multiplying the above equation by

*p*and summing over

_{i}*i*, we see

*λ*=

*n*−

*n*

_{1}. Thus,

where

By the arguments of Lin and Zeng [2006], maximizing the likelihood function is equivalent to maximizing

which is called the profile log-likelihood function for *γ*_{1}, *γ*_{2}, *θ* and *ξ*.

When the disease rate is known, we maximize

subject to the constraints that Σ* _{i} p_{i}* = 1 and

where *ξ*_{0} is the known value of *P*(*D* = 1). By using the Lagrange multipliers, we see that the estimate for *p _{i}* satisfies

The two constraints imply *λξ*_{0} + = n; therefore,

where *λ* satisfies Σ* _{i} p_{i}* = 1. Thus, the profile log-likelihood function for

*θ*,

*γ*

_{0},

*γ*

_{1}and

*γ*

_{2}is

where *λ* is determined by the equation

under the constraint that each term in the summation is positive.

Finally, we deal with the case in which the disease is not rare and the disease rate is unknown. If *γ*_{2} = 0, we can use standard statistical methods; see Appendix B. Suppose that *γ*_{2} ≠ 0. By introducing the Lagrange multiplier *λ*, we differentiate the log-likelihood function to obtain

where *n*_{1} and *n*_{0} are the numbers of cases and controls. It is easy to see that *λ* = 0, so

where *ξ* = *P*(*D* = 1). Plugging this expression into the log-likelihood function yields the profile log-likelihood function for *γ*_{0}, *γ*_{1}, *γ*_{2}, *θ* and *ξ*

In all three cases, we maximize the profile log-likelihood functions via the Newton-Raphson algorithm or optimization algorithms. By the arguments used in the proofs of Theorems 2 and 3 of Lin and Zeng [2006], we can show that the maximum likelihood estimators are consistent and asymptotically normal. In addition, the limiting covariance matrix attains the efficiency bound and can be consistently estimated by the inverse of the negative Hessian matrix of the profile log-likelihood function.

## APPENDIX B: THEORETICAL PROPERTIES OF STANDARD STATISTICAL METHODS

We investigate the validity of standard statistical methods. Let *P* denote the true probability, and let *P _{obs}* denote the observed probability under the case-control design. Since

and *P _{obs}*(

*Y, X*|

*D*) =

*P*(

*Y, X*|

*D*), we have

Likewise,

and

If *P _{obs}*(

*Y*|

*X*) =

*P*(

*Y*|

*X*), then the usual (prospective) likelihood based on the combined sample of cases and controls will yield correct estimates of

*θ*as well as correct variance estimates. The sufficient and necessary condition for this equality is that

*Y*is independent

*D*conditional on

*X*, i.e.,

*γ*

_{2}= 0. To show the sufficiency, we note that equation (1) under

*γ*

_{2}= 0 entails

Similarly, *P*(*X*) = *P*(*X* |*D* = 0)(*e*^{γ0+γ1TX} + 1)*P*(*D* = 0). Plugging these two expressions into equation (1), we see *P _{obs}*(

*Y*|

*X*) =

*P*(

*Y*|

*X*). We show that

*γ*

_{2}= 0 is the necessary condition by contradiction. Suppose that

*P*(

_{obs}*Y*|

*X*) =

*P*(

*Y*|

*X*) but

*γ*

_{2}≠ 0. Then equation (1) implies

This is false since *P _{obs}*(

*D*= 1) is arbitrary under the case-control design. Using equations (2) and (3), we can show that

*γ*

_{2}= 0 is also the necessary and sufficient condition for the standard analysis based on controls only or cases only to be valid. In most case-control studies, secondary phenotypes are correlated with the disease status, so that

*γ*

_{2}≠ 0.

What happens if *D* is independent of *X* conditional on *Y*, i.e., *γ*_{1} = 0? We first consider dichotomous traits. By switching the roles of *X* and *Y* in the above derivation, we see *P _{obs}*(

*X*|

*Y*) =

*P*(

*X*|

*Y*) under

*γ*

_{1}= 0; similar calculations show that

*P*(

_{obs}*X*|

*Y, D*) is proportional to

*P*(

*X*|

*Y*). Therefore, standard logistic regression analysis based on cases only, controls only, or the combined sample yields correct estimates of the odds ratios in the general population. For quantitative traits, the linear model specifies that

*Y*=

*β*

_{0}+

*β*

_{1}

^{T}

*X*+ , where is zero-mean normal with variance

*σ*

^{2}. Write

*Ỹ*=

*Y*−

*β*

_{1}

^{T}

*X*. Clearly,

*Ỹ*is independent of

*X*, so standard linear regression will estimate

*β*

_{1}correctly if and only if

*P*(

_{obs}*Ỹ*|

*X*) has mean zero. In light of equation (1), this means

for all *X*, implying *γ*_{1} = −*γ*_{2}*β*_{1}. Thus, standard regression analysis is biased under *γ*_{1} = 0 unless *β*_{1} = 0 or *γ*_{2} = 0. This explains the patterns of bias seen in Figure 2.

We now investigate the bias of including the case-control status as a predictor in a regression model. It follows from equations (2) and (3) that for dichotomous traits,

and for continuous traits,

where *c*_{0}(*X*) = *∫*_{y} *P*(*y*|*X*)(1 + *e*^{γ0+γ1TX+γ2y})^{−1}*dy*. Thus, the logistic regression of *Y* on *X* and *D* is not a valid analysis of the effects of *X* on *Y* in the general population unless *γ*_{1} = 0 or *γ*_{2} = 0, whereas the linear regression of *Y* on *X* and *D* generally yields biased estimates of the effects of *X* on *Y* in the general population unless *γ*_{1} = 0 and *γ*_{2} = 0.

Finally, we consider rare disease. When the disease is rare, equation (1) becomes

Clearly, *P _{obs}*(

*Y*|

*X*) is not equivalent to

*P*(

*Y*|

*X*) unless

*γ*

_{2}= 0. That is, standard analysis based on the combined sample is generally invalid. For rare disease, equation (2) becomes

Thus, standard analysis based on controls only is approximately valid. On the other hand, equation (3) becomes

Thus, *P _{obs}*(

*Y*|

*X, D*= 1) is not equivalent to

*P*(

*Y*|

*X*) unless

*γ*

_{2}= 0. Interestingly, when

*Y*is dichotomous,

so standard logistic regression based on cases only yields approximately correct estimates of odds ratios. If *Y* is continuous satisfying the linear model, then *P _{obs}* (

*Y*|

*X, D*= 1) is approximately proportional to exp{−(

*Y*−

*β*

_{1}

^{T}

*X*−

*β*

_{0}−

*σ*

^{2}

*γ*

_{2})

^{2}/2

*σ*

^{2}}; therefore, standard linear regression based on cases only is approximately correct for rare disease. Whether the trait is dichotomous or continuous, the regression means between cases and controls differ only by a constant, so (for rare disease) the regression of

*Y*on

*X*and

*D*yields approximately correct estimation of the effects of

*X*on

*Y*in the general population.

## References

- Frayling TM, Timpson NJ, Weedon MN, Zeggini E, Freathy RM, Lindgren CM, Perry JR, Elliott KS, Lango H, Rayner NW, et al. A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Science. 2007;316:889–894. [PMC free article] [PubMed]
- Gudbjartsson DF, Walters GB, Thorleifsson G, Stefansson H, Halldorsson BV, Zusmanovich P, Sulem P, Thorlacius S, Gylfason A, Steinberg S, et al. Many sequence variants affecting diversity of adult human height. Nat Genet. 2008;40:609–615. [PubMed]
- Kathiresan S, Melander O, Guiducci C, Surti A, Burtt NP, Rieder MJ, Cooper GM, Roos C, Voight BF, Havulinna AS, et al. Six new loci associated with blood low-density lipoprotein cholesterol, high-density lipoprotein cholesterol or triglycerides in humans. Nat Genet. 2008;40:189–197. [PMC free article] [PubMed]
- Lettre G, Jackson AU, Gieger C, Schumacher FR, Berndt SI, Sanna S, Eyheramendy S, Voight BF, Butler JL, Guiducci C, et al. Identification of ten loci associated with height highlights new biological pathways in human growth. Nat Genet. 2008;40:584–591. [PMC free article] [PubMed]
- Lin DY, Zeng D. Likelihood-based inference on haplotype effects in genetic association studies. J Am Stat Ass. 2006;101:89–118. with discussion.
- Loos RJF, Lindgren CM, Li S, Wheeler E, Zhao JH, Prokopenko I, Inouye M, Freathy RM, Attwood AP, Beckmann JS, et al. Common variants near MC4R are associated with fat mass, weight and risk of obesity. Nat Genet. 2008;40:768–775. [PMC free article] [PubMed]
- Prentice RL, Pyke R. Logistic disease incidence models and case-control studies. Biometrika. 1979;66:403–411.
- Roeder K, Carroll RJ, Lindsay BG. A semiparametric mixture approach to case-control studies with errors in covariables. J Am Stat Ass. 1996;91:722–732.
- Sanna S, Jackson AU, Nagaraja R, Willer CJ, Chen WM, Bonnycastle LL, Shen H, Timpson N, Lettre G, Usala G. Common variants in the GDF5-UQCC region are associated with variation in human height. Nat Genet. 2008;40:198–203. [PMC free article] [PubMed]
- Saxena R, Voight BF, Lyssenko V, Burtt NP, de Bakker PIW, Chen H, Roix JJ, Kathiresan S, Hirschhorn JN, Daly MJ, et al. Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. Science. 2007;316:1331–1336. [PubMed]
- Weedon MN, Lango H, Lindgren CM, Wallace C, Evans DM, Mangino M, Freathy RM, Perry JRB, Stevens S, Hall AS, et al. Genome-wide association analysis identifies 20 loci that influence adult height. Nat Genet. 2008;40:575–583. [PMC free article] [PubMed]
- Weedon MN, Lettre G, Freathy RM, Lindgren CM, Voight BF, Perry JR, Elliott KS, Hackett R, Guiducci C, Shields B, et al. A common variant of HMGA2 is associated with adult and childhood height in the general population. Nat Genet. 2007;39:1245–1250. [PMC free article] [PubMed]
- Willer CJ, Sanna S, Jackson AU, Scuteri A, Bonnycastle LL, Clarke R, Heath SC, Timpson NJ, Najjar SS, Stringham HM, et al. Newly identified loci that influence lipid concentrations and risk of coronary artery disease. Nat Genet. 2008;40:161–169. [PubMed]

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (992K)

- A Gaussian copula approach for the analysis of secondary phenotypes in case-control genetic association studies.[Biostatistics. 2012]
*He J, Li H, Edmondson AC, Rader DJ, Li M.**Biostatistics. 2012 Jul; 13(3):497-508. Epub 2011 Sep 19.* - LRTae: improving statistical power for genetic association with case/control data when phenotype and/or genotype misclassification errors are present.[BMC Genet. 2006]
*Barral S, Haynes C, Stone M, Gordon D.**BMC Genet. 2006 Apr 27; 7:24. Epub 2006 Apr 27.* - Using cases to strengthen inference on the association between single nucleotide polymorphisms and a secondary phenotype in genome-wide association studies.[Genet Epidemiol. 2010]
*Li H, Gail MH, Berndt S, Chatterjee N.**Genet Epidemiol. 2010 Jul; 34(5):427-33.* - [Genome-wide association study on complex diseases: genetic statistical issues].[Yi Chuan. 2008]
*Yan WL.**Yi Chuan. 2008 May; 30(5):543-9.* - Combining case-control and case-trio data from the same population in genetic association analyses: overview of approaches and illustration with a candidate gene study.[Am J Epidemiol. 2009]
*Infante-Rivard C, Mirea L, Bull SB.**Am J Epidemiol. 2009 Sep 1; 170(5):657-64. Epub 2009 Jul 27.*

- An alternative hypothesis testing strategy for secondary phenotype data in case-control genetic association studies[Frontiers in Genetics. ]
*Lutz SM, Hokanson JE, Lange C.**Frontiers in Genetics. 5188* - Bias correction to secondary trait analysis with case-control design[Statistics in medicine. 2013]
*Chen HY, Kittles R, Zhang W.**Statistics in medicine. 2013 Apr 30; 32(9)1494-1508* - A Note on Penalized Regression Spline Estimation in the Secondary Analysis of Case-Control Data[Statistics in biosciences. 2013]
*Gazioglu S, Wei J, Jennings EM, Carroll RJ.**Statistics in biosciences. 2013 Nov 1; 5(2)250-260* - Genomic study in Mexicans identifies a new locus for triglycerides and refines European lipid loci[Journal of medical genetics. 2013]
*Weissglas-Volkov D, Aguilar-Salinas CA, Nikkola E, Deere KA, Cruz-Bautista I, Arellano-Campos O, Muñoz-Hernandez LL, Gomez-Munguia L, Ordoñez-Sánchez ML, Reddy PM, Lusis AJ, Matikainen N, Taskinen MR, Riba L, Cantor RM, Sinsheimer JS, Tusie-Luna T, Pajukanta P.**Journal of medical genetics. 2013 May; 50(5)298-308* - Unified Analysis of Secondary Traits in Case-Control Association Studies[Journal of the American Statistical Associa...]
*Ghosh A, Wright FA, Zou F.**Journal of the American Statistical Association. 2013; 108(502)10.1080/01621459.2013.793121*

- PubMedPubMedPubMed citations for these articles