Logo of ajhgLink to Publisher's site
Am J Hum Genet. 2011 Sep 9; 89(3): 354–367.
PMCID: PMC3169821

A General Framework for Detecting Disease Associations with Rare Variants in Sequencing Studies

Abstract

Biological and empirical evidence suggests that rare variants account for a large proportion of the genetic contributions to complex human diseases. Recent technological advances in high-throughput sequencing platforms have made it possible for researchers to generate comprehensive information on rare variants in large samples. We provide a general framework for association testing with rare variants by combining mutation information across multiple variant sites within a gene and relating the enriched genetic information to disease phenotypes through appropriate regression models. Our framework covers all major study designs (i.e., case-control, cross-sectional, cohort and family studies) and all common phenotypes (e.g., binary, quantitative, and age at onset), and it allows arbitrary covariates (e.g., environmental factors and ancestry variables). We derive theoretically optimal procedures for combining rare mutations and construct suitable test statistics for various biological scenarios. The allele-frequency threshold can be fixed or variable. The effects of the combined rare mutations on the phenotype can be in the same direction or different directions. The proposed methods are statistically more powerful and computationally more efficient than existing ones. An application to a deep-resequencing study of drug targets led to a discovery of rare variants associated with total cholesterol. The relevant software is freely available.

Introduction

Genome-wide association studies (GWAS) with tagSNPs have successfully identified common SNPs with small to modest effects for virtually every complex human disease. Technological advances in high-throughput sequencing platforms have made it possible for researchers to extend association studies to rare variants in targeted exons and soon in the entire genome. Rare variants tend to be functional alleles and have stronger effects on complex diseases than common variants.1,2 Indeed, deep-resequencing studies of candidate genes have already demonstrated the influence of rare variants on several complex traits.3–5

Association testing with a single rare variant has limited power because only a small percentage of study subjects carry a rare mutation and there are a large number of tests to be adjusted for. Collapsing or grouping methods, which combine information across multiple variant sites within a gene, can enrich association signals and reduce the penalty of multiple testing. The simplest collapsing method is the burden test, which is based on the number of rare mutations each subject carries in a gene.6,7 A second approach is the weighted sum statistic of Madsen and Browning,8 which weights each mutation according to its frequency in the unaffected subjects and permutes the disease status to assess the significance of a Wilcoxon-type test statistic. A third approach is the variable-threshold (VT) idea of Price et al.,9 which uses the maximum of the test statistics over all allele-frequency thresholds and assesses statistical significance by permutation. The forgoing methods assume that the effects of the combined rare mutations on the phenotype are in the same direction. To detect opposite effects, Han and Pan10 incorporated the signs of the observed effects into the burden test, whereas Neale et al.11 and Wu et al.12 tested the variance of the effects.

In this article, we provide a general framework for association testing with rare variants that reflects the spirits of the existing methods but is statistically more powerful and computationally more efficient. Our framework covers all major study designs (i.e., case-control, cross-sectional, cohort and family studies) and all common phenotypes (e.g., binary and quantitative traits, and potentially censored ages at onset of disease) and allows any covariates (e.g., environmental factors and ancestry variables). The ability to accommodate covariates is critically important because population stratification is expected to be a more severe issue with rare variants than with common variants but could be corrected by including suitable ancestry variables (e.g., the percentage of African ancestry or principal components for ancestry) in the association analysis. We combine information across multiple variant sites within a gene by taking a weighted sum of the mutation counts for each study subject and relate the combined information and covariates to disease phenotypes through appropriate regression models. We derive theoretically optimal weights that would produce the most powerful tests among all valid tests and develop the corresponding testing procedures. We employ score-type statistics, which are numerically stable even in the case of extremely rare variants and computationally fast even in the presence of covariates. We provide asymptotic normal approximation for both fixed-threshold and VT methods and develop permutation and other resampling tests that can accommodate covariates. We investigate theoretically and numerically when normal approximation is appropriate and when resampling is required. We modify the popular methods of Madsen and Browning8 and Price et al.9 to enhance statistical power, avoid permutation, and accommodate covariates. We construct data-adaptive test statistics that are powerful even when the combined rare mutations have opposite effects on the phenotype. The advantages of the proposed methods over the existing ones are demonstrated both analytically and empirically. The software implementing the proposed methods is available at our website.

Material and Methods

Suppose that a total of n subjects are genotyped on a total of m SNPs in a gene and that there are d covariates. Here, the word “gene” refers to the group of variants that will be collectively analyzed and might pertain to a subset of SNPs within a gene or to a region or pathway involving multiple genes; covariates might include nongenetic variables, such as age and smoking status, as well as ancestry variables, such as the percentage of African ancestry and principal components for ancestry. For i = 1, …, n, let Yi be the phenotype value of the ith subject; for i = 1, …, n and j = 1, …, m, let Xji denote the number of the rare mutation the ith subject carries at the jth SNP; for i = 1, …, n and j = 1, …, d, let Zji denote the value of the jth covariate on the ith subject. We can define

Xi=[X1iXmi],Zi=[1Z1iZdi].

We focus on binary phenotypes in the main text but consider all common phenotypes in Appendix A. It is natural to relate Yi to Xi and Zi through the logistic regression model:

Pr(Yi=1)=eβTXi+γTZi1+eβTXi+γTZi,
(Equation 1)

where β and γ are m × 1 and (d + 1)×1 vectors of unknown regression coefficients. Because the first component of Zi is 1, the first component of γ corresponds to the intercept. We can write β = τξ, where τ is a scalar constant, and ξ = β / τ. Then Equation (1) becomes

Pr(Yi=1)=eτSi+γTZi1+eτSi+γTZi,
(Equation 2)

where Si = ξTXi. Note that ξ = (ξ1, …, ξm)T is a m × 1 vector of weights and that Si is a weighted linear combination of X1i, …, Xmi with Xji receiving the weight ξj. We will refer to ξ as the weight function.

The score statistic for testing the null hypothesis H0:τ = 0 takes the form

U=i=1n(YieγˆTZi1+eγˆTZi)Si,

where γˆ is the restricted maximum likelihood estimator of γ and solves the equation

i=1n(YieγTZi1+eγTZi)Zi=0.

The variance of U is estimated by

V=i=1nviSi2(i=1nviSiZi)T(i=1nviZiZiT)1(i=1nviSiZi),

where

vi=eγˆTZi(1+eγˆTZi)2.

Under H0, the test statistic T = U / V1 / 2 is asymptotically standard normal. In the absence of covariates,

U=i=1n(YiY¯)Si,

and

V=Y¯(1Y¯){i=1nSi2n1(i=1nSi)2},

where Y¯=n1i=1nYi.

The true value of the weight function ξ = (ξ1, …, ξm)T is unknown and must be determined biologically or empirically. If we set ξj = 1(j = 1, …, m), then T is a burden test, which counts the total number of rare mutations each subject carries over the m SNPs. If we believe that common variants are not associated with the phenotype, then we set ξj = 0 if pj > c, where pj is the minor allele frequency (MAF) of the jth SNP, and c is a given threshold. If we set ξj = {pj(1 − pj)} − 1 / 2(j = 1, …, m), then the weight function is in the same vein as that of Madsen and Browning.8

If the choice of the weight function ξ is not proportional to β or ξ is estimated from the data, then U is no longer the score statistic. However, we show in Appendix A that the test statistic T is asymptotically standard normal under H0 regardless of how ξ is determined. The only condition is that if ξ is estimated from the data, then the estimate converges to a constant vector as the sample size n increases. This condition is satisfied by all sensible estimates, including those based on estimated allele frequencies. If the choice of ξ or the limit of the estimate of ξ is proportional to β, then the corresponding test statistic T is the most powerful among all valid tests.

The weight function ξ is similar to that of Price et al.9 The latter authors showed that, for case-control studies with known allele frequencies in the control population, the choice of ξj = {pj(1 − pj)} − 1 / 2(j = 1, …, m) corresponds to the implicit assumption that log(ORj) ∝ {pj(1 − pj)} − 1 / 2(j = 1, …, m), where ORj is the odds ratio in the 2 × 2 table for the jth SNP. Our theory is much more general in that it assumes unknown allele frequencies and accommodates covariates. Indeed, the proposed test statistic is optimal if ξ is proportional to the set of regression coefficients (in the limit); this result holds for all phenotypes, including binary and continuous traits, as well as potentially censored ages at onset of disease.

Madsen and Browning8 suggested to set ξj={pˆj(1pˆj)}1/2 (j = 1, …, m), where pˆj is the estimate of the MAF of the jth SNP in the unaffected subjects. Because the weights depend on the phenotype values, the authors suggested a permutation-based test. Our testing framework allows such data-dependent weights because the frequency estimates converge to the true values as n increases. To improve the accuracy of asymptotic approximation, we suggest estimating the frequencies from all study subjects rather than the unaffected subjects. Because the variants can be very rare, we recommend adding pseudocounts when estimating the frequencies, as was done by Madsen and Browning.8 The weight functions based on the frequency estimates in the pooled sample and the unaffected subjects will be denoted by Fp and Fu, respectively; the constant weight function will be denoted by C. The corresponding tests will be referred to as the Fp test, the Fu test and the C test.

Although Fu is the weight function used by Madsen and Browning,8 our Fu test is fundamentally different from the Madsen and Browning (MB) test. The latter is based on the sum of the ranks of the Si's with weight function Fu over the affected subjects. Madsen and Browning8 proposed to assess the statistical significance of their rank-sum statistic by permutation. They also suggested an asymptotic normal approximation by standardizing the rank-sum statistic by its mean and standard derivation. Because the mean and standard derivation are estimated by permutation, the asymptotic version of the MB test is many orders of magnitudes slower than our asymptotic tests. The rank-sum statistic is confined to case-control analysis without covariates.

Price et al.9 developed a VT method by taking the maximum of the test statistics (i.e., Z scores) over all allele-frequency thresholds and assessing statistical significance by permutation. We describe below a more general approach that allows not only multiple allele-frequency thresholds but also different types of weight function; it also accommodates covariates and does not require permutation.

We consider K choices of ξ, which could correspond to different thresholds or different types of weight function, or both. (It is assumed that K is small relative to n.) For the kth choice of ξ, the corresponding Si is denoted by Ski. Then the score statistic is

Uk=i=1n(YieγˆTZi1+eγˆTZi)Ski,

and the test statistic is Tk=Uk/Vk1/2, where

Vk=i=1nviSki2(i=1nviSkiZi)T(i=1nviZiZiT)1(i=1nviSkiZi).

It is shown in Appendix A that, under H0, the random vector (U1, …, UK)T is approximately K-variate normal with mean 0 and covariance matrix {Vkl; k, l = 1, …, K}, where

Vkl=i=1nUkiUli,

and

Uki=(YieγˆTZi1+eγˆTZi){Ski(i=1nviSkiZi)T(i=1nviZiZiT)1Zi}.

For the two-sided test, we consider the maximum of the absolute test statistics

Tmax=maxk=1,,K|Tk|.

Let tmax be the observed value of Tmax. The p value is given by

Pr (Tmax ≥ tmax) = 1 − Pr (|T1| < tmax, …, |TK| < tmax), 

which is evaluated by treating (T1, …, TK)T as a K-variate normal random vector with a mean of 0 and a covariance matrix of {rkl; k, l = 1, …, K}, where rkl = Vkl / (VkkVll)1 / 2. (The one-sided p value can be calculated in a similar manner.) We reject H0 if the p value is smaller than the nominal significance level α.

The tests based on positive weight functions, such as C, Fu, and Fp, will have low power if the mutations being combined have opposite effects on the phenotype. The optimal choice of ξj is βj, which is unknown. We can estimate βj from the data. It would be tempting to set ξj to βˆj, where βˆj is an appropriate estimate of βj. There are two major problems with this strategy. First, the test statistic T will not be asymptotically normal. Second, the βˆj's are highly variable (because the individual variants are very rare) and can be quite different from the true values of the βj's. As a compromise, we set ξj=βˆj+δ, where δ is a given constant. We refer to this weight function as EREC, an abbreviation of estimated regression coefficients. The corresponding test statistic T will be asymptotically standard normal as long as δ is nonzero. Indeed, the EREC test is asymptotically optimal in that ξj will converge to βj if we let δ decrease to 0 as the sample size n increases to ∞. The asymptotic normality and optimality require very large samples. For small samples, we recommend to use a relatively large value of δ so that the weights are not unduly driven by the highly variable βˆj's. For n < 2000, we set δ = 1 for binary traits and δ = 2 for standardized quantitative traits.

The sequence kernel association test (SKAT) of Wu et al.12 assumes that βj follows an arbitrary distribution with a mean of 0 and a variance of ξjν, and tests the null hypothesis that ν = 0 by using a variance-component score statistic. The SKAT statistic can be written as Q=j=1mξjUj2, where Uj is the jth component of the score statistic for testing the null hypothesis that β = 0 under Equation 1. The C-alpha statistic of Neale et al.11 is a special case of Q with ξj = 1 for binary traits without covariates. Our score statistic U can be written as j=1mξjUj. The Han and Pan10 (HP) statistic is a special case of U (for binary traits without covariates) in which ξj = − 1 if βˆj<0 and the corresponding p value <0.1 and in which ξj = 1 otherwise.

Because the asymptotic approximation might not be accurate in small samples, especially when the weight function ξ involves the phenotype values Yi's, we also provide permutation-type tests. In the absence of covariates, we simply permute the phenotype values Yi's and calculate the test statistic T for each permutation. Note that it is necessary to recalculate the Si's after permuting the Yi's if the weight function ξ depends on the Yi's.

Our permutation differs from that of Price et al.9 in that we permute T, whereas they permuted i=1nYiSi. The former is a pivotal statistic, whereas the latter is not. (It is desirable to permute a pivotal statistic.13) If the test is one-sided and the weight function does not depend on the phenotype values, then our permutation is equivalent to Price et al.'s9; otherwise, the two are different. For VT methods, the numerators in the Z scores of Price et al.9 are the same as ours, but the denominators are not the same as or proportional to ours. Thus, the permutation p values are generally different between the two methods. The permutation version of the MB test requires ranking the Si's for each permutation and is thus substantially slower than our permutation tests.

In the presence of covariates, permuting the Yi's it is not appropriate because Yi is generally correlated with Zi. Instead, we generate Yi from the fitted null model:

Pr(Yi=1)=eγˆTZi1+eγˆTZi,

replace the Yi's with the Yi's, and recalculate the test statistic. (The recalculation of the test statistic starts with re-estimating γ and recalculating the Si's.) This process is repeated and is called (parametric) bootstrap.13 Both permutation and bootstrap are resampling methods. In the absence of covariates, Pr(Yi=1) is the sample proportion of cases.

Obtaining an accurate estimate of a small p value requires a large number of resamples (i.e., permutations or bootstrap samples). However, most p values are relatively large and can be estimated accurately with a small number of resamples. Thus, we employ a multistage procedure which filters out large p values with small numbers of resamples and uses large numbers of resamples only for the most extreme p values.

Results

Simulation Studies

We conducted extensive simulation studies to investigate the performance of the proposed and existing methods. We simulated case-control data with an equal number of cases and controls from Equation 1 in which the first component of γ was set to –2. We considered mainly the following six combinations of MAFs: (1) pj = 0.001j (j = 1, …, 10) with a total frequency of 5.5%; (2) pj = 0.0005j (j = 1, …, 10) with a total frequency of 2.75%; (3) pj = 0.00025j(j = 1, …, 20) with a total frequency of 5.25%; (4) pj = 0.005 (j = 1, …, 10) with a total frequency of 5%; (5) pj = 0.0025 (j = 1, …, 10) with a total frequency of 2.5%; and (6) pj = 0.0025(j = 1, …, 20) with a total frequency of 5%. The genotype values were simulated under Hardy-Weinberg equilibrium and linkage equilibrium. We did not use sophisticated population genetics models because we wished to control the number of variants and their frequencies, which allowed us to see clearly how the proposed and existing methods perform under various scenarios. We evaluated both asymptotic and resampling methods. When the simulation studies involved asymptotic methods only, we used 10 millions replicates (i.e., simulated data sets) to evaluate type I error and 100,000 replicates to evaluate power at α = 10 − 2, 10 − 3, and 10 − 4. When the simulation studies involved resampling methods, we used 1 million replicates to evaluate type I error and 10,000 replicates to evaluate power at α = 10 − 2 and 10 − 3. The resampling p values were obtained from a three-stage procedure with a maximum of 1 million resamples. The null hypothesis corresponded to H0:βj = 0(j = 1, …, m). We considered alternative hypotheses such as H1:βj = x(j = 1, …, m) and H1:βj = x / {pj(1 − pj)}1 / 2(j = 1, …, m), where x was chosen such that the power (of the most powerful method) was reasonably high at α = 10 − 2. We report below results from six series of simulation studies, the first four without covariates and the last two with covariates. The tests were two-sided except for the third series.

We designed our first series of simulation studies to evaluate the proposed asymptotic methods with different weight functions. We considered the aforementioned six combinations of MAFs and generated data under the null hypothesis H0:βj = 0(j = 1, …, m), as well as two alternative hypotheses H1:βj = x(j = 1, …, m) and H1:βj = x / {pj(1 − pj)}1 / 2(j = 1, …, m). We considered three (positive) weight functions: C, Fp, and Fu. We also considered the maximum of the test statistics based on weight functions C and Fp, which will be referred to as Tmax. The results for the first combination of MAFs are displayed in Table 1, whereas those of the remaining five combinations are provided in Tables S1–S5, available online. The performance of the tests is affected more by the total allele frequency than the number of variants or individual MAFs. The C test, Fp test, and Tmax are conservative but less so as n, α, or total allele frequency increases. As expected, the C test is more powerful than the Fp test under the first alternative hypothesis and less powerful under the second alternative hypothesis; Tmax is nearly as powerful as the C test under the first alternative and nearly as powerful as the Fp test under the second alternative. The Fu test is unacceptably liberal; therefore, we will not consider this asymptotic test any further.

Table 1
Type I Errora and Power of Asymptotic Methods with Different Weight Functions

Our second series of studies was devoted to comparisons of asymptotic and permutation methods. In addition to the proposed methods, we evaluated the asymptotic and permutation versions of the MB test, as well as the permutation method of Price et al.9 with weight function Fu. We simulated data in the same manner as the first series of studies. We performed one-sided tests because the MB and Price et al. tests were designed as one-sided. The results for the first combination of MAFs are displayed in Table 2. Because of the discreteness of the test statistic, the permutation version of the C test is more conservative than its asymptotic counterpart and consequently less powerful. The permutation Fp and Fu tests do not appear to be conservative; the former appears to be slightly more powerful than the latter. The MB test was designed for the second alternative hypothesis, for which the proposed asymptotic test based on weight function Fp is more powerful than the asymptotic version of the MB test whereas the proposed permutation tests based on weight functions Fp and Fu are more powerful than the permutation version of the MB test. For weight function Fu, our permutation test is more powerful than that of Price et al.9

Table 2
Type I Errora and Power of Asymptotic and Permutation Methods

In the third series of studies, we compared fixed-threshold and VT methods. We simulated 11 SNPs with MAFs pj = 0.001j(j = 1, …, 10) and p11 = 0.03. We considered the null hypothesis H0:β1 = β2 = … = β11 = 0, as well as two alternative hypotheses H1:β1 = β2 = … = β10 = x, β11 = 0 and H1:β1 = β2 = … = β11 = x. For fixed-threshold methods, we considered the thresholds of 0.01 and 0.05; the corresponding tests are referred to as the T1 and T5 tests. For VT methods, we excluded the thresholds for which the total numbers of rare mutations were fewer than 10. As shown in Table 3, all the tests appear to be conservative, especially when n and α are small. The permutation T1 and T5 tests are more conservative than their asymptotic counterparts. In theory, T1 and T5 are the most powerful under the first and second alternatives, respectively. Because the frequency estimates for rare variants are highly variable, T1 turns out to be the least powerful among all the tests under the first alternative. The VT tests have good power under both alternatives, and the asymptotic and permutation versions have similar power. The permutation version of our VT test is slightly more powerful than that of Price et al.9

Table 3
Type I Errora and Power of Fixed-Threshold and VT Methods

In the fourth set of studies, we compared the C test, Fp test, and EREC test, as well as the HP, C-alpha, and SKAT tests. Note that the last four tests were designed to detect variants with opposite effects. The EREC, HP, and C-alpha tests were based on permutation, whereas the SKAT was based on the Davies method.12 For the EREC test, βˆj was the estimate of the log odds ratio βj (after adding a pseudocount of 1 to each of the four cells in the 2×2 table). For the SKAT test, we used the default weighted linear kernel function. We set pj = 0.001j(j = 1, …, 10) and considered the null hypothesis H0:βj = 0(j = 1, …, 10) and six alternative hypotheses representing different numbers of causal variants and different patterns of positive and negative effects. As shown in Table 4, the SKAT is highly conservative, especially when n and α are small. The EREC test is slightly less powerful than the C test and Fp test when the SNP effects are all positive but is much more powerful than the latter when there are opposite effects. The EREC test is more powerful than the HP test. It is also more powerful than the C-alpha and SKAT, especially when the mean of the regression coefficients is not 0.

Table 4
Type I Errora and Power of Asymptotic and Permutation Tests for Detecting Potentially Opposite Effects

The above four sets of studies contained no covariates. We also conducted extensive studies with covariates. We generated data in the same manner as before except that we added a normally distributed covariate whose mean is equal to the total number of rare mutations and whose variance is equal to 1 and we set its regression coefficient to 0.3. Some key results are presented in Tables 5 and 6. The T1, T5, Fp, and VT tests are less conservative than in the case of no covariates, and their asymptotic and bootstrap versions have similar power. The EREC test has similar power to the C and Fp tests when all SNP effects are positive and is much more powerful than the latter when there are opposite effects. The EREC test tends to be more powerful than the SKAT, especially when the mean of the regression coefficients is not 0.

Table 5
Type I Errora and Power of Fixed-Threshold and VT Methods with Covariates
Table 6
Type I Errora and Power of Asymptotic and Bootstrap Tests for Detecting Potentially Opposite Effects in the Presence of Covariates

Real Data

We considered high-depth sequence data from the exons of 202 genes encoding known or potential drug targets14 for 1957 subjects randomly drawn from the CoLaus population-based collection.15 We analyzed total cholesterol (available in 1899 subjects) as a quantitative trait and included eight covariates in the analysis: gender, age, age2, and the top five principal components for ancestry constructed from the GWAS SNP data. One subject without the gender and age information was removed. We employed the methods for quantitative traits described in Appendix A.

We restricted our analysis to polymorphic variants that are nonsense, missense, or splice site mutations. We removed variants with observed MAFs>5% or missingness>10%. We excluded any gene whose total number of rare mutations is less than five and ended up with a total of 172 genes. There were a total of 2304 variants in these 172 genes, and the number of variants per gene varied from 1 to 70, with a median of 11. We applied both the asymptotic and permutation versions of our T1, T5, Fp, and VT tests, as well as the permutation EREC test. We calculated the two-sided p values. With 172 genes, the Bonferroni threshold at the 0.05 significance level corresponds to a p value of 0.0003 or –log10(p value) of 3.5.

The results based on the asymptotic and permutation methods are shown in Figures 1 and 2, respectively. One gene was identified as the most significant by all the tests: the asymptotic p values for T1, T5, Fp, and VT are 0.00011, 0.00011, 0.00021, and 0.00057, respectively; the corresponding permutation p values are 0.00013, 0.00013, 0.00025, and 0.0012, respectively; the p value of the EREC test is 0.00012. (The name of the gene is not disclosed here because the main study has not been published yet.) All the p values, except the VT's, pass the Bonferroni criterion. Similar evidence of association has been observed in other samples of the sequencing project.14 There were 13 variants in the top gene. Their observed MAFs ranged from 0.00026 to 0.0024, the total frequency being 1.13%. Because the observed MAFs are all less than 1% in this case, T1 and T5 are the same test. For the VT test, the maximum occurs at the highest MAF. It is interesting to point out that common SNPs in the top gene were previously identified to be associated with total cholesterol.16

Figure 1
Quantile-Quantile Plots of p Values on the –log10 Scale for the Asymptotic T1, T5, Fp, and VT Tests in the Quantitative Trait Analysis of Total Cholesterol
Figure 2
Quantile-Quantile Plots of p Values on the –log10 Scale for the Permutation EREC, T5, Fp, and VT Tests in the Quantitative Trait Analysis of Total Cholesterol

We also performed a binary trait analysis by comparing high (i.e., >6.2 mmol/l) and desirable (i.e., <5.2mmol/l) total cholesterol values. There were 451 subjects with high total cholesterol and 683 subjects with desirable total cholesterol. The results of the analysis are shown in Figures 3 and 4. All the tests identified the same top gene as was identified in the quantitative trait analysis: the asymptotic p values for T1, T5, Fp, and VT are 0.00022, 0.00022, 0.00057, and 0.00088, respectively; the corresponding bootstrap p values are 0.00019, 0.00019, 0.00039, and 0.00033, respectively. Again, T1 and T5 are the same test. The maximum of the VT test occurs at the highest MAF, at which threshold 18 out of the 451 subjects with high cholesterol values carry the rare mutations as opposed to 7 out of 683 subjects with desirable cholesterol values. The p value of the bootstrap EREC test is 0.000021, which is the most extreme among all the tests and is even more extreme than all the p values of the quantitative trait analysis. For eight out of the 10 variants in the top gene, there were more mutations in the high group than in the desirable group (17 versus two); for the remaining two variants, there were fewer mutations in the high group than in the desirable group (one versus five). Thus, allowing opposite effects yielded stronger evidence of association than assuming effects of the same direction.

Figure 3
Quantile-Quantile Plots of p Values on the –log10 Scale for the Asymptotic T1, T5, Fp, and VT Tests in the Binary Trait Analysis of Total Cholesterol
Figure 4
Quantile-Quantile Plots of p Values on the –log10 Scale for the Bootstrap EREC, T5, Fp, and VT Tests in the Binary Trait Analysis of Total Cholesterol

Finally, we compared the proposed methods to the existing ones. The results for the SKAT are shown in Figure S1 (top panel). For the top gene, the SKAT yielded the p values of 0.0014 and 0.00024 in the quantitative and binary trait analyses, respectively, which are 10 times larger than the p values of our EREC test. Because the other existing methods do not allow covariates and some of them require binary traits, we also performed the binary trait analysis without the covariates for all the methods. The results are shown in the bottom panel of Figure S1 and in Figures S2–S4. Although the top gene remains the same, the results without covariate adjustment (for the top gene) are considerably less significant than those with covariate adjustment. For the top gene, the EREC test yielded a much more significant result (p value =0.00013) than all the other tests.

Discussion

We developed a very general framework for the association analysis of rare variants. This framework enabled us to evaluate existing methods and develop other methods. Our theoretical analysis and simulation studies yielded insights into the behavior of the existing methods. The normal approximation works very well for the proposed methods, and resampling is required only when the weight function depends on the phenotype values. The proposed methods are numerically stable and easy to implement. The asymptotic tests are extremely fast. A computer program implementing the proposed methods is posted at our website. For a typical exome-sequencing study, it takes only a few hours to run all the proposed asymptotic and resampling tests.

We have adopted score-type statistics, which are computationally faster and more stable than Wald and likelihood ratio (LR) statistics because the null model does not involve rare variants and needs to be fit only once. Our simulation studies revealed that Wald tests tend to be overly conservative (resulting in substantial loss of power) whereas likelihood ratio tests tend to be too liberal (resulting in excessive false-positive findings), especially for small n and low MAFs; see Tables S6–S8.

Our work improves upon the pioneer work of Madsen and Browning8 by using more powerful test statistics, accommodating covariates and avoiding permutation. For case-control studies, Madsen and Browning8 estimated the allele frequencies in the unaffected subjects only so that a true signal from an excess of mutations in the affected subjects would not be deflated by using the total number of mutations in both affected and unaffected subjects. According to our theory, the allele frequencies in the unaffected subjects will be optimal if log(ORj) ∝ {pj(1 − pj)} − 1 / 2(j = 1, …, m) and pj is the frequency of the jth variant in the unaffected subjects. Even if that is the truth, the frequency estimates are highly variable and can be very different from the true values. The frequency estimates in the pooled sample of affected and unaffected subjects are more stable and the corresponding Fp test can be implemented through normal approximation (rather than resampling).

The optimal choice of the frequency threshold depends on the nature of association, which is generally unknown. In addition, the frequency estimates for rare variants are highly variable, especially for small samples with substantial missing data. Thus, VT methods might be preferable to fixed-threshold methods. Our VT approach improves upon that of Price et al.9 in three aspects: (1) it uses more powerful test statistics, (2) it can accommodate covariates, (3) it can be implemented by normal approximation instead of permutation.

The EREC test is capable of detecting rare mutations with opposite effects. Simulation studies (Tables 4 and 6) showed that the EREC test has similar power to the tests assuming the same direction of effects when that assumption holds and is much more powerful than the latter when that assumption fails. In addition, the EREC test outperforms the HP, C-alpha and SKAT tests. In the real data example, the EREC test produced the most convincing evidence of association for the top gene among all the tests. Thus, we recommend the EREC test for general use.

The SKAT is computationally faster than the EREC, HP, and C-alpha tests because it calculates p values analytically. Simulation studies revealed that the SKAT is overly conservative, especially when n and α are small. The resampling methods developed in this article can be used to obtain accurate p values for the SKAT, and indeed any other tests, with or without covariates.

Statistical analysis of rare variants is a very active research area. Several other methods have been published during the preparation of this article.17–19 We have not compared our methods to all existing methods for several reasons: (1) we wished to focus on the most commonly used current methods, (2) some of the newly published methods are based on different philosophies and thus would be difficult to compare directly, (3) a comprehensive comparison of all existing methods is beyond the scope of this article.

It is possible to incorporate biological and computational information about the functional effects of rare variants, such as SIFT20 and PolyPhen21 scores, into the association analysis. Indeed, our theory allows incorporation of any prior knowledge into the weight function. Efficient use of functional or bioinformatics information requires further investigation. It would be worthwhile to explore Bayesian methods.

Grouping methods for rare variants are in the same vein as the SNP-set methods for GWAS studies22–24 in that multiple SNPs within a group are analyzed collectively to enhance statistical power. Because the data are extremely sparse for individual rare variants, the SNP-set methods for common variants might not be applicable to rare variants. On the other hand, the methods for rare variants can potentially be used to combine low-frequency SNPs in GWAS studies.

We have considered one group of variants at a time. It might be desirable to analyze several groups of variants simultaneously. Our approach can be readily extended to multiple groups of variants. Specifically, we divide variants into, say, K groups according to certain criteria (e.g., MAFs) and combine the information within each group. We can express the score statistic for each group of variants as a sum of n efficient score functions (see Appendix A) so that the asymptotic joint distribution of the K score statistics follows from the multivariate central limit theorem. We can then use the asymptotic joint distribution to form a multivariate test statistic. If we choose the maximum of the K test statistics, then the formulas for K weight functions presented in Material and Methods can be directly applied. If we choose the chi-square statistic with K degrees of freedom, then our method would be a generalization of the combined multivariate and collapsing (CMC) method of Li and Leal.7

We used the Bonferroni correction in the analysis of the real data. This criterion is conservative if there is strong linkage disequilibrium (LD) among the genes. More accurate correction for multiple testing can be achieved by accounting for the correlations of the test statistics. There are two possible ways to do so: one is to use permutation and the other is to use Monte Carlo.25 The latter is based on efficient score functions, which are provided in Appendix A.

This work and indeed all existing literature assume that the quantitative trait data are obtained from a random sample. In many sequencing studies, including several in the National Heart, Lung, and Blood Institute (NHLBI) Exome Sequencing Project that we are involved with, only the subjects with the extreme values of a quantitative trait are selected for sequencing. The case-control testing is a valid option but might be inefficient if there is a quantitative association. In addition, it might be desirable to analyze quantitative traits that are not the one used to select the subjects for sequencing. We are currently developing valid and efficient methods for the association analysis of quantitative traits under such trait-dependent sampling.

Acknowledgments

This research was supported by the National Institutes of Health grants R01 CA082659, R37 GM047845, and P01 CA142538. The authors thank GlaxoSmithKline, especially Matthew R. Nelson, Margaret G. Ehm, and Li Li, and the co-principal investigators of the CoLaus study, Gerard Waeber and Peter Vollenweider, for the use of the resequencing data. They are also grateful to Yun Li and Kuo-Ping Li for their assistance with the preparation of the data.

Appendix A

We relate Yi to Xi and Zi through a generalized linear model with the linear predictor βTXi + γTZi, where β = τξ. Let η consist of γ and other nuisance parameters. Let l(τ, η; ξ) denote the log-likelihood function for τ and η with a fixed value of ξ. The corresponding score function and observed Fisher information matrix are

[Uτ(τ,η;ξ)Uη(τ,η;ξ)],

and

[Iττ(τ,η;ξ)Iτη(τ,η;ξ)Iητ(τ,η;ξ)Iηη(τ,η;ξ)],

where Uτ(τ, η; ξ) = l(τ, η; ξ) / τ,Uη(τ, η; ξ) = l(τ, η; ξ) / η, Iττ(τ, η; ξ) = − 2l(τ, η; ξ) / τ2, Iτη(τ, η; ξ) = − 2l(τ, η; ξ) / τηT, Iητ(τ,η;ξ)=IτηT(τ,η;ξ), and Iηη(τ, η; ξ) = − 2l(τ, η; ξ) / ηηT. The score statistic for testing the null hypothesis H0:τ = 0 is Uτ(0,ηˆ;ξ), where ηˆ is the solution to the equation Uη(0, η; ξ) = 0. Under H0, the random variable n1/2Uτ(0,ηˆ;ξ) is asymptotically zero-mean normal with a variance that can be consistently estimated by26

n1{Iττ(0,ηˆ;ξ)Iτη(0,ηˆ;ξ)Iηη1(0,ηˆ;ξ)Iητ(0,ηˆ;ξ)}.

Suppose that ξ is estimated from the data by ξˆ. Then we replace ξ in Uτ(0,ηˆ;ξ) by ξˆ. It can be shown that Uτ(0, η; ξ) = ξTUβ(0, η), where Uβ(β, η) is the score function of β under Equation 1. Because n1/2Uβ(0,ηˆ) is asymptotically zero-mean normal, ξˆTn1/2Uβ(0,ηˆ) has the same asymptotic distribution as ξTn1/2Uβ(0,ηˆ), where ξ ∗  is the limit of ξˆ. As a result, n1/2Uτ(0,ηˆ;ξˆ) has the same asymptotic distribution as n1/2Uτ(0,ηˆ;ξ). Thus, the test statistic

Uτ(0,ηˆ;ξˆ){Iττ(0,ηˆ;ξˆ)Iτη(0,ηˆ;ξˆ)Iηη1(0,ηˆ;ξˆ)Iητ(0,ηˆ;ξˆ)}1/2

is asymptotically standard normal as long as ξˆ converges to a nonzero constant as n.

Let Uτ, i(τ, η; ξ) and Uη, i(τ, η; ξ) be the ith subject's contributions to Uτ(τ, η; ξ) and Uη(τ, η; ξ), respectively, and let Στη and Σηη be the limits of n − 1Iτη(0, η; ξ) and n − 1Iηη(0, η; ξ), respectively. It is easy to show that n1/2Uτ(0,ηˆ;ξ) is asymptotically equivalent to n1/2i=1nui, where

ui=Uτ,i(0,η;ξ)ΣτηΣηη1Uη,i(0,η;ξ).

We refer to ui as the ith subject's efficient score function.27 To derive the joint distribution of the test statistics with K weight functions, we use the fact that n − 1 / 2Uk is asymptotically equivalent to n1/2i=1nuki, where uki is the ith subject's efficient score function associated with the kth weight function. Note that (u1i, …, uKi)(i = 1, …, n) are n independent random vectors. By the multivariate central limit theorem and law of large numbers, the null distribution of n − 1 / 2(U1, …, UK) is asymptotically zero-mean normal, and the covariance between n − 1 / 2Uk and n − 1 / 2Ul is consistently estimated by n1i=1nUkiUli, where the Uki's are obtained from the uki's by replacing all unknown parameters by their sample estimators.

For quantitative traits, we replace Equation 2 with the linear regression model:

Yi = τSi + γTZiϵi

where εi is normal with mean 0 and variance σ2. Then the score statistic and its variance are

U=i=1n(YiγˆTZi)Si,

and

V=σˆ2{i=1nSi2(i=1nSiZi)T(i=1nZiZiT)1(i=1nSiZi)},

where

γˆ=(i=1nZiZiT)1i=1nYiZi,

and

σˆ2=n1i=1n(YiγˆTZi)2.

For multiple weight functions,

Uk=i=1n(YiγˆTZi)Ski,

and

Uki=(YiγˆTZi){Ski(i=1nSkiZi)T(i=1nZiZiT)1Zi}.

To perform permutation tests without covariates, we simply permute the Yi's. In the presence of covariates, we adopt the following procedure: (1) calculate the residuals Ri=YiγˆTZi(i = 1, …, n), (2) permute the Ri's to yield the Ri's, (3) create new trait values Yi=γˆTZi+Ri(i = 1, …, n), (4) replace the Yi's by the Yi's, (5) recalculate the test statistic, and (6) repeat steps 2–5 a large number of times.

We have implicitly assumed that the trait is univariate and the subjects are unrelated. For repeated measures or family studies, we use generalized linear mixed models28 to capture the dependence of trait values. Suppose that the study contains n families with ni members in the ith family. For i = 1, …, n and l = 1, …, ni, let Yil, Sil and Zil denote the values of Y, S, and Z for the lth member of the ith family. The random effects bi (i = 1, …, n) are independent zero-mean random vectors with density function f(b; θ) indexed by a set of parameters θ. Conditional on bi, the trait values Yi1, …, Yi, ni are independent and follow a generalized linear model with density f(y| Sil, Zil; bi). The log-likelihood function is

l(τ,η;ξ)=i=1nlogbl=1nif(Yil|Sil,Zil;b)f(b;θ)db,

where τ is the fixed effect of Sil, and η includes the fixed effects of Zil and parameters θ. For repeated measures, the log-likelihood takes the same form with Yil and Zil being the trait and covariate values at the lth measurement time for the ith subject and with Sil replaced by Si. We can then use the arguments of the first three paragraphs to derive the test statistics.

For potentially censored age-at-onset traits, we specify that the hazard function for the age at onset conditional on Si and Zi satisfies the proportional hazards model29

λ(t| SiZi) = λ0(t)eτSiTZi

where λ0 is an arbitrary baseline hazard function and Zi is redefined to exclude the unit component. Let Ti denote the duration of follow-up for the ith subject, and let Δi indicate, by the values 1 versus 0, whether Ti is the actual age at onset or the censoring time. Then the score statistic and its variance are

U=i=1nΔi(SijRieγˆTZjSjjRieγˆTZj),

and V=IττIτγIγγ1Iγτ, where ℛi denotes the set of subjects whose durations of follow-up are no shorter than Ti, γˆ is the solution to the equation

i=1nΔi(ZijRieγTZjZjjRieγTZj)=0,
[IττIτγIγτIγγ]=i=1nΔijRieγˆTZj{jRieγˆTZj[SjZj]2(jRieγˆTZj)1(jRieγˆTZj[SjZj])2},

and a ⊗ 2 = aaT. For multiple weight functions, we obtain the efficient score functions by approximating the partial likelihood score function with a sum of n independent terms.30

Supplemental Data

Document S1. Four Figures and Eight Tables:

Web Resources

The URL for data presented herein is as follows:

References

1. Pritchard J.K. Are rare variants responsible for susceptibility to complex diseases? Am. J. Hum. Genet. 2001;69:124–137. [PMC free article] [PubMed]
2. Gorlov I.P., Gorlova O.Y., Sunyaev S.R., Spitz M.R., Amos C.I. Shifting paradigm of association studies: Value of rare single-nucleotide polymorphisms. Am. J. Hum. Genet. 2008;82:100–112. [PMC free article] [PubMed]
3. Cohen J.C., Kiss R.S., Pertsemlidis A., Marcel Y.L., McPherson R., Hobbs H.H. Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science. 2004;305:869–872. [PubMed]
4. Ahituv N., Kavaslar N., Schackwitz W., Ustaszewska A., Martin J., Hebert S., Doelle H., Ersoy B., Kryukov G., Schmidt S. Medical sequencing at the extremes of human body mass. Am. J. Hum. Genet. 2007;80:779–791. [PMC free article] [PubMed]
5. Nejentsev S., Walker N., Riches D., Egholm M., Todd J.A. Rare variants of IFIH1, a gene implicated in antiviral responses, protect against type 1 diabetes. Science. 2009;324:387–389. [PMC free article] [PubMed]
6. Morgenthaler S., Thilly W.G. A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: A cohort allelic sums test (CAST) Mutat. Res. 2007;615:28–56. [PubMed]
7. Li B., Leal S.M. Methods for detecting associations with rare variants for common diseases: Application to analysis of sequence data. Am. J. Hum. Genet. 2008;83:311–321. [PMC free article] [PubMed]
8. Madsen B.E., Browning S.R. A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 2009;5:e1000384. [PMC free article] [PubMed]
9. Price A.L., Kryukov G.V., de Bakker P.I.W., Purcell S.M., Staples J., Wei L.J., Sunyaev S.R. Pooled association tests for rare variants in exon-resequencing studies. Am. J. Hum. Genet. 2010;86:832–838. [PMC free article] [PubMed]
10. Han F., Pan W. A data-adaptive sum test for disease association with multiple common or rare variants. Hum. Hered. 2010;70:42–54. [PMC free article] [PubMed]
11. Neale B.M., Rivas M.A., Voight B.F., Altshuler D., Devlin B., Orho-Melander M., Kathiresan S., Purcell S.M., Roeder K., Daly M.J. Testing for an unusual distribution of rare variants. PLoS Genet. 2011;7:e1001322. [PMC free article] [PubMed]
12. Wu M.C., Lee S., Cai T., Li Y., Boehnke M., Lin X. Rare variant association testing for sequencing data using the sequence kernel association test (SKAT) Am. J. Hum. Genet. 2011;89:82–93. [PMC free article] [PubMed]
13. Davison A.C., Hinkley D.V. Cambridge University Press; Cambridge: 1997. Bootstrap Methods and Their Application.
14. Li L., Li Y., Browning S.R., Browning B.L., Slater A.J., Kong X., Aponte J.L., Mooser V.E., Chissoe S.L., Whittaker J.C., Nelson M.R., Ehm M.G. Performance of genotype imputation for rare variants identified in exons and flanking regions of genes. PloS One. 2011 in press. [PMC free article] [PubMed]
15. Firmann M., Mayor V., Vidal P.M., Bochud M., Pecoud A., Hayoz D., Paccaud F., Preisig M., Song K.S., Yuan X. The CoLaus study: A population-based study to investigate the epidemiology and genetic determinants of cardiovascular risk factors and metabolic syndrome. BMC Cardiovasc. Disord. 2008;8:6. [PMC free article] [PubMed]
16. Teslovich T.M., Musunuru K., Smith A.V., Edmondson A.C., Stylianou I.M., Koseki M., Pirruccello J.P., Ripatti S., Chasman D.I., Willer C.J. Biological, clinical and population relevance of 95 loci for blood lipids. Nature. 2010;466:707–713. [PMC free article] [PubMed]
17. Li Y., Byrnes A.E., Li M. To identify associations with rare variants, just WHaIT: Weighted Haplotype and Imputation-based Tests. Am. J. Hum. Genet. 2010;87:728–735. [PMC free article] [PubMed]
18. Liu D.J., Leal S.M. A novel adaptive method for the analysis of next-generation sequencing data to detect complex trait associations with rare variants due to gene main effects and interactions. PLoS Genet. 2010;6:e1001156. [PMC free article] [PubMed]
19. King C.R., Rathouz P.J., Nicolae D.L. An evolutionary framework for association testing in resequencing studies. PLoS Genet. 2010;6:e1001202. [PMC free article] [PubMed]
20. Ng P.C., Henikoff S. SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003;31:3812–3814. [PMC free article] [PubMed]
21. Adzhubei I.A., Schmidt S., Peshkin L., Ramensky V.E., Gerasimova A., Bork P., Kondrashov A.S., Sunyaev S.R. A method and server for predicting damaging missense mutations. Nat. Methods. 2010;7:248–249. [PMC free article] [PubMed]
22. Schaid D.J., McDonnell S.K., Hebbring S.J., Cunningham J.M., Thibodeau S.N. Nonparametric tests of association of multiple genes with human disease. Am. J. Hum. Genet. 2005;76:780–793. [PMC free article] [PubMed]
23. Wessel J., Schork N.J. Generalized genomic distance-based regression methodology for multilocus association analysis. Am. J. Hum. Genet. 2006;79:792–806. [PMC free article] [PubMed]
24. Tzeng J.Y., Zhang D. Haplotype-based association analysis via variance component score test. Am. J. Hum. Genet. 2007;81:939–963. [PMC free article] [PubMed]
25. Lin D.Y. An efficient Monte Carlo approach to assessing statistical significance in genomic studies. Bioinformatics. 2005;21:781–787. [PubMed]
26. Cox D.R., Hinkley D.V. Chapman and Hall; New York: 1974. Theoretical statistics.
27. Lin D.Y. Evaluating statistical significance in two-stage genomewide association studies. Am. J. Hum. Genet. 2006;78:505–509. [PMC free article] [PubMed]
28. Diggle P.J., Heagerty P., Liang K.-Y., Zeger S.L. Second Edition. Oxford University Press; Oxford: 2002. Analysis of longitudinal data.
29. Cox D.R. Regression models and life-tables (with discussion) J. R. Stat. Soc., B. 1972;34:187–220.
30. Lin D.Y., Wei L.J. The robust inference for the Cox proportional hazards model. J. Am. Stat. Assoc. 1989;84:1074–1078.

Articles from American Journal of Human Genetics are provided here courtesy of American Society of Human Genetics
PubReader format: click here to try

Formats:

Save items

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • MedGen
    MedGen
    Related information in MedGen
  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...