# A General Framework for Detecting Disease Associations with Rare Variants in Sequencing Studies

^{1}Department of Biostatistics, University of North Carolina, Chapel Hill, NC 27599, USA

## Abstract

Biological and empirical evidence suggests that rare variants account for a large proportion of the genetic contributions to complex human diseases. Recent technological advances in high-throughput sequencing platforms have made it possible for researchers to generate comprehensive information on rare variants in large samples. We provide a general framework for association testing with rare variants by combining mutation information across multiple variant sites within a gene and relating the enriched genetic information to disease phenotypes through appropriate regression models. Our framework covers all major study designs (i.e., case-control, cross-sectional, cohort and family studies) and all common phenotypes (e.g., binary, quantitative, and age at onset), and it allows arbitrary covariates (e.g., environmental factors and ancestry variables). We derive theoretically optimal procedures for combining rare mutations and construct suitable test statistics for various biological scenarios. The allele-frequency threshold can be fixed or variable. The effects of the combined rare mutations on the phenotype can be in the same direction or different directions. The proposed methods are statistically more powerful and computationally more efficient than existing ones. An application to a deep-resequencing study of drug targets led to a discovery of rare variants associated with total cholesterol. The relevant software is freely available.

## Introduction

Genome-wide association studies (GWAS) with tagSNPs have successfully identified common SNPs with small to modest effects for virtually every complex human disease. Technological advances in high-throughput sequencing platforms have made it possible for researchers to extend association studies to rare variants in targeted exons and soon in the entire genome. Rare variants tend to be functional alleles and have stronger effects on complex diseases than common variants.^{1,2} Indeed, deep-resequencing studies of candidate genes have already demonstrated the influence of rare variants on several complex traits.^{3–5}

Association testing with a single rare variant has limited power because only a small percentage of study subjects carry a rare mutation and there are a large number of tests to be adjusted for. Collapsing or grouping methods, which combine information across multiple variant sites within a gene, can enrich association signals and reduce the penalty of multiple testing. The simplest collapsing method is the burden test, which is based on the number of rare mutations each subject carries in a gene.^{6,7} A second approach is the weighted sum statistic of Madsen and Browning,^{8} which weights each mutation according to its frequency in the unaffected subjects and permutes the disease status to assess the significance of a Wilcoxon-type test statistic. A third approach is the variable-threshold (VT) idea of Price et al.,^{9} which uses the maximum of the test statistics over all allele-frequency thresholds and assesses statistical significance by permutation. The forgoing methods assume that the effects of the combined rare mutations on the phenotype are in the same direction. To detect opposite effects, Han and Pan^{10} incorporated the signs of the observed effects into the burden test, whereas Neale et al.^{11} and Wu et al.^{12} tested the variance of the effects.

In this article, we provide a general framework for association testing with rare variants that reflects the spirits of the existing methods but is statistically more powerful and computationally more efficient. Our framework covers all major study designs (i.e., case-control, cross-sectional, cohort and family studies) and all common phenotypes (e.g., binary and quantitative traits, and potentially censored ages at onset of disease) and allows any covariates (e.g., environmental factors and ancestry variables). The ability to accommodate covariates is critically important because population stratification is expected to be a more severe issue with rare variants than with common variants but could be corrected by including suitable ancestry variables (e.g., the percentage of African ancestry or principal components for ancestry) in the association analysis. We combine information across multiple variant sites within a gene by taking a weighted sum of the mutation counts for each study subject and relate the combined information and covariates to disease phenotypes through appropriate regression models. We derive theoretically optimal weights that would produce the most powerful tests among all valid tests and develop the corresponding testing procedures. We employ score-type statistics, which are numerically stable even in the case of extremely rare variants and computationally fast even in the presence of covariates. We provide asymptotic normal approximation for both fixed-threshold and VT methods and develop permutation and other resampling tests that can accommodate covariates. We investigate theoretically and numerically when normal approximation is appropriate and when resampling is required. We modify the popular methods of Madsen and Browning^{8} and Price et al.^{9} to enhance statistical power, avoid permutation, and accommodate covariates. We construct data-adaptive test statistics that are powerful even when the combined rare mutations have opposite effects on the phenotype. The advantages of the proposed methods over the existing ones are demonstrated both analytically and empirically. The software implementing the proposed methods is available at our website.

## Material and Methods

Suppose that a total of *n* subjects are genotyped on a total of *m* SNPs in a gene and that there are *d* covariates. Here, the word “gene” refers to the group of variants that will be collectively analyzed and might pertain to a subset of SNPs within a gene or to a region or pathway involving multiple genes; covariates might include nongenetic variables, such as age and smoking status, as well as ancestry variables, such as the percentage of African ancestry and principal components for ancestry. For *i* = 1, …, *n*, let *Y _{i}* be the phenotype value of the

*i*th subject; for

*i*= 1, …,

*n*and

*j*= 1, …,

*m*, let

*X*denote the number of the rare mutation the

_{ji}*i*th subject carries at the

*j*th SNP; for

*i*= 1, …,

*n*and

*j*= 1, …,

*d*, let

*Z*denote the value of the

_{ji}*j*th covariate on the

*i*th subject. We can define

We focus on binary phenotypes in the main text but consider all common phenotypes in Appendix A. It is natural to relate *Y _{i}* to

*X*and

_{i}*Z*through the logistic regression model:

_{i}where β and γ are *m* × 1 and (*d* + 1)×1 vectors of unknown regression coefficients. Because the first component of *Z _{i}* is 1, the first component of γ corresponds to the intercept. We can write

*β*=

*τ*

*ξ*, where τ is a scalar constant, and

*ξ*=

*β*/

*τ*. Then Equation (1) becomes

where *S*_{i} = *ξ*^{T}*X*_{i}. Note that *ξ* = (*ξ*_{1}, …, *ξ*_{m})^{T} is a *m* × 1 vector of weights and that *S _{i}* is a weighted linear combination of

*X*

_{1i}, …,

*X*

_{mi}with

*X*receiving the weight ξ

_{ji}*. We will refer to ξ as the weight function.*

_{j}The score statistic for testing the null hypothesis *H*_{0}:*τ* = 0 takes the form

where $\stackrel{\u02c6}{\gamma}$ is the restricted maximum likelihood estimator of γ and solves the equation

The variance of *U* is estimated by

where

Under *H*_{0}, the test statistic *T* = *U* / *V*^{1 / 2} is asymptotically standard normal. In the absence of covariates,

and

where $\overline{Y}={n}^{-1}{\sum}_{i=1}^{n}{Y}_{i}$.

The true value of the weight function *ξ* = (*ξ*_{1}, …, *ξ*_{m})^{T} is unknown and must be determined biologically or empirically. If we set *ξ*_{j} = 1(*j* = 1, …, *m*), then *T* is a burden test, which counts the total number of rare mutations each subject carries over the *m* SNPs. If we believe that common variants are not associated with the phenotype, then we set *ξ*_{j} = 0 if *p*_{j} > *c*, where *p _{j}* is the minor allele frequency (MAF) of the

*j*th SNP, and

*c*is a given threshold. If we set

*ξ*

_{j}= {

*p*

_{j}(1 −

*p*

_{j})}

^{ − 1 / 2}(

*j*= 1, …,

*m*), then the weight function is in the same vein as that of Madsen and Browning.

^{8}

If the choice of the weight function ξ is not proportional to β or ξ is estimated from the data, then *U* is no longer the score statistic. However, we show in Appendix A that the test statistic *T* is asymptotically standard normal under *H*_{0} regardless of how ξ is determined. The only condition is that if ξ is estimated from the data, then the estimate converges to a constant vector as the sample size *n* increases. This condition is satisfied by all sensible estimates, including those based on estimated allele frequencies. If the choice of ξ or the limit of the estimate of ξ is proportional to β, then the corresponding test statistic *T* is the most powerful among all valid tests.

The weight function ξ is similar to that of Price et al.^{9} The latter authors showed that, for case-control studies with known allele frequencies in the control population, the choice of *ξ*_{j} = {*p*_{j}(1 − *p*_{j})}^{ − 1 / 2}(*j* = 1, …, *m*) corresponds to the implicit assumption that log(*O**R*_{j}) ∝ {*p*_{j}(1 − *p*_{j})}^{ − 1 / 2}(*j* = 1, …, *m*), where *OR _{j}* is the odds ratio in the 2 × 2 table for the

*j*th SNP. Our theory is much more general in that it assumes unknown allele frequencies and accommodates covariates. Indeed, the proposed test statistic is optimal if ξ is proportional to the set of regression coefficients (in the limit); this result holds for all phenotypes, including binary and continuous traits, as well as potentially censored ages at onset of disease.

Madsen and Browning^{8} suggested to set ${\xi}_{j}={\left\{{\stackrel{\u02c6}{p}}_{j}\left(1-{\stackrel{\u02c6}{p}}_{j}\right)\right\}}^{-1/2}$
(*j* = 1, …, *m*), where ${\stackrel{\u02c6}{p}}_{j}$ is the estimate of the MAF of the *j*th SNP in the unaffected subjects. Because the weights depend on the phenotype values, the authors suggested a permutation-based test. Our testing framework allows such data-dependent weights because the frequency estimates converge to the true values as *n* increases. To improve the accuracy of asymptotic approximation, we suggest estimating the frequencies from all study subjects rather than the unaffected subjects. Because the variants can be very rare, we recommend adding pseudocounts when estimating the frequencies, as was done by Madsen and Browning.^{8} The weight functions based on the frequency estimates in the pooled sample and the unaffected subjects will be denoted by *F _{p}* and

*F*, respectively; the constant weight function will be denoted by

_{u}*C*. The corresponding tests will be referred to as the

*F*test, the

_{p}*F*test and the

_{u}*C*test.

Although *F _{u}* is the weight function used by Madsen and Browning,

^{8}our

*F*test is fundamentally different from the Madsen and Browning (MB) test. The latter is based on the sum of the ranks of the

_{u}*S*'s with weight function

_{i}*F*over the affected subjects. Madsen and Browning

_{u}^{8}proposed to assess the statistical significance of their rank-sum statistic by permutation. They also suggested an asymptotic normal approximation by standardizing the rank-sum statistic by its mean and standard derivation. Because the mean and standard derivation are estimated by permutation, the asymptotic version of the MB test is many orders of magnitudes slower than our asymptotic tests. The rank-sum statistic is confined to case-control analysis without covariates.

Price et al.^{9} developed a VT method by taking the maximum of the test statistics (i.e., *Z* scores) over all allele-frequency thresholds and assessing statistical significance by permutation. We describe below a more general approach that allows not only multiple allele-frequency thresholds but also different types of weight function; it also accommodates covariates and does not require permutation.

We consider *K* choices of ξ, which could correspond to different thresholds or different types of weight function, or both. (It is assumed that *K* is small relative to *n*.) For the *k*th choice of ξ, the corresponding *S _{i}* is denoted by

*S*. Then the score statistic is

_{ki}and the test statistic is ${T}_{k}={U}_{k}/{V}_{k}^{1/2}$, where

It is shown in Appendix A that, under *H*_{0}, the random vector (*U*_{1}, …, *U*_{K})^{T} is approximately *K*-variate normal with mean 0 and covariance matrix {*V*_{kl}; *k*, *l* = 1, …, *K*}, where

and

For the two-sided test, we consider the maximum of the absolute test statistics

Let *t*_{max} be the observed value of *T*_{max}. The p value is given by

*T*

_{max}≥

*t*

_{max}) = 1 − Pr (|

*T*

_{1}| <

*t*

_{max}, …, |

*T*

_{K}| <

*t*

_{max}),

which is evaluated by treating (*T*_{1}, …, *T*_{K})^{T} as a *K*-variate normal random vector with a mean of 0 and a covariance matrix of {*r*_{kl}; *k*, *l* = 1, …, *K*}, where *r*_{kl} = *V*_{kl} / (*V*_{kk}*V*_{ll})^{1 / 2}. (The one-sided p value can be calculated in a similar manner.) We reject *H*_{0} if the p value is smaller than the nominal significance level α.

The tests based on positive weight functions, such as *C*, *F _{u}*, and

*F*, will have low power if the mutations being combined have opposite effects on the phenotype. The optimal choice of ξ

_{p}*is β*

_{j}*, which is unknown. We can estimate β*

_{j}*from the data. It would be tempting to set ξ*

_{j}*to ${\stackrel{\u02c6}{\beta}}_{j}$, where ${\stackrel{\u02c6}{\beta}}_{j}$ is an appropriate estimate of β*

_{j}*. There are two major problems with this strategy. First, the test statistic*

_{j}*T*will not be asymptotically normal. Second, the ${\stackrel{\u02c6}{\beta}}_{j}$'s are highly variable (because the individual variants are very rare) and can be quite different from the true values of the β

*'s. As a compromise, we set ${\xi}_{j}={\stackrel{\u02c6}{\beta}}_{j}+\delta $, where δ is a given constant. We refer to this weight function as EREC, an abbreviation of*

_{j}*e*stimated

*re*gression

*c*oefficients. The corresponding test statistic

*T*will be asymptotically standard normal as long as δ is nonzero. Indeed, the EREC test is asymptotically optimal in that ξ

*will converge to β*

_{j}*if we let δ decrease to 0 as the sample size*

_{j}*n*increases to ∞. The asymptotic normality and optimality require very large samples. For small samples, we recommend to use a relatively large value of δ so that the weights are not unduly driven by the highly variable ${\stackrel{\u02c6}{\beta}}_{j}$'s. For

*n*< 2000, we set

*δ*= 1 for binary traits and

*δ*= 2 for standardized quantitative traits.

The sequence kernel association test (SKAT) of Wu et al.^{12} assumes that β* _{j}* follows an arbitrary distribution with a mean of 0 and a variance of

*ξ*

_{j}

*ν*, and tests the null hypothesis that

*ν*= 0 by using a variance-component score statistic. The SKAT statistic can be written as $Q={\sum}_{j=1}^{m}{\xi}_{j}{U}_{j}^{2}$, where

*U*is the

_{j}*j*th component of the score statistic for testing the null hypothesis that

*β*= 0 under Equation 1. The C-alpha statistic of Neale et al.

^{11}is a special case of

*Q*with

*ξ*

_{j}= 1 for binary traits without covariates. Our score statistic

*U*can be written as ${\sum}_{j=1}^{m}{\xi}_{j}{U}_{j}$. The Han and Pan

^{10}(HP) statistic is a special case of

*U*(for binary traits without covariates) in which

*ξ*

_{j}= − 1 if ${\stackrel{\u02c6}{\beta}}_{j}<0$ and the corresponding p value <0.1 and in which

*ξ*

_{j}= 1 otherwise.

Because the asymptotic approximation might not be accurate in small samples, especially when the weight function ξ involves the phenotype values *Y _{i}*'s, we also provide permutation-type tests. In the absence of covariates, we simply permute the phenotype values

*Y*'s and calculate the test statistic

_{i}*T*for each permutation. Note that it is necessary to recalculate the

*S*'s after permuting the

_{i}*Y*'s if the weight function ξ depends on the

_{i}*Y*'s.

_{i}Our permutation differs from that of Price et al.^{9} in that we permute *T*, whereas they permuted ${\sum}_{i=1}^{n}{Y}_{i}{S}_{i}$. The former is a pivotal statistic, whereas the latter is not. (It is desirable to permute a pivotal statistic.^{13}) If the test is one-sided and the weight function does not depend on the phenotype values, then our permutation is equivalent to Price et al.'s^{9}; otherwise, the two are different. For VT methods, the numerators in the *Z* scores of Price et al.^{9} are the same as ours, but the denominators are not the same as or proportional to ours. Thus, the permutation p values are generally different between the two methods. The permutation version of the MB test requires ranking the *S _{i}*'s for each permutation and is thus substantially slower than our permutation tests.

In the presence of covariates, permuting the *Y _{i}*'s it is not appropriate because

*Y*is generally correlated with

_{i}*Z*. Instead, we generate ${Y}_{i}^{\ast}$ from the fitted null model:

_{i}replace the *Y _{i}*'s with the ${Y}_{i}^{\ast}$'s, and recalculate the test statistic. (The recalculation of the test statistic starts with re-estimating γ and recalculating the

*S*'s.) This process is repeated and is called (parametric) bootstrap.

_{i}^{13}Both permutation and bootstrap are resampling methods. In the absence of covariates, $\mathrm{Pr}\left({Y}_{i}^{\ast}=1\right)$ is the sample proportion of cases.

Obtaining an accurate estimate of a small p value requires a large number of resamples (i.e., permutations or bootstrap samples). However, most p values are relatively large and can be estimated accurately with a small number of resamples. Thus, we employ a multistage procedure which filters out large p values with small numbers of resamples and uses large numbers of resamples only for the most extreme p values.

## Results

### Simulation Studies

We conducted extensive simulation studies to investigate the performance of the proposed and existing methods. We simulated case-control data with an equal number of cases and controls from Equation 1 in which the first component of γ was set to –2. We considered mainly the following six combinations of MAFs: (1) *p*_{j} = 0.001*j*
(*j* = 1, …, 10) with a total frequency of 5.5%; (2) *p*_{j} = 0.0005*j*
(*j* = 1, …, 10) with a total frequency of 2.75%; (3) *p*_{j} = 0.00025*j*(*j* = 1, …, 20) with a total frequency of 5.25%; (4) *p*_{j} = 0.005
(*j* = 1, …, 10) with a total frequency of 5%; (5) *p*_{j} = 0.0025
(*j* = 1, …, 10) with a total frequency of 2.5%; and (6) *p*_{j} = 0.0025(*j* = 1, …, 20) with a total frequency of 5%. The genotype values were simulated under Hardy-Weinberg equilibrium and linkage equilibrium. We did not use sophisticated population genetics models because we wished to control the number of variants and their frequencies, which allowed us to see clearly how the proposed and existing methods perform under various scenarios. We evaluated both asymptotic and resampling methods. When the simulation studies involved asymptotic methods only, we used 10 millions replicates (i.e., simulated data sets) to evaluate type I error and 100,000 replicates to evaluate power at *α* = 10^{ − 2}, 10^{ − 3}, and 10^{ − 4}. When the simulation studies involved resampling methods, we used 1 million replicates to evaluate type I error and 10,000 replicates to evaluate power at *α* = 10^{ − 2} and 10^{ − 3}. The resampling p values were obtained from a three-stage procedure with a maximum of 1 million resamples. The null hypothesis corresponded to *H*_{0}:*β*_{j} = 0(*j* = 1, …, *m*). We considered alternative hypotheses such as *H*_{1}:*β*_{j} = *x*(*j* = 1, …, *m*) and *H*_{1}:*β*_{j} = *x* / {*p*_{j}(1 − *p*_{j})}^{1 / 2}(*j* = 1, …, *m*), where *x* was chosen such that the power (of the most powerful method) was reasonably high at *α* = 10^{ − 2}. We report below results from six series of simulation studies, the first four without covariates and the last two with covariates. The tests were two-sided except for the third series.

We designed our first series of simulation studies to evaluate the proposed asymptotic methods with different weight functions. We considered the aforementioned six combinations of MAFs and generated data under the null hypothesis *H*_{0}:*β*_{j} = 0(*j* = 1, …, *m*), as well as two alternative hypotheses *H*_{1}:*β*_{j} = *x*(*j* = 1, …, *m*) and *H*_{1}:*β*_{j} = *x* / {*p*_{j}(1 − *p*_{j})}^{1 / 2}(*j* = 1, …, *m*). We considered three (positive) weight functions: *C*, *F _{p}*, and

*F*. We also considered the maximum of the test statistics based on weight functions

_{u}*C*and

*F*, which will be referred to as

_{p}*T*

_{max}. The results for the first combination of MAFs are displayed in Table 1, whereas those of the remaining five combinations are provided in Tables S1–S5, available online. The performance of the tests is affected more by the total allele frequency than the number of variants or individual MAFs. The

*C*test,

*F*test, and

_{p}*T*

_{max}are conservative but less so as

*n*, α, or total allele frequency increases. As expected, the

*C*test is more powerful than the

*F*test under the first alternative hypothesis and less powerful under the second alternative hypothesis;

_{p}*T*

_{max}is nearly as powerful as the

*C*test under the first alternative and nearly as powerful as the

*F*test under the second alternative. The

_{p}*F*test is unacceptably liberal; therefore, we will not consider this asymptotic test any further.

_{u}Our second series of studies was devoted to comparisons of asymptotic and permutation methods. In addition to the proposed methods, we evaluated the asymptotic and permutation versions of the MB test, as well as the permutation method of Price et al.^{9} with weight function *F _{u}*. We simulated data in the same manner as the first series of studies. We performed one-sided tests because the MB and Price et al. tests were designed as one-sided. The results for the first combination of MAFs are displayed in Table 2. Because of the discreteness of the test statistic, the permutation version of the

*C*test is more conservative than its asymptotic counterpart and consequently less powerful. The permutation

*F*and

_{p}*F*tests do not appear to be conservative; the former appears to be slightly more powerful than the latter. The MB test was designed for the second alternative hypothesis, for which the proposed asymptotic test based on weight function

_{u}*F*is more powerful than the asymptotic version of the MB test whereas the proposed permutation tests based on weight functions

_{p}*F*and

_{p}*F*are more powerful than the permutation version of the MB test. For weight function

_{u}*F*, our permutation test is more powerful than that of Price et al.

_{u}^{9}

In the third series of studies, we compared fixed-threshold and VT methods. We simulated 11 SNPs with MAFs *p*_{j} = 0.001*j*(*j* = 1, …, 10) and *p*_{11} = 0.03. We considered the null hypothesis *H*_{0}:*β*_{1} = *β*_{2} = … = *β*_{11} = 0, as well as two alternative hypotheses *H*_{1}:*β*_{1} = *β*_{2} = … = *β*_{10} = *x*, *β*_{11} = 0 and *H*_{1}:*β*_{1} = *β*_{2} = … = *β*_{11} = *x*. For fixed-threshold methods, we considered the thresholds of 0.01 and 0.05; the corresponding tests are referred to as the T1 and T5 tests. For VT methods, we excluded the thresholds for which the total numbers of rare mutations were fewer than 10. As shown in Table 3, all the tests appear to be conservative, especially when *n* and α are small. The permutation T1 and T5 tests are more conservative than their asymptotic counterparts. In theory, T1 and T5 are the most powerful under the first and second alternatives, respectively. Because the frequency estimates for rare variants are highly variable, T1 turns out to be the least powerful among all the tests under the first alternative. The VT tests have good power under both alternatives, and the asymptotic and permutation versions have similar power. The permutation version of our VT test is slightly more powerful than that of Price et al.^{9}

In the fourth set of studies, we compared the *C* test, *F _{p}* test, and EREC test, as well as the HP, C-alpha, and SKAT tests. Note that the last four tests were designed to detect variants with opposite effects. The EREC, HP, and C-alpha tests were based on permutation, whereas the SKAT was based on the Davies method.

^{12}For the EREC test, ${\stackrel{\u02c6}{\beta}}_{j}$ was the estimate of the log odds ratio β

*(after adding a pseudocount of 1 to each of the four cells in the 2×2 table). For the SKAT test, we used the default weighted linear kernel function. We set*

_{j}*p*

_{j}= 0.001

*j*(

*j*= 1, …, 10) and considered the null hypothesis

*H*

_{0}:

*β*

_{j}= 0(

*j*= 1, …, 10) and six alternative hypotheses representing different numbers of causal variants and different patterns of positive and negative effects. As shown in Table 4, the SKAT is highly conservative, especially when

*n*and α are small. The EREC test is slightly less powerful than the

*C*test and

*F*test when the SNP effects are all positive but is much more powerful than the latter when there are opposite effects. The EREC test is more powerful than the HP test. It is also more powerful than the C-alpha and SKAT, especially when the mean of the regression coefficients is not 0.

_{p}^{a}and Power of Asymptotic and Permutation Tests for Detecting Potentially Opposite Effects

The above four sets of studies contained no covariates. We also conducted extensive studies with covariates. We generated data in the same manner as before except that we added a normally distributed covariate whose mean is equal to the total number of rare mutations and whose variance is equal to 1 and we set its regression coefficient to 0.3. Some key results are presented in Tables 5 and 6. The T1, T5, *F _{p}*, and VT tests are less conservative than in the case of no covariates, and their asymptotic and bootstrap versions have similar power. The EREC test has similar power to the

*C*and

*F*tests when all SNP effects are positive and is much more powerful than the latter when there are opposite effects. The EREC test tends to be more powerful than the SKAT, especially when the mean of the regression coefficients is not 0.

_{p}### Real Data

We considered high-depth sequence data from the exons of 202 genes encoding known or potential drug targets^{14} for 1957 subjects randomly drawn from the CoLaus population-based collection.^{15} We analyzed total cholesterol (available in 1899 subjects) as a quantitative trait and included eight covariates in the analysis: gender, age, age^{2}, and the top five principal components for ancestry constructed from the GWAS SNP data. One subject without the gender and age information was removed. We employed the methods for quantitative traits described in Appendix A.

We restricted our analysis to polymorphic variants that are nonsense, missense, or splice site mutations. We removed variants with observed MAFs>5% or missingness>10%. We excluded any gene whose total number of rare mutations is less than five and ended up with a total of 172 genes. There were a total of 2304 variants in these 172 genes, and the number of variants per gene varied from 1 to 70, with a median of 11. We applied both the asymptotic and permutation versions of our T1, T5, *F _{p}*, and VT tests, as well as the permutation EREC test. We calculated the two-sided p values. With 172 genes, the Bonferroni threshold at the 0.05 significance level corresponds to a p value of 0.0003 or –log

_{10}(p value) of 3.5.

The results based on the asymptotic and permutation methods are shown in Figures 1 and 2, respectively. One gene was identified as the most significant by all the tests: the asymptotic p values for T1, T5, *F _{p}*, and VT are 0.00011, 0.00011, 0.00021, and 0.00057, respectively; the corresponding permutation p values are 0.00013, 0.00013, 0.00025, and 0.0012, respectively; the p value of the EREC test is 0.00012. (The name of the gene is not disclosed here because the main study has not been published yet.) All the p values, except the VT's, pass the Bonferroni criterion. Similar evidence of association has been observed in other samples of the sequencing project.

^{14}There were 13 variants in the top gene. Their observed MAFs ranged from 0.00026 to 0.0024, the total frequency being 1.13%. Because the observed MAFs are all less than 1% in this case, T1 and T5 are the same test. For the VT test, the maximum occurs at the highest MAF. It is interesting to point out that common SNPs in the top gene were previously identified to be associated with total cholesterol.

^{16}

_{10}Scale for the Asymptotic T1, T5,

*F*, and VT Tests in the Quantitative Trait Analysis of Total Cholesterol

_{p}_{10}Scale for the Permutation EREC, T5,

*F*, and VT Tests in the Quantitative Trait Analysis of Total Cholesterol

_{p}We also performed a binary trait analysis by comparing high (i.e., >6.2 mmol/l) and desirable (i.e., <5.2mmol/l) total cholesterol values. There were 451 subjects with high total cholesterol and 683 subjects with desirable total cholesterol. The results of the analysis are shown in Figures 3 and 4. All the tests identified the same top gene as was identified in the quantitative trait analysis: the asymptotic p values for T1, T5, *F _{p}*, and VT are 0.00022, 0.00022, 0.00057, and 0.00088, respectively; the corresponding bootstrap p values are 0.00019, 0.00019, 0.00039, and 0.00033, respectively. Again, T1 and T5 are the same test. The maximum of the VT test occurs at the highest MAF, at which threshold 18 out of the 451 subjects with high cholesterol values carry the rare mutations as opposed to 7 out of 683 subjects with desirable cholesterol values. The p value of the bootstrap EREC test is 0.000021, which is the most extreme among all the tests and is even more extreme than all the p values of the quantitative trait analysis. For eight out of the 10 variants in the top gene, there were more mutations in the high group than in the desirable group (17 versus two); for the remaining two variants, there were fewer mutations in the high group than in the desirable group (one versus five). Thus, allowing opposite effects yielded stronger evidence of association than assuming effects of the same direction.

_{10}Scale for the Asymptotic T1, T5,

*F*, and VT Tests in the Binary Trait Analysis of Total Cholesterol

_{p}_{10}Scale for the Bootstrap EREC, T5,

*F*, and VT Tests in the Binary Trait Analysis of Total Cholesterol

_{p}Finally, we compared the proposed methods to the existing ones. The results for the SKAT are shown in Figure S1 (top panel). For the top gene, the SKAT yielded the p values of 0.0014 and 0.00024 in the quantitative and binary trait analyses, respectively, which are 10 times larger than the p values of our EREC test. Because the other existing methods do not allow covariates and some of them require binary traits, we also performed the binary trait analysis without the covariates for all the methods. The results are shown in the bottom panel of Figure S1 and in Figures S2–S4. Although the top gene remains the same, the results without covariate adjustment (for the top gene) are considerably less significant than those with covariate adjustment. For the top gene, the EREC test yielded a much more significant result (p value =0.00013) than all the other tests.

## Discussion

We developed a very general framework for the association analysis of rare variants. This framework enabled us to evaluate existing methods and develop other methods. Our theoretical analysis and simulation studies yielded insights into the behavior of the existing methods. The normal approximation works very well for the proposed methods, and resampling is required only when the weight function depends on the phenotype values. The proposed methods are numerically stable and easy to implement. The asymptotic tests are extremely fast. A computer program implementing the proposed methods is posted at our website. For a typical exome-sequencing study, it takes only a few hours to run all the proposed asymptotic and resampling tests.

We have adopted score-type statistics, which are computationally faster and more stable than Wald and likelihood ratio (LR) statistics because the null model does not involve rare variants and needs to be fit only once. Our simulation studies revealed that Wald tests tend to be overly conservative (resulting in substantial loss of power) whereas likelihood ratio tests tend to be too liberal (resulting in excessive false-positive findings), especially for small *n* and low MAFs; see Tables S6–S8.

Our work improves upon the pioneer work of Madsen and Browning^{8} by using more powerful test statistics, accommodating covariates and avoiding permutation. For case-control studies, Madsen and Browning^{8} estimated the allele frequencies in the unaffected subjects only so that a true signal from an excess of mutations in the affected subjects would not be deflated by using the total number of mutations in both affected and unaffected subjects. According to our theory, the allele frequencies in the unaffected subjects will be optimal if log(*O**R*_{j}) ∝ {*p*_{j}(1 − *p*_{j})}^{ − 1 / 2}(*j* = 1, …, *m*) and *p _{j}* is the frequency of the

*j*th variant in the unaffected subjects. Even if that is the truth, the frequency estimates are highly variable and can be very different from the true values. The frequency estimates in the pooled sample of affected and unaffected subjects are more stable and the corresponding

*F*test can be implemented through normal approximation (rather than resampling).

_{p}The optimal choice of the frequency threshold depends on the nature of association, which is generally unknown. In addition, the frequency estimates for rare variants are highly variable, especially for small samples with substantial missing data. Thus, VT methods might be preferable to fixed-threshold methods. Our VT approach improves upon that of Price et al.^{9} in three aspects: (1) it uses more powerful test statistics, (2) it can accommodate covariates, (3) it can be implemented by normal approximation instead of permutation.

The EREC test is capable of detecting rare mutations with opposite effects. Simulation studies (Tables 4 and 6) showed that the EREC test has similar power to the tests assuming the same direction of effects when that assumption holds and is much more powerful than the latter when that assumption fails. In addition, the EREC test outperforms the HP, C-alpha and SKAT tests. In the real data example, the EREC test produced the most convincing evidence of association for the top gene among all the tests. Thus, we recommend the EREC test for general use.

The SKAT is computationally faster than the EREC, HP, and C-alpha tests because it calculates p values analytically. Simulation studies revealed that the SKAT is overly conservative, especially when *n* and α are small. The resampling methods developed in this article can be used to obtain accurate p values for the SKAT, and indeed any other tests, with or without covariates.

Statistical analysis of rare variants is a very active research area. Several other methods have been published during the preparation of this article.^{17–19} We have not compared our methods to all existing methods for several reasons: (1) we wished to focus on the most commonly used current methods, (2) some of the newly published methods are based on different philosophies and thus would be difficult to compare directly, (3) a comprehensive comparison of all existing methods is beyond the scope of this article.

It is possible to incorporate biological and computational information about the functional effects of rare variants, such as SIFT^{20} and PolyPhen^{21} scores, into the association analysis. Indeed, our theory allows incorporation of any prior knowledge into the weight function. Efficient use of functional or bioinformatics information requires further investigation. It would be worthwhile to explore Bayesian methods.

Grouping methods for rare variants are in the same vein as the SNP-set methods for GWAS studies^{22–24} in that multiple SNPs within a group are analyzed collectively to enhance statistical power. Because the data are extremely sparse for individual rare variants, the SNP-set methods for common variants might not be applicable to rare variants. On the other hand, the methods for rare variants can potentially be used to combine low-frequency SNPs in GWAS studies.

We have considered one group of variants at a time. It might be desirable to analyze several groups of variants simultaneously. Our approach can be readily extended to multiple groups of variants. Specifically, we divide variants into, say, *K* groups according to certain criteria (e.g., MAFs) and combine the information within each group. We can express the score statistic for each group of variants as a sum of *n* efficient score functions (see Appendix A) so that the asymptotic joint distribution of the *K* score statistics follows from the multivariate central limit theorem. We can then use the asymptotic joint distribution to form a multivariate test statistic. If we choose the maximum of the *K* test statistics, then the formulas for *K* weight functions presented in Material and Methods can be directly applied. If we choose the chi-square statistic with *K* degrees of freedom, then our method would be a generalization of the combined multivariate and collapsing (CMC) method of Li and Leal.^{7}

We used the Bonferroni correction in the analysis of the real data. This criterion is conservative if there is strong linkage disequilibrium (LD) among the genes. More accurate correction for multiple testing can be achieved by accounting for the correlations of the test statistics. There are two possible ways to do so: one is to use permutation and the other is to use Monte Carlo.^{25} The latter is based on efficient score functions, which are provided in Appendix A.

This work and indeed all existing literature assume that the quantitative trait data are obtained from a random sample. In many sequencing studies, including several in the National Heart, Lung, and Blood Institute (NHLBI) Exome Sequencing Project that we are involved with, only the subjects with the extreme values of a quantitative trait are selected for sequencing. The case-control testing is a valid option but might be inefficient if there is a quantitative association. In addition, it might be desirable to analyze quantitative traits that are not the one used to select the subjects for sequencing. We are currently developing valid and efficient methods for the association analysis of quantitative traits under such trait-dependent sampling.

## Acknowledgments

This research was supported by the National Institutes of Health grants R01 CA082659, R37 GM047845, and P01 CA142538. The authors thank GlaxoSmithKline, especially Matthew R. Nelson, Margaret G. Ehm, and Li Li, and the co-principal investigators of the CoLaus study, Gerard Waeber and Peter Vollenweider, for the use of the resequencing data. They are also grateful to Yun Li and Kuo-Ping Li for their assistance with the preparation of the data.

## Appendix A

We relate *Y _{i}* to

*X*and

_{i}*Z*through a generalized linear model with the linear predictor

_{i}*β*

^{T}

*X*

_{i}+

*γ*

^{T}

*Z*

_{i}, where

*β*=

*τ*

*ξ*. Let η consist of γ and other nuisance parameters. Let

*l*(

*τ*,

*η*;

*ξ*) denote the log-likelihood function for τ and η with a fixed value of ξ. The corresponding score function and observed Fisher information matrix are

and

where *U*_{τ}(*τ*, *η*; *ξ*) = *∂**l*(*τ*, *η*; *ξ*) / *∂**τ*,*U*_{η}(*τ*, *η*; *ξ*) = *∂**l*(*τ*, *η*; *ξ*) / *∂**η*, *I*_{ττ}(*τ*, *η*; *ξ*) = − *∂*^{2}*l*(*τ*, *η*; *ξ*) / *∂**τ*^{2}, *I*_{τη}(*τ*, *η*; *ξ*) = − *∂*^{2}*l*(*τ*, *η*; *ξ*) / *∂**τ**∂**η*^{T}, ${I}_{\eta \tau}\left(\tau ,\eta ;\xi \right)={I}_{\tau \eta}^{\text{T}}\left(\tau ,\eta ;\xi \right)$, and *I*_{ηη}(*τ*, *η*; *ξ*) = − *∂*^{2}*l*(*τ*, *η*; *ξ*) / *∂**η**∂**η*^{T}. The score statistic for testing the null hypothesis *H*_{0}:*τ* = 0 is ${U}_{\tau}\left(0,\stackrel{\u02c6}{\eta};\xi \right)$, where $\stackrel{\u02c6}{\eta}$ is the solution to the equation *U*_{η}(0, *η*; *ξ*) = 0. Under *H*_{0}, the random variable ${n}^{-1/2}{U}_{\tau}\left(0,\stackrel{\u02c6}{\eta};\xi \right)$ is asymptotically zero-mean normal with a variance that can be consistently estimated by^{26}

Suppose that ξ is estimated from the data by $\stackrel{\u02c6}{\xi}$. Then we replace ξ in ${U}_{\tau}\left(0,\stackrel{\u02c6}{\eta};\xi \right)$ by $\stackrel{\u02c6}{\xi}$. It can be shown that *U*_{τ}(0, *η*; *ξ*) = *ξ*^{T}*U*_{β}(0, *η*), where *U*_{β}(*β*, *η*) is the score function of β under Equation 1. Because ${n}^{-1/2}{U}_{\beta}\left(0,\stackrel{\u02c6}{\eta}\right)$ is asymptotically zero-mean normal, ${\stackrel{\u02c6}{\xi}}^{\text{T}}\phantom{\rule{0.25em}{0ex}}{n}^{-1/2}{U}_{\beta}\left(0,\stackrel{\u02c6}{\eta}\right)$ has the same asymptotic distribution as ${\xi}^{\ast \text{T}}{n}^{-1/2}{U}_{\beta}\left(0,\stackrel{\u02c6}{\eta}\right)$, where *ξ*^{ ∗ } is the limit of $\stackrel{\u02c6}{\xi}$. As a result, ${n}^{-1/2}{U}_{\tau}\left(0,\stackrel{\u02c6}{\eta};\stackrel{\u02c6}{\xi}\right)$ has the same asymptotic distribution as ${n}^{-1/2}{U}_{\tau}\left(0,\stackrel{\u02c6}{\eta};{\xi}^{\ast}\right)$. Thus, the test statistic

is asymptotically standard normal as long as $\stackrel{\u02c6}{\xi}$ converges to a nonzero constant as $n\to \mathrm{\infty}$.

Let *U*_{τ, i}(*τ*, *η*; *ξ*) and *U*_{η, i}(*τ*, *η*; *ξ*) be the *i*th subject's contributions to *U*_{τ}(*τ*, *η*; *ξ*) and *U*_{η}(*τ*, *η*; *ξ*), respectively, and let Σ_{τη} and Σ_{ηη} be the limits of *n*^{ − 1}*I*_{τη}(0, *η*; *ξ*) and *n*^{ − 1}*I*_{ηη}(0, *η*; *ξ*), respectively. It is easy to show that ${n}^{-1/2}{U}_{\tau}\left(0,\stackrel{\u02c6}{\eta};\xi \right)$ is asymptotically equivalent to ${n}^{-1/2}{\sum}_{i=1}^{n}{u}_{i}$, where

We refer to *u _{i}* as the

*i*th subject's efficient score function.

^{27}To derive the joint distribution of the test statistics with

*K*weight functions, we use the fact that

*n*

^{ − 1 / 2}

*U*

_{k}is asymptotically equivalent to ${n}^{-1/2}{\sum}_{i=1}^{n}{u}_{ki}$, where

*u*is the

_{ki}*i*th subject's efficient score function associated with the

*k*th weight function. Note that (

*u*

_{1i}, …,

*u*

_{Ki})(

*i*= 1, …,

*n*) are

*n*independent random vectors. By the multivariate central limit theorem and law of large numbers, the null distribution of

*n*

^{ − 1 / 2}(

*U*

_{1}, …,

*U*

_{K}) is asymptotically zero-mean normal, and the covariance between

*n*

^{ − 1 / 2}

*U*

_{k}and

*n*

^{ − 1 / 2}

*U*

_{l}is consistently estimated by ${n}^{-1}{\sum}_{i=1}^{n}{U}_{ki}{U}_{li}$, where the

*U*'s are obtained from the

_{ki}*u*'s by replacing all unknown parameters by their sample estimators.

_{ki}For quantitative traits, we replace Equation 2 with the linear regression model:

*Y*

_{i}= τ

*S*

_{i}+ γ

^{T}

*Z*

_{i}+

*ϵ*

_{i},

where ε* _{i}* is normal with mean 0 and variance σ

^{2}. Then the score statistic and its variance are

and

where

and

For multiple weight functions,

and

To perform permutation tests without covariates, we simply permute the *Y _{i}*'s. In the presence of covariates, we adopt the following procedure: (1) calculate the residuals ${R}_{i}={Y}_{i}-{\stackrel{\u02c6}{\gamma}}^{\text{T}}{Z}_{i}$(

*i*= 1, …,

*n*), (2) permute the

*R*'s to yield the ${R}_{i}^{\ast}$'s, (3) create new trait values ${Y}_{i}^{\ast}={\stackrel{\u02c6}{\gamma}}^{\text{T}}{Z}_{i}+{R}_{i}^{\ast}$(

_{i}*i*= 1, …,

*n*), (4) replace the

*Y*'s by the ${Y}_{i}^{\ast}$'s, (5) recalculate the test statistic, and (6) repeat steps 2–5 a large number of times.

_{i}We have implicitly assumed that the trait is univariate and the subjects are unrelated. For repeated measures or family studies, we use generalized linear mixed models^{28} to capture the dependence of trait values. Suppose that the study contains *n* families with *n _{i}* members in the

*i*th family. For

*i*= 1, …,

*n*and

*l*= 1, …,

*n*

_{i}, let

*Y*,

_{il}*S*and

_{il}*Z*denote the values of

_{il}*Y*,

*S*, and

*Z*for the

*l*th member of the

*i*th family. The random effects

*b*(

_{i}*i*= 1, …,

*n*) are independent zero-mean random vectors with density function

*f*(

*b*;

*θ*) indexed by a set of parameters θ. Conditional on

*b*, the trait values

_{i}*Y*

_{i1}, …,

*Y*

_{i, ni}are independent and follow a generalized linear model with density

*f*(

*y*|

*S*

_{il},

*Z*

_{il};

*b*

_{i}). The log-likelihood function is

where τ is the fixed effect of *S _{il}*, and η includes the fixed effects of

*Z*and parameters θ. For repeated measures, the log-likelihood takes the same form with

_{il}*Y*and

_{il}*Z*being the trait and covariate values at the

_{il}*l*th measurement time for the

*i*th subject and with

*S*replaced by

_{il}*S*. We can then use the arguments of the first three paragraphs to derive the test statistics.

_{i}For potentially censored age-at-onset traits, we specify that the hazard function for the age at onset conditional on *S _{i}* and

*Z*satisfies the proportional hazards model

_{i}^{29}

*t*|

*S*

_{i},

*Z*

_{i}) = λ

_{0}(

*t*)

*e*

^{τSi+γTZi},

where λ_{0} is an arbitrary baseline hazard function and *Z _{i}* is redefined to exclude the unit component. Let

*T*denote the duration of follow-up for the

_{i}*i*th subject, and let Δ

*indicate, by the values 1 versus 0, whether*

_{i}*T*is the actual age at onset or the censoring time. Then the score statistic and its variance are

_{i}and $V={I}_{\tau \tau}-{I}_{\tau \gamma}{I}_{\gamma \gamma}^{-1}{I}_{\gamma \tau}$, where ℛ_{i} denotes the set of subjects whose durations of follow-up are no shorter than *T _{i}*, $\stackrel{\u02c6}{\gamma}$ is the solution to the equation

and *a*^{ ⊗ 2} = *a**a*^{T}. For multiple weight functions, we obtain the efficient score functions by approximating the partial likelihood score function with a sum of *n* independent terms.^{30}

## Web Resources

The URL for data presented herein is as follows:

- SCORE-Seq: Score-Type Tests for Detecting Disease Associations With Rare Variants in Sequencing Studies, http://www.bios.unc.edu/∼lin/software/SCORE-Seq/

## References

**American Society of Human Genetics**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (326K) |
- Citation

- Utilizing population controls in rare-variant case-parent association tests.[Am J Hum Genet. 2014]
*Jiang Y, Satten GA, Han Y, Epstein MP, Heinzen EL, Goldstein DB, Allen AS.**Am J Hum Genet. 2014 Jun 5; 94(6):845-53. Epub 2014 May 15.* - Meta-analysis of gene-level associations for rare variants based on single-variant statistics.[Am J Hum Genet. 2013]
*Hu YJ, Berndt SI, Gustafsson S, Ganna A, Genetic Investigation of ANthropometric Traits (GIANT) Consortium, Hirschhorn J, North KE, Ingelsson E, Lin DY.**Am J Hum Genet. 2013 Aug 8; 93(2):236-48. Epub 2013 Jul 25.* - Meta-analysis of sequencing studies with heterogeneous genetic associations.[Genet Epidemiol. 2014]
*Tang ZZ, Lin DY.**Genet Epidemiol. 2014 Jul; 38(5):389-401. Epub 2014 May 5.* - In search of low-frequency and rare variants affecting complex traits.[Hum Mol Genet. 2013]
*Panoutsopoulou K, Tachmazidou I, Zeggini E.**Hum Mol Genet. 2013 Oct 15; 22(R1):R16-21. Epub 2013 Aug 6.* - Identifying rare variants associated with complex traits via sequencing.[Curr Protoc Hum Genet. 2013]
*Li B, Liu DJ, Leal SM.**Curr Protoc Hum Genet. 2013 Jul; Chapter 1:Unit 1.26.*

- Testing Genetic Association with Rare and Common Variants in Family Data[Genetic epidemiology. 2014]
*Chen H, Malzahn D, Balliu B, Li C, Bailey JN.**Genetic epidemiology. 2014 Sep; 38(0 1)S37-S43* - Rare SERINC2 variants are specific for alcohol dependence in subjects of European descent[Pharmacogenetics and genomics. 2013]
*Zuo L, Wang KS, Zhang XY, Li CS, Zhang F, Wang X, Chen W, Gao G, Zhang H, Krystal JH, Luo X.**Pharmacogenetics and genomics. 2013 Aug; 23(8)395-402* - Targeted exon sequencing fails to identify rare coding variants with large effect in rheumatoid arthritis[Arthritis Research & Therapy. 2014]
*Bang SY, Na YJ, Kim K, Joo YB, Park Y, Lee J, Lee SY, Ansari AA, Jung J, Rhee H, Lee JY, Han BG, Ahn SM, Won S, Lee HS, Bae SC.**Arthritis Research & Therapy. 2014; 16(5)447* - Performance of statistical methods on CHARGE targeted sequencing data[BMC Genetics. ]
*Xing C, Dupuis J, Cupples LA.**BMC Genetics. 15104* - Loss-of-Function Mutations in APOC3, Triglycerides, and Coronary Disease[The New England journal of medicine. 2014]
*The TG and HDL Working Group of the Exome Sequencing Project, National Heart, Lung, and Blood Institute.**The New England journal of medicine. 2014 Jul 3; 371(1)22-31*

- A General Framework for Detecting Disease Associations with Rare Variants in Seq...A General Framework for Detecting Disease Associations with Rare Variants in Sequencing StudiesAmerican Journal of Human Genetics. 2011 Sep 9; 89(3)354

Your browsing activity is empty.

Activity recording is turned off.

See more...