- Journal List
- BMC Proc
- v.5(Suppl 9); 2011
- PMC3287858

# An aggregating *U*-Test for a genetic association study of quantitative traits

^{1}1Department of Epidemiology, Michigan State University, B601 West Fee Hall, East Lansing, MI 48824, USA

^{}Corresponding author.

#### Supplement

#### Conference

## Abstract

We propose a novel aggregating *U*-test for gene-based association analysis. The method considers both rare and common variants. It adaptively searches for potential disease-susceptibility rare variants and collapses them into a single “supervariant.” A forward *U*-test is then used to assess the joint association of the supervariant and other common variants with quantitative traits. Using 200 simulated replicates from the Genetic Analysis Workshop 17 mini-exome data, we compare the performance of the proposed method with that of a commonly used approach, QuTie. We find that our method has an equivalent or greater power than QuTie to detect nine genes that influence the quantitative trait Q1. This new approach provides a powerful tool for detecting both common and rare variants associated with quantitative traits.

## Background

In the past decade, extensive genome-wide association studies (GWAS) have been conducted to understand the genetic etiology of complex diseases. However, the common variants identified so far explain only a small fraction of the variations of complex diseases [1]. It now seems clear that the genetic etiology of complex diseases is highly heterogeneous. Some genetic mutations, although individually rare, may impose a high risk for the development of diseases [2]. These rare variants have been the recipients of growing attention by investigators. With the fast development of biotechnology, it is now feasible to genotype rare sequence variations in the general population with unprecedented speed [3]. Meanwhile, statistical methods are greatly needed to detect the association between these genetic variants and common complex diseases.

The most commonly used approach for detecting the association between rare variants and disease outcome is to collapse multiple rare variants into a single “supervariant,” which is tested further as a common variant. Based on this idea, Li and Leal developed a combined multivariate and collapsing (CMC) method for rare variant analysis [4]. This method was further extended for quantitative traits and was implemented in the software package QuTie [5]. In the past two years, a number of collapsing methods that use various strategies have been proposed, and they provide alternatives to the CMC method [6]. Compared to the multivariate analysis of multiple rare variants, these collapsing methods could reduce the degrees of freedom by creating a single supervariant composed of multiple individual rare variants, thus improving the testing power. In addition, testing on a single supervariant could reduce the burden of multiple testing. However, the existing methods also have a few limitations that may affect their performance. Collapsing all the rare variants in the same gene or genomic region, although biologically meaningful, can also introduce nonfunctional variants into the supervariant, which may diminish the signal that the functional variants carry. Intuitively, this limitation can be addressed by collapsing only a subset of the disease-susceptibility rare variants. In what follows, we refer to the collapsing process using trait information as aggregation.

In this paper, we propose an aggregating *U*-test to examine the association between the quantitative traits and multiple genetic variants, including both rare and common variants. First, the method adaptively collapses the disease-susceptibility rare variants into a supervariant; it then searches the supervariant and the remaining common variants for the best multi-SNP (single-nucleotide polymorphism) combination, using a forward selection. We have applied our method to the Genetic Analysis Workshop 17 (GAW17) mini-exome data and compared its performance with QuTie.

## Methods

We have recently developed a forward *U*-test to detect gene-gene interactions using general multisample *U* statistics [7]. Here, we first describe the definition of *U* statistics and then explain the aggregation of rare variants and the forward selection of multi-SNP combinations using *U* statistics.

Suppose that we have a study population of *N* subjects. Let *Y _{i}* denote the observed value of the quantitative trait for the

*i*th individual (

*i*= 1, 2, …,

*N*); and let

*X*=

_{i}*X*,

_{i1}*X*, … ,

_{i2}*X*, denote the genotypes of

_{iK}*K*SNPs for the

*i*th individual, each taking its value from one of the three possible genotypes,

*X*{

_{ij}*AA*,

*Aa*,

*aa*},

*j*= 1, 2, …,

*K*. Without loss of generality, we assume that

*a*is the minor allele, and the first

*r*SNPs (

*X*,

_{i1}*X*, …,

_{i2}*X*) are rare variants.

_{ir}*U* statistics

Suppose that we have *L* multi-SNP genotypes formed by *k* SNPs of interest, denoted as *G*_{1}, *G*_{2}, …, *G _{L}*. A multi-SNP genotype

*G*is defined here as a vector of the

_{l}*k*genotypes that an individual carries (e.g.,

*g*

_{1},

*g*

_{2}, …,

*g*). The

_{k}*k*SNPs and

*L*multi-SNP genotypes are selected sequentially out of a total of

*K*genotyped SNPs (see the “Forward

*U*-test” section for details). Let:

be the set of subjects carrying multi-SNP genotype *G _{l}*, and let

*m*= |

_{l}*S*| be the number of subjects in

_{l}*S*. We measure the trait difference between two sets of subjects

_{l}*S*and

_{l}*S*as:

_{l′}where the kernel function is chosen as:

*U _{l,l′}*, is the summation of all possible pairwise trait comparisons for any two subjects from

*S*and

_{l}*S*. In the presence of an association, we would expect individuals carrying different multi-SNP genotypes to have different trait values (e.g., those carrying a high-risk multi-SNP genotype would have higher trait values than those carrying a low-risk multi-SNP genotype). Based on

_{l′}*U*, we can form the global

_{l,l′}*U*statistic. We assume that the expected quantitative trait value of the

*L*multi-SNP genotypes decreases with

*l*(i.e., ). Practically, we sort the multi-SNP genotypes according to their average trait values (i.e., ). We define the global

*U*statistic for

*L*sets of subjects with different multi-SNP genotypes as:

where:

Here, the weight parameter *ω _{l,l′}* is chosen to account for the number of subjects in each genotype group. This global

*U*statistic measures the overall trait differences among individuals from a total number of

*L*multi-SNP genotype groups. It is equivalent to Eq. (2) when

*L*= 2.

### Aggregation of the rare variants

When dealing with a large number of rare variants, it is likely that a significant proportion of the rare variants will not be associated with a disease; thus collapsing on a selected subset of rare variants will be necessary. Each rare variant can form two single-SNP genotypes, {*g*_{1} = *Aa* | *aa*, *g*_{2} = *AA*}, for which a *U* statistic can be calculated by using Eq. (2). We rank the *U* statistics in decreasing order as *U*_{(1)}, *U*_{(2)}, …, *U*_{(}_{r}_{)}. Assume that *V*_{(1)}, *V*_{(2)}, …, *V*_{(}_{r}_{)} are the corresponding rare variants in a candidate gene and that *X _{i}*

_{(1)},

*X*

_{i}_{(2)}, …,

*X*

_{i}_{(}

_{r}_{)}are their observed genotypes for individual

*i*. We start from variant

*V*

_{(1)}, and define a supervariant as:

At each step of the aggregation process, we add a rare variant with the largest *U* statistic to the supervariant. Accordingly, we redefine the supervariant as:

The supervariant in Eq. (7) always forms two different genotypes, {*R _{ij}* = 1,

*R*= 0}, for which a

_{ij}*U*statistic can be calculated using Eq. (2). The collapsing procedure stops at step

*t*, where the

*U*statistics start to decrease (i.e., ).

### Forward *U*-test

A forward *U*-test is used to evaluate the supervariant and other common variants for their joint association with the trait [7]. We start the process by treating all individuals as a single group. In the first step, each common SNP *j* can form two single-SNP genotypes, , in three possible ways, denoted , , and . As a special case, the supervariant can form only two single-SNP genotypes, . This leads to a total of 3(*K* − *r*) + 1 possible grouping strategies, which can be represented by , where denotes the *l*th multi-SNP genotype at step *s*. A *U* statistic can be calculated for two sets of subjects grouped by . The SNP with the largest *U* statistic value is selected, and the corresponding grouping is recorded.

In the second step, based on the first selected SNP, a second SNP *j*′ is chosen to form four two-SNP genotypes, denoted by {, , , }. We calculate the global *U* statistics for each of these grouping strategies using Eq. (4) and choose the one with the largest *U* statistics. It should be noted that, if the same SNP from step 1 is chosen in step 2, then only three single-SNP genotypes will be formed, denoted by {, , }. As the algorithm moves forward, the *U* statistic is expected to increase until groups cannot be split further. This results in a series of models with different numbers of groups. The best model, with the appropriate number of groups, can be determined by using a 10-fold cross-validation procedure. The *U* statistic of the best model is calculated using the whole data set, and the significance of its association can be obtained using permutation. For each permutation replicate, the same procedure (including the aggregation process and model selection) is applied to calculate the *U* statistics. An empirical *p*-value, which accounts for inflated type I error resulting from model selection, can be calculated by using a large number of permutations.

## Results

We applied the proposed method to analyze the quantitative trait Q1 in the GAW17 mini-exome data. Thirty-nine SNPs, located in nine genes, were associated with trait Q1. The minor alleles of these SNPs were associated with the higher mean of Q1, and their frequencies ranged from 0.07% to 16.5%. We first adjusted the trait by age, using a linear regression model. The residual scores were then used for our association studies. Based on 200 replicates, we conducted a gene-based association study for each of the nine causal genes. For each gene, the traits were permutated 1,000 times, to generate the empirical null distribution of the *U* statistics. We then evaluated the power of our method by counting the number of replicates whose *U* statistics exceeded the 95th percentile of the null distribution. A similar analysis was also conducted using QuTie, version 0.2. The threshold for rare variants was chosen as a minor allele frequency (MAF) less than 0.01. The performance of the two methods varied according to the number of SNPs within the genes, the number of causal SNPs, and their effect sizes. We divided the nine genes into five groups accordingly (Table (Table11).

We found that both methods had a high power to detect the association for genes in group 1 and group 2. As a special case, gene *VEGFC* had only one rare variant, and therefore no selection was necessary. Both methods were able to detect this SNP because of its large effect size. The aggregating *U*-test showed a significant power improvement over QuTie for genes in group 3. For example, gene *ELAVL4* was composed of seven rare variants and three common variants, among which only two rare variants were causal. The individual effects of these variants, though relatively high (0.769 and 0.304), were mitigated by collapsing them with other SNPs, which led to low power by QuTie. In addition, both methods had high power to detect the association for genes in group 4, whereas QuTie attained higher power than the aggregating *U*-test. Because most of the rare variants in the gene were causal and their effects were relatively large, it would be ideal to collapse all the variants. In such a case, the aggregation would have a smaller advantage because the selection process would introduce additional variations. However, we believe that this is not a common scenario in a real data application. Finally, both methods had low power to detect the association for genes in group 5. For both genes, the selection of rare variants did not show any advantage because of the low effect of each functional rare variant.

## Discussion

Our method has two major advantages: (1) It can substantially improve the testing power when only a small proportion of rare variants under examination are functional; and (2) it collapses only a subset of selected rare variants that are potentially trait related. Therefore it can help to identify disease-susceptibility rare variants, which makes our results easier to interpret and replicate in follow-up studies. The existing methods, such as QuTie, collapse all rare variants within the same genomic region and analyze the rare and common variants without differentiating the functional and nonfunctional variants. As illustrated by our analysis, such methods are subject to low power when a significant proportion of the variants in the genomic region are not trait related. Nevertheless, they could have comparable or even higher power over our method when most (or all) of the variants within a gene are trait related. Moreover, because existing methods do not adopt a model selection algorithm to eliminate the noise loci, they are computationally faster than the proposed method. However, if we make the same assumption as existing methods (i.e., that all the variants are associated with disease), then we could also use the asymptotic result of the proposed method to test for association without model selection and permutation [7]. Under the null hypothesis, asymptotically the global *U* statistic has a mean of zero and follows a normal distribution.

We also note that the aggregation of rare variants is different from the forward selection of multi-SNP genotypes. During the aggregation process, the supervariant always forms two genotype groups: one with all rare alleles and the other without any rare allele. The corresponding *U* statistic first increases by adding the risk rare alleles and then decreases when nonrisk rare alleles are added. On the contrary, the number of genotype groups in the forward selection process keeps increasing as the algorithm moves forward, which results in an increasing global *U* statistic (i.e., the model with the largest number of genotype groups has the highest global *U* statistic). Therefore the cross-validation procedure is necessary for forward selection to avoid overfitting the data. The cross-validation procedure, however, is not practical for the aggregation of rare variants, because their low MAFs may cause the absence of rare alleles in the testing set.

## Conclusions

The proposed aggregating *U*-test provides a powerful tool for genetic association studies with both common and rare variants. Our method could also be useful for identifying disease-susceptibility variants underlying quantitative traits.

## Competing interests

The authors declare that there are no competing interests.

## Authors’ contributions

ML developed the method, carried out the analysis and drafted the manuscript. WF participated in the design and helped to draft the manuscript. QL conceived of the study, developed the method, participated in its design and coordination and helped to draft the manuscript. All authors read and approved the final manuscript.

## Acknowledgments

The Genetic Analysis Workshop is supported by National Institutes of Health grant R01 GM031575. This work is supported by start-up funds from Michigan State University. We wish to thank the two editors and the two anonymous referees for their helpful comments, which improved the manuscript.

This article has been published as part of *BMC Proceedings* Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/5?issue=S9.

## References

- Schork N, Murray S, Frazer K, Topol KE. Common vs. rare allele hypotheses for complex diseases. Curr Opin Genet Dev. 2009;19:212–219. doi: 10.1016/j.gde.2009.04.010. [PMC free article] [PubMed] [Cross Ref]
- McClellan J, King M. Genetic heterogeneity in human disease. Cell. 2010;141:210–217. doi: 10.1016/j.cell.2010.03.032. [PubMed] [Cross Ref]
- Schuster S. Next-generation sequencing transforms today’s biology. Nat Meth. 2008;5:16–18. doi: 10.1038/nmeth1156. [PubMed] [Cross Ref]
- Li B, Leal S. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008;83:311–321. doi: 10.1016/j.ajhg.2008.06.024. [PMC free article] [PubMed] [Cross Ref]
- Lawrence R, Day-Williams A, Elliott K, Morris A, Zeggini E. CCRaVAT and QuTie-enabling analysis of rare variants in large-scale case-control and quantitative trait association studies. BMC Bioinformatics. 2010;11:527–534. doi: 10.1186/1471-2105-11-527. [PMC free article] [PubMed] [Cross Ref]
- Dering C, Pugh E, Ziegler A. Statistical analysis of rare sequence variants: An overview of collapsing methods. Genet Epidemiol. 2011;X(suppl X):X–X. [PMC free article] [PubMed]
- Li M, Ye C, Fu W, Elston R, Lu Q. Detecting genetic interactions for quantitative trait with U statistics. Genet Epidemiol. 2011;35:1–12. [PMC free article] [PubMed]

**BioMed Central**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (250K) |
- Citation

- Gene-based multiple trait analysis for exome sequencing data.[BMC Proc. 2011]
*Zhao J, Thalamuthu A.**BMC Proc. 2011 Nov 29; 5 Suppl 9:S75. Epub 2011 Nov 29.* - Comparative study of statistical methods for detecting association with rare variants in exome-resequencing data.[BMC Proc. 2011]
*Saad M, Pierre AS, Bohossian N, Macé M, Martinez M.**BMC Proc. 2011 Nov 29; 5 Suppl 9:S33. Epub 2011 Nov 29.* - CCRaVAT and QuTie-enabling analysis of rare variants in large-scale case control and quantitative trait association studies.[BMC Bioinformatics. 2010]
*Lawrence R, Day-Williams AG, Elliott KS, Morris AP, Zeggini E.**BMC Bioinformatics. 2010 Oct 21; 11:527. Epub 2010 Oct 21.* - Propensity score analysis in the Genetic Analysis Workshop 17 simulated data set on independent individuals.[BMC Proc. 2011]
*Lin CM, Sathirapongsasuti FJ, Kerner B.**BMC Proc. 2011 Nov 29; 5 Suppl 9:S71. Epub 2011 Nov 29.* - Detecting genetic interactions for quantitative traits with U-statistics.[Genet Epidemiol. 2011]
*Li M, Ye C, Fu W, Elston RC, Lu Q.**Genet Epidemiol. 2011 Sep; 35(6):457-68. Epub 2011 May 26.*

- A method to incorporate prior information into score test for genetic association studies[BMC Bioinformatics. ]
*Zakharov S, Teoh GH, Salim A, Thalamuthu A.**BMC Bioinformatics. 1524* - Combined genotype and haplotype tests for region-based association studies[BMC Genomics. ]
*Zakharov S, Wong TY, Aung T, Vithana EN, Khor CC, Salim A, Thalamuthu A.**BMC Genomics. 14569* - Comparison of similarity-based tests and pooling strategies for rare variants[BMC Genomics. ]
*Zakharov S, Salim A, Thalamuthu A.**BMC Genomics. 1450*

- PubMedPubMedPubMed citations for these articles

- An aggregating U-Test for a genetic association study of quantitative traitsAn aggregating U-Test for a genetic association study of quantitative traitsBMC Proceedings. 2011; 5(Suppl 9)S23

Your browsing activity is empty.

Activity recording is turned off.

See more...