- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

# GATES: A Rapid and Powerful Gene-Based Association Test Using Extended Simes Procedure

^{1,}

^{2,}

^{3}Hong-Sheng Gui,

^{1}Johnny S.H. Kwan,

^{1}and Pak C. Sham

^{1,}

^{2,}

^{3,}

^{}

^{1}Department of Psychiatry and State Key Laboratory for Cognitive and Brain Sciences, the University of Hong Kong, Pokfulam, Hong Kong

^{2}The Centre for Reproduction, Development and Growth, the University of Hong Kong, Pokfulam, Hong Kong

^{3}Genome Research Centre, the University of Hong Kong, Pokfulam, Hong Kong

^{}Corresponding author ; Email: pcsham/at/hkucc.hku.hk

## Abstract

The gene has been proposed as an attractive unit of analysis for association studies, but a simple yet valid, powerful, and sufficiently fast method of evaluating the statistical significance of all genes in large, genome-wide datasets has been lacking. Here we propose the use of an extended Simes test that integrates functional information and association evidence to combine the p values of the single nucleotide polymorphisms within a gene to obtain an overall p value for the association of the entire gene. Our computer simulations demonstrate that this test is more powerful than the SNP-based test, offers effective control of the type 1 error rate regardless of gene size and linkage-disequilibrium pattern among markers, and does not need permutation or simulation to evaluate empirical significance. Its statistical power in simulated data is at least comparable, and often superior, to that of several alternative gene-based tests. When applied to real genome-wide association study (GWAS) datasets on Crohn disease, the test detected more significant genes than SNP-based tests and alternative gene-based tests. The proposed test, implemented in an open-source package, has the potential to identify additional novel disease-susceptibility genes for complex diseases from large GWAS datasets.

## Introduction

Genome-wide association studies (GWASs) are being used for identification of susceptibility loci for complex diseases.^{1} These studies typically use the single nucleotide polymorphism (SNP) as the basic unit of analysis, which is a convenient strategy and has led to the discovery of many important genetic loci for human diseases.^{2} However, the statistically significant variants detected so far explain only a modest proportion of the total variance in liability to disease, and inadequate statistical power is likely to have contributed to the failure to detect true effects.^{3,4} The problem of statistical power is exacerbated by the necessity of adopting stringent p value thresholds for significance (typically 5 × 10^{−8}) in order to control false-positive association from the large number of SNPs tested. In addition, many significant SNPs are likely to represent surrogate markers in linkage disequilibrium (LD) with the variants causing diseases, and differences in LD patterns across populations can lead to nonreplication of the same SNP in another population but significant association for some other surrogate SNPs.^{5}

Shifting from SNP-based association analysis to gene-based analysis is one possible way to improve the power of GWASs. In a gene-based analysis, one jointly analyzes all variants within a putative gene to obtain a single p value representing the significance of association of the entire gene. Analysis using the gene as the basic unit has several attractive features. First, the gene is the functional unit of the human genome. Unlike genetic variants that have different allele frequencies, LD structure, and heterogeneity across diverse human populations, the gene itself is highly consistent across populations.^{6} Gene-based analysis might therefore lead to more consistent results and alleviates difficulties in replication. Second, gene-based analysis reduces the multiple-testing burden substantially; it requires correction for approximately 20,000–30,000 genes rather than potentially millions of SNPs. Finally, with the gene as the unit of analysis, extension of the findings to further functional analyses, such as protein-protein interactions (PPIs) and biological pathways, is more straightforward. The integration of association evidence and functional information might facilitate the unraveling of the pathogenic mechanisms of complex diseases.

A number of gene-based association tests have been proposed. Linear regression (for quantitative traits) and logistic regression (for binary traits) are straightforward methods of evaluating the overall association between a gene and a trait. In these tests, all the SNPs or haplotypes in the gene are entered as predictor variables simultaneously, except for redundant SNPs, whose inclusion would result in collinearity.^{6} However, a simple regression analysis might suffer from low statistical power if many SNPs or haplotypes are included, resulting in a test with many degrees. Many methods reduce the dimensionality of the test by compressing the information in the multiple correlated SNPs, for example by Fourier transformation,^{7} principal-components analysis,^{8,9} the use of fixed SNP weights based on the LD pattern across the gene,^{10} and cluster analysis.^{11} All these regression-based methods require the availability of the raw, individual phenotype and genotype data.

Methods involving the combination of the SNP-based test statistics or p values have also been proposed. The largest test statistic from all the SNP-based tests in a gene has been proposed as a gene-based test statistic.^{12} However, the value of this statistic is expected to be positively correlated with the number of SNPs in the gene, and although adjustment for gene size by a permutation procedure is possible, this is time consuming for large datasets.^{12} Another possible method is to combine the p values of the SNPs in a gene by Fisher's combination test.^{13} However, this method assumes that the constituent p values should be based on independent tests, which is unlikely to be true for SNPs in the same gene. Violation of this assumption is likely to inflate the type I error rate, unless use of a permutation procedure provides empirical statistical significance. A variant of the Fisher's method is the truncated-product p value method,^{14} which was originally developed to deal with “publication bias” in meta-analysis.^{15} However, like the Fisher's combination method, this test is also sensitive to LD among the SNPs in a gene and therefore requires a permutation procedure if an empirical p value is to be obtained. Instead of permutation, which requires raw genotype data, a recent variation of the Fisher's combination test uses a simulation approach based on normal variables with correlations that are assigned values according to the LD structure between SNPs.^{16} The p values of this method are highly correlated with those obtained from a permutation procedure. The simulation method, although faster than permutation, is still computationally intensive when applied to genome-wide datasets.

A separate issue for the design of gene-based tests is the possibility of improving the power of the test by imposing weights on the SNPs according to prior information on their likely relative importance. The idea of p value weights was introduced in the context of a sequential step-down test for maintaining the family-wise type 1 error rate^{17} and was subsequently incorporated into a false-discovery rate (FDR) procedure.^{18} A procedure for assigning prior p value weights based on a mixture model for p values has been suggested.^{19} Indeed, given the observed p values, it is possible to optimize the choice of p value weights to be applied to tests grouped by prior information.^{20} However, because the observed dataset might contain limited information, it might be desirable to also make use of established functional information and prior data in the assignment of p value weights.

In this paper, we propose a rapid gene-based association test that uses extended Simes procedure (GATES) to assess the gene-level statistical association significance that can efficiently handle results based on millions of SNPs (possibly from imputation and meta-analysis) in the later stages of GWASs and next-generation sequencing studies. This test can rapidly combine the p values of SNPs within a gene, without relying on raw, individual phenotype and genotype data, to produce valid gene-based p values. This gene-based test can also incorporate functional information on SNPs by the use of prior weights to increase statistical power. After introducing the test, we present a series of computer simulations that are useful in investigating the test's type 1 error rate, and we compare the test's statistical power with that of alternative gene-based tests. To assess its performance in real datasets, we applied the method to GWAS data on Crohn disease (CD [MIM 266600]).

## Material and Methods

### Construction of Gene-Based-Association p Value

We assume that a test of association between the disease and each of the available SNPs within a gene has been carried out and that the resulting p values and pair-wise correlation coefficients *r* for all the SNPs are available. The proposed method, GATES, a modification of the Simes test, combines these available p values to give a gene-based p value. Let *p*_{(1)}, …, *p*_{(m)} be the ascending p values of *m* SNPs within a gene. We propose combining the *m* SNP-based p values to obtain an overall p value for the gene as follows:

where *m _{e}* is the effective number of independent p values among the

*m*SNPs and

*m*

_{e}_{(j)}is the effective number of independent p values among the top

*j*SNPs. The null hypothesis of this gene-based test is that no SNP within the gene is associated with the disease, whereas the alternative is that at least one SNP in the gene is associated with the disease.

In the test proposed above, we used a measure that is more robust than those currently available^{21–24} (unpublished data) to obtain ${m}_{e}$. The value of ${m}_{e}$ is estimated to be equal to $M-\sum _{i=1}^{M}[I({\lambda}_{i}>1)({\lambda}_{i}-1)]\begin{array}{c}{\lambda}_{i}>0\end{array}$, where *I*(*x*) is an indicator function and ${\lambda}_{i}$ is the *i*^{th} eigenvalue of the p value correlation coefficient matrix $\left[{\rho}_{i,j}\right]$ of SNP-based statistic tests. The negative eigenvalues are set as zero and ignored. Negative eigenvalues should only arise in the presence of missing data, and they are usually relatively few in number and close to zero.^{21} When the SNPs are independent, the eigenvalues are all 1, so that *m _{e}* is equal to the number of SNPs. When all the SNPs are in complete LD, the first eigenvalue is equal to the number of SNPs and the rest are 0, so that

*m*= 1. For intermediate situations, we have performed simulation and permutation studies (see below) to show that the formula also provides an appropriate effective number of SNP p values and that ${P}_{G}$ will thus have an approximate uniform (0,1) distribution.

_{e}^{18,25}

For a simple case-control study, the pair-wise SNP p value correlation coefficient ρ is expected to be mainly determined by the pair-wise LD between the two corresponding SNPs, as measured by the allelic correlation coefficient *r*, although it could also be influenced by the allele frequencies of the two SNPs and the numbers of cases and controls in the study. We explored the relationship ρ and *r*, for different allele frequencies and sample sizes, empirically by simulation. Genotype data of two biallelic SNPs were simulated for 1,500 cases and 1,500 controls, for a particular set of values of *r* and allele frequencies, under Hardy-Weinberg equilibrium. We then performed an allelic association test for each of the two SNP to obtain two p values. Repeating this procedure 100,000 times resulted in 100,000 sets of p values, from which the correlation coefficient of the p values of the two SNPs,ρ, was calculated. We increased the allele frequencies and *r* in steps of 0.05 from their minimum to their maximum values to generate a series of data points. It turned out that the p value correlation coefficient ρ could be accurately approximated by a sixth-order polynomial function of the pair-wise allelic correlation coefficient *r* (coefficient of determination *R*^{2} = 0.9986), regardless of allele frequencies (see Figure 1). Repeated simulations using samples of different sizes and quantitative traits (analyzed by linear regression) also yielded the same polynomial approximation.

The gene-based test can be further extended to incorporate differential SNP weights as follows:

where *w*_{(1),} …, *w*_{(m)} are non-negative and sum to *m _{e}*. These weights are calculated from prior weights

*r*

_{(1),}…,

*r*

_{(m)}, set according to the relative functional importance of the SNP to non-negative values but otherwise unconstrained. The procedure takes in turn the sorted SNPs, according

*w*

_{(i)}=

*c*(

*m*

_{e}_{(i)}−

*m*

_{e}_{(i-1)})

*r*

_{(i)}, where

*m*

_{e}_{(0)}= 0 and

*c*is defined such that the weights sum to

*m*:

_{e}The use of weights is expected to increase statistical power if SNPs with higher weights are more likely to be associated with disease than SNPs with lower weights. In the absence of information, equal weights can be used.

### Alternative Gene-Based Tests

We performed simulation studies to compare the type 1 error rate and statistical power of GATES with those of the following alternative gene-based tests:

- •
Logistic regression. Each SNP is entered as an explanatory variable, coded as 0, 1, or 2 for the number of copies of the minor allele in the genotype, and case-control status is coded as the response variable. A gene-based p value is provided by the likelihood ratio test comparing the full model with all available SNPs and the null model without any SNP.

- •
Fisher combination test. The gene-based test statistic is given by $T=-2\sum _{j=1}^{m}\mathrm{ln}{p}_{\left(j\right)}$, which has a chi-square distribution with 2

*m*degrees of freedom under the null hypothesis when the*m*tests are independent.^{26}The test is expected to be liberal for positively correlated tests, such that a permutation procedure is needed if a valid p value is to be obtained.^{13}- •
Original Simes test. The gene-based p value is ${P}_{S}=\mathrm{min}\left(m{p}_{\left(j\right)}/j\right)$. For independent tests,

*P*is uniform (0,1) under the null hypothesis. For positively correlated tests,_{S}*P*is expected to be conservative._{S}- •
A versatile gene-based test for genome-wide association studies (VEGAS) proposed recently by Liu et al. (2010).

^{16}The test allows the SNP-based chi-square test statistics within a gene to be combined in a flexible manner to give a gene-based test statistic (e.g., it can take the sum of all the statistics, or the sum of the several top statistics, or simply the largest statistic). An empirical null distribution for this gene-based test statistic is obtained through a simulation of multivariate standard normal random vectors with correlations equal to those between the SNPs in the gene; the component variables are squared to give correlated chi-square random variables, and then appropriate variables are summed as dictated by how the gene-based test statistic was calculated. In our simulations, we calculated two versions of the test, one based on the sum of all the SNP-based chi-square statistics in the gene (VEGAS-Sum) and one based on just the largest statistic (VEGAS-Max).

Note that only logistic regression requires the raw phenotype and genotype data, whereas the other tests require only the SNP-based p values. However, a permutation procedure, which is necessary to ensure the correct type 1 error rates for the Fisher and original Simes tests when the SNPs are correlated, also requires the raw data. The VEGAS method does not require raw data but instead requires only the correlation matrix of the SNPs.

### Simulation Studies of Type 1 Error Rate and Statistical Power

The simulation involved the generation of genotype data on 30 SNPs, which were all biallelic and under Hardy-Weinberg equilibrium. We considered three different scenarios in terms of LD structure: (1) the SNPs are situated in six strong LD blocks (see Table S1), (2) the SNPs are situated in six moderate LD blocks (see Table S2), or (3) the SNPs are in linkage equilibrium. Given the LD pattern and the allele frequencies of the 30 SNPs, we used a program based on the HapSim algorithm^{27} to generate genotype data. We then considered three different scenarios in terms of gene size: (1) a three-SNP gene containing the first three SNPs, (2) a ten-SNP gene containing the first ten SNPs, and (3) a 30-SNP gene containing all 30 SNPs. Finally, we considered three scenarios in terms of disease model: (1) a null model where no SNP has any effect on disease risk, (2) an additive model where one SNP in each LD block has a minor allele that increases the risk ratio additively by 0.14, and (3) a multiplicative model where one SNP in each LD block has a minor allele that increases the risk ratio multiplicatively by a factor of 1.14 (see Tables S1 and S2).^{28} Because three-SNP, ten-SNP, and 30-SNP genes contain one, two, and six LD blocks, respectively, the number of susceptibility SNPs they contain are correspondingly one, two, and six. The baseline risk corresponding to the absence of any risk-increasing alleles is calculated from the allele frequencies and risk ratios of the susceptibility SNPs and gives a population disease prevalence of 0.1. For each combination of scenarios, a population of 1,000,000 individuals was generated. A random sample of 1500 cases and 1500 controls was drawn, without replacement, from the population and subjected to the different methods of gene-based association. Type 1 error rates and statistical power estimates under the different scenarios were obtained from the proportion of simulated datasets, out of 1,000 simulated populations, that resulted in significant p values (set at 0.05).

### Impact of Weighting on Type 1 Error Rates and Statistical Power

To evaluate the impact of weighting the SNPs in the construction of the gene-based test, we assigned some SNP with a high weight (*w _{i}* > 1) and the others with a low weight (0 <

*w*< 1) in simulated data generated as described above. We considered two scenarios of weight assignment: (1) the SNPs assigned to have the high weight are the true susceptibility SNPs, whereas the SNPs assigned to have the low weight have no direct causal effect, and (2) the assignment of weight is random. Although the first scenario is expected to increase statistical power, the latter scenario is expected to have no effect or to result in reduced statistical power. Although random assignment is not the worst possible scenario, it might be the worst that is likely to occur in real data analyses. We also varied the ratio of high to low weights from 1 to 16 to see the impact on type 1 error rates and statistical power.

_{i}### Genome-wide Type 1 Error Rates under Realistic LD Patterns

The above evaluation of type 1 error rates in simulation was based on arbitrary LD structure and might not represent realistic examples of the actual LD structure of genes in real populations. In order to assess the genome-wide type 1 error rates under realistic situations, we calculated the various gene-based test statistics for genotype data from a real GWAS, where the phenotypes were reassigned at random. The real GWAS data used were on a sample of 2514 Chinese subjects typed by the Illumina Human610-Quad BeadChip from projects in Hong Kong with Institutional Review Board approval. After standard quality-control procedures, 473,931 SNPs were left for analysis; among these, 209,784 SNPs were in 23,672 genes. SNP-based association analysis was carried out with a genotypic association test in Plink.^{29} Two LD datasets from different sources were prepared: the pair-wise r-squares estimated through Plink^{29} from the genotype data of the actual case-control sample and the *r*-squares from the latest HapMap LD dataset (CHB panel) released on April 19, 2009. We used GATES to combine SNP-level p values to obtain gene-based p values. We assessed type 1 error rates for the gene-based tests by examining the proportion of genes for which the gene-based p value is lower than various threshold values (0.05., 0.01, 0.001, 0.0001). In addition, we used a quantile-quantile (Q-Q) plot to compare the overall distribution of the gene-based p values to a uniform (0,1) distribution.

### Application to GWASs

To further evaluate the performance of GATES under realistic situations, we used it to reanalyze the data from a published meta-analysis of three CD GWASs with a total of 3,230 cases and 4,829 controls.^{30} We used the *r*-square values from the HapMap CEU sample to adjust for marker dependency. Prior to applying GATES, we subjected the SNP-based p values to genomic control correction^{31} to avoid inflated significance levels. SNPs were mapped onto genes according to the gene coordinate information from NCBI. SNPs within 5 kilobase pairs of each gene were also assigned into the gene. In the very rare case where a SNP was in the overlapping region of two genes, the SNP was assigned into both genes. We compared the results of the SNP-based tests, the original Simes test and GATES, in terms of the number of significant hits after Bonferroni correction.

## Results

### Simulation Studies of Type 1 Error Rate and Statistical Power

The empirical type 1 error rates and statistical powers of GATES and the five alternative methods at a nominal type 1 error rate (α) of 0.05 are given in Table 1. When the markers within a gene are independent, the empirical type 1 error rates of all tests are approximately 0.05. For dependent markers, however, the Fisher combination test is a liberal test with an inflated type 1 error rate. In contrast, the original Simes test becomes conservative for a gene with multiple SNPs in strong LD. The type 1 error rates of the other five tests (including the one we propose) are all correct regardless of the marker dependency.

The statistical powers of the tests are affected by the number of disease-susceptibility loci (DSL) and the marker dependency. When the markers are independent and there are only 1 or 2 susceptibility loci (i.e., in the case of the three-SNP or ten-SNP gene), all the tests have approximately equal power to identify the susceptibility genes. When a gene has 30 SNPs and six susceptibility loci, the most powerful tests are those that combine the evidence from all the SNPs in an additive manner, i.e., logistic regression, Fisher's combination, and VEGAS-Sum (see Table 1). GATES has power comparable to that of the VEGAS-Sum test in the three-SNP and ten-SNP scenarios, but it is slightly less powerful for a gene with 30 SNPs and six susceptibility loci. It is more powerful than logistic regression when the markers are in strong LD, and it is similar or superior in power to the original Simes test or the VEGAS-Max test in all situations.

The powers of the Fisher combination test with permutation, the original Simes test with permutation, and GATES are shown in Table 2. In general, all three tests have very similar powers, with a few exceptions. One of these situations is when there are six susceptibility loci (among 30 SNPs), in which case the Fisher combination test is more powerful than the other two tests. Another is when there is only one susceptibility locus among a large number (i.e., 10) of independent SNPs, in which case the Fisher combination test is less powerful than the other two tests.

### Impact of Weighting on Type 1 Error Rates and Statistical Power

The use of weights does not lead to an inflated type 1 error rate for GATES (see Figure 2). However, weight setting can have substantial effects on statistical power. When the SNPs are independent or in moderate LD, the assignment of relatively high weights to the true susceptibility SNPs can substantially increase the power of the gene-based test (see Figure 2). The bigger the difference between the high and the low weights, the greater the power gain. However, the assignment of high weights to nonpredisposing SNPs can decrease power; bigger differences between high and low weights leads to greater power loss. Fortunately, the power loss is generally much less than the potential power gain that can result from favorable weight setting for genes. For example, when the high:low weight ratio is 3, the randomly assigned weights result in only 2% power loss for the scenario with one susceptibility locus among ten independent SNPs in the gene, whereas a favorable weight assignment would result in more than a 10% increase in power in the same situation. However, this pattern does not seem applicable to the gene with three SNPs in strong LD; in that case, the power loss due to random weighting might be larger than the power gain when the ratio of high to low weight is large. Actually, when all SNPs are in strong LD, the effective number of p values will approach 1, and the higher weight will be also close to 1 so that the type 1 error can be controlled. Hence, the favorable weight will only have a slight effect on the SNPs p values and thus on the power of the statistic test. Anyway, according to the empirical simulation, a high:low weight ratio less than 5 seems preferable because the power loss due to the random weights is trivial, at least across the scenarios we have tested, whereas the power gain as a result of corrected weight can be substantial.

### Genome-wide Type 1 Error Rates under Realistic LD Patterns

In the simulation study with real genotypes and permuted phenotypes from an actual GWAS dataset, GATES does not show inflation of type 1 error rates across all genes at the α levels of 0.05, 0.01, 0.001, and 0.0001, regardless of the number of SNPs in the gene (see Table 3). The use of LD derived from the current GWAS dataset or from HapMap CHB data leads to similar results (Table 3). The Pearson correlation coefficient between the two sets of gene-based p values was 0.997. An examination of the QQ plot of the p values of all genes, genes with three or fewer SNPs, and genes with more than three SNPs reveals no deviation from a uniform (0,1) distribution (Figure 3).

### Application to Genome-wide Association Dataset on CD

GATES was implemented in an open-source tool named *K*nowledge-Based Mining System for *G*enome-wide *G*enetic Studies (KGG), which was used for analysis of the SNP-based p values for CD. The program took less than 2 min to perform a whole-genome scan for the dataset on an ordinary desktop computer with Intel Core 2 CPU 2.66G Hz, RAM 1.97 GB, and 32-bit Windows XP Professional Version 2002.

There was an overall inflation of SNP-based p values (genomic control λ 1.1586) in the Meta-analysis dataset on CD. Barrett et al. (2008)^{30} argued that, given the large sample size (3,230 CD cases and 4,829 controls), the overall inflation was modest and would not introduce spurious differences between cases and controls. Nevertheless, we adjusted the SNP-based p values by the genomic control inflation factor^{31} to reduce potential false positives. In the dataset, 311,638 (49.09%) SNPs were assigned to be within one or more of 23,974 genes. The numbers of significant p values for the SNP-based test, the original Simes test, and GATES at three levels of family-wise significance are shown in Table 4. GATES detected more significant genes than the original Simes test or SNP-based test. At the family-wise error rate of 0.05, GATES detected five more significant genes than the SNP-based p values alone. All significant genes according to SNP-based p values were also significant by the original Simes test. The extended Simes test reported two more significant genes, *MST1* [MIM 142408] and *BSN* [MIM 604020], than the original Simes test; the other genes were significant for both tests. *MST1* was convincingly replicated in independent samples by Barrett et al. (2008).^{30} Recent studies also support a contribution from *BSN*^{32,33} to CD. At the family-wise error rate of 0.1, GATES detected five more significant genes than the SNP-based p values. Among these five genes, Barrett et al. (2008)^{30} successfully replicated *ITLN1* [MIM 609873], and there is also support for *TNFSF15* [MIM 604052] as a candidate gene involved in CD.^{34–36} In a recent genome-wide meta-analysis of CD in a larger sample, the susceptibility of the four genes was reconfirmed.^{37} The significant genes (FWER ≤ 0.1) and their SNPs are detailed online in the Table S3.

#### Using the Gene-Based Test to Guide Replication Studies

After a genome-wide gene-based scan, the next practical issue is how to use the results to guide follow-up replication studies. A straightforward strategy is to prioritize genes on the basis of their p values and then select the SNPs with the smallest p values within each prioritized gene for replication. We conceptually validated this idea by using the released replication results in Table S2 of Barrett et al. (2008)^{30} for CD. There were 23 SNPs with a significant replication p value < 3.85E−4 (= 0.05/130) among the 130 SNPs in their Table S2. These SNPs could be mapped onto 19 known genes. In 13 of these 19 genes, the same SNP was the most significant SNP within the gene in both the original Meta-analysis and the independent replication study (see Table S4 online), suggesting that choosing the most significant SNP within each selected gene is usually optimal. However, functional considerations are also potentially relevant for SNP selection. The most significant SNPs of two genes (*IL23R* [MIM 607562] and *RTEL1* [MIM 608833]) in the Meta-analysis later were surpassed in the replication study by other SNPs in the same gene with greater functional significance. For *IL23R*, the most significant SNP in the replication study is rs11209026, a missense variant. For *RTEL1* [MIM 608833], it is rs2297441, a variant in utr-3. Interestingly, rs2297441 is also mapped onto a miRNA binding site of *RTEL1* in Sanger's miRBase.

## Discussion

The proposed gene-based test, GATES, is a Simes test extension that is valid for correlated SNPs and capable of incorporating previously assigned functional weights of the SNPs in the gene. The test does not require the raw genotype or phenotype data as inputs but requires only the SNP-based p values and SNP-SNP correlations, and it need not assume that all SNPs of a gene have the same direction of effect. It is also very fast because there is no need for permutation or simulation. GATES can handle millions of SNPs in less than 10 min, which makes it convenient for post-GWAS analyses, especially for the huge datasets that are being generated by genome-wide meta-analyses^{38} and imputation,^{39,40} as well as by next-generation sequencing technology,^{41} although it will lack power for rare variants. We have shown GATES to have correct type 1 error rates in both simulated and permuted datasets, regardless of the number of typed SNPs in the gene or LD structure. We have also shown that it is similar in statistical power to alternative gene-based tests that require permutation or simulation.^{12,16,42} Furthermore, we have shown that the power of the test can be improved by the appropriate assignment of differential prior weights to the SNPs within a gene.

In the present study, we made a systematic comparison between several simple and efficient methods of combining p values to guide gene-level association studies. These tests can be generally categorized into two groups, ones simultaneously combining all SNPs and others mainly focusing on the best SNPs. The first group includes the logistic regression method, Fisher combination test (adjusted by permutation), and the VEGAS-Sum test proposed by Liu et al.^{16}; tests belonging to the second group are the Simes test, the VEGAS-Max test proposed by Liu et al. ,^{16} and GATES in the present study. The first group of tests are generally more powerful for detecting a gene with multiple independent DSL, whereas the second group of tests can work better when a gene has only one or a few independent DSL. In addition, the performance of the first group of tests is more sensitive to the number of neutral SNPs within a gene. That is, they can be much less powerful than the second group of tests for detecting a large gene with many typed SNPs but only a few truly associated ones. Interestingly, the presence of LD invalidates only the Fisher combination test and tends to increase the statistical power of the other tests, except for logistic regression, which has the same power regardless of LD. As a result, logistic regression is more powerful than other tests when the SNPs in the gene are uncorrelated but less powerful when the SNPs are in LD. Among the second group of tests, GATES has comparable power but is much faster than the best-SNP test proposed by Liu et al.^{16} and can be more powerful than the original Simes test when the SNPs within a gene are in strong LD.

GATES could be less powerful than the permutation-based Fisher combination test and the simulation-based summation statistic test proposed by Liu et al.^{16} when it comes to detecting a gene that is of small or moderate size but that includes quite a few (say, five or more) independent DSL. However, to the best of our knowledge, this would be a rare scenario in real datasets. Instead, it is probably more usual for a gene to contain only one or two independent DSL, in which case the power of GATES to detect a susceptibility gene is similar to that of the permutation-based Fisher combination test and the simulation-based summation statistic test proposed by Liu et al.^{16} Moreover, the methods based on summation of SNP-based statistics also have their own weakness, in that they are less powerful for detecting a large gene with many typed SNPs that do not have a true effect. Therefore, when we are uncertain about the true pattern of association in a gene, it might be reasonable to adopt GATES because computation is fast and convenient.

The construction of prior weights is still an open question. There is no guarantee that true susceptibility SNPs will always be assigned high or favorable weighs because we do not yet have full understanding of the relationship sequence and function to allow us to accurately predict the functional consequences of a sequence change. One potentially useful resource for weight construction is the Catalog of Published GWAS.^{43} In comparison to SNPs randomly selected from genotyping arrays, trait/disease-associated SNPs (TASs) were significantly overrepresented only in nonsynonymous sites (odds ratio [OR] = 3.9 (2.2–7.0), p = 3.5E−7] and 5 kb promoter regions (OR = 2.3 (1.5–3.6), p = 3E−7)]; however, they were not overrepresented in introns, although 88% of TASs collected through December 31, 2008 in the Catalog of Published GWAS were intronic. Nicolae et al. found that TASs were more likely to be expression quantitative trait loci (eQTL), and the eQTL information can be used to enhance discovery of trait-associated SNPs for complex phenotypes.^{44} Hence, it might be possible to construct the prior weights for each SNP on the basis of the ORs associated with their genomic annotations. However, many of the GWAS hits are likely to represent indirect associations, and the sequence at the associated SNP itself might therefore be of no significance. Moreover, different classes of diseases (e.g., neurological diseases and immunological diseases) might have different distributions of the enrichment across various categories. If this is true, weights that are specific to a disease, or a disease class, might be more effective. Unfortunately, the number of available GWAS hits is still too limited to allow stable estimates even for a class of diseases, not to mention an individual disease. As the number of GWAS hits increases, this obstacle will diminish. Anyhow, as we have shown in our simulation, the power gain resulting from a favorable weight setting in GATES is expected to be greater than the power loss resulting from an arbitrary weight setting, especially when the ratio of high to low weight is <5. Therefore, the use the prior weights to evaluate gene-based association may be worthwhile when it is feasible to generate reliable weights.

The statistically valid gene-level p values attained with GATES can facilitate in-depth bioinformatics analysis because it is usually more appropriate to take the entire gene (rather than individual SNPs) as a basic analysis unit. The evaluation of association at the gene level nicely avoids the difficulties in processing the evidence from numerous dependent SNPs in biological pathways or networks. On the basis of these gene-level p values, many Bioinformatics methods^{45} originally developed for gene-set enrichment analysis of microarray expression data could be readily adopted for the functional analysis of GWAS hits. A common basic assumption of the enrichment analysis is that genes responsible for the same diseases tend to be distributed within the same biological modules.^{46} Such an assumption implies that many disease susceptibility genes might not function alone but could be connected to each another in one or more biological modules. A module can be a protein complex,^{47} a pathway,^{48} or a subnetwork of PPIs.^{49} Within a module, unknown underlying disease-susceptibility genes could be predicted on the basis of some known ones. The coexistence of multiple significantly associated genes within the same biological modules could, in turn, strengthen the evidence of the involvement of the modules in the development of disease.^{50–52} More importantly, the biological modules could also aid our understanding of the pathogenic mechanisms of the disease and therefore suggest novel targets for drug development. The strategy of integrating multiple bioinformatics resources into genetic analysis is a promising and important trend for genetic studies in the near future.

An advantage of GATES is that it can use LD information from a known reference population (e.g., HapMap), and it therefore can be used even when individual genotype information on the study sample is not available, as long as the SNP-based p values are accessible. The method behaves well when the reference population matches closely with the actual study population. For example, using the LD information from HapMap Chinese reference sample on the SNP-based p values of a permuted Chinese dataset gave the correct type 1 error rate (Table 3), and the gene-based p values correlated highly (r = 0.997) with gene-based p values obtained from an analysis where LD is obtained from the genotype data of the actual study sample. However, if the reference population does not match well with the study population, then the type 1 error rate will be affected. If the reference population has a generally higher level of LD than the actual study population, then the ${m}_{e}$ will be underestimated, and the gene-based test will tend to be liberal. Conversely, if the reference population has a generally lower level of LD than the actual study population, then the ${m}_{e}$ will be overestimated, and the gene-based test will tend to be conservative. One problematic scenario is when the SNP-based p values have been obtained from a meta-analysis of multiple populations with differing LD structures. In practice, apart from African populations and population isolates, most outbred populations such as Europeans and Asians have rather similar levels of LD, and when the type 1 errors of the gene-based tests for these populations are calculated from HapMap reference samples, they are unlikely to be grossly inflated or deflated.

In principle, one can apply this method to combine the SNP p values of genes within a pathway to produce a pathway-based p value. However, the complex structure of pathways might make it more difficult to interpret the results. A single highly significant SNP p value within a pathway might lead to a significant pathway p value. If the gene containing this SNP is only involved in a single pathway, then this would suggest that this pathway is important. However, because a gene can belong to multiple pathways and a large pathway can contain multiple small pathways, it might be difficult to clearly identify which pathways are involved in disease etiology.

A gene-based test can obviously only cover SNPs within and near to genes, and although genes are the most interesting regions of genome, it is certain that some intergenic SNPs are still of functional significance, for example in altering the expression of genes at a distance. We suggest that a gene-based analysis should be complemented by SNP-based tests of SNPs outside of genes, so that the entire genome is exhaustively explored for all possible association signals. We have implemented this strategy in KGG, which is a standalone tool with graphic interface. It can read SNP p values by any statistic tests and LD information from various sources to perform a gene-based test. In addition, supported by multiple integrated bioinformatics databases, KGG can also use the generated gene-based p values to explore biological pathways and PPI networks.

## Acknowledgments

We are grateful to Mark J. Daly for sharing data on CD.^{30} This work was funded by Hong Kong Research Grants Council GRF HKU 774707, the European Community's Seventh Framework Program under grant agreement No. HEALTH-F2-2009-241909 (Project EU-GEI), the Small Project Funding HKU 201007176166, and The University of Hong Kong Strategic Research Theme on Genomics. We also thank two anonymous reviewers for their useful comments, which improved this paper significantly.

## Web Resources

The URLs for data presented herein are as follows:

- The Catalog of Published Genome-Wide Association Studies, http://www.genome.gov/gwastudies
- Gene coordinates information from NCBI, ftp://ftp.ncbi.nlm.nih.gov/genomes/MapView/Homo_sapiens/sequence/BUILD.36.3/updates/seq_gene.md.gz
- HapMap, http://www.hapmap.org/
- KGG website, http://bioinfo.hku.hk/kggweb/
- Online Mendelian Inheritance in Man (OMIM), http://www.ncbi.nlm.nih.gov/omim
- Sanger's miRBase, http://microrna.sanger.ac.uk/

## References

**American Society of Human Genetics**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (555K)

- HYST: a hybrid set-based test for genome-wide association studies, with application to protein-protein interaction-based association analysis.[Am J Hum Genet. 2012]
*Li MX, Kwan JS, Sham PC.**Am J Hum Genet. 2012 Sep 7; 91(3):478-88.* - Resampling-based multiple hypothesis testing procedures for genetic case-control association studies.[Genet Epidemiol. 2006]
*Chen BE, Sakoda LC, Hsing AW, Rosenberg PS.**Genet Epidemiol. 2006 Sep; 30(6):495-507.* - Gene-based testing of interactions in association studies of quantitative traits.[PLoS Genet. 2013]
*Ma L, Clark AG, Keinan A.**PLoS Genet. 2013; 9(2):e1003321. Epub 2013 Feb 28.* - SNP-based pathway enrichment analysis for genome-wide association studies.[BMC Bioinformatics. 2011]
*Weng L, Macciardi F, Subramanian A, Guffanti G, Potkin SG, Yu Z, Xie X.**BMC Bioinformatics. 2011 Apr 15; 12:99. Epub 2011 Apr 15.* - Uncovering networks from genome-wide association studies via circular genomic permutation.[G3 (Bethesda). 2012]
*Cabrera CP, Navarro P, Huffman JE, Wright AF, Hayward C, Campbell H, Wilson JF, Rudan I, Hastie ND, Vitart V, et al.**G3 (Bethesda). 2012 Sep; 2(9):1067-75. Epub 2012 Sep 1.*

- A review of multivariate analyses in imaging genetics[Frontiers in Neuroinformatics. ]
*Liu J, Calhoun VD.**Frontiers in Neuroinformatics. 829* - Resequencing Three Candidate Genes for Major Depressive Disorder in a Dutch Cohort[PLoS ONE. ]
*Verbeek EC, Bevova MR, Bochdanovits Z, Rizzu P, Bakker IM, Uithuisje T, De Geus EJ, Smit JH, Penninx BW, Boomsma DI, Hoogendijk WJ, Heutink P.**PLoS ONE. 8(11)e79921* - Properties of permutation-based gene tests and controlling type 1 error using a summary statistic based gene test[BMC Genetics. ]
*Swanson DM, Blacker D, AlChawa T, Ludwig KU, Mangold E, Lange C.**BMC Genetics. 14108* - Association Testing Strategy for Data from Dense Marker Panels[PLoS ONE. ]
*Lee D, Bacanu SA.**PLoS ONE. 8(11)e80540* - Alzheimer's Disease Risk Gene, GAB2, is Associated with Regional Brain Volume Differences in 755 Young Healthy Twins[Twin research and human genetics : the offi...]
*Hibar DP, Jahanshad N, Stein JL, Kohannim O, Toga AW, Medland SE, Hansell NK, McMahon KL, de Zubicaray GI, Montgomery GW, Martin NG, Wright MJ, Thompson PM.**Twin research and human genetics : the official journal of the International Society for Twin Studies. 2012 Jun; 15(3)286-295*

- GATES: A Rapid and Powerful Gene-Based Association Test Using Extended Simes Pro...GATES: A Rapid and Powerful Gene-Based Association Test Using Extended Simes ProcedureAmerican Journal of Human Genetics. Mar 11, 2011; 88(3)283PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...