• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of bmcprocBioMed Centralsearchsubmit a manuscriptregisterthis articleBMC Proceedings
BMC Proc. 2009; 3(Suppl 7): S38.
Published online Dec 15, 2009.
PMCID: PMC2795936

Genome-wide association studies of rheumatoid arthritis data via multiple hypothesis testing methods for correlated tests

Abstract

Genome-wide association studies often involve testing hundreds of thousands of single-nucleotide polymorphisms (SNPs). These tests may be highly correlated because of linkage disequilibrium among SNPs. Multiple testing correction ignoring the correlation among markers, as is done in the Bonferroni procedure, can cause loss of power. Several multiple testing adjustment methods accounting for correlations among tests have been developed and have shown improved power compared to the Bonferroni procedure. These methods include a Monte Carlo (MC) method and a method of computing p-values adjusted for correlated tests. The objective of this study is to apply these two multiple testing methods to genome-wide association study of the Genetic Analysis Workshop 16 rheumatoid arthritis data from the North American Rheumatoid Arthritis Consortium, to compare the performance of these two methods to the Bonferroni procedure in identifying susceptibility loci underlying rheumatoid arthritis, and to discuss the strengths and weaknesses of these methods. The results show that both the MC method and p-values adjusted for correlated tests method identified more significant SNPs, thus potentially have higher power than the corresponding Bonferroni methods using the same test statistics as in the MC method and p-values adjusted for correlated tests, respectively. Simulation studies demonstrate that the MC method may have slightly higher power than the p-values adjusted for correlated tests method.

Background

Genome-wide association studies (GWAS) for complex diseases involve multiple hypothesis testing. The Bonferroni procedure is commonly used to control family-wise error rate (FWER) for multiple hypothesis testing. However, the Bonferroni procedure becomes more conservative as the number of hypotheses tested increases and the test statistics are correlated [1,2]. To handle the correlation among test statistics, a permutation method [3] was proposed based on estimation of the joint distribution of test statistics. However, this approach is computationally intensive and not appropriate for GWAS. Therefore, several efficient methods that can account for the correlation among test statistics have been developed for multiple testing [1,2]. Lin [2] proposed a Monte Carlo (MC) sampling approach based on approximating the joint distribution of test statistics. This method does not require repeated analyses of simulated datasets as in the permutation method, and therefore is much less computationally demanding. Conneely and Boehnke [1] proposed a method of computing p-values adjusted for correlated tests (p_ACT) by numerical integration of the asymptotic multivariate normal distribution of the test statistics. This approach is very computationally efficient and attains even greater speed. In this study we applied three multiple testing procedures, the Bonferroni procedure, MC method, and p_ACT method, to GWAS of the Genetic Analysis Workshop 16 (GAW16) rheumatoid arthritis (RA) data from the North American Rheumatoid Arthritis Consortium (NARAC). We compared the performance of these three procedures by simulation studies.

Methods

We describe the three multiple hypothesis testing procedures in the context of association studies. Suppose there are n individuals with m markers in the observed case-control data. We test m null hypotheses H1, H2, ..., Hm for the m markers. The corresponding p-values are p1, p2, ..., pm. In the Bonferroni procedure, if pi α/m, then Hi is rejected (i = 1, ..., m), where α is the pre-set significance level. While in the Bonferroni procedure all tests are assumed to be independent, the MC method and the p_ACT methods described below account for dependence among test statistics by considering the joint distribution of test statistics. All the three methods can control the FWER well.

MC method

The test statistic for the jth marker (corresponding to hypothesis Hj) is defined as

where An external file that holds a picture, illustration, etc.
Object name is 1753-6561-3-S7-S38-i2.gif, An external file that holds a picture, illustration, etc.
Object name is 1753-6561-3-S7-S38-i3.gif, Yi is the phenotypic value of individual i and An external file that holds a picture, illustration, etc.
Object name is 1753-6561-3-S7-S38-i4.gif; Xji is the genotypic score of individual i at locus j and An external file that holds a picture, illustration, etc.
Object name is 1753-6561-3-S7-S38-i5.gif; An external file that holds a picture, illustration, etc.
Object name is 1753-6561-3-S7-S38-i6.gif. If the hypothesis Hj is true, the statistic Tj has approximately a χ2 distribution with dj degrees of freedom, where dj is the dimension of Uj. For GWAS of the RA data in this article, we only consider an additive genetic model with dj = 1, and Xji = 0, 1, or 2, indicating the number of minor alleles in the genotype of individual i at locus j.

The test statistics (T1, T2, ..., Tm) may be correlated due to linkage disequilibrium (LD) among markers. The multiple testing procedure using the actual joint distribution of (T1, T2, ..., Tm) can be computationally intensive. The MC method provides an approach to approximate the actual joint distribution by MC sampling. The MC method defines An external file that holds a picture, illustration, etc.
Object name is 1753-6561-3-S7-S38-i7.gif, where An external file that holds a picture, illustration, etc.
Object name is 1753-6561-3-S7-S38-i8.gif and G1, G2, ..., Gn are independent standard normal random variables that are independent of the data, and then the method uses the joint distribution of An external file that holds a picture, illustration, etc.
Object name is 1753-6561-3-S7-S38-i9.gif values to approximate the joint distribution of Tj values based on obtaining realizations from distributions of An external file that holds a picture, illustration, etc.
Object name is 1753-6561-3-S7-S38-i9.gif values by repeatedly generating the normal random samples G1, G2, ..., Gn. Let t(1) t(2) (...)t(m) be the ordered observed values of the test statistics (T1, T2, ..., Tm), and let H(1), H(2)..., H(m) be the hypotheses and An external file that holds a picture, illustration, etc.
Object name is 1753-6561-3-S7-S38-i10.gif are An external file that holds a picture, illustration, etc.
Object name is 1753-6561-3-S7-S38-i9.gif variables corresponding to (t(1), t(2), ..., t(m)), respectively. The MC method works as a step-down procedure as follows: starting with hypothesis H(1), the method rejects H(j), j = 1, 2, ..., m and removes the corresponding marker and variable An external file that holds a picture, illustration, etc.
Object name is 1753-6561-3-S7-S38-i11.gif from consideration, if An external file that holds a picture, illustration, etc.
Object name is 1753-6561-3-S7-S38-i12.gif, provided that H(1), ..., H(j-1) have been tested and rejected. This probability is calculated based on a large number (e.g., 10,000) of realizations of the An external file that holds a picture, illustration, etc.
Object name is 1753-6561-3-S7-S38-i9.gif values. We have implemented this method by using the statistical package R [4].

p_ACT method

Suppose the test statistics T = (T1, T2, (...), Tm) for m markers follow multivariate normal distribution N(0, Σ) asymptotically when all null hypotheses are true (i.e., no markers are associated with the disease), where 0 is an m-dimensional vectors of zeros and Σ is a m × m correlation matrix. Let pmin p(2) (...)p(m) be the ordered p-values calculated from the observed data. The probability of observing at least one p-value as small as pmin is

where Z1, Z2, (...), Zm are random variables from the multivariate normal distribution N(0, Σ). Computation of pACT requires integration of the multiple normal density function. The current version of the p_ACT method can handle integration of up to 1,000 dimensions (i.e., 1,000 markers) at one time [5]. Although any test statistics T with asymptotic multivariate normal distribution can be used for Eq. (2), Conneely and Boehnke [1] described the test statistic

where An external file that holds a picture, illustration, etc.
Object name is 1753-6561-3-S7-S38-i15.gif, An external file that holds a picture, illustration, etc.
Object name is 1753-6561-3-S7-S38-i16.gif. These test statistics (T1, T1, ..., Tm) asymptotically follow multivariate normal distribution N(0, R), where R = {rjk}, j = 1, ..., m, k = 1, ..., m, and An external file that holds a picture, illustration, etc.
Object name is 1753-6561-3-S7-S38-i17.gif; An external file that holds a picture, illustration, etc.
Object name is 1753-6561-3-S7-S38-i18.gif. The p_ACT method also works as a step-down procedure: 1) If pACT <α, then reject the null hypothesis associated with pmin, and remove pmin and the marker associated with pmin. 2) Let pmin = p(2) in Eq. (2), change m into m-1, and repeat Step 1. Continuing in this fashion for the remaining p(j) until for some p(k), pACT = α, then accept all remaining hypotheses (including that associated with p(k)). This p_ACT method has been implemented in a computer program in R [6].

Partitioning of genome-wide single-nucleotide polymorphism (SNP) data

As stated earlier, the p_ACT method can only handle up to 1,000 tests at a time, and the MC method can handle more than 1,000 tests but may become computationally intensive when the number of tests is very large. To apply these methods to GWAS of the RA data, we divided the whole genome into small blocks; each block includes hundreds or up to one thousand of SNPs. We assume that tests within each block are dependent, and that tests from different blocks are independent. To control the FWER at α for the whole genome, we apply Bonferroni procedure among blocks, that is, we assign αb to each block such that the sum of these αb equals α (i.e., ∑ αb = α), where αb is proportional to the number of SNPs within the block (i.e., block size). We applied the MC method and p_ACT method separately to each block and control FWER at αb for the tests within the block. In this study, we considered the block sizes of 100, 500, and 1,000 separately to evaluate the effect of block size on the performance of the MC method and p_ACT method.

Application to RA data

The RA dataset contains 868 cases and 1,194 controls with 545,080 SNPs after removing duplicated and contaminated samples. If an individual has missing genotype at a marker, we imputed the most-frequent genotype observed in the data at that marker. We removed SNPs with minor allele frequencies (MAF) less than 0.01 and SNPs with p-values less than 1 × 10-4 in Hardy-Weinberg equilibrium (HWE) test in controls. We did not consider SNPs on sex chromosomes. After these procedures, 515,050 SNPs remained in our analysis.

We applied the three multiple testing procedures to the GWAS of the RA dataset. First, we performed the association analysis on the RA dataset by using the test statistics in Eqs. (1) and (3) separately and obtained p-values for each SNP. We call these p-values raw p-values. For each test statistic, we applied the Bonferroni procedure to the p-values and set the nominal level of FWER as 0.05. The raw p-values calculated from statistic in Eq. (3) were used in the p_ACT method (see below). The MC method is based on the statistic in Eq. (1), and we set the number of replicates of the normal random samples G1, G2, ..., Gn as 25,000. The p_ACT method is based on the test statistic in Eq. (3), and all tests are two-sided. In the p_ACT method we set the limit on the number of simulations or integrand values in R function "pmvnorm" as maxpts = 25,000.

Simulation studies

To evaluate the performance of the two statistics in Eqs. (1) and (3) and of three multiple testing methods, we simulated 10,000 replicated data sets in a manner similar to Lin [2]; each data set included N1 cases and N2 controls with 100 SNPs in a chromosomal region (or one block). We considered two sets of values of N1 and N2: 1) N1 = N2 = 100, and 2) N1 = 100, and N2 = 150. For each individual, the simulated chromosomal region consisted five independent consecutive subregions. Each subregion has 20 biallelic SNPs in LD with coefficient of r2 = 0.9 between two successive SNPs. At each SNP, HWE was assumed and MAF was 0.3. We chose one SNP in the first subregion as a disease SNP, and determined case/control status based on an additive disease model with disease prevalence of 0.1 and genotype relative risk of 2.

Results

Table Table11 shows the estimated FWER and power for the simulated datasets with nominal level of FWER = 0.05 (i.e., α = 0.05). The estimated FWER was calculated based on whether any SNP in any of the last four subregions was significant. The power was estimated based on whether any SNP in the first subregion is significant. In the situation N1 = N2 = 100, the two test statistics in Eqs. (1) and (3) with Bonferroni correction generated almost the same results (on FWER and power), and the MC method and p_ACT method also had nearly same results. In the situation N1 = 100 and N2 = 150, the test statistic in Eq. (1) had slightly higher power than that in Eq. (3), and consequently, the MC method had slightly higher power than the p_ACT method. In both situations, the MC method and p_ACT method had higher power than the Bonferroni methods using statistics in Eqs. (1) and (3), respectively.

Table 1
Estimated FWER and power from the simulated 10,000 replicated data sets

Table Table22 reports the number of significant SNPs associated with RA on 22 chromosomes detected by the three procedures, where the overall significant level α across the whole genome was 0.05. By using the Bonferroni procedure, we identified 634 and 589 significant SNPs for the 22 chromosomes based on the statistics defined in Eqs. (1) and (3), respectively. The test statistic in Eq. (1) identified more significant SNPs. Table Table22 also describes the number of significant SNPs on chromosomes 1, 6, 9, 16, and 22, which had more identified significant SNPs than other chromosomes. Based on the statistic in Eq. (1), the MC method identified 667, 679, and 682 significant SNPs for the 22 chromosomes when the block sizes are 100, 500, and 1,000, respectively. These numbers of identified significant SNPs are greater than the 634 identified by the Bonferroni procedure using the same statistic. Similarly, based on the statistic in Eq. (3), the p_ACT method identified 611, 621, and 635 significant SNPs, when the block sizes were 100, 500, and 1,000, respectively. These numbers of identified significant SNPs are also greater than the 589 identified by the Bonferroni procedure using the statistic in Eq. (3). As the block size increased, the numbers of significant SNP identified by both the MC method and the p_ACT method increased. The MC method using the statistic in Eq. (1) identified more significant SNPs than the p_ACT method using the statistic in Eq. (3).

Table 2
The numbers of identified significant SNPs from the RA data set

We compared computing times of the MC method and p_ACT method. As an example, we only showed the times for chromosome 9. With block sizes of 100, 500, and 1,000, the MC method used about 6.24 hr, 2.89 hr, and 2.48 hr, while the p_ACT method used 0.27 hr, 0.47 hr, and 1.08 hr, respectively. The p_ACT method is faster than the MC method.

Discussion

We have applied two multiple testing methods (MC and p_ACT), which account for correlation among tests by splitting each chromosome into smaller blocks, to the GWAS of the RA dataset and then compared the results of these methods to those of the Bonferroni procedure. Both the MC method and p_ACT method identified more significant SNPs than the Bonferroni procedure. The numbers of significant SNPs identified by the MC and p_ACT methods increased as the block size increased.

The test statistic in Eq. (3) is transformed from a traditional score statistic from a generalized linear model. The essential difference between the statistics in Eqs. (1) and (3) is that variance Vj is estimated by different ways. Our simulation studies show that when the numbers of cases and controls are equal, the two test statistics have almost the same power, and that when the numbers of cases and controls are different, the test statistic in Eq. (1) can have slightly higher power than that in Eq. (3), and consequently, the power of the MC method can be slightly higher than that of p_ACT method. Our simulation studies were only based on additive model, small sample sizes, and small number of SNPs. More extensive simulation studies are necessary in the future research.

In our analysis, we divided the whole genome into blocks with a fixed number of SNPs. We only accounted for LD within each block, and we assumed independence between tests from different blocks by ignoring LD between blocks. This assumption can cause loss of power. To avoid losing power, each entire chromosome may be treated as a block. However, the MC method will become computationally intensive or infeasible, and the p_ACT cannot handle more than 1,000 SNPs in each block. Another possible solution is to split each chromosome into blocks according to LD pattern, to group a set of consecutive SNPs in strong LD into one block and to ignore weak LD between blocks. As described earlier, the larger the block size we select, the higher the power we can obtain. This is an issue we will pursue in the future. Also we did not consider population stratification, which may cause spurious false-positive results.

List of abbreviations used

FWER: Family-wise error rate; GAW16: Genetic Analysis Workshop 16; GWAS: genome-wide association studies; HWE: Hardy-Weinberg equilibrium; LD: Linkage disequilibrium; MAF: Minor allele frequency; MC: Monte Carlo; NARAC: North American Rheumatoid Arthritis Consortium; p_ACT: p-Values adjusted for correlated tests; RA: Rheumatoid arthritis; SNP: Single-nucleotide polymorphism.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

GK did all simulation studies and data analysis, and drafted and revised the manuscript. DKC participated in the design of the study. NL and KZ participated in the design of the study and revised the manuscript. GG directed the study and partially drafted and revised the manuscript. All authors read and approved the final manuscript.

Acknowledgements

This research was supported by grants GM073766, GM074913, GM081488, and GM077490 from the National Institute of General Medical Sciences and T32-HL072757 from the National Heart, Lung, and Blood Institute.

This article has been published as part of BMC Proceedings Volume 3 Supplement 7, 2009: Genetic Analysis Workshop 16. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/3?issue=S7.

References


Articles from BMC Proceedings are provided here courtesy of BioMed Central
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • MedGen
    MedGen
    Related information in MedGen
  • Nucleotide
    Nucleotide
    Published Nucleotide sequences
  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...