- Journal List
- Hum Hered
- PMC3171280

# A New Approach to Account for the Correlations among Single Nucleotide Polymorphisms in Genome-Wide Association Studies

^{a}Biostatistics Epidemiology Research Design Core, Center for Clinical and Translational Sciences, University of Texas Health Science Center at Houston, Houston, Tex., USA

^{b}Department of Computer Science, Sam Houston State University, Huntsville, Tex., USA

## Abstract

In genetic association studies, such as genome-wide association studies (GWAS), the number of single nucleotide polymorphisms (SNPs) can be as large as hundreds of thousands. Due to linkage disequilibrium, many SNPs are highly correlated; assuming they are independent is not valid. The commonly used multiple comparison methods, such as Bonferroni correction, are not appropriate and are too conservative when applied to GWAS. To overcome these limitations, many approaches have been proposed to estimate the so-called effective number of independent tests to account for the correlations among SNPs. However, many current effective number estimation methods are based on eigenvalues of the correlation matrix. When the dimension of the matrix is large, the numeric results may be unreliable or even unobtainable. To circumvent this obstacle and provide better estimates, we propose a new effective number estimation approach which is not based on the eigenvalues. We compare the new method with others through simulated and real data. The comparison results show that the proposed method has very good performance.

**Key Words:**Effective number, Genome-wide association studies, Multiple comparisons, Single nucleotide polymorphisms

## Introduction

In a multiple-comparison setting, a certain statistical test is applied to each individual variable. Tests with p values less than a preset threshold will be claimed statistically significant. It is important but usually difficult to set the cutoff values in advance. With a large cutoff value, there will be so many false-positive results due to chance only; on the other hand, with a too stringent cutoff, many true-positive results will not pass the threshold and therefore be overlooked. Šidák [1] and Bonferroni [2, 3] corrections are two commonly used methods to control experiment-wise error rate.

In a multiple testing problem, if the individual tests are not independent, the Šidák and Bonferroni corrections are conservative in the sense that the actual experiment-wise error rate will be lower than the given nominal value. In recent genome-wide association studies, the number of variables (e.g. single nucleotide polymorphisms, SNPs), which are often densely genotyped, can be up to hundreds of thousands. Due to linkage disequilibrium (LD), many SNPs are highly correlated. Giving this situation, neither Šidák nor Bonferroni correction should be used since they are only appropriate for independent tests. An alternative method based on permutation has been proposed [4]. This method shuffles the cases and controls in each permutation; then it calculates the p values (or the corresponding statistics) for all variables. For each permutation, the smallest p value (or the statistic with the largest absolute value) is recorded. After a large number of permutations, say M, have been conducted, the *q*-th quantile of the M-smallest p values (or the largest absolute statistics) is then the estimated point-wise cutoff p value (or statistic) to control the experiment-wise error rate at level *q*[4]. Usually, the cutoff p values from this approach control the experiment-wise error rates quite well and it has been served as the gold standard method. However, it is a computation-intensive approach that requires many permutations to get accurate estimates. With large number of variables, it could take time from several days to many years [5].

Some methods that are less computation dependent have been proposed [5, 6, 7, 8, 9, 10, 11, 12, 13, 14]. We assume there exist N_{eff} independent tests which are equivalent to those N correlated tests in the sense that the cutoff value based on these independent tests will control the experiment-wise error rate at the given nominal level. N_{eff} is called the effective number of independent tests. Cheverud [6] is the first person who proposed the idea of effective number of independent tests and developed a simple method to estimate this number based on the eigenvalues of the correlation matrix. However, studies have shown that Cheverud's method is too conservative [5, 8, 9, 10]. Several other eigenvalue-based methods have also been proposed to improve the performance [5, 7, 8, 9, 10]. One should notice that although the method proposed by Dudbridge and Gusnanto [12] also utilized the eigenvalues; the correlation matrix used in their method was different from those used in the above methods. Unfortunately, those eigenvalue-based methods have limitations which are associated with eigenvalue calculation. When the dimension of the correlation matrix is large, the numerical results are either unreliable or difficult to get. In order to circumvent these difficulties and provide better estimates, we propose a new approach that does not require calculating the eigenvalues of the correlation matrix. Instead, we use the correlation coefficients themselves to estimate the effective number. To evaluate the performance of the new approach, we compare it with other methods through simulated and real data by using the permutation-based method as the gold standard. Our comparisons show that the proposed method performs better than existing eigenvalue-based methods.

## Methods

### Effective Number and Its Estimation

In a multiple comparison problem that controls the experiment-wise error rate, the Šidák point-wise cutoff p value is defined as

where α_{e} is the experiment-wise threshold and *N* is the total number of comparisons in the study [1]. The Bonferroni point-wise cutoff p value is an approximation for that of Šidák and is given by [2, 3]:

Both Šidák and Bonferroni corrections are conservative and only appropriate for independent comparisons (tests). In a genome-wide association study (GWAS), it is very common that some SNPs are highly correlated due to LD. Observing that ‘higher correlation among the traits leads to higher eigenvalue variance’, Cheverud proposed the following formula to estimate the effective number [6]:

where

is the observed variance of the eigenvalues λ_{i} (*i* = 1, 2, …, *N*) of the correlation matrix of all SNPs. *N*_{Chev} has the following properties [6, 8]:

#### Property 1

When all tests are completely independent, the *N* eigenvalues are all equal to 1 and *V*λ_{obs} = 0, therefore *N*_{Chev} = *N*.

#### Property 2

When all the tests are perfectly correlated, except for one which is equal to *N*, all other eigenvalues are equal to 0, *V*λ_{obs} = *N*, and hence *N*_{Chev} = 1.

Replacing *N* by *N*_{Chev} in (1) or (2), we can get the Cheverud's point-wise cutoff p value. This method is usually very conservative in the sense that the actual experiment-wise error rates are much smaller than the preset value [5, 8, 9, 10, 15]. Nyholt [10] tried to improve Cheverud's method by excluding all SNPs in perfect LD except one before using formula (3). However, Nyholt's method is still overly conservative [5].

Li and Ji [8] pointed out that the effective number should also have the third property:

#### Property 3

If the *N* tests can be composed of *c*(1 ≤ *c* ≤ *N*) copies of *N*/*c* independent tests, then the effective number is *N*/*c*.

*N*_{Chev} does not possess this property since in this situation *N*/*c* of the *N* eigenvalues are equal to *c* and the remainder equal to 0. The estimated effective number from (3) is then *N* − 1 + *c*, not *N/c*[8]. Observing this limitation of Cheverud's method, Li and Ji [8] proposed an improved version of *N*_{Chev} which satisfies property 3:

where *f*(*x*) = *I*(*x* ≥ 1) + (*x* − [*x*]) for *x* ≥ 0 and [*x*] is the floor function which gives the largest integer less than or equal to *x*.

Recently, Gao et al. [5, 7] proposed another eigenvalue-based method to estimate the effective number through principal component analysis:

where λ_{1} ≥ λ_{2} ≥ … ≥ λ_{N} are the ordered eigenvalues of the correlation matrix and *C* is a parameter, which is typically set as 99.5% [5, 7]. Obviously, *N*_{Gao} does not possess property 2. When all the tests are completely independent, *N*_{Gao} will always underestimate the effective number for any *C* < 1.

### New Method to Estimate Effective Number N_{Chen}

*Step 1.* For the *i*-th (*i* = 1, 2, …, *N*) SNP, estimate the absolute composite LD (CLD) coefficient between this SNP and any other SNP *r*_{ij}, *j* ≠ *i*;

*Step 2.* Calculate

where *k* is a positive constant number;

*Step 3.* Estimate the effective number:

The constant *k* is a statistical test-dependent parameter. In this paper, the Cochran-Armitage trend test is used to test associations between the phenotype and the genotype [16, 17]. It is easy to verify that our new method satisfies properties 1–3 mentioned above.

Once we obtain the effective number, we can estimate the point-wise cutoff p value by using Šidák or Bonferroni correction with the total number of tests *N* being replaced by the estimated effective number in (1) and (2), respectively. In this paper, we use *N*_{P}, *N*_{Chev}, *N*_{LJ}, *N*_{Gao} and *N*_{Chen} to denote the estimated effective number from the permutation-based method, the methods of Cheverud, Li and Ji, Gao et al. and our new method, respectively. We actually do not need to estimate the effective number if the permutation-based method is used; however, in order to use it as a standard in method comparisons, we will estimate *N*_{P} by α_{e}/α_{p}, where α_{p} is the estimated point-wise cutoff p value from the permutation-based method.

For our proposed method, we estimate the effective number based on the correlation matrix. As suggested by Gao et al. [5], we use the CLD coefficient to estimate the correlation coefficient between each pair of SNPs since it has certain advantages over the LD correlation [5, 18, 19, 20, 21]. For example, the expectation-maximization algorithm-based estimate of LD correlation makes a strong assumption of the Hardy-Weinberg equilibrium [22], which may not meet in practice [21, 23, 24]. Some researchers have shown that CLD can capture the relationship among SNPs comparable to those of gametic LD without requiring the Hardy-Weinberg equilibrium [5, 18, 19, 20, 21]. In addition, the CLD coefficient can be easily estimated. The calculation of the CLD coefficient is simple: code the wild-type homozygote, heterozygote and variant homozygote as 2, 1 and 0, respectively, for each individual genotype and then calculate the correlation coefficient in the usual way [e.g. R function cor()] for each pair of SNPs. For more details, see Gao et al. [5].

### Simulation Settings

The R package ‘popgen’ (version 0.0-4; http://cran.r-project.org/src/contrib/Archive/popgen/) is used to generate phenotype data. We simulate data sets with the settings similar to those used in Gao et al. [5, 25]. More specifically, we simulate two different data sets. For simulation 1, we simulate 8 cold regions (each 10 kb long) separated by hotspots (each 1 kb long). For simulation 2, we simulate 4 cold regions (each 10 kb long) separated by hotspots (each 15 kb long). The mutation rate is θ = 4*N*_{e}μ, where the effective population size *N*_{e} and the mutation rate per base pair per generation μ are set to be 10,000 and 1.4 × 10^{−8}, respectively. The recombination rate is *r* = 4*N*_{e}δ, where the recombination rate per base pair per generation δ are chosen to be 2.5 × 10^{−8} and 9 × 10^{−10} for cold regions in simulation 1 and 2, respectively, to get patterns similar to those observed in the SeattleSNP database [5, 25]. For the hot regions, the recombination rates per base pair per generation are set to be 100 times greater than those in the cold regions. For both simulations, 100 experiments will be generated, each with 200 cases and 200 controls. To see whether sample sizes affect the outcomes, we also simulate data sets with 1,000 cases and 1,000 controls. The lowest minor allele frequency (MAF) will be set as 0.05, as commonly chosen in practice; SNPs with MAF <0.05 will be removed. Gao et al. [5] used 0.1 as the MAF cutoff in their simulations. In the permutation-based method, we use chran-Armitage trend tests with 10,000 permutations to estimate the point-wise cutoff p values and the corresponding estimated effective numbers *N*_{P} with experiment-wise levels 0.05 and 0.01. Regarding the method of Gao et al. [5], we use 99.5% for the parameter *C*; in our new approach, we use *k* = 7.

### Real Data

A real SNP data set is also used to compare the methods [26]. In this data set, we use the data from the 167 Eastern Asian Chinese people. There are 1,272 SNPs across 16 regions of chromosome 21; among them 226 SNPs with MAF <5% are removed, resulting in 1,046 SNPs in the final analysis. Among the 167 samples, 83 are assumed as hypothetical cases and 84 as controls in the permutation-based test.

## Results

### Simulation Results

The number of SNPs generated from the 100 experiments in simulation 1 varies from 33 to 184 with a mean of 69.1 and a median of 65.5. Similarly, the number of SNPs in simulation 2 is between 30 and 170, with a mean of 74.4 and a median of 75.5. We first use the permutation-based method to estimate *N*_{P}, which is then served as the gold standard: the closer the estimated effective number to *N*_{P}, the better the method performs. Figure Figure11 plots the estimated effective numbers *N*_{Chev}, *N*_{LJ}, *N*_{Gao} and *N*_{Chen} by the methods of Cheverud, Li and Ji, and Gao et al., and our new method, respectively, in simulation 1, with *N*_{P} estimated from trend test with experiment-wise level 0.05. The effective numbers are sorted by *N*_{P} before they are plotted to give better visualization (this applies to all figures). Figure Figure22 compares those estimated effective numbers in simulation 1 with *N*_{P} estimated with experiment-wise level 0.01. Figures Figures11 and and22 clearly show that regardless of the experiment-wise levels used, most of the time, *N*_{Chen} performs better than *N*_{LJ} and *N*_{Gao}, which are close to each other and both perform better than *N*_{Chev}. Figures Figures33 and and44 plot the effective numbers estimated in simulation 2 with *N*_{P} estimated with experiment-wise levels 0.05 and 0.01, respectively. Again the overall performance of our new method is better than the methods of Li and Ji and Gao et al., which both are better than the Cheverud method.

**1, 2**) and in simulation 2 (

**3, 4**) with N

_{P}estimated at experiment-wise level 0.05 (

**1, 3**) and 0.01 (

**2, 4**). Perm = Permutation; L&J = Li and Ji.

**1, 2**) and in simulation 2 (

**3, 4**) with N

_{P}estimated at experiment-wise level 0.05 (

**1, 3**) and 0.01 (

**2, 4**). Perm = Permutation; L&J = Li and Ji.

**1, 2**) and in simulation 2 (

**3, 4**) with N

_{P}estimated at experiment-wise level 0.05 (

**1, 3**) and 0.01 (

**2, 4**). Perm = Permutation; L&J = Li and Ji.

**1, 2**) and in simulation 2 (

**3, 4**) with N

_{P}estimated at experiment-wise level 0.05 (

**1, 3**) and 0.01 (

**2, 4**). Perm = Permutation; L&J = Li and Ji.

Tables Tables11 and and22 summarize the statistics of the estimated effective numbers from various methods in simulations 1 and 2, respectively. It is noticeable that our new method *N*_{Chen} has very similar characteristics as the permutation-based method. For example, in simulation 1 the estimated overall effective numbers are 1,574 and 1,592 from *N*_{P} with experiment-wise level 0.05 and 0.01, respectively, which are very close to the 1,594 obtained by our new method. The estimated overall effective numbers from *N*_{LJ} and *N*_{Gao} are both greater than those from *N*_{P}.

We also perform statistical tests to compare *N*_{Chev}, *N*_{LJ}, *N*_{Gao} and *N*_{Chen} with *N*_{P}. A one-sample test (e.g. a paired t test or a signed-rank test) is applied to the effective numbers estimated by each pair of methods of the 100 experiments within the same simulation. Tables Tables33 and and44 report the p values from a one-sample t test for simulations 1 and 2, respectively. For both simulations (1 and 2), the estimated effective numbers from our new method and the permutation-based method are not statistically significantly different. For any other methods, their estimated effective numbers are always highly statistically significantly different than those from the permutation-based method. We also applied the Wilcoxon signed-rank test and obtained very similar results. When we increase the numbers of cases and controls to 500 or 1,000 each, we have very similar results to those with 200 cases and 200 controls; this is consistent with the findings from Gao et al. [5].

For SNP data, it is very common to have missing values, methods based on principal component analysis, such as *N*_{G}, cannot be applied directly to this kind of data [5]. Although some data imputing strategies can be used, this will bring some errors as well. Another problem with principal component analysis is that it becomes inefficient with a large number of SNPs (>1,000) [5]. The new proposed method is not sensitive to missing value since it only needs the correlation coefficient between each pair of SNPs.

When the number of SNPs becomes very large, many effective number estimation methods need to group the SNPs into subsets. However, we may not know exactly which SNPs should be grouped together in practice. It is desirable to know how the subset size affects the effective number estimation. To this purpose, we treat the generated SNP data from the 100 experiments within the same simulation as one single data set; then choose different subset sizes (e.g. 100, 200, …, 1,000) to separate the single data set into several subsets with equal numbers of SNPs (only the last subset may have more SNPs). The overall estimated effective number is assumed the sum of the effective numbers estimated from each individual subset. Tables Tables55 and and66 report the results for simulations 1 and 2, respectively. It can be seen that our new method is not sensitive to the subset size. With subset sizes between 500 and 1,000, our new method gives reliable, effective numbers, which are very close to those from the permutation-based method.

To see how those methods work for large SNP data, we combine the data from the two simulations into one single data set with 14,453 SNPs in total. Due to the limit of the memory of *R*, we need to divide the whole data set into four subsets with almost equal size. The running time values (in seconds) using *R* are 274, 631, 628, 630 and 1,086,961 (about 302 h) for our new method, and the methods of Cheverud, Gao et al., and Li and Ji, and the permutation method with trend test and 10,000 replicates, respectively. The estimated effective numbers by those methods are 3,198, 14,225, 1,405, 2,030 and 3,359, respectively. The estimated effective number by Cheverud is too large, while those by Gao et al. and Li and Ji are too small. Our proposed method and the permutation method obtained very similar results.

### Results from Real Data

From the real data, the estimated effective numbers from the permutation-based method with experiment-wise level 0.05 and 0.01 are 354 and 349, respectively. The effective number from our new method is 348, which is very close to those from the permutation-based method. The estimated effective numbers from Gao et al., Li and Ji, and Cheverud methods are 361, 371, and 859, respectively. Again we can see that the Cheverud method is too conservative.

## Discussion

In multiple-comparison problems with highly correlated tests, such as GWAS using SNPs, statistical methods that account for the dependence and give reasonable cutoff p values are highly desirable. Some of these methods have been proposed and successfully applied to genetic association studies. The concept of effective numbers of independent tests is simple but very useful.

Like constant *C* in the method of Gao et al., the parameter *k* in our new method needs to be chosen in advance. This constant may vary among different situations. Although we used *k* = 7 for our new method when the statistical tests were Cochran-Armitage trend tests, we found that, unlike the method of Gao et al. [5], our method was not sensitive to the parameter *k*. For example, if we replace 7 by any value between 6 and 8, we will have very similar results as those from our new method with *k* = 7. From both simulation and real data, we have shown that the estimated effective numbers from our new method with *k* = 7 were very close to those from the permutation-based method; we feel that *k* = 7 is appropriate for most situations. If other association tests are used, another constant *k* should be chosen. For example, we find that if Pearson's χ^{2} test with 2 degrees of freedom (d.f.) is used, *k* = 3 is more appropriate (data not shown). In GWAS, one may want to adjust some covariates, e.g. age and gender. Under this situation, a logistic regression model with genotype and several other covariates as independent variables is more appropriate. As the statistical method changes, we would expect that a different constant *k* needs to be chosen. We do not have such data to give our suggestion about choosing *k* for this situation, however, in principle, we may estimate this constant based on the permutation method with a small portion of the data. Although based on the results from our simulations and real data, *k* = 7 is suitable for both significance levels 0.05 and 0.01, this constant *k* may also depend on the significance level used, as pointed out by other researchers [11, 14].

For the methods of Gao et al., and Li and Ji, it is important to find a suitable group size to divide a large data set into some subsets as the grouping effects are not negligible for those methods. However, for our proposed method, the grouping effect is limited based on our simulations.

It should be noticed that the method proposed by Moskvina and Schmidt [11] also utilizes the correlation coefficients among SNPs, although in a different way, to estimate the effective number. In general, the estimated effective numbers by their method are conservative since their *r*_{j}*s* are usually underestimated, resulting in overestimated *k*_{j}*s* and therefore the effective numbers. Furthermore, Moskvina and Schmidt's method is independent of the association tests used. This may be a limitation of their method. Recently, Han et al. [13] proposed another correction method based on the observation that the covariance of their test statistics from two markers was the sample correlation coefficient of the two markers. However, the test they used was related to the allelic χ^{2} test (Pearson's χ^{2} test with 1 d.f. for a 2 × 2 contingency table); it was neither the trend test nor the χ^{2} test with 2 d.f. for a 2 × 3 contingency table. It is unclear how this method works if we choose the commonly used trend test.

We have proposed a new simple method to estimate the effective number to account for the correlations among SNPs in genetic association studies. It is less computation dependent and easy to implement. Through simulation and real data, we have shown that the proposed method outperforms existing effective number estimation methods.

## Acknowledgements

The authors would like to thank Dr. Noah Rosenberg for providing the real SNP data. We are very grateful to the Editor, the Associate Editor and two anonymous reviewers for their helpful comments that resulted in a substantial improvement of the original version of this paper. We also would like to thank Ms. Naturaleza Jolivet for editorial assistance and the support from the NIH grant (UL1 RR024148), awarded to the University of Texas Health Science Center at Houston.

## References

**Karger Publishers**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (162K) |
- Citation

- Establishing an adjusted p-value threshold to control the family-wide type 1 error in genome wide association studies.[BMC Genomics. 2008]
*Duggal P, Gillanders EM, Holmes TN, Bailey-Wilson JE.**BMC Genomics. 2008 Oct 31; 9:516. Epub 2008 Oct 31.* - A flexible genome-wide bootstrap method that accounts for ranking and threshold-selection bias in GWAS interpretation and replication study design.[Stat Med. 2011]
*Faye LL, Sun L, Dimitromanolakis A, Bull SB.**Stat Med. 2011 Jul 10; 30(15):1898-912. Epub 2011 May 3.* - GATES: a rapid and powerful gene-based association test using extended Simes procedure.[Am J Hum Genet. 2011]
*Li MX, Gui HS, Kwan JS, Sham PC.**Am J Hum Genet. 2011 Mar 11; 88(3):283-93.* - [Advances on gene-based association analysis].[Yi Chuan. 2013]
*Luo XH, Liu ZF, Dong CZ.**Yi Chuan. 2013 Sep; 35(9):1065-71.* - The identification of colon cancer susceptibility genes by using genome-wide scans.[Methods Mol Biol. 2010]
*Daley D.**Methods Mol Biol. 2010; 653:3-21.*

- Detecting differentially methylated loci for multiple treatments based on high-throughput methylation data[BMC Bioinformatics. ]
*Chen Z, Huang H, Liu Q.**BMC Bioinformatics. 15142* - Age-adjusted nonparametric detection of differential DNA methylation with case–control designs[BMC Bioinformatics. ]
*Huang H, Chen Z, Huang X.**BMC Bioinformatics. 1486* - Detecting differentially methylated loci for Illumina Array methylation data based on human ovarian cancer data[BMC Medical Genomics. ]
*Chen Z, Huang H, Liu J, Tony Ng HK, Nadarajah S, Huang X, Deng Y.**BMC Medical Genomics. 6(Suppl 1)S9* - Design and Analysis of Multiple Diseases Genome-wide Association Studies without Controls[Gene. 2012]
*Chen Z, Huang H, Ng HK.**Gene. 2012 Nov 15; 510(1)87-92* - A Robust Method for Testing Association in Genome-Wide Association Studies[Human Heredity. 2012]
*Chen Z, Ng HK.**Human Heredity. 2012 Mar; 73(1)26-34*

- A New Approach to Account for the Correlations among Single Nucleotide Polymorph...A New Approach to Account for the Correlations among Single Nucleotide Polymorphisms in Genome-Wide Association StudiesHuman Heredity. Sep 2011; 72(1)1

Your browsing activity is empty.

Activity recording is turned off.

See more...