- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

# A Simple Correction for Multiple Testing for Single-Nucleotide Polymorphisms in Linkage Disequilibrium with Each Other

## Abstract

In this report, we describe a simple correction for multiple testing of single-nucleotide polymorphisms (SNPs) in linkage disequilibrium (LD) with each other, on the basis of the spectral decomposition (SpD) of matrices of pairwise LD between SNPs. This method provides a useful alternative to more computationally intensive permutation tests. A user-friendly interface (SNPSpD) for performing this correction is available online (http://genepi.qimr.edu.au/general/daleN/SNPSpD/). Additionally, output from SNPSpD includes eigenvalues, principal-component coefficients, and factor “loadings” after varimax rotation, enabling the selection of a subset of SNPs that optimize the information in a genomic region.

SNPs in disease-related genes are increasingly being used as candidates in the search for causative variations. Both theoretical (Long and Langley ^{1999}; Service et al. ^{1999}; Zollner and von Haeseler ^{2000}; Akey et al. ^{2001}; Bader ^{2001}; Morris and Kaplan ^{2002}) and empirical studies (Clark et al. ^{1998}; Terwilliger and Weiss ^{1998}; Escamilla et al. ^{1999}; Martin et al. ^{2000}) have produced contradictory results on whether haplotypes of two or more SNPs provide greater power than individual SNPs to find useful linkage disequilibrium (LD) between a causative mutation and linked marker loci. Moreover, variability in LD across the genome, the large dependence of the strength of association on allele-frequency differences between the disease variant and the SNP (e.g., Ohashi and Tokunaga ^{2001}), and questions regarding the suitability of the “common disease common variant” (CDCV) hypothesis (i.e., depending on the ascertainment method) (Pritchard and Cox ^{2002}) all suggest that an initial investigation of a candidate gene or interval should test many SNPs individually for association. However, unless the selected SNPs are all in complete LD with each other, such multiple testing will increase the false-positive (type I error) rate under nominal significance thresholds (e.g., α=0.05). On the other hand, when background LD exists between SNPs but they are assumed to be completely independent, then the Šidák correction—which is approximated by the popular Bonferroni correction (Šidák ^{1968}^{, }^{1971})—would markedly overcorrect for the inflated false-positive rate, resulting in a reduction in power. Here we describe a simple correction for multiple testing of SNPs in LD with each other, on the basis of the spectral decomposition (SpD) of matrices of pairwise LD between SNPs. This method provides a useful alternative to more computationally intensive permutation tests.

It has previously been shown that the collective correlation among a set of variables can be measured by the variance of the eigenvalues (λs) derived from a correlation matrix (e.g., Cheverud et al. ^{1983}^{, }^{2001}). As detailed by Cheverud (^{2001}), high correlation among variables leads to high λs. For example, if all variables are completely correlated, the first λ equals the number of variables in the correlation matrix (*M*) and the rest of the λs are zero. In this case, the variance of the λs is at its maximum, and it is equal to the number of variables in the matrix. Conversely, if no correlation exists among variables, all of the λs will be equal to one, and the set of λs will have no variance. Hence, the variance of the λs will range between zero, when all the variables are independent, and *M,* where *M* is the total number of variables included in the matrix. Therefore, the ratio of observed eigenvalue variance, *Var*(λ_{obs}), to its maximum (*M*) gives the proportional reduction in the number of variables in a set, and the effective number of variables (*M*_{eff}) may be calculated as follows:

The common LD measure Δ is also the correlation coefficient for a 2 × 2 table (Hill and Robertson ^{1968}), where

and the notation for estimated haplotype and marker allele frequencies in the 2 × 2 table is as follows:

SNP 2 | |||

SNP 1 | Allele 1 | Allele 2 | Total |

Allele 1 | π_{11} | π_{12} | π_{1+} |

Allele 2 | π_{21} | π_{22} | π_{2+} |

Total | π_{+1} | π_{+2} | 1 |

Consequently, λs for the LD correlation (Δ) matrix may be calculated by principal-components analysis or, more generally, by spectral decomposition (SpD), and the approach of Cheverud (^{2001}) may be applied to obtain the effective number of independent SNPs (*M*_{eff}) represented in the matrix.

Although *M*_{eff} could easily be calculated using standard statistical packages and/or free software in the public domain, we developed a user-friendly Web interface (SNPSpD) because we believe a wide variety of researchers may have use for this approach, which simply requires users to upload a MERLIN-format pedigree and map file (Abecasis et al. ^{2002}). The uploaded files are run through a slightly altered version of Gonçalo Abecasis’s ^{LDMAX} program—part of the GOLD Command Line Tools package [gold-1.1.0.tar.gz] (Abecasis and Cookson ^{2000})—which uses the expectation-maximization–based approach of Excoffier and Slatkin (^{1995}) to estimate haplotype frequencies in case-control or family data. Using these haplotype frequencies, LDMAX calculates a number of pairwise LD statistics. A Perl script then creates a matrix of pairwise Δ measures, from which SNPSpD calculates λs by SpD, by use of the EIGEN function of R (v1.7.1) (R Development Core Team ^{2003}). SNPSpD output includes the matrix of SNP-SNP Δ measures, *M,* λs, *Var*(λ_{obs}), *M*_{eff}, and a Šidák-corrected significance threshold (for *M*_{eff} tests) required to keep the type I error rate at 5%.

To investigate the performance of the *M*_{eff}-Šidák correction we utilized two real data sets. The first data set consisted of 10 highly associated SNPs, spanning ~27 kb within the *angiotensin-I converting enzyme* (*ACE*) gene (Keavney et al. ^{1998}), and the second data set consisted of 23 SNPs, spanning ~794 kb within the T-cell antigen receptor (TCR) α/δ locus (Moffatt et al. ^{2000}). The results of SNPSpD were validated by permutation (e.g., Westfall and Young ^{1993}).

For the Keavney data set, 88 founders were utilized. For each permutation, 44 founders were randomly selected (without replacement) and labeled “cases,” and the remaining 44 founders were labeled “controls.” This selection process maintained each founder’s haplotype and, hence, the LD information between each SNP. For each permuted case-control sample (replicate), a χ^{2} test of homogeneity was used to compare genotype frequencies between the permuted case and control populations for each SNP. Thus, for each replicate, a total of 10 χ^{2} values were produced. This process was repeated 50,000 times. Finally, the number of replicates in which at least one SNP had a χ^{2} value with *P*.05 [i.e., χ^{2}5.991476; df 2] were counted to estimate the probability of a type I error. For example, the number of replicates producing at least one χ^{2} value 5.991476 were 2,235, 2,655, 4,100, 4,328, 4,328, 5,909, 7,042, 7,042, 7,042, and 7,844 for the first 1, 2, 3, 4, 5, 6, 7, 8, 9, and all 10 SNPs (in chromosomal order), respectively.

Permutations were performed in R, utilizing the SAMPLE and CHISQ.TEST functions. Permuting 50,000 replicates took 26 min for the Keavney data set and 62 min for the Moffatt data set, whereas our SNPSpD Web interface took only 12 s and 14 s, respectively. Considering the fact that the R permutations were performed on a 2.8 GHz Xeon (Linux v2.4.20) server with exclusive CPU use, whereas the SNPSpD interface was run on our 300 MHz Sun4 SPARC 10 (SunOS 5.8) Web server, the SNPSpD approach was well over 100 times faster than the R permutations.

The 10 SNPs in the Keavney data set produced an *M*_{eff} of 4.59, representative of high intermarker LD (see table 1). Figure 1 shows the probability (Pr) of a type I error plotted against the number of SNPs tested for the Keavney et al. (1998) data set. Compared with the permuted rate, a Šidák correction ignoring intermarker LD (standard-Šidák correction) would clearly overcorrect for the inflated type I error rate, whereas the *M*_{eff}-Šidák rate, although slightly conservative in the presence of higher order intermarker LD (i.e., very strong LD across >2 SNPs) provides a good approximation to the permuted rate. For example, in terms of the significance threshold required to keep the type I error rate at 5% if all 10 SNPs were individually tested for association with ACE levels, the standard-Šidák [i.e., 1-(1-α)^{1/M}], *M*_{eff}-Šidák [i.e., 1-(1-α)^{1/Meff}], and permutation-based corrections would specify thresholds of *P*.005, *P*.011, and *P*.015, respectively.

^{1998}) data. The graph shows the expected increase in the false-positive rate for completely independent SNPs [i.e., 1-(1-α)

**...**

Analysis of the 23 SNPs in the Moffatt data set indicated low levels of intermarker LD with an *M*_{eff} of 22.53 and resulted in thresholds to keep the type I error rate at 5% of *P*.0022, *P*.0023, and *P*.0028 for the standard-Šidák, *M*_{eff}-Šidák, and permutation-based corrections, respectively.

It is worth noting that >50,000 permutations would be required to avoid rounding highly significant *P* values (*P*<.00002). Consequently, to correct for multiple testing of SNPs in LD with each other, our SNPSpD approach provides a simple and useful alternative to more computationally intensive permutation tests. Furthermore, by providing an estimate of the number of independent tests (*M*_{eff}), the SNPSpD approach allows researchers to apply any flavor of multiplicity correction they prefer—for example, the modified Bonferroni procedures of Holm (^{1979}), Hochberg (^{1988}), and Hommel (^{1988}) or the more recently proposed false-discovery-rate (FDR) approach of Benjamini and Hochberg (^{1995}).

Coincidentally, during the preparation of this manuscript, Meng et al. (^{2003}) described a method based on the SpD of matrices of pairwise LD between markers to select a subset of SNPs that optimize the information in a genomic region. Although there are some parallels between the approach of Meng et al. (^{2003}) and that presented here, our study, unlike that of Meng et al. (^{2003}), not only is primarily concerned with the correction for multiple testing when using multiple SNPs in LD with each other but also provides important validation of the use of an SpD-based approach to correct for such nonindependence. That said, to complete the usefulness of our SNPSpD interface, we have extended analyses to include results after varimax rotation. Specifically, we report λs, proportions of variance, and principal-component coefficients after varimax rotation (an orthogonal rotation method that minimizes the number of variables that have high loadings on each factor, thus simplifying the interpretation of the factors). Furthermore, we maximize interpretability of these results by flagging the SNP(s) contributing the *most* to each rotated factor (i.e., group of SNPs). These flagged SNPs may be viewed as “haplotype-tagging SNPs.” Indeed, even in data with strong LD, the rotated factors correspond well with haplotypes obtained via traditional methods. For example, the seven haplotypes reported in the Keavney et al. (^{1998}) study correspond to the seven factors produced by SNPSpD after varimax rotation.

Finally, because the user may then easily select SNPs to represent either each factor, the factor(s) with the largest *M*_{eff} λs, or the factor(s) explaining a selected proportion of variance, we believe many researchers will appreciate the convenience of our SNPSpD Web interface.^{}^{}

## Acknowledgments

The author thanks Professor Nicholas G. Martin, Dr. Grant W. Montgomery, and, in particular, Dr. David L. Duffy, for many helpful discussions, and Dr. Martin Farrall, for generously sharing the ACE data set. Special thanks go to David C. Smyth for assisting with the development of the SNPSpD Web interface. This research was supported in part by a National Health and Medical Research Council (NHMRC) Peter Doherty Fellowship and an NHMRC (Australia) grant 241916.

## Electronic-Database Information

The URLs for data presented herein are as follows:

## References

**American Society of Human Genetics**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (87K)

- Evaluation of an algorithm of tagging SNPs selection by linkage disequilibrium.[Clin Biochem. 2006]
*Tang NL, Pharoah PD, Ma SL, Easton DF.**Clin Biochem. 2006 Mar; 39(3):240-3. Epub 2006 Jan 19.* - MIDAS: software for analysis and visualisation of interallelic disequilibrium between multiallelic markers.[BMC Bioinformatics. 2006]
*Gaunt TR, Rodriguez S, Zapata C, Day IN.**BMC Bioinformatics. 2006 Apr 27; 7:227. Epub 2006 Apr 27.* - LD2SNPing: linkage disequilibrium plotter and RFLP enzyme mining for tag SNPs.[BMC Genet. 2009]
*Chang HW, Chuang LY, Chang YJ, Cheng YH, Hung YC, Chen HC, Yang CH.**BMC Genet. 2009 Jun 6; 10:26. Epub 2009 Jun 6.* - [Linkage disequilibrium in the human genome and its exploitation].[Arch Inst Pasteur Tunis. 2005]
*Kharrat N, Rebaï M, Rebaï A.**Arch Inst Pasteur Tunis. 2005; 82(1-4):9-21.* - Navigating the HapMap.[Brief Bioinform. 2006]
*Barnes MR.**Brief Bioinform. 2006 Sep; 7(3):211-24. Epub 2006 Jul 28.*

- Racial Disparities in Cancer Care in the Veterans Affairs Health Care System and the Role of Site of Care[American Journal of Public Health. 2014]
*Samuel CA, Landrum MB, McNeil BJ, Bozeman SR, Williams CD, Keating NL.**American Journal of Public Health. 2014 Sep; 104(Suppl 4)S562-S571* - Fat Mass and Obesity-Associated (FTO) Gene Polymorphisms Are Associated with Physical Activity, Food Intake, Eating Behaviors, Psychological Health, and Modeled Change in Body Mass Index in Overweight/Obese Caucasian Adults[Nutrients. ]
*Harbron J, van der Merwe L, Zaahl MG, Kotze MJ, Senekal M.**Nutrients. 6(8)3130-3152* - Genetic Association of CHRNB3 and CHRNA6 Gene Polymorphisms with Nicotine Dependence Syndrome Scale in Korean Population[Psychiatry Investigation. 2014]
*Won WY, Park B, Choi SW, Kim L, Kwon M, Kim JH, Lee CU, Shin HD, Kim DJ.**Psychiatry Investigation. 2014 Jul; 11(3)307-312* - The Use of Exome Genotyping to Predict Pathological Gleason Score Upgrade after Radical Prostatectomy in Low-Risk Prostate Cancer Patients[PLoS ONE. ]
*Oh JJ, Park S, Lee SE, Hong SK, Lee S, Choe G, Yoon S, Byun SS.**PLoS ONE. 9(8)e104146* - Holm multiple correction for large-scale gene-shape association mapping[BMC Genetics. ]
*Fu G, Saunders G, Stevens J.**BMC Genetics. 15(Suppl 1)S5*

- A Simple Correction for Multiple Testing for Single-Nucleotide Polymorphisms in ...A Simple Correction for Multiple Testing for Single-Nucleotide Polymorphisms in Linkage Disequilibrium with Each OtherAmerican Journal of Human Genetics. Apr 2004; 74(4)765PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...