• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of ajhgLink to Publisher's site
Am J Hum Genet. Apr 2004; 74(4): 765–769.
Published online Mar 2, 2004. doi:  10.1086/383251
PMCID: PMC1181954

A Simple Correction for Multiple Testing for Single-Nucleotide Polymorphisms in Linkage Disequilibrium with Each Other

Abstract

In this report, we describe a simple correction for multiple testing of single-nucleotide polymorphisms (SNPs) in linkage disequilibrium (LD) with each other, on the basis of the spectral decomposition (SpD) of matrices of pairwise LD between SNPs. This method provides a useful alternative to more computationally intensive permutation tests. A user-friendly interface (SNPSpD) for performing this correction is available online (http://genepi.qimr.edu.au/general/daleN/SNPSpD/). Additionally, output from SNPSpD includes eigenvalues, principal-component coefficients, and factor “loadings” after varimax rotation, enabling the selection of a subset of SNPs that optimize the information in a genomic region.

SNPs in disease-related genes are increasingly being used as candidates in the search for causative variations. Both theoretical (Long and Langley 1999; Service et al. 1999; Zollner and von Haeseler 2000; Akey et al. 2001; Bader 2001; Morris and Kaplan 2002) and empirical studies (Clark et al. 1998; Terwilliger and Weiss 1998; Escamilla et al. 1999; Martin et al. 2000) have produced contradictory results on whether haplotypes of two or more SNPs provide greater power than individual SNPs to find useful linkage disequilibrium (LD) between a causative mutation and linked marker loci. Moreover, variability in LD across the genome, the large dependence of the strength of association on allele-frequency differences between the disease variant and the SNP (e.g., Ohashi and Tokunaga 2001), and questions regarding the suitability of the “common disease common variant” (CDCV) hypothesis (i.e., depending on the ascertainment method) (Pritchard and Cox 2002) all suggest that an initial investigation of a candidate gene or interval should test many SNPs individually for association. However, unless the selected SNPs are all in complete LD with each other, such multiple testing will increase the false-positive (type I error) rate under nominal significance thresholds (e.g., α=0.05). On the other hand, when background LD exists between SNPs but they are assumed to be completely independent, then the Šidák correction—which is approximated by the popular Bonferroni correction (Šidák 1968, 1971)—would markedly overcorrect for the inflated false-positive rate, resulting in a reduction in power. Here we describe a simple correction for multiple testing of SNPs in LD with each other, on the basis of the spectral decomposition (SpD) of matrices of pairwise LD between SNPs. This method provides a useful alternative to more computationally intensive permutation tests.

It has previously been shown that the collective correlation among a set of variables can be measured by the variance of the eigenvalues (λs) derived from a correlation matrix (e.g., Cheverud et al. 1983, 2001). As detailed by Cheverud (2001), high correlation among variables leads to high λs. For example, if all variables are completely correlated, the first λ equals the number of variables in the correlation matrix (M) and the rest of the λs are zero. In this case, the variance of the λs is at its maximum, and it is equal to the number of variables in the matrix. Conversely, if no correlation exists among variables, all of the λs will be equal to one, and the set of λs will have no variance. Hence, the variance of the λs will range between zero, when all the variables are independent, and M, where M is the total number of variables included in the matrix. Therefore, the ratio of observed eigenvalue variance, Varobs), to its maximum (M) gives the proportional reduction in the number of variables in a set, and the effective number of variables (Meff) may be calculated as follows:

equation image

The common LD measure Δ is also the correlation coefficient for a 2 × 2 table (Hill and Robertson 1968), where

equation image

and the notation for estimated haplotype and marker allele frequencies in the 2 × 2 table is as follows:

SNP 2
SNP 1Allele 1Allele 2Total
Allele 1π11π12π1+
Allele 2π21π22π2+
Totalπ+1π+21

Consequently, λs for the LD correlation (Δ) matrix may be calculated by principal-components analysis or, more generally, by spectral decomposition (SpD), and the approach of Cheverud (2001) may be applied to obtain the effective number of independent SNPs (Meff) represented in the matrix.

Although Meff could easily be calculated using standard statistical packages and/or free software in the public domain, we developed a user-friendly Web interface (SNPSpD) because we believe a wide variety of researchers may have use for this approach, which simply requires users to upload a MERLIN-format pedigree and map file (Abecasis et al. 2002). The uploaded files are run through a slightly altered version of Gonçalo Abecasis’s LDMAX program—part of the GOLD Command Line Tools package [gold-1.1.0.tar.gz] (Abecasis and Cookson 2000)—which uses the expectation-maximization–based approach of Excoffier and Slatkin (1995) to estimate haplotype frequencies in case-control or family data. Using these haplotype frequencies, LDMAX calculates a number of pairwise LD statistics. A Perl script then creates a matrix of pairwise Δ measures, from which SNPSpD calculates λs by SpD, by use of the EIGEN function of R (v1.7.1) (R Development Core Team 2003). SNPSpD output includes the matrix of SNP-SNP Δ measures, M, λs, Varobs), Meff, and a Šidák-corrected significance threshold (for Meff tests) required to keep the type I error rate at 5%.

To investigate the performance of the Meff-Šidák correction we utilized two real data sets. The first data set consisted of 10 highly associated SNPs, spanning ~27 kb within the angiotensin-I converting enzyme (ACE) gene (Keavney et al. 1998), and the second data set consisted of 23 SNPs, spanning ~794 kb within the T-cell antigen receptor (TCR) α/δ locus (Moffatt et al. 2000). The results of SNPSpD were validated by permutation (e.g., Westfall and Young 1993).

For the Keavney data set, 88 founders were utilized. For each permutation, 44 founders were randomly selected (without replacement) and labeled “cases,” and the remaining 44 founders were labeled “controls.” This selection process maintained each founder’s haplotype and, hence, the LD information between each SNP. For each permuted case-control sample (replicate), a χ2 test of homogeneity was used to compare genotype frequencies between the permuted case and control populations for each SNP. Thus, for each replicate, a total of 10 χ2 values were produced. This process was repeated 50,000 times. Finally, the number of replicates in which at least one SNP had a χ2 value with P[less-than-or-eq, slant].05 [i.e., χ2[gt-or-equal, slanted]5.991476; df 2] were counted to estimate the probability of a type I error. For example, the number of replicates producing at least one χ2 value [gt-or-equal, slanted]5.991476 were 2,235, 2,655, 4,100, 4,328, 4,328, 5,909, 7,042, 7,042, 7,042, and 7,844 for the first 1, 2, 3, 4, 5, 6, 7, 8, 9, and all 10 SNPs (in chromosomal order), respectively.

Permutations were performed in R, utilizing the SAMPLE and CHISQ.TEST functions. Permuting 50,000 replicates took 26 min for the Keavney data set and 62 min for the Moffatt data set, whereas our SNPSpD Web interface took only 12 s and 14 s, respectively. Considering the fact that the R permutations were performed on a 2.8 GHz Xeon (Linux v2.4.20) server with exclusive CPU use, whereas the SNPSpD interface was run on our 300 MHz Sun4 SPARC 10 (SunOS 5.8) Web server, the SNPSpD approach was well over 100 times faster than the R permutations.

The 10 SNPs in the Keavney data set produced an Meff of 4.59, representative of high intermarker LD (see table 1). Figure 1 shows the probability (Pr) of a type I error plotted against the number of SNPs tested for the Keavney et al. (1998) data set. Compared with the permuted rate, a Šidák correction ignoring intermarker LD (standard-Šidák correction) would clearly overcorrect for the inflated type I error rate, whereas the Meff-Šidák rate, although slightly conservative in the presence of higher order intermarker LD (i.e., very strong LD across >2 SNPs) provides a good approximation to the permuted rate. For example, in terms of the significance threshold required to keep the type I error rate at 5% if all 10 SNPs were individually tested for association with ACE levels, the standard-Šidák [i.e., 1-(1-α)1/M], Meff-Šidák [i.e., 1-(1-α)1/Meff], and permutation-based corrections would specify thresholds of P[less-than-or-eq, slant].005, P[less-than-or-eq, slant].011, and P[less-than-or-eq, slant].015, respectively.

Figure  1
Probability (Pr) of a type I error plotted against the number of SNPs (in chromosomal order) tested in the Keavney et al. (1998) data. The graph shows the expected increase in the false-positive rate for completely independent SNPs [i.e., 1-(1-α) ...
Table 1
LD Matrix with Pairwise Correlations (Δ) and Eigenvalues (λs)[Note]

Analysis of the 23 SNPs in the Moffatt data set indicated low levels of intermarker LD with an Meff of 22.53 and resulted in thresholds to keep the type I error rate at 5% of P[less-than-or-eq, slant].0022, P[less-than-or-eq, slant].0023, and P[less-than-or-eq, slant].0028 for the standard-Šidák, Meff-Šidák, and permutation-based corrections, respectively.

It is worth noting that >50,000 permutations would be required to avoid rounding highly significant P values (P<.00002). Consequently, to correct for multiple testing of SNPs in LD with each other, our SNPSpD approach provides a simple and useful alternative to more computationally intensive permutation tests. Furthermore, by providing an estimate of the number of independent tests (Meff), the SNPSpD approach allows researchers to apply any flavor of multiplicity correction they prefer—for example, the modified Bonferroni procedures of Holm (1979), Hochberg (1988), and Hommel (1988) or the more recently proposed false-discovery-rate (FDR) approach of Benjamini and Hochberg (1995).

Coincidentally, during the preparation of this manuscript, Meng et al. (2003) described a method based on the SpD of matrices of pairwise LD between markers to select a subset of SNPs that optimize the information in a genomic region. Although there are some parallels between the approach of Meng et al. (2003) and that presented here, our study, unlike that of Meng et al. (2003), not only is primarily concerned with the correction for multiple testing when using multiple SNPs in LD with each other but also provides important validation of the use of an SpD-based approach to correct for such nonindependence. That said, to complete the usefulness of our SNPSpD interface, we have extended analyses to include results after varimax rotation. Specifically, we report λs, proportions of variance, and principal-component coefficients after varimax rotation (an orthogonal rotation method that minimizes the number of variables that have high loadings on each factor, thus simplifying the interpretation of the factors). Furthermore, we maximize interpretability of these results by flagging the SNP(s) contributing the most to each rotated factor (i.e., group of SNPs). These flagged SNPs may be viewed as “haplotype-tagging SNPs.” Indeed, even in data with strong LD, the rotated factors correspond well with haplotypes obtained via traditional methods. For example, the seven haplotypes reported in the Keavney et al. (1998) study correspond to the seven factors produced by SNPSpD after varimax rotation.

Finally, because the user may then easily select SNPs to represent either each factor, the factor(s) with the largest Meff λs, or the factor(s) explaining a selected proportion of variance, we believe many researchers will appreciate the convenience of our SNPSpD Web interface.

Acknowledgments

The author thanks Professor Nicholas G. Martin, Dr. Grant W. Montgomery, and, in particular, Dr. David L. Duffy, for many helpful discussions, and Dr. Martin Farrall, for generously sharing the ACE data set. Special thanks go to David C. Smyth for assisting with the development of the SNPSpD Web interface. This research was supported in part by a National Health and Medical Research Council (NHMRC) Peter Doherty Fellowship and an NHMRC (Australia) grant 241916.

Electronic-Database Information

The URLs for data presented herein are as follows:

SNP Spectral Decomposition (SNPSpD) Web Interface, http://genepi.qimr.edu.au/general/daleN/SNPSpD/
LDMAX Program (Part of the GOLD Command Line Tools Package [gold-1.1.0.tar.gz]), http://www.sph.umich.edu/csg/abecasis/GOLD/download/index.html
TCR α/δ Locus Data of Moffatt et al. (2000), http://www.well.ox.ac.uk/asthma/public/TCR/index.shtml

References

Abecasis GR, Cherny SS, Cookson WO, Cardon LR (2002) Merlin: rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet 30:97–101 [PubMed] [Cross Ref]10.1038/ng786
Abecasis GR, Cookson WO (2000) GOLD: graphical overview of linkage disequilibrium. Bioinformatics 16:182–183 [PubMed] [Cross Ref]10.1093/bioinformatics/16.2.182
Akey J, Jin L, Xiong M (2001) Haplotypes vs single marker linkage disequilibrium tests: what do we gain? Eur J Hum Genet 9:291–300 [PubMed] [Cross Ref]10.1038/sj.ejhg.5200619
Bader JS (2001) The relative power of SNPs and haplotype as genetic markers for association tests. Pharmacogenomics 2:11–24 [PubMed]
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Statist Soc [Ser B] 57:289–300
Cheverud JM (2001) A simple correction for multiple comparisons in interval mapping genome scans. Heredity 87:52–58 [PubMed] [Cross Ref]10.1046/j.1365-2540.2001.00901.x
Cheverud J, Rutledge J, Atchley W (1983) Quantitative genetics of development: genetic correlations among age-specific trait values and the evolution of ontogeny. Evolution 37:895–905
Clark AG, Weiss KM, Nickerson DA, Taylor SL, Buchanan A, Stengard J, Salomaa V, Vartiainen E, Perola M, Boerwinkle E, Sing CF (1998) Haplotype structure and population genetic inferences from nucleotide-sequence variation in human lipoprotein lipase. Am J Hum Genet 63:595–612 [PMC free article] [PubMed]
Escamilla MA, McInnes LA, Spesny M, Reus VI, Service SK, Shimayoshi N, Tyler DJ, Silva S, Molina J, Gallegos A, Meza L, Cruz ML, Batki S, Vinogradov S, Neylan T, Nguyen JB, Fournier E, Araya C, Barondes SH, Leon P, Sandkuijl LA, Freimer NB (1999) Assessing the feasibility of linkage disequilibrium methods for mapping complex traits: an initial screen for bipolar disorder loci on chromosome 18. Am J Hum Genet 64:1670–1678 [PMC free article] [PubMed]
Excoffier L, Slatkin M (1995) Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol Biol Evol 12:921–927 [PubMed]
Hill WG, Robertson A (1968) Linkage disequilibrium in finite populations. Theor Appl Genet 38:226–231 [PubMed]
Hochberg Y (1988) A sharper Bonferroni procedure for multiple tests of significance. Biometrika 75:800–802
Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Statist 6:65–70
Hommel G (1988) A stagewise rejective multiple test procedure based on a modified Bonferroni test. Biometrika 75:383–386
Keavney B, McKenzie CA, Connell JM, Julier C, Ratcliffe PJ, Sobel E, Lathrop M, Farrall M (1998) Measured haplotype analysis of the angiotensin-I converting enzyme gene. Hum Mol Genet 7:1745–1751 [PubMed] [Cross Ref]10.1093/hmg/7.11.1745
Long AD, Langley CH (1999) The power of association studies to detect the contribution of candidate genetic loci to variation in complex traits. Genome Res 9:720–731 [PMC free article] [PubMed]
Martin ER, Lai EH, Gilbert JR, Rogala AR, Afshari AJ, Riley J, Finch KL, Stevens JF, Livak KJ, Slotterbeck BD, Slifer SH, Warren LL, Conneally PM, Schmechel DE, Purvis I, Pericak-Vance MA, Roses AD, Vance JM (2000) SNPing away at complex diseases: analysis of single-nucleotide polymorphisms around APOE in Alzheimer disease. Am J Hum Genet 67:383–394 [PMC free article] [PubMed]
Meng Z, Zaykin DV, Xu CF, Wagner M, Ehm MG (2003) Selection of genetic markers for association analyses, using linkage disequilibrium and haplotypes. Am J Hum Genet 73:115–130 [PMC free article] [PubMed]
Moffatt MF, Traherne JA, Abecasis GR, Cookson WO (2000) Single nucleotide polymorphism and linkage disequilibrium within the TCR α/δ locus. Hum Mol Genet 9:1011–1019 [PubMed] [Cross Ref]10.1093/hmg/9.7.1011
Morris RW, Kaplan NL (2002) On the advantage of haplotype analysis in the presence of multiple disease susceptibility alleles. Genet Epidemiol 23:221–233 [PubMed] [Cross Ref]10.1002/gepi.10200
Ohashi J, Tokunaga K (2001) The power of genome-wide association studies of complex disease genes: statistical limitations of indirect approaches using SNP markers. J Hum Genet 46:478–482 [PubMed] [Cross Ref]10.1007/s100380170048
Pritchard JK, Cox NJ (2002) The allelic architecture of human disease genes: common disease-common variant…or not? Hum Mol Genet 11:2417–2423 [PubMed] [Cross Ref]10.1093/hmg/11.20.2417
R Development Core Team (2003) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, http://www.R-project.org (accessed March 1, 2004)
Service SK, Lang DW, Freimer NB, Sandkuijl LA (1999) Linkage-disequilibrium mapping of disease genes by reconstruction of ancestral haplotypes in founder populations. Am J Hum Genet 64:1728–1738 [PMC free article] [PubMed]
Šidák Z (1968) On multivariate normal probabilities of rectangles: their dependence on correlations. Ann Math Statist 39:1425–1434
——— (1971) On probabilities of rectangles in multivariate normal Student distributions: their dependence on correlations. Ann Math Statist 41:169–175
Terwilliger JD, Weiss KM (1998) Linkage disequilibrium mapping of complex disease: fantasy or reality? Curr Opin Biotechnol 9:578–594 [PubMed] [Cross Ref]10.1016/S0958-1669(98)80135-3
Westfall PH, Young SS (1993) Resampling-based multiple testing: examples and methods for p-value adjustment. John Wiley & Sons, New York
Zollner S, von Haeseler A (2000) A coalescent approach to study linkage disequilibrium between single-nucleotide polymorphisms. Am J Hum Genet 66:615–628 [PMC free article] [PubMed]

Articles from American Journal of Human Genetics are provided here courtesy of American Society of Human Genetics
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • MedGen
    MedGen
    Related information in MedGen
  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...