Logo of bmcbioiBioMed Centralsearchsubmit a manuscriptregisterthis articleBMC Bioinformatics
BMC Bioinformatics. 2005; 6: 32.
Published online Feb 17, 2005. doi:  10.1186/1471-2105-6-32
PMCID: PMC551600

A note on generalized Genome Scan Meta-Analysis statistics

Abstract

Background

Wise et al. introduced a rank-based statistical technique for meta-analysis of genome scans, the Genome Scan Meta-Analysis (GSMA) method. Levinson et al. recently described two generalizations of the GSMA statistic: (i) a weighted version of the GSMA statistic, so that different studies could be ascribed different weights for analysis; and (ii) an order statistic approach, reflecting the fact that a GSMA statistic can be computed for each chromosomal region or bin width across the various genome scan studies.

Results

We provide an Edgeworth approximation to the null distribution of the weighted GSMA statistic, and, we examine the limiting distribution of the GSMA statistics under the order statistic formulation, and quantify the relevance of the pairwise correlations of the GSMA statistics across different bins on this limiting distribution. We also remark on aggregate criteria and multiple testing for determining significance of GSMA results.

Conclusion

Theoretical considerations detailed herein can lead to clarification and simplification of testing criteria for generalizations of the GSMA statistic.

Background

Wise, Lanchbury and Lewis [1] introduced a rank-based statistical technique for meta-analysis of genome scans, the Genome Scan Meta-Analysis (GSMA) method, and derived its exact null distribution using a clever inclusion/exclusion argument. Koziol and Feng [2] provided an alternative derivation of the null distribution of the GSMA statistic via a combinatoric approach involving probability generating functions, and suggested an Edgeworth series approximation to its exact null distribution that improves upon the Wise [1] normal approximation.

Levinson [3] described two generalizations to the GSMA statistic: (i) a weighted version of the GSMA statistic, so that different studies could be ascribed different weights for analysis; and (ii) an order statistic approach, reflecting the fact that a GSMA statistic can be computed for each chromosomal region or bin across the various genome scan studies. Wise [1] had suggested that each chromosomal region (bin) be about 30 cM, leading to a total of about n = 120 bins spanning the entire genome, and correspondingly 120 GSMA statistics. Wise [1] and Koziol and Feng [2] had investigated the marginal distribution of any of these (exchangeable) GSMA statistics, whereas under the order statistic formulation of Levinson [3], the joint distribution of the entire set of GSMA statistics is taken into account. In this note, we consider both generalizations in turn. In particular, (i) we provide an Edgeworth approximation to the null distribution of the weighted GSMA statistic, analogous to that in Koziol and Feng [2]; and (ii) we examine the limiting distribution of the GSMA statistics under the order statistic formulation, and quantify the relevance of the pairwise correlations of the GSMA statistics across different bins on this limiting distribution. We conclude with remarks concerning the Levinson [3] aggregate criteria and multiple testing for determining significance of GSMA results.

Results

The GSMA statistics

We first introduce some notation. Let Xij, i = 1, ..., m, j = 1, ..., n, denote the rank of any particular linkage test statistic (e.g., LOD score) in the jth chromosomal region (bin) from the ith study, with each study being ranked separately. Levinson [3] rank the bins from 1 = "best" to n = "worst" on the basis of, say, maximum LOD score or lowest p value observed within each bin, but the reverse ranking from 1 = "worst" to n = "best" is also feasible. In practice, m can be as few as 4 (e.g., [4,5]); and, following Wise [1], n is generally about 120. The GSMA statistics are then S1, ..., Sn, where An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-32-i1.gif. The exact (marginal) null distribution of each Sj was derived in Wise [1]; in the notation of Levinson [3], PAvgRnk, the "pointwise probability" of any Sj, is obtained from its marginal null distribution. The normal approximation to the exact distribution of the Sj is straightforward: the Sj are identically distributed, and each Sj has an approximate normal distribution with mean An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-32-i2.gif and variance An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-32-i3.gif under the null hypothesis that ranks are randomly assigned within each study. Koziol and Feng [2] provided an Edgeworth correction to this approximation, and recommended that the correction be used, at least for m ≤ 12.

The weighted GSMA statistic

Levinson [3] proposed a weighted version of the GSMA statistic, namely, An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-32-i4.gif, with the weight wi ascribed to the ith study reflecting the relative linkage information from that study. (We are temporarily omitting the j subscript for clarity.) The normal approximation to the marginal null distribution of Sw is straightforward, and depends on the two parameters An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-32-i5.gif and An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-32-i6.gif. The combinatorial argument utilized by Koziol and Feng [2] to derive the exact distribution of the unweighted GSMA statistic (which relies on probability generating functions) is generally no longer applicable in the weighted setting. Nevertheless, as in Koziol and Feng [2], we here provide an Edgeworth correction that may be applied to the weighted GSMA statistic. To this end, we equivalently consider the linear transform An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-32-i7.gif, where An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-32-i8.gif. We then have E[Rw] = 0, An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-32-i9.gif, and

An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-32-i10.gif

(We have used Koziol and Feng [2], eqn. 11, for An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-32-i11.gif.)

The Edgeworth Type A series approximation to the density of An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-32-i12.gif up to 4th order terms, is f (z) = [var phi] (z)[1 + c4H4 (z)], where [var phi](·) is the standard normal density function, An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-32-i13.gif, H4 (·) is the 4th degree Hermite-Chebyshev polynomial, H4 (z) = z4 - 6z2 + 3, and the constant c4 given by An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-32-i14.gif (Stuart and Ord [6], eqn. 6.42). Furthermore, the cumulative distribution function of the Edgeworth series is given simply by

An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-32-i15.gif

where Φ(·) denotes the cumulative distribution function of the standard normal distribution, c4 and [var phi](·) are as above, and H3(·) is the 3rd degree Hermite-Chebyshev polynomial, H3 (z) = z3 - 3z (Stuart and Ord [6], eqn. 6.43).

In practice, we would expect the Edgeworth series approximation to provide an adequate representation of the exact distribution of the weighted GSMA statistic, in a manner analogous to the unweighted case [2]. Here, we briefly investigate the adequacy of the Edgeworth approximation, using an example from Lewis et al. [7]. They had applied the GSMA methodology to data from m = 20 schizophrenia genome scans, and found strong evidence for linkage on chromosome 2q, as well as suggestive evidence for linkage at several other chromosomal locations. The rank data for each scan are available online at D.F. Levinson's website (accessed July 14, 2004) [8], and we use these data to reconstruct first the unweighted GSMA statistics Sj, j = 1, 2, ..., 120, corresponding to the 120 bins spanning the entire genome, then their preferred weighted versions. Lewis [7] had recommended weights for each individual study proportional to the square root of the number of affected cases for that study. From Levinson's website [8], the individual weights wi, i = 1, 2, ..., 20, are 2.32, 1.77, 1.20, 1.17, 1.17, 1.16, 1.15, 1.08, 1.03, 1.01, 0.95, 0.88, 0.80, 0.80, 0.68, 0.67, 0.59, 0.54, 0.53, 0.51, greater than a four-fold range.

We simulated the null distribution of the weighted GSMA statistic An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-32-i7.gif, where An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-32-i8.gif with m = 20, n = 120, by drawing each Xi as an independent random integer from 1 to 120 (that is, a uniform distribution of the integers from 1 to 120), then forming Rw with the Levinson [8] weights. We used the random number generator in R [9], to produce 10,000 values for Rw. We then formed An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-32-i12.gif, and compared the empirical distribution of the 10,000 Z values with the Edgeworth approximation as described above; for comparative purposes, we also computed the normal approximation, which is based on matching the first two moments rather than the first four moments with Edgeworth. Figure Figure11 shows the resulting quantile-quantile plot of the empirical distribution of the weighted GSMA Z values with both a normal approximation, panel A, and the Edgeworth approximation, panel B. Note that, even in this setting of the weighted combination of m = 20 individual GSMA statistics, the normal approximation is particularly ill-fitting in the tails. Agreement in the tails would be of particular relevance in practical applications, as these represent the areas of potentially significant findings (p-values). The noticeable disagreement in the tails between the weighted GSMA statistic and its normal approximation is largely ameliorated with the Edgeworth approximation. With attendant computational savings, the Edgeworth approximation provides a practical means of determining significance of weighted GSMA results compared to simulation; tail probabilities derived from the normal approximation should only be used with extreme caution.

Figure 1
Quantile-quantile plots of the weighted GSMA statistic vs. a normal approximation and an Edgeworth approximation. A. Quantile-quantile plot of the null distribution of the weighted GSMA statistic, x-axis, versus the Edgeworth approximation, y-axis. B ...

The GSMA order statistics

We turn next to order statistic considerations (and reintroduce the subscript j). The Levinson [3] order statistic approach to inference relating to the GSMA statistics takes into account the inherent ordering of the Sj: their Pord refers to the probability of any observed Sk given the kth bin's place in the ordering of all of the Sj. We here derive approximations to this distribution. Let An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-32-i16.gif, j = 1, ..., n, and T(1) <T(2) < ... T(n) denote the order statistics. We note first that the Tj have (approximately) a singular symmetric multivariate normal distribution, with means 0, variances 1, and correlations An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-32-i17.gif. That the joint distribution is singular follows from the observation that An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-32-i18.gif for all i, hence, An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-32-i19.gif is identically 0. If we dismiss the correlations as negligible (of absolute magnitude < 0.01 for n > 100), then the Tj are (approximately) independent, identically distributed N(0,1) (standard normal) random variates, and the cumulative distribution function (cdf) Fk of the kth order statistic T(k) is given by

An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-32-i20.gif

with Φ(·) as above (David [10], eqn. 2.1.3).

We briefly examine whether correlations can be ignored when determining the distributions of the T(j). Numerical computation of the distributions of the order statistics from a symmetric multivariate normal distribution is feasible in a number of cases; we here examine perhaps the most relevant case, concerning the extreme T(n). Note that

Prob (T(n) x) = Prob (T1 x,T2 x, ...,Tn x);     (2)

this latter probability may be calculated in R using the mvtnorm package [9], based on methodology by Genz [11,12]. With n = 120, we depict in Figure Figure22 a Q-Q plot of the (approximate) distribution of T(n) under independence, eqn. (1), compared to the distribution from eqn. (2) with pairwise correlations An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-32-i17.gif. The independence model tends to agree quite closely to the correlation model in this particular case, especially in the critical right tail, and has the virtue of numerical simplicity. We remark that one might improve slightly on the normal independence model by incorporating the Edgeworth correction into the individual cumulative distribution functions in equation (1).

Figure 2
Quantile-quantile Plot of T(n) Distribution. Quantile-quantile plot of the approximate normal distribution of T(n), the largest (order) GSMA statistic, under the correlation model (eqn. 2, with pairwise correlations An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-32-i17.gif), x-axis, versus the independence ...

Aggregate criteria and multiple testing

Levinson [3] had proposed an aggregate criterion for detecting linkage based on both the marginal distributions and the order statistic distributions of the GSMA statistics. In particular, they argued that bins that have achieved both PAvgRnk < 0.05 and Pord < 0.05 "are the most likely to contain linked loci or to be adjacent to linked bins". Note that their criterion entails both the marginal distribution of the Tj, through PAvgRnk, and the (joint) order statistic distribution of the Tj, through Pord. We remark that there is some redundancy to the aggregate criterion {PAvgRnk < 0.05 and Pord < 0.05}, as can be seen through consideration of critical values relating to their aggregate criterion. With the normal approximation to the distribution of each normalized GSMA statistic Tj, the criterion {PAvgRnk < 0.05} is equivalent to the criterion {Tj > 1.645}. The criterion {Pord < 0.05} may be computed from eqn. (1), and depends on the ordering of the individual Tj. With n = 120, then for the ten largest order statistics T(120), T(119), ..., T(111), the criterion {PAvgRnk < 0.05 and Pord < 0.05} reduces to {Pord < 0.05}, since their 95th percentiles under their joint order statistic distribution exceed 1.645 [implying that, if {Pord < 0.05} obtains, then {PAvgRnk < 0.05} will automatically be satisfied]; and, for the remaining order statistics T(110), T(109), ..., T(1), the criterion {PAvgRnk < 0.05 and Pord < 0.05} reduces to {PAvgRnk < 0.05}, equivalently, {T(j) > 1.645}, as their 95th percentiles under the order distribution, eqn. (1), are less than 1.645 [implying that, if {PAvgRnk < 0.05} obtains, then {Pord < 0.05} will automatically be satisfied].

We conclude with a remark concerning multiple testing. Levinson [3] suggested a simple Bonferroni correction for multiple testing when determining the significance of GSMA results. In particular, they used the criterion {PAvgRnk < 0.000417} (0.05 corrected for 120 tests) for evidence that a bin is likely to contain a linked locus or loci. One can improve on this procedure by using Holm's [13] paradigm for multiple testing rather than Bonferroni. We illustrate Holm's [13] procedure by returning to the Lewis [7] study with m = 20 schizophrenia genome scans. As noted above, we used the online data to reconstruct the normalized unweighted GSMA statistics Tj,j = 1, 2, ..., 120, corresponding to the 120 bins spanning the entire genome. With m = 20 studies, we shall invoke the normal approximation to the distributions of the individual Tj.

Lewis [7] had extensively investigated various criteria for linkage from the 20 schizophrenia genome scans, and we shall not reproduce their analyses. Rather, we here illustrate a graphical procedure for the simultaneous evaluation of p-values from tests on the same data; this procedure is immediately applicable to the simultaneous consideration of the 120 GSMA statistics. The procedure, originally suggested by Schweder and Spjøtvoll [14], consists of a probability plot of the p-values versus the uniform distribution. Koziol [15] subsequently suggested that Holm's [13] paradigm for multiple testing be applied to Schweder and Spjøtvoll's [14] scenario, for a graphical determination of the number of true hypotheses.

Let us briefly review the Holm [13] method, which is an extension of the Bonferroni method for multiple comparisons. Suppose we compare the smallest p-value P(1) among n p-values with α/n and we find that the p-value is less than α/n. Then our multiple testing problem has been reduced by one test, and we should compare the next smallest p-value P(2) to An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-32-i21.gif. In general, we would compare P(i) with An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-32-i22.gif. Holm's [13] step-down test begins with i = 1, comparing P(i) with An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-32-i22.gif, and stops as soon as P(i) exceeds An external file that holds a picture, illustration, etc.
Object name is 1471-2105-6-32-i22.gif, rejecting at overall level α all prior tests with smaller p-values. The Holm [13] method, like Bonferroni, makes no assumption on the dependence of tests, but beyond P(1) is less conservative than Bonferroni.

In Figure Figure3A3A we present a probability plot of the 120 p-values corresponding to the 120 individual Tj statistics, which we have recomputed from the online Levinson dataset [8]. On this plot, the points corresponding to the "true" hypotheses of no linkage in individual bins should roughly fall along a straight line passing through the origin. We have also superimposed the Bonferroni and Holm boundaries for overall alpha level 0.05 and n = 120 p-values; but, the two boundaries are virtually indistinguishable. There is little indication of large departures from the global null hypothesis of no linkage.

Figure 3
Probability plot of GSMA schizophrenia statistics. A. Probability plot of the 120 p-values corresponding to the 120 GSMA statistics T1 , ..., T120 from the 20 schizophrenia genome scans reported in Lewis [7], versus the (expected) uniform distribution. ...

In Figure Figure3B3B we rescale the y-axis, and focus solely on the Bonferroni and Holm boundaries. Differences are most readily apparent for the largest ordered p-values. On the other hand, with a large number of hypotheses (here 120), the improvement of Holm over Bonferroni at the smallest ordered p-values is marginal at best. As a reviewer has presciently remarked, the Holm procedure generally is most helpful (advantageous) relative to Bonferroni with only a small number of hypotheses.

In Figure Figure3C3C we zoom in on the part of the probability plot nearest the origin; we here have superimposed the Holm [13] boundary. In accord with Lewis [7], we find that only one GSMA statistic achieves statistical significance at overall alpha level 0.05, namely, the statistic corresponding to bin 2.5. [Recall that the Holm and Bonferroni boundaries coincide at the smallest p-value, P(1).] That is, in this particular instance, the unweighted GSMA statistics with either Bonferroni or Holm [13] correction for multiple testing identify statistically evidence for linkage on chromosme 2q.

Conclusion

For practitioners utilizing GSMA statistics, the question arises as to whether the methods proposed here as well as in Koziol and Feng [2] are merely of theoretical interest, or have practical import. If one utilizes solely the unweighted GSMA statistic, and chooses to consider its marginal distribution (corresponding to the PAvgRnk formulation of Levinson [2]), then the exact null distribution of the GSMA statistic is available from Wise [1] or Koziol and Feng [2], and should be preferred over any approximate methods. If the exact null distribution is computationally intractable for practitioners, then the Edgeworth approximation of Koziol and Feng [2] provides a simple and accurate means of assessing significance; we would argue that the Edgeworth approximation is preferable to a normal approximation in this instance. When weights are introduced into the GSMA statistic, then the combinatoric arguments of Wise [1] and Koziol and Feng [2] will typically be insufficient to derive the exact null distribution [though we remark that a moment generating function approach patterned after the probability generating function formulation of Koziol and Feng [2] can be brought to bear on this problem.] One can either simulate the null distribution or derive an Edgeworth approximation: we do not believe either method enjoys global advantages over the other. We caution against simple reliance on a normal approximation: in the situation investigated here, Figure Figure1,1, the weighted combination of m = 20 individual GSMS statistics, the normal approximation is particularly ill-fitting in the tails. [Agreement in the tails is of particular relevance to practitioners, as these represent the areas of potentially significant findings (p-values).] As for the order statistic formulation and the aggregate criteria of Levinson [3], we believe that the theoretical considerations given in this paper can lead to clarification and simplification of testing criteria.

Authors' contributions

JAK conceived, designed and drafted the manuscript. ACF performed the statistical simulation.

Both authors read and approved the final manuscript.

Acknowledgements

This research was supported in part by NIH grant RR00833 to the Scripps General Clinical Research Center. We thank the reviewers for their insightful comments and suggestions.

References

  • Wise LH, Lanchbury JS, Lewis CM. Meta-analysis of genome searches. Ann Hum Genet. 1999;63:263–72. doi: 10.1046/j.1469-1809.1999.6330263.x. [PubMed] [Cross Ref]
  • Koziol JA, Feng AC. A note on the genome scan meta-analysis statistic. Ann Hum Genet. 2004;68:376–380. doi: 10.1046/j.1529-8817.2004.00103.x. [PubMed] [Cross Ref]
  • Levinson DF, Levinson MD, Segurado R, Lewis CM. Genome scan meta-analysis of schizophrenia and bipolar disorder, part I: Methods and power analysis. Am J Hum Genet. 2003;73:17–33. doi: 10.1086/376548. [PMC free article] [PubMed] [Cross Ref]
  • Chiodini BD, Lewis CM. Meta-analysis of 4 coronary heart disease genome-wide linkage studies confirms a susceptibility locus on chromosome 3q. Arterioscler Thromb Vasc Biol. 2003;23:1863–1868. doi: 10.1161/01.ATV.0000093281.10213.DB. [PubMed] [Cross Ref]
  • Demenais F, Kanninen T, Lindgren CM, Wiltshire S, Gaget S, Dandrieux C, Almgren P, Sjogren M, Hattersley A, Dina C, Tuomi T, McCarthy MI, Froguel P, Groop LC. A meta-analysis of four European genome screens (GIFT consortium) shows evidence for a novel region on chromosome 17p11.2-q22 linked to type 2 diabetes. Hum Mol Genet. 2003;12:1865–1873. doi: 10.1093/hmg/ddg195. [PubMed] [Cross Ref]
  • Stuart A, Ord JK. Kendall's Advanced Theory of Statistics. 5. New York: Oxford University Press; 1987. [Distribution Theory, Vol 1.]
  • Lewis CM, Levinson DF, Wise LH, DeLisi LE, Straub RE, Hovatta I, Williams NM, Schwab SG, Pulver AE, Faraone SV, Brzustowicz LM, Kaufmann CA, Garver DL, Gurling HM, Lindholm E, Coon H, Moises HW, Byerley W, Shaw SH, Mesen A, Sherrington R, O'Neill FA, Walsh D, Kendler KS, Ekelund J, Paunio T, Lonnqvist J, Peltonen L, O'Donovan MC, Owen MJ, Wildenauer DB, Maier W, Nestadt G, Blouin JL, Antonarakis SE, Mowry BJ, Silverman JM, Crowe RR, Cloninger CR, Tsuang MT, Malaspina D, Harkavy-Friedman JM, Svrakic DM, Bassett AS, Holcomb J, Kalsi G, McQuillin A, Brynjolfson J, Sigmundsson T, Petursson H, Jazin E, Zoega T, Helgason T. Genome scan meta-analysis of schizophrenia and bipolar disorder, part II: Schizophrenia. Am J Hum Genet. 2003;73:34–48. doi: 10.1086/376549. [PMC free article] [PubMed] [Cross Ref]
  • Levinson Research Page http://depressiongenetics.med.upenn.edu/meta-analysis/index.htm
  • The R Project for Statistical Computing http://www.r-project.org/ [PubMed]
  • David HA. Order Statistics. New York: John Wiley; 1970.
  • Genz A. Numerical computation of multivariate normal probabilities. Journal of Computational and Graphical Statistics. 1992;1:141–150.
  • Genz A. Comparison of methods for the computation of multivariate normal probabilities. Computing Science and Statistics. 1993;25:400–405.
  • Holm S. A simple sequentially rejective multiple test procedure. Scand J Statist. 1979;6:65–70.
  • Schweder T, Spjøtvoll E. Plots of p-values to evaluate many tests simultaneously. Biometrika. 1982;69:493–502.
  • Koziol JA. A note on plots of p-values to evaluate many tests simultaneously. Biom J. 1989;8:969–972.

Articles from BMC Bioinformatics are provided here courtesy of BioMed Central
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • MedGen
    MedGen
    Related information in MedGen
  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...