BMJ. Nov 3, 2007; 335(7626): 914–916.

# Uncertainty in heterogeneity estimates in meta-analyses

Clinical Trials and Evidence-Based Medicine Unit, Department of Hygiene and Epidemiology, University of Ioannina School of Medicine, Ioannina 45110, Greece

Accepted August 29, 2007.

This article has been

cited by other articles in PMC.

**John Ioannidis**, **Nikolaos Patsopoulos**, and **Evangelos Evangelou** argue that, although meta-analyses often measure heterogeneity between studies, these estimates can have large uncertainty, which must be taken into account when interpreting evidence

### Summary points

The extent of between study heterogeneity should be measured when interpreting results of meta-analyses

Meta-analyses rarely document uncertainty in estimates of heterogeneity

Our evaluation of a large number of meta-analyses shows a wide range of uncertainty about the extent of heterogeneity in most

Confidence intervals of I^{2} should be calculated and considered when interpreting meta-analyses

An important aim of systematic reviews and meta-analyses is to assess the extent to which different studies give similar or dissimilar results.^{1} Clinical, methodological, and biological heterogeneity are often topic specific, but statistical heterogeneity can be examined with the same methods in all meta-analyses. Therefore, the perception of statistical heterogeneity or homogeneity often influences meta-analysts and clinicians in important decisions. These decisions include whether the data are similar enough to combine different studies; whether a treatment is applicable to all or should be “individualised” because of variable benefits or harms in different types of patients; and whether a risk factor affects all people exposed or only select populations. How uncertain is the extent of statistical heterogeneity in meta-analyses? Moreover, is this uncertainty properly factored in when interpreting the results?

## Evaluating heterogeneity between studies

Many statistical tests are available for evaluating heterogeneity between studies.^{2} ^{3} Until recently, the most popular was Cochran's Q, a statistic based on the χ^{2} test.^{4} Cochran's Q usually has only low power to detect heterogeneity, however. It also depends on the number of studies and cannot be compared across different meta-analyses.^{2} ^{3} Higgins and colleagues, in two highly cited papers,^{5} ^{6} proposed the routine use of the I^{2} statistic. I^{2} is calculated as [(Q−df)/Q]×100%, where df is degrees of freedom (number of studies minus 1). Values of I^{2} range from 0% to 100%, and it tells us what proportion of the total variation across studies is beyond chance. This statistic can be used to compare the amount of inconsistency across different meta-analyses even with different numbers of studies.^{7} I^{2} is routinely implemented in all Cochrane reviews (standard option in RevMan) and is increasingly used in meta-analyses published in medical journals.

Higgins and colleagues suggested that we could “tentatively assign adjectives of low, moderate, and high to I^{2} values of 25%, 50%, and 75%.”^{6} Like any metric, however, I^{2} has some uncertainty, and Higgins and Thompson provided methods to calculate this uncertainty.^{5} Recently, other investigators compared the performance of I^{2} and Q in Monte-Carlo simulations across diverse simulated meta-analytic conditions. They found that I^{2} also has low statistical power with small numbers of studies and its confidence intervals can be large.^{8}

## Interpreting heterogeneity in selected meta-analyses

Inferences about the extent of heterogeneity must be especially cautious when the 95% confidence intervals around I^{2} are wide, ranging from low to high heterogeneity. Such uncertainty is usually ignored in systematic reviews, however. This can result in misconceptions. For example, a systematic review of corticosteroids for Kawasaki disease found a point estimate I^{2}=59%.^{9} The authors decided to exclude the two studies that were most different, saying that their removal eliminated all of the across study heterogeneity (Q=5.59, P=0.588, I^{2}=0.00). In fact, the 95% confidence interval for this I^{2}=0% estimate still extends from 0% to 56%. With two small randomised trials and six non-randomised comparisons remaining, the meta-analysis concluded that corticosteroids consistently halve the risk of coronary aneurysms. However, the two largest randomised trials on this topic were published after the meta-analysis. Heterogeneity resurfaced: the largest trial found no effect on coronary dimensions,^{10} while the other trial showed an 80% reduction in the risk of coronary artery abnormalities.^{11}

Eight systematic reviews published in the *BMJ* between 1 July 2005 and 1 January 2006 performed meta-analyses of randomised trials and seven of them performed some statistical analysis of heterogeneity between studies (table on bmj.com).^{12} ^{13} ^{14} ^{15} ^{16} ^{17} ^{18} Each review stated that they had tried to interpret heterogeneity, and seven meta-analyses provided enough information for us to calculate the 95% confidence interval of I^{2}. The lower 95% confidence interval was always as low as 0% (rounded to integer percentage), with one exception. The upper 95% confidence interval always exceeded the 50% threshold, and in four cases it also exceeded the 75% threshold. A conclusive statement was feasible in only one case, where I^{2} was 69%, the 95% confidence interval was 40% to 80%, the Q statistic had P<0.001, and the authors justifiably concluded that “there was significant heterogeneity among these trials.”^{13} This meta-analysis had 15 studies, so the power of both Q and I^{2} was good. In all other meta-analyses (two to 12 studies each), strong statements in interpreting heterogeneity would be difficult to make. Only one review presented 95% confidence intervals for an I^{2} estimate.^{12} The authors concluded that “we could not observe significant heterogeneity.” Indeed the Q statistic had P=0.19. However, with only five studies, the power to detect heterogeneity was negligible. The I^{2} statistic was 35% and the 95% confidence interval ranged from 0% (no heterogeneity) to 76% (high heterogeneity).

## Uncertainty in I^{2}: large scale survey of meta-analyses

This limitation is not confined to the selected examples presented here—it is probably the rule rather than the exception. We used two large datasets of meta-analyses to evaluate empirically the extent of uncertainty in I^{2} estimates. Firstly, we looked at meta-analyses of the *Cochrane Database of Systematic Reviews* (Issue 4, 2005) that had four or more synthesised studies and binary outcomes. Because each Cochrane review may include several meta-analyses, we looked only at the one with the highest number of studies; in the case of ties, we used the one with the largest sample size. We did not look at meta-analyses of two or three studies. Such studies form a sizeable proportion of the Cochrane Library,^{19} but their 95% confidence intervals of I^{2} almost always span a wide range of heterogeneity, unless the studies are large and they give very different results. In total, we calculated the I^{2} statistic and its 95% confidence intervals for 1011 meta-analyses. The second dataset was a previously described database of 50 meta-analyses of gene-disease associations that had found a nominally statistically significant effect (P<0.05) for the proposed genetic risk factors.^{20}

Figure 1 shows the upper and lower 95% confidence intervals of I^{2} for the two sets of meta-analyses. The pattern is similar. Of the meta-analyses where I^{2} is ≤25% (low heterogeneity), 83% of the Cochrane meta-analyses and 73% of the genetic risk factor meta-analyses have upper 95% confidence intervals that cross into the range of large heterogeneity (I^{2} ≥50%). Of the meta-analyses where I^{2} is ≥50% (large heterogeneity), 67% of the Cochrane meta-analyses and 52% of the genetic risk factor meta-analyses have lower 95% confidence intervals that cross into the range of low heterogeneity (I^{2} ≤25%).

**Fig 1** Confidence intervals for estimated I^{2} in 1011 Cochrane meta-analyses and 50 meta-analyses of genetic risk factors. The median number of studies was 7 (interquartile range 5-11) and 20 (13-26), respectively, and the median total sample size was 1112 **...**

Meta-analyses where I^{2} is estimated at 0% are affected by an especially important misconception. Many reviews interpret this as absence of heterogeneity, but the upper 95% confidence interval may be substantial (as in the Kawasaki example discussed above^{9}). Figure 2 shows the uncertainty for the upper 95% confidence interval of I^{2} for the two sets of meta-analyses, limited to those with I^{2}=0% (n=373 for Cochrane reviews, n=12 genetic studies). The upper 95% confidence interval exceeds 33% in all these meta-analyses. For 81% of the meta-analyses with I^{2}=0%, the 95% confidence intervals are 50% or higher. Because of the way that research is currently reported, considerable heterogeneity between studies cannot be excluded with confidence in most meta-analyses. Some heterogeneity between studies is probably present in most meta-analyses. Claims for homogeneity may sometimes be stronger than the evidence allows. Trusting a non-significant P value for the Q statistic and an I^{2} estimate of 0% may sometimes lead to spurious certainty about the comparability and similarity of study results.

**Fig 2** Proportion of meta-analyses with estimated I^{2}=0% whose upper 95% confidence interval of I^{2} is lower than a given value

## Technical aspects

The confidence interval of I^{2} can be calculated by several methods.^{5} Two methods, a test based approach and a non-central χ^{2} based approach have been implemented in Stata (heterogi module). The performance of these two methods is comparable, although the test based approach often gives lower values for lower and upper confidence intervals, so that the non-central χ^{2} based approach may be preferable.

## Concluding comments

All statistical tests for heterogeneity are weak, including I^{2}. The clinical implications of this are considerable and must be examined on a case by case basis. Putting too much trust in homogeneity of effects may give a false sense of reassurance that one size fits all. Lack of evidence of heterogeneity is not evidence of homogeneity. Conversely, putting too much trust in the presence of heterogeneity of effects may lead to spurious subgroup and exploratory analyses. Given that I^{2} is not precise, 95% confidence intervals should always be given.

## Notes

Contributors and sources: JPAI has a long standing interest in meta-analyses and heterogeneity and had the original idea for this article. NAP and EE collected the data. NAP performed statistical analyses with help from JPAI and EE. JPAI wrote the manuscript and NAP and EE commented on it. JPAI is guarantor.

Competing interests: None declared.

Provenance and peer review: Not commissioned; externally peer reviewed.

## References

1.

Lau J, Ioannidis JPA, Schmid CH. Summing up evidence: one answer is not always enough. Lancet 1998;351:123-7. [PubMed]2. Sutton A, Abrams K, Jones D, Sheldon T, Song F. Methods for meta-analysis in medical research Chichester: Wiley, 2000

3.

Petitti DB. Approaches to heterogeneity in meta-analysis. Stat Med 2001;20:3625-33. [PubMed]4. Cochran WG. The combination of estimates from different experiments. Biometrics 1954;10:101-29.

5.

Higgins JPT, Thompson SG. Quantifying heterogeneity in a meta-analysis. Stat Med 2002;21:1539-58. [PubMed]6.

Higgins JPT, Thompson SG, Deeks J, Altman DG. Measuring inconsistency in meta-analyses. BMJ 2003;327:557-60. [PMC free article] [PubMed]7.

Mittlbock M, Heinzl H. A simulation study comparing properties of heterogeneity measures in meta-analyses. Stat Med 2006;25:4321-33. [PubMed]8.

Huedo-Medina TB, Sánchez-Meca F, Marín-Martínez F, Botella J. Assessing heterogeneity in meta-analysis: Q statistic or IPsychol Methods index? [PubMed]9.

Wooditch AC, Aronoff SC. Effect of initial corticosteroid therapy on coronary artery aneurysm formation in Kawasaki disease: a meta-analysis of 862 children. Pediatrics 2005;116:989-95. [PubMed]10.

Newburger JW, Sleeper LA, McCrindle BW, Minich LL, Gersony W, Vetter VL, et al; Pediatric Heart Network Investigators. Randomized trial of pulsed corticosteroid therapy for primary treatment of Kawasaki disease. N Engl J Med 2007;356:663-75. [PubMed]11.

Inoue Y, Okada Y, Shinohara M, Kobayashi T, Kobayashi T, Tomomasa T, et al. A multicenter prospective randomized trial of corticosteroids in primary therapy for Kawasaki disease: clinical course and coronary artery outcome. J Pediatr 2006;149:336-41. [PubMed]12.

Maier PC, Funk J, Schwarzer G, Antes G, Falck-Ytter YT. Treatment of ocular hypertension and open angle glaucoma: meta-analysis of randomised controlled trials. BMJ 2005;331:134. [PMC free article] [PubMed]13.

Dennis CL. Psychosocial and psychological interventions for prevention of postnatal depression: systematic review. BMJ 2005;331:15. [PMC free article] [PubMed]14.

Devereaux PJ, Beattie WS, Choi PT, Badner NH, Guyatt GH, Villar JC, et al. How strong is the evidence for the use of perioperative beta blockers in non-cardiac surgery? Systematic review and meta-analysis of randomised controlled trials. BMJ 2005;331:313-21. [PMC free article] [PubMed]15.

Taylor SJ, Candy B, Bryar RM, Ramsay J, Vrijhoef HJ, Esmond G, et al. Effectiveness of innovations in nurse led chronic disease management for patients with chronic obstructive pulmonary disease: systematic review of evidence. BMJ 2005;331:485. [PMC free article] [PubMed]16.

Webster AC, Woodroffe RC, Taylor RS, Chapman JR, Craig JC. Acrolimus versus ciclosporin as primary immunosuppression for kidney transplant recipients: meta-analysis and meta-regression of randomised trial data. BMJ 2005;331:810. [PMC free article] [PubMed]17.

McDonald MA, Simpson SH, Ezekowitz JA, Gyenes G, Tsuyuki RT. Angiotensin receptor blockers and risk of myocardial infarction: systematic review. BMJ 2005;331:873. [PMC free article] [PubMed]18.

Glass J, Lanctot KL, Herrmann N, Sproule BA, Busto UE. Sedative hypnotics in older people with insomnia: meta-analysis of risks and benefits. BMJ 2005;331:1169. [PMC free article] [PubMed]19.

Ioannidis JP, Trikalinos TA, Zintzaras E. Extreme between-study homogeneity in meta-analyses could offer useful insights. J Clin Epidemiol 2006;59:1023-32. [PubMed]20.

Ioannidis JP, Trikalinos TA, Khoury MJ. Implications of small effect sizes of individual genetic variants on the design and interpretation of genetic association studies of complex diseases. Am J Epidemiol 2006;164:609-14. [PubMed]