• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

West SL, Gartlehner G, Mansfield AJ, et al. Comparative Effectiveness Review Methods: Clinical Heterogeneity [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2010 Sep.

Cover of Comparative Effectiveness Review Methods: Clinical Heterogeneity

Comparative Effectiveness Review Methods: Clinical Heterogeneity [Internet].

Show details


As noted in Chapter 1, we address here the five (of six) key questions (KQs) for which we had some empirical information relating to systematic reviews (SRs), comparative effectiveness reviews (CERs), or meta-analyses. Table 2 listed the full set of KQs. We provide summary tables of primary findings here; the three evidence tables pertaining to these KQs can be found in Appendix C. Chapter 2 described our various reviews of the literature for different KQs and documented the yields from those searches. It also explains how we conducted the key informant interviews.

KQ 1. What Is Clinical Heterogeneity?

The focus of this report is on best practices for addressing clinical heterogeneity within SRs. Ideally, if SRs were able to provide summary effect estimates that would differentiate between patients who would benefit from an intervention in contrast to those who would either not benefit or who might be harmed, then this would allow clinicians to provide treatment tailored to their patients. Thus, clinical heterogeneity should be valued because it helps inform patient care. The Cochrane Handbook for Systematic Reviews of Interventions20 defines heterogeneity as “any kind of variability among studies in a systematic review,” but defines clinical heterogeneity as variability in the participants, interventions and outcomes studied.

The term “heterogeneity” as used in the epidemiology literature and assessed in clinical studies refers to an intervention-disease association that differs according to the level of a factor under investigation. The term “effect-measure modification” is often used to clarify that heterogeneity can be observed on the relative scale, the absolute scale, neither, or both, and may be present on one scale but not the other (hence, it is the specific effect measure where the heterogeneity is observed).

The presence of effect-measure modification may suggest a biologic (or etiologic) effect of a factor upon the intervention-disease relationship, or it may reflect one or more biases. A factor can modify an effect measure for the intervention-disease relationship when baseline rates of the disease vary among factor subgroups or when the baseline rates do not vary among those subgroups. However, it is important to note that baseline rates may vary within subgroups of a factor whether or effect-measure modification is observed on any scale. This is because whether a given factor modifies baseline risk of disease and whether or not it modifies the effect of a particular treatment on that disease, or the direction or degree to which it modifies that treatment effect, are unrelated. Many different clinical factors can be evaluated as influencing the intervention-disease association (i.e., as modifiers of one or more effect measures), including demographics (age, sex, race/ethnicity), severity of disease, disease risk factors, coexisting diseases, and cointerventions. Many, but not all such factors, influence baseline rates of disease as well.

Ideally, expert advice and the prior literature should be used during the protocol development stage to identify factors that may impact the heterogeneity of treatment effects. Often, however, subgroup analyses in trials either are not defined a priori or are done inadequately, leading to false-positive findings because of multiple statistical tests having been conducted or false-negative findings because of lack of power.21

Nevertheless, clinical knowledge is constantly evolving, and the impact of heterogeneity on treatment effects may be unknown at the design stage of a trial. Post hoc subgroup analyses, therefore, have an important role in research but should be viewed as hypothesis generating and not as an assessment of an associative relationship. The strength of SRs and CERs in regard to heterogeneity is that, because they review multiple studies on the same intervention, they offer a new opportunity to explore reasons for varying study results.5

Evaluating whether there is heterogeneity of the treatment effect in an SR or CER is one of the first steps in an analysis because it is linked to the effect being studied. The fact that clinical heterogeneity is present is a finding to be reported because it helps identify who benefits the most, who benefits the least, and who has the greatest risk of experiencing adverse outcomes. These are central concerns for most users of SRs and CERs because clinicians do not treat “average” patients; they want to know the extent to which a test or treatment might benefit the next patient they see. Thus, more information on the treatment effect across diverse groups of patients may assist clinicians’ work and improve the quality of care they can render.

KQ 1a. Definitions of Clinical Heterogeneity by Various Groups

To provide an overview of approaches and definitions of various international institutions, we reviewed the methods manuals from nine organizations or public-sector agencies that produce SRs (or clinical practice guidelines in which SRs are embedded). They are located in the United States (three organizations) and abroad (two from the United Kingdom, one each from Germany and Australia, and two European or global enterprises):

We summarize here the range of definitions and recommendations about clinical heterogeneity found in their methods manuals.

Of the nine methods manuals reviewed, only five—AHRQ’s EPC program, CRD, Cochrane Collaboration, DERP, and EUnetHTA—provided explicit definitions of clinical heterogeneity.20,22,23,25,29 AHRQ, Cochrane, and CRD used the term “clinical diversity” rather than “clinical heterogeneity.” Their manuals have defined “clinical diversity” as variability of study population characteristics, interventions, and outcomes, with “differential response to an intervention” as another way to refer to differences in treatment effects on specific outcome measures because the underlying effect does differ by one of these factors. Table 6 lists the main definitions from these five organizations.

Table 6. Definitions of clinical heterogeneity by five organizations.

Table 6

Definitions of clinical heterogeneity by five organizations.

Cochrane Collaboration definition. The Cochrane manual provides the most detailed discussion of what we are referring to as “clinical heterogeneity,” which they define as “the variability in the participants, interventions, and outcomes studied.” Variability in participants include personal characteristics (e.g., age, sex, ethnicity) and disease severity and progression, varying baseline risks of experiencing certain events, coexisting conditions, past treatments, and other factors. Differing interventions refer to varying dosages, cointerventions, and surgical techniques; the concept also means inconsistent control interventions (e.g., placebo or active controls).

Cochrane clearly distinguishes clinical heterogeneity from methodological heterogeneity by defining methodological heterogeneity as “the variability in study designs and risk of bias.”30 With respect to methodological heterogeneity, the Cochrane manual contends that differences in methodological factors such as adequate randomization, allocation concealment, and use of blinding among studies will lead to differences in observed treatment effects. Such differences, however, do not necessarily indicate that the true intervention effect varies. Empiric studies have shown that poor study design can lead to an overestimation of the magnitude of the effect.31,32

In short, true clinical heterogeneity exists when patient-level factors—most commonly variables related to patient characteristics, comorbidities, and accompanying treatment—may influence or modify the magnitude of the treatment effect.

AHRQ and CRD definition. Like Cochrane, CRD and AHRQ acknowledge that some variation in treatment effects among studies always arises by chance alone. However, if clinical heterogeneity influences the estimated intervention effect beyond what is expected by chance alone, then clinical heterogeneity becomes important. The AHRQ EPC Methods Guide lists common examples of factors contributing to clinical heterogeneity: age, sex, disease severity, site of lesion, evolving diagnostic criteria, changes in standard care, time-dependent care, differences in baseline risks, and dose-dependent effects.

Other organizations. The DERP and EUnetHTA manuals use a definition similar to AHRQ, CRD, and Cochrane, but DERP also uses the term “qualitative heterogeneity.” No other manuals explicitly define clinical heterogeneity. Their chapters about heterogeneity deal primarily with statistical heterogeneity and the consequences of statistical heterogeneity with respect to meta-analyses.

“Restriction” as a related concept. Cochrane and CRD both caution against “restriction” (i.e., constraining enrollment of subjects, study settings, or what measures to use) as a way of addressing clinical heterogeneity because, they argue, doing so limits the applicability (see below) of the information to patient populations with the condition of interest. Any restrictions with respect to specific population characteristics should be based on a sound rationale.

No other manual addresses restriction. However, “applicability” is sometimes considered a further related concept. Applicability, as related to evidence-based practice, can be thought of as generalizability or external validity of the evidence in an SR or CER; it concerns whether information can be said to pertain directly to a broad selection of patient populations, outcomes, settings, and so forth. The AHRQ EPC Methods Guide22 does address questions of applicability as a characteristic of bodies of evidence. A recent publication on grading the strength of evidence also discusses applicability.33 We provide more information on applicability in SRs in the discussion section (Chapter 4).

KQ 1b. Distinctions Between Clinical and Statistical Heterogeneity

In contrast to how clinical heterogeneity was defined for KQ 1a, statistical heterogeneity refers to the variability in the observed treatment effects that is beyond what would be expected by random error (chance). Assessing statistical heterogeneity involves testing the null hypothesis that the studies have a common treatment effect given a chosen P-value. Clinical heterogeneity can result in statistical heterogeneity.6

Authors of SRs have to put forward convincing arguments that clinical heterogeneity did or did not occur when evaluating an outcome of interest in a given review. When clinical heterogeneity is detected, the onus is on the author to determine whether this finding is clinically relevant. In other words, systematic reviewers have to determine whether differences in population characteristics among studies can lead to clinical heterogeneity that could change clinical decisions.

For example, in a CER on treatments for rheumatoid arthritis (RA), the relative benefit of biologic treatments over methotrexate was smaller in patients with early RA than in patients who had long-lasting RA and had failed to respond to other disease-modifying antirheumatic drugs (DMARDs).12 Although study protocols, drug dosages, follow-up periods, and methodological rigor were very similar between the two sets of trials, the differing stages of RA in the study populations may have produced a substantial variation in the magnitude of treatment effects.

Such an assessment is not always so straightforward. Historically, the impact of clinical heterogeneity has been both under-and over -estimated, based on flawed subgroup analyses or anecdotal clinical evidence.21 Exploring the impact of clinical heterogeneity in SRs, therefore, has to involve both clinical understanding and formal statistical tests.

Investigating the extent of variation of among-study results is an important part of any SR. Results of a careful assessment provide the foundation from which one or another of two clinically important conclusions can be drawn:

  1. Treatment effects are similar statistically despite clinical heterogeneity. Such a finding is an important corroboration of the applicability of study results to more diverse clinical populations.
  2. Treatment effects exhibit variation beyond what would be expected by chance alone as indicated by a statistical test. Such a result requires careful investigation of the reasons for and the magnitude of the variation of treatment effects. Findings from such an investigation might then dictate choice of the statistical model for meta-analyses, employment of sensitivity analyses to determine the effect of the variation on the overall pooled estimate, subgrouping of studies to estimate separate pooled estimates by subgroup, or a decision to forego any meta-analysis that pools data inferentially across studies.

Formal statistical methods to assess heterogeneity. Inevitably, even with thoughtfully defined eligibility criteria and well-formulated, focused KQs, studies included in SRs will differ in various ways and will exhibit some variation in treatment effects. This is to be expected, by chance alone (random error). The underlying rationale of statistical tests to assess heterogeneity is to investigate whether existing variations in treatment effects go beyond what would be expected by chance fluctuations alone.

Various statistical methods exist to determine and quantify the degree of variation. Commonly used statistical tests are Cochran’s Q test,34 I2 index, 35 and meta-regression.10 Table 7 summarizes common statistical approaches to test for heterogeneity.

Table 7. Summary of common statistical approaches to test for heterogeneity.

Table 7

Summary of common statistical approaches to test for heterogeneity.

Relationship between clinical and statistical heterogeneity. In SRs, clinical and methodological heterogeneity across studies is often present, regardless of whether treatment effect is measured on the relative or absolute scale.39 Also, it is possible for one effect measure to be homogeneous and another to be heterogeneous. More problematic is that this heterogeneity is not always measured in its full detail because of incomplete descriptions of intervention protocols, populations, and outcomes. Moreover, it can, but does not always, result in detectable statistical heterogeneity (i.e., variation in treatment effect beyond that expected by chance alone). Thus, an overall test of heterogeneity may be nonsignificant but a specific aspect of the study populations may be significantly associated with study findings.

Clinical and statistical heterogeneity are closely intertwined. Understanding this relationship is important because they do not have a linear relationship. In other words, high clinical heterogeneity does not always cause statistical heterogeneity and it is critical to realize that statistical heterogeneity can be caused by either or both methodological and clinical heterogeneity.

Common reasons for statistical heterogeneity include the following:

  1. Methodological heterogeneity. This can refer to variability in study design, study conduct, outcome measures, and study quality (internal validity). It concerns differences in methodological quality that lead to variations in bias. Empiric studies have shown that high risk of bias often leads to an overestimation of the magnitude of the effect. Such methodological issues could include problems with randomization, allocation concealment, drop-out rates, or statistical analyses (e.g., intention-to-treat vs. per-protocol analyses).31
  2. Chance. Individual studies, particularly studies with small sample sizes or low event rates, can exhibit extreme results based simply on chance. Such outliers can cause statistical heterogeneity.
  3. Biases. In addition to biases that threaten the validity of individual studies and that are captured under methodological heterogeneity, various other biases, including funding and reporting (publication) biases, may cause variability in treatment effects estimated across studies.6,40 For example, small trials with nonsignificant findings have a higher risk of remaining unpublished than small trials showing significant (or very large) effects.

Consequently, for systematic reviewers assessing heterogeneity, the relationship between clinical and statistical heterogeneity is not always straightforward. Table 8 outlines the different relations between clinical and statistical heterogeneity under the assumption that random error, methodological heterogeneity, and biases do not play a role. When both clinical and statistical heterogeneity are present, the reviewers must consider whether the differences in treatment effect may be due to clinical variability or methodological characteristics. Thus, in some cases, reviewers have to pay close attention to the methods of each study.

Table 8. Summary of relationships between clinical and statistical heterogeneity.

Table 8

Summary of relationships between clinical and statistical heterogeneity.

The “possible underlying situation” (right column) explains what inferences might be drawn and whether reviewers need to examine the situation further. Figures 1, 2, and 3 illustrate different underlying situations graphically.

Figure 1. Clinical heterogeneity is present but has a minimal impact on the treatment effect.

Figure 1

Clinical heterogeneity is present but has a minimal impact on the treatment effect.

Figure 2. Clinical heterogeneity is present but the relevance of the impact has to be determined on clinical grounds.

Figure 2

Clinical heterogeneity is present but the relevance of the impact has to be determined on clinical grounds.

Figure 3. Clinical heterogeneity is present and leads to a clinically relevant impact on the treatment effect (reversed direction).

Figure 3

Clinical heterogeneity is present and leads to a clinically relevant impact on the treatment effect (reversed direction).

Exploration of statistical heterogeneity. As outlined in Table 8, statistical heterogeneity can be present with or without clinical heterogeneity and can be caused by reasons other than clinical heterogeneity. SR authors might be tempted to overinterpret apparent relationships between statistical heterogeneity and clinical variations based on results at hand. Particularly when findings are caused by chance, searching for causes can be misleading.6 The problem is similar to that of subgroup analyses.21 Therefore, systematic reviewers must carefully and cautiously explore the reasons for statistical heterogeneity and view results as exploratory rather than causal.

False conclusions about clinical heterogeneity based on statistical heterogeneity can be summarized in two ways:

  1. False-positive conclusion (type I error). The presence of statistical heterogeneity is attributed to clinical differences rather than random variation or confounding.
  2. False-negative conclusion (type II error).The presence of statistical heterogeneity is attributed to other factors such as methodological heterogeneity or chance because no clinical heterogeneity is apparent. In reality, unidentified or “not obvious” factors cause variability of the treatment effect. A not-obvious factor might involve items that are important but not measured (or not easily measurable), such as socioeconomic status or genetic makeup. Generally, reviewers use one or more of three common approaches to explore heterogeneity:

Stratified analyses of homogenous subgroups. We distinguish between subgroup analysis and sensitivity analysis using the Cochrane Collaboration Glossary of terms as our basis (http://www.cochrane.org/resources/glossary.htm). Subgroup analysis is an “analysis in which the intervention effect is evaluated in a subset” of particular study participants, or as defined by study characteristics. So for example, the subgroup might be defined by sex (men vs. women), or by study location (urban vs. rural setting). Subgroup analysis tends to be defined a priori, that is, as part of the study protocol.

The Cochrane manual advises that subgroup analysis should be tested via an interaction test, not by comparing P-values.20

Meta-regression. These types of analyses, as discussed earlier, enable investigators to explore sources of heterogeneity in terms of study-level covariates. They must be done with due attention to potential pitfalls and challenges, however.

Sensitivity analyses. Sensitivity analysis is defined as analysis used to “assess how robust the results are to assumptions about the data and the methods that were used.” Generally such analyses are post hoc, that is, during the analysis phase of the study. For example, a sensitivity analysis might be conducted to determine if changing study inclusion/exclusion criteria changes the conclusions substantially, or to assess if methods for imputing missing data impact the results. Due to its post hoc nature, sensitivity analysis should be considered exploratory, not confirmatory. Both subgroup and sensitivity analyses are constrained practically and inferentially in terms of the availability of studies and sample size. Both may be subject to the challenge of multiple comparisons.

KQ 1c. Clinical Heterogeneity and Other Issues in the AHRQ Methods Manual

For systematic reviewers, especially those doing CERs in the context of guidance from AHRQ through the Evidence-based Practice Center (EPC) program’s Methods Guide,23 identifying potential effect-modifying clinical characteristics is important from the planning stages of the review to the synthesis of the evidence. Specifically, systematic reviewers should consider which factors may be associated with effect-measure modification at all stages of the review: from framing the KQs, through protocol development when inclusion and exclusion criteria are determined, in the development of the abstraction forms, analysis, and finally, when the data are summarized either qualitatively or quantitatively. However, assessing heterogeneity is not an explicit part of the workplan template that EPCs are presently expected to follow.

As outlined above, clinical heterogeneity can be the cause of statistical heterogeneity. Systematic reviewers who consider combining studies statistically must explore existing statistical heterogeneity. If clinical heterogeneity is suspected to be the cause of statistical heterogeneity, researchers might abstain from meta-analyses because populations across different trials might be too different to be combined in a meaningful meta-analysis. Even if statistical heterogeneity does not appear to be present, suspicion of clinical heterogeneity may be cause to limit meta-analysis. The distinction will be important to clinicians. If clinical heterogeneity is confirmed, it may change clinical decisionmaking with individual patients.

As mentioned earlier, clinical heterogeneity is also closely related to a broader issue of SRs and CERs: namely, the assessment of the applicability of findings and conclusions. “Applicability” has been defined as inferences about the extent to which a causal relationship holds over variations in persons, settings, treatments, and outcomes.41 For many audiences in the broader world of health services research, policy research, or quality improvement and patient safety evaluations, this concept is often equated with generalizability or external validity.

Deciding to whom findings of SRs apply requires a close understanding of which patient groups benefit the most and which the least from a given medical intervention. Any specific intervention is unlikely to benefit everyone equally, even with a statistically significant and clinically relevant overall treatment effect. A hypothetical intervention with a number needed to treat (NNT) of 3 to achieve a beneficial outcome would be considered highly effective.42 Nevertheless, in this scenario, two of three treated patients would not experience any benefit from the intervention. Moreover, they might even experience harm from the treatment with no gain or benefit. To identify those who benefit the most or the least is an important piece of information when available; with this information clinicians can appropriately tailor treatments to individuals.

In turn, being aware of treatments for which clinical heterogeneity is not a significant issue is also important. A common criticism of SRs is that they provide average results that are not applicable to individual patients with varying risks and prognostic factors. To identify treatments that are not or only minimally affected by clinical heterogeneity can lead to a more rational use of interventions and help avoid both over-and under -treatment.

Input from experts and stakeholders is important to identify issues of clinical heterogeneity and to frame applicability issues.22 These experts can provide insights into typical health care practice. Numerous studies have reported important differences between patients enrolled in trials and those treated with the same condition in everyday practice.43–45

Whether such differences translate into varying treatment effects remains generally unclear in many areas. Some treatment-and condition -specific knowledge, however, can be gained from the exploration of clinical heterogeneity in SRs and CERs.

KQ 2. How Have Systematic Reviews Dealt with Clinical Heterogeneity in the Key Questions?

KQs 2a and b. Key Questions and Pre-Identified Subgroups

KQs 2a and 2b addressed how the various research groups dealt with clinical heterogeneity in their KQs (KQ 2a) and how they identified (a priori) population subgroups of interest (KQ 2b). We note below the distribution of reviews with respect to including demographic variables and addressing disease variables (i.e., disease stage, type, severity, or site) or similar clinical variables in KQs, as well as pre-identifying population subgroups based on clinical characteristics. Results are presented by each of the four research groups included in this study and the reviews identified from CRD’s DARE and HTA abstracts database. Of interest were the following 15 clinical conditions: breast cancer, lung cancer, prostate cancer, cesarean section, chronic kidney disease, chronic obstructive pulmonary disease (COPD), depression, dyspepsia, heart failure (including congestive heart failure), heavy menstrual bleeding, hypertension, irritable bowel syndrome (IBS), labor induction, myocardial infarction, and osteoarthritis. The reviews completed for each research group and selected from the DARE and HTA databases were listed in Table 4 (Chapter 2).

Agency for Healthcare Research and Quality (AHRQ). To address KQs 2a and 2b for AHRQ, we obtained their SRs for 11 medical conditions: breast cancer, lung cancer, prostate cancer, heart failure, cesarean section, COPD, depression, dyspepsia, hypertension, labor induction, and osteoarthritis (Table 9). 46–56 No AHRQ SRs were available for chronic kidney disease, heavy menstrual bleeding, IBS, or myocardial infarction.

Table 9. AHRQ’s use of clinical heterogeneity in key questions.

Table 9

AHRQ’s use of clinical heterogeneity in key questions.

Cochrane Collaboration. To address KQs 2a and 2b for the Cochrane Collaboration, we obtained SRs from Cochrane for 14 medical conditions (39 reviews in all): breast cancer, lung cancer, prostate cancer, heart failure, cesarean delivery, chronic kidney disease, COPD, depression, heavy menstrual bleeding, hypertension, IBS, labor induction, myocardial infarction, and osteoarthritis (Table 10).14,16,18,57–92 No Cochrane SRs were available for dyspepsia.

Table 10. Cochrane Collaboration use of clinical heterogeneity in key questions.

Table 10

Cochrane Collaboration use of clinical heterogeneity in key questions.

Database of Abstracts of Reviews of Effects (DARE). To address KQs 2a and 2b for SRs located in CRD’s DARE database, we identified and obtained SRs for 12 medical conditions (37 reviews in all): breast cancer, lung cancer, prostate cancer, cesarean delivery, chronic kidney disease, COPD, depression, heart failure, hypertension, IBS, myocardial infarction, and osteoarthritis (Table 11). 15,17,93–127 No SRs for dyspepsia, heavy menstrual bleeding, or labor induction were available from the DARE database.

Table 11. DARE use of clinical heterogeneity in key questions.

Table 11

DARE use of clinical heterogeneity in key questions.

Drug Effectiveness Review Project (DERP). To address KQs 2a and 2b for DERP, we obtained SRs for eight medical conditions: COPD, depression, dyspepsia, heart failure, hypertension, IBS, myocardial infarction, and osteoarthritis. We randomly selected 18 of their SRs; however, six reports were duplicates (among heart failure, hypertension, and myocardial infarction) (Table 12). 128–139 No DERP SRs were available for breast cancer, lung cancer, prostate cancer, cesarean section, chronic kidney disease, heavy menstrual bleeding, or labor induction.

Table 12. DERP use of clinical heterogeneity in key questions.

Table 12

DERP use of clinical heterogeneity in key questions.

Health Technology Assessment Database from CRD. To address KQs 2a and 2b from the Health Technology Assessment (HTA) database, we obtained SRs for eight medical conditions: breast cancer, lung cancer, prostate cancer, depression, dyspepsia, hypertension, myocardial infarction, and osteoarthritis (Table 13). 140–149 There were no SRs for cesarean section, chronic kidney disease, COPD, heart failure, heavy menstrual bleeding, IBS, or labor induction completed during the 2007–2009 time period.

Table 13. HTA use of clinical heterogeneity in key questions.

Table 13

HTA use of clinical heterogeneity in key questions.

National Institute for Health and Clinical Excellence (NICE). To address this question for NICE, we reviewed five SRs—one each on breast cancer, depression, dyspepsia, myocardial infarction, and osteoarthritis. We were able to identify NICE reviews for all of the conditions, but due to time and resource constraints, we were only able to focus on five SRs (Table 14). 150–154

Table 14. NICE use of clinical heterogeneity in key questions.

Table 14

NICE use of clinical heterogeneity in key questions.

KQ 2c. “Best Practices” for Key Questions

We used the manuals reviewed for KQ 1 to address KQ 2c. In contrast with our discussion above for KQs 2a and 2b, which provides our findings by review group, we do not carry through with this format below.

Only the DERP manual recommends explicitly that investigators develop a KQ on patient subgroups. The context is the need to assess whether the comparative effectiveness or tolerability and safety of drugs vary in patient subgroups defined by demographics (age, racial groups, sex or gender, or similar factors), use of other medications, or presence of coexisting conditions. This advice is not couched in “scoping” terms.

Three groups appear to suggest one or another method of “scoping” KQs. AHRQ recommends one approach—namely, performing a preliminary search for relevant trials and the consultation of experts in the field. In this context AHRQ recommends that authors focus carefully on all aspects of the review questions to ensure that they specifically examine subgroups of interest in their review. CRD suggests considering factors that may be investigated for subgroup analysis, including participants’ age, sex or gender, socioeconomic status, ethnicity, and geographical area; disease severity; and presence of any comorbidities, before any KQs are stated. Finally, NICE recommends convening a scoping workshop before KQs are formulated to identify which patient or population subgroups should be specified (if any).

The Cochrane handbook does not explicitly discuss subgroups during the process of formulating the KQs; it deals with subgroup analysis in the data analysis chapter. Nevertheless, the Cochrane handbook does discuss restriction with respect to specific population characteristics or settings during the formulation of KQs; this might be regarded as a way to lay out the scope of the issues insofar as clinical heterogeneity is concerned. It specifically advises that authors should consider any relevant demographic factors and notes (as mentioned for KQ 1, above) that any restriction should be based on a sound rationale because restriction limits the applicability of SRs.

No other manual provides guidance on how to address clinical heterogeneity in KQs.

KQ 3. How Have Systematic Reviews Dealt With Clinical Heterogeneity in the Review Process?

For this KQ, we summarized recommendations from the guidance documents we abstracted for KQ 1 and provided best practices from these documents. Although this report focuses on addressing clinical heterogeneity in the KQs (KQ 2) and in the analysis phase (KQ 3), we did not find guidance documents, studies, or commentaries indicating that clinical heterogeneity must be considered at all stages of the review, from its inception with forming the KQs, developing the inclusion and exclusion criteria, designing the abstraction form, abstracting the information, and then analyzing the findings, and synthesizing the results.

Besides reviewing the guidance documents from KQ 1, we also identified whether AHRQ EPCs considered clinical heterogeneity during the analysis phase of their reviews.

KQ 3a. Recommendations from Guidance Documents

Agency for Healthcare Research and Quality. The AHRQ EPC Methods Guide recommends that biological or clinical factors that may influence the occurrence of clinical heterogeneity in the treatment effect be determined a priori based on previous reviews or expert opinion. Then, when framing the KQs for the review, the authors can develop the questions to include the factors contributing to clinical heterogeneity or suggest subgroup analyses to explore these factors in the analysis. With respect to handling clinical heterogeneity in analyses, the manual advises that when it is present, the authors should explain the issues that they considered, including the range of differences in clinical factors that would be considered acceptable for pooling, in deciding whether or not to combine studies using meta-analysis. Any meta-analyses should include sensitivity analyses. The Methods Guide does not address restriction as a way of addressing clinical heterogeneity. Currently, assessing heterogeneity is not part of the work plan template.

Cochrane Collaboration and Centre for Reviews and Dissemination. The Cochrane Collaboration and CRD are the only institutions that provide guidance on how to assess clinical heterogeneity and how to deal with clinical heterogeneity in SRs. Both manuals recommend assessing the importance of clinical heterogeneity by visually exploring differences in the magnitudes of treatment effects as a first step. This approach requires plotting point estimates with confidence intervals on a common scale for each study. A forest plot, as used for meta-analysis, would probably be the most appropriate graph. Both institutions recommend investigating the overlap of confidence intervals. If confidence intervals do not overlap or overlap only to a small degree, more formal statistical methods (e.g., chi-square tests) should be considered.

Specifically, the Cochrane handbook suggests that authors consider subgroup analyses as well as meta-regression for addressing clinical heterogeneity. However, meta-analysis will be informative and appropriate only if the study participants, interventions, and outcomes are sufficiently homogeneous. It also provides guidance on the use of restriction with respect to specific population characteristics or settings.

As Cochrane reviews are intended to be widely relevant internationally, the manual advises that authors must justify exclusion of studies based on population characteristics using a sound rationale and must explain this in their review. For example, focusing a review of the effectiveness of prostate cancer screening on men between 50 and 60 years of age may be justified on the basis of biological plausibility, previously published SRs, and existing controversy. By contrast, authors should avoid focusing a review on a particular subgroup based on age, sex, or ethnicity when no underlying biologic or sociological justification can be found for doing so, as this would increase the likelihood of type 1 error. When reviewers are uncertain whether effects among various subgroups of people may differ in important ways, they may be best advised to include all the relevant subgroups and then test for important and plausible differences in the analysis (see Chapter 9, Section 9.6 of the handbook). Subgroup analyses should be planned a priori, stated as a secondary objective, and not driven by the availability of data.

The CRD manual suggests that investigators explore clinical heterogeneity using subgroup analyses that are planned during protocol development. However, when authors cannot plan for subgroups a priori because little information is available at the protocol development stage, they should use an adaptive process with the process specified in the protocol. (The developers of the manual do not provide an example of what an adaptive process might look like.) When authors plan to use restriction, CRD advises that the restrictions put in place should be clinically justifiable such that the results are relevant to the population of concern.

Drug Effectiveness Review Program. DERP guidance does not distinguish among clinical, methodologic, and statistical heterogeneity; rather it discusses heterogeneity in general. Authors of DERP reviews are instructed to consider heterogeneity using the populations, interventions, comparators, outcomes (PICO) framework to determine whether meta-analysis is appropriate. The guidance states that reviewers should use qualitative summaries when meta-analysis is not appropriate. The DERP guidance does not discuss use of restriction for addressing clinical heterogeneity.

Other organizations. The EUnetHTA guidance provides no guidance on how to address clinical heterogeneity; it does indicate that authors note whether clinical heterogeneity is present.29 Whether clinical heterogeneity is present can be conveyed using tables specifying the populations, interventions, settings, and outcome measures. EUnetHTA does not include restriction as a way of dealing with clinical heterogeneity.

The HuGENet handbook addresses clinical heterogeneity through use of subgroups based on disease or sociodemographic characteristics. Authors should clearly specify subgroups. Details of the subgroup analysis can be provided in text rather than in tabular format unless the subgroup analysis was pre-specified as a primary issue to be evaluated in the review. HuGENet does not include restriction as a way of dealing with clinical heterogeneity.

IQWIG focuses on subgroups as a way to evaluate consistency of treatment results across populations and subgroups such as gender and baseline disease risk. They do not discuss restriction as a means of handling clinical heterogeneity.

Finally, neither the NHMRC guidance nor the NICE manual makes recommendations about exactly how to handle clinical heterogeneity in analyses, and neither discusses restriction.

KQ 3b. Evidence-based Practice Center Practices for Clinical Heterogeneity

KQ 3b asked how AHRQ’s EPCs have dealt with the concept of clinical heterogeneity in their SRs and CERs. To address this question, we sought SRs (including CERs) from AHRQ for all 15 medical conditions noted earlier. Because AHRQ requested a broad review of conditions, the principal investigator for this study selected one condition to represent each body system.

Of these, we obtained reviews on 11 conditions: breast cancer, lung cancer, prostate cancer, cesarean delivery, COPD, depression, dyspepsia, heart failure, hypertension, labor induction, and osteoarthritis. In addition, as no AHRQ report dealt with dyspepsia, we used an SR for gastroesophageal reflux disease (GERD) instead. We selected one SR or CER for each of the 11 conditions, counting dyspepsia, regardless of how many reviews EPCs might have been completed for a given topic over the years.46–56 No AHRQ SR was available for chronic kidney disease, heavy menstrual bleeding, IBS, or myocardial infarction.

We note the distribution of reviews with respect to whether they included demographic variables and addressed disease variables (i.e., disease stage, type, severity, or site) or similar clinical variables (Table 15).

Table 15. Use of demographic or disease variables in AHRQ systematic reviews.

Table 15

Use of demographic or disease variables in AHRQ systematic reviews.

KQ 3c. “Best Practices” for Considering Intervention-Outcome Associations

This subquestion pertained to all organizations considered for KQ 1, not just AHRQ. We comment in detail below only if a manual or handbook provided some explicit advice about analyses or statistical tests to be used in examining associations between interventions and treatment outcomes taking clinical heterogeneity into account.

Cochrane Collaboration and Centre for Reviews and Dissemination. The Cochrane manual suggests that authors determine, at the point of writing their protocols, which characteristics may be associated with clinical heterogeneity so they can develop a plan to assess these factors during the analysis; the manual also suggests consideration of meta-regression. An initial step when studies reflect inconsistencies is to evaluate whether statistical heterogeneity exists. However, because the power of the heterogeneity test for detecting clinical heterogeneity is low, they suggest using the I2 with a P-value of 0.10 as this P-value can provide the strength of the available evidence. When evaluating forest plots, authors should consider the overlap in confidence intervals.

The CRD guidance suggests that authors should examine forest plots, chi-square tests (Q-statistic), and the I2 test as a means of assessing whether clinical heterogeneity is influencing the treatment effect.

Drug Effectiveness Review Program. The DERP manual suggests that reviewers consider whether there are differences in the patient populations, interventions, and outcomes and if the studies are of similar quality before determining whether a meta-analysis should be performed. When meta-analyses are inappropriate, the data should be summarized qualitatively.

Human Genome Epidemiology Network. Clinical heterogeneity with respect to intervention-outcome associations is not addressed specifically by the HuGENet handbook. However, it does advise that heterogeneity in general can be assessed in one or more ways: the estimate of among-study variance (I2 statistic) and meta-regression with sensitivity analyses.

German Institute for Quality and Efficiency in Health Care. The IQWIG manual did not address clinical heterogeneity specifically although it does provide guidance on assessing heterogeneity in general. Their guidance suggests a priori determination of possible effect-measure modifiers that might affect the treatment-outcome association in particular patient subgroups. Studies that are strongly heterogeneous may be meta-analyzed only when the reasons for the heterogeneity are plausible and justifiable. The extent of heterogeneity should be quantified using the I2 statistic.

Other organizations. AHRQ does not make any explicit recommendation regarding how authors should assess whether clinical heterogeneity affects the intervention-outcome relationships in its SRs or CERs but does provide guidance on the possible choices for its evaluation. The EUnetHTA guidance has no recommendation for assessing whether clinical heterogeneity influences the intervention effect. The NHMRC manual discusses heterogeneity only in general but suggests that authors should explore possible causes of variation in outcome estimates even when the test for heterogeneity is not statistically significant. Finally, the NICE manual also does not specifically address clinical heterogeneity with respect to outcome estimates or effects. It states, however, that authors should describe and justify their meta-analytical techniques and approaches. This guidance includes specifications for any subgroup analyses and sensitivity analyses.

KQ 4. What Are Critiques in How Systematic Reviews Handle Clinical Heterogeneity?

KQ 4a. Critiques from Peer and Public Reviews of AHRQ Evidence-based Practice Center Reports

As with KQ 3a, this issue related only to CERs from AHRQ EPCs. It specifically deals with external peer review and public comments for three draft CERs from AHRQ EPCs:

  • Comparative Effectiveness of Drug Therapy for Rheumatoid Arthritis and Psoriatic Arthritis in Adults12
  • Comparative Effectiveness of Percutaneous Coronary Bypass Grafting for Coronary Artery Diseases13
  • Comparative Effectiveness of Second-Generation Antidepressants in the Pharmacologic Treatment of Adult Depression.52

All had had KQs addressing subgroups, the most relevant of which are reproduced in Table 16.

Table 16. Clinical heterogeneity variables specified in key questions for AHRQ comparative effectiveness reviews.

Table 16

Clinical heterogeneity variables specified in key questions for AHRQ comparative effectiveness reviews.

The source of these comments was compilations provided by the SRC. We did not review the Peer Review Disposition Reports (PRDRs) that all EPCs produce to indicate how they dealt with peer or public comments in their final reports. Thus, comments noted below may have been accurate (leading to revisions to the final CER) or inaccurate or irrelevant (meaning that the final reports did not have any related revisions); in all cases, however, the PRDR would have had an explanation of the disposition made for all comments.

Table 17 presents our summary synthesis by type of comments or concerns that either independent peer reviewers or public commentators made about these draft reports. All three reports were criticized for lacking either information on the clinically relevant subgroups or clarity on which comparisons were being made. One reviewer suggested that “to avoid confusion in the interpretation of [the] analysis, it must be made clear exactly what is being compared.” The reviewer cautioned that when the population is not well defined or the subgroups being compared are not clearly stated, the reader may apply the findings inappropriately.

Table 17. Types of comments received on draft comparative effectiveness reviews.

Table 17

Types of comments received on draft comparative effectiveness reviews.

Some reviewers expressed confusion about restriction to a very specific subset of the population vs. a more general subgroup analysis. One noted that specifying unusual eligibility criteria that other studies might have considered exclusion criteria would be important. The example was the AWESOME (Angina With Extremely Serious Operative Mortality Evaluation) trial of special high-risk patients with ischemic symptoms refractory to medical therapy who were at increased risk for adverse events after either percutaneous coronary intervention (PCI) or coronary artery bypass grafting (CABG).

Another comment concerned the concept of applicability. The claim was that included trials are typically efficacy studies and that they do not provide information for real practice in which, for instance, patients have multiple comorbidities. Note: the draft AHRQ EPC guidance recommends distinguishing between efficacy and effectiveness studies and references work done by Gartlehner et al.165

Reviewers also pointed out cases in which studies or data with relevant information on clinical heterogeneity were never mentioned at all. They advised that if no studies existed that addressed important clinical subgroups, then EPC authors should state that fact clearly; if such data did exist but were not considered in a CER, then EPC authors should explain the reasons for excluding the studies or the data. For example, one reviewer noted that clinicians are very interested in patients with comorbid depression and chronic pain and this was not addressed in the report. Yet several studies were available that would shed light on this issue.

More generally, reviewers criticized the lack of consideration of disease activity and severity. They emphasized that these factors are very important for understanding drug efficacy and safety in patients with differing severity of disease.

Inappropriate analysis or interpretation (or both) of subgroup data were sometimes criticized. One critique focused on forced grouping of subgroups: “The mix of simple and complex randomized controlled trials in a forest plot with a summary line is simply inappropriate. There is a great hazard in blending trials with such divergent target populations as the authors have done in multiple forest plots.” Whether this was intended to be a call for more discrete subset, subgroup, or heterogeneity analyses, however, was not clear.

Another point involved where in the report that subgroup analyses might be explored and discussed in more depth. Some comments advised that discussion or conclusion chapters of the reports should include commentary on subgroup analysis. For example, one commentator thought that this paragraph from the results section of the report should have been brought forth into the discussion or the conclusions section:

We were interested in finding studies that would allow us to predict individual responses to a specific drug based on [a] patient’s clinical and genetic characteristics. In theory, drugs have varying side effect profiles and [an] individual’s tolerance of those side effects varies but overall incidence of side effects is relatively high. The lack of data relating individual’s characteristics to drug effects makes it difficult to predict which drug will be best tolerated by a specific individual. This is indicated by substantial discontinuation rates and frequent need to try multiple drugs before finding an effective drug that is well-tolerated. Studies of tailoring therapy would have been eligible for this review, but we did not find any. Most of these studies looked only at average effectiveness, excluded subjects with comorbidities, and did not even assess difference in effectiveness according to broad demographic characteristics.

Despite the fact that the KQs for these three reviews did indicate the consideration of clinical heterogeneity, the reviewers of these reports critiqued them on which subgroups were evaluated, how the evaluation was done, and the lack of information on factors contributing to clinical heterogeneity.

KQ 4b. General Critiques (in the Literature) about Clinical Heterogeneity in Systematic Reviews

The most frequently mentioned concern noted for assessing clinical heterogeneity in SRs is that authors should specify in advance, during the development of the design or protocol for their review, which factors they will be investigating. Analysis of factors identified a posteriori may be considered a “data dredging” exercise that is likely to produce unreliable results.4,5,11,38,166–185

A related concern is that analyses identifying factors that appear to modify intervention-outcome associations should be regarded with caution. Factors investigated may not be biologically plausible or based on disease pathophysiology,186 and may be misleading. 170 Subgroups arising from a “per protocol” rather than from an intention-to-treat analysis of randomized controlled trials may be particularly suspect because control of possible confounding by randomization no longer holds.166,176,178–180,182,187,188 Hence, the analysis of potential indicators of clinical heterogeneity is considered hypothesis-generating.

Many authors suggest that requisite caution should be exercised by severely limiting the number of pre-specified factors and by controlling the overall type I error probability for the entire group of factors. The latter approach has the effect of reducing the type I error probability for each factor.166,168,179,180,186,188–191

Homogeneity tests have low power, and this problem can cause authors of SRs to miss clinical heterogeneity that has an impact on the magnitude of a treatment effect. For this reason, some experts suggest the use of a higher alpha level than usual, such as 0.10.6,170,183,191,192

Two approaches to understanding or dealing with clinical heterogeneity were popular in the earlier literature: excluding outlier studies without sufficient justification11,178,182,193 and using the control groups of included studies to estimate the underlying risk of the outcome.194–196 Given the risk of selection bias when excluding studies without just cause, this is no longer done. Currently, we are still using the rate of events in the comparison group to control for baseline risk.

KQ 5. What Evidence Is There To Support How Best To Address Clinical Heterogeneity in a Systematic Review?

This section describes the literature search to identify best practices on handling clinical heterogeneity in SRs and CERs and our discussions with key informants.

Review of Methodologic Studies Addressing Clinical Heterogeneity

As described in Chapter 2, we conducted a formal literature search in an effort to identify best practices for handling effect-measure modification within SRs and CERs (or constituent meta-analyses). The intent of this search was to identify whether guidance on the conduct of SRs and CERs (1) differentiated among different types of heterogeneity, and (2) described how to identify factors causing clinical heterogeneity, including evaluating particular subgroups and conducting analyses on individual patient data rather than using the summary results from publications.

For this question (which yielded more than 1,000 citations at the outset with an additional 387 identified via citation search), two senior reviewers independently reviewed the output and identified 60 publications that discussed how to handle clinical heterogeneity in SRs (broadly defined). After removing two duplicate articles, we initially had 58 papers for review.

We also reviewed an additional group of 25 papers that the SRC identified in its publication library that addressed clinical heterogeneity or subgroup analyses. The overall sample included 83 papers (summarized in Evidence Table C3 [Appendix C]). These 83 publications cannot be considered as representing either a systematic search or a random sample. Although the 1,000+ citations were identified from a formal systematic search, the final 83 papers reviewed were not independent with regard to authorship but rather exhibited extensive clustering by a small number of experts (Table 18).

Table 18. Clustering of authors of publications about clinical heterogeneity.

Table 18

Clustering of authors of publications about clinical heterogeneity.

Of the 83 papers we reviewed, 80 (96 percent) addressed heterogeneity among studies in one form or another; the other three focused on evaluating heterogeneity in studies rather than in SRs. In all, 57 (69 percent) of the papers addressed within-study heterogeneity. There were 54 studies (65 percent) that addressed both within-and among-study heterogeneity, often in the context of comparing conventional meta-analysis with individual patient meta-analysis or distinguishing study-level study characteristics (e.g., randomized or observational) from patient-level characteristics (e.g., disease severity). These publications did indicate that analysis of individual patient-level data in meta-analyses does allow better assessment of clinical heterogeneity, but the time, cost, and difficulty in obtaining these data is often prohibitive.

Of these 83 studies, 53 (64 percent) distinguished between heterogeneity regarding methodologic characteristics of studies that would affect their internal validity (e.g., allocation concealment in trials) and characteristics of patients and clinical settings that would affect external validity (e.g., presence of coexisting conditions). The papers that did not draw this distinction tended to be those that focused more on the statistical aspects of heterogeneity assessment than on substantive applications.

Finally, 14 articles (17 percent) gave guidance for defining indicators or measures of clinical heterogeneity. For the most part, this guidance tended to be very general, such as using clinical judgment, conducting interviews with patients, and looking for leads in previous research.

At the current time, there is no guidance on how to identify which factors should be considered as potential effect-measure modifiers of the treatment-outcome association. The literature is very general and suggests the use of experts and information from the literature, but how does a systematic reviewer determine which literature is most relevant? For example, systematic reviewers typically include demographic factors such as age, sex, and race/ethnicity, with little forethought on why these factors may be relevant or if there are more critical effect modifiers that should be considered. Guidance and processes to determine how to select important effect modifiers is not available in the public domain currently.

Results of Key Informant Interviews

We interviewed six authors in all; three pertained to osteoarthritis reviews (one each from Cochrane, an author of a health technology assessment, and AHRQ via a search of the DARE database),14,15,19 and three pertained to myocardial infarction (one from DARE, two from Cochrane).16–18 RTI staff attempted to include authors of NICE SRs, but we were unable schedule to interviews because of the authors’ limited availability within the specified time frame of this task.

Topical Analysis

Typical approach for developing a study protocol for a systematic review. Five of the six participants indicated that they follow a processor protocol, such as the process described in the Cochrane Collaboration guidance,197 when developing a study protocol for an SR. Four participants specifically mentioned use of the “PICO” scheme, which addresses the patient(s), intervention(s), comparison(s) (comparator[s]), and outcomes. This is a slightly abbreviated version of the “PICOTS” framework often used by EPCs, which also includes timeframe and setting elements. These observations generally pertain to the general process for developing a workplan or protocol for the entire review, not to particular elements such as stating KQs or outlining specific analytic techniques. Two participants noted that they also consult with experts in the field:

“We would go through a preliminary literature search. We’ll have our own conversations with experts that we have contacts with already,” explained one participant. “We’ll be in touch with whoever did the topic development in the SRC [Scientific Resource Center], but now that’s being done more and more by the Evidence-based Practice Centers [EPCs] themselves. Whatever technical experts they consulted with, we try to make an effort to contact them…then it’s all aimed at us gaining ownership of the topic to make sure that we have mastery of the issues as we are able to get started on the review.”

Timing of subgroup identification and ideal process for subgroup identification. Participants were asked to indicate the specific point in the review process at which authors should formulate subgroups and the ideal process for identifying subgroups in an SR. Four authors said that subgroups should be developed during the protocol development process; however when asked specifically about a priori vs. a posteriori identification of subgroups, five participants indicated that subgroups should be identified a priori.

Timing of subgroup identification. With regard to the timing of subgroup selection, one author stated,

“[P]eople should think about whether there are clearly defined, if you like obvious, subgroups of patients who may display or react differently to a given intervention. And if there is substantial evidence to support such an assumption, they should then plan appropriate subgroup analyses to investigate in their review if this is the case. Now this isn’t to say that you’re not allowed to do a subgroup analysis that you didn’t pre-specify. I see research in general, but also systematic reviews and meta-analyses, as a creative process and exploration of the data is a good thing because you can discover something. But, of course, there is a risk that what you see is a chance finding if you have explored very extensively and you happen to fall upon a finding that’s not real. For this reason, I think you should always make it clear whether a subgroup analysis was pre-specified or whether it was explored.”

Another commented:

“I think, ideally, it should be done during the protocol development process and I think that’s why we’re so intent when we do systematic reviews to, as I said, gain ownership of the topic. To really become immersed as quickly as possible in the important issues because once the protocol has already been developed, I think there’s a real interest in getting through the search and abstraction phases as quickly as possible, and it can be very disruptive to have to redesign your abstraction instruments midway through the process. It can be really frustrating and there can be a lot of duplication so I would vote for the protocol development process.”

A third noted:

“[I] guess it should be done early in the . . . protocol development. My experience has been that ideally it should be done almost in isolation, but the reality is that because of your previous knowledge of the literature, that’s going to influence some of that subgroup development. Your knowledge may be a little biased on what you’ve already read or what you already know of the literature, so I guess early on in the development of the process or the protocol before you’ve even started your search.”

A fourth said:

“If you can, it should be done before—a priori. That would be the best way of doing it, but sometimes, like I said, when you’re reviewing literature some things stand out. You find a subgroup of patients dying more often than others and you probe a little farther. When you basically publish those results, they may not be as robust as a priori hypothesis, but still something may be clinically meaningful. The short answer is you should have a hypothesis before. If you want to look at subgroups, you should have a hypothesis generated before.”

Along similar lines of reasoning, another responded:

“[I]f the question was: ‘Is there meaningful difference between the subgroups based on age, race, or sex?’ then we would have created a priori hypothesis and assessed the number needed to have meaningful differences in the subgroups. But, occasionally what happens is that when you do the analyses, some subgroups fall out (they look quite abnormal), and then we probe a little better into that but we do not make that our primary result of the analysis.”

Two participants felt that clearly stating how and when authors specified subgroups is of great importance.

“[T]here’s nothing wrong with doing subgroup analyses that were not pre-specified. There may be some new finding that wasn’t known at the time when you wrote the protocol that leads you to think about subgroup analyses that you didn’t think about before … you could just creatively explore the data, but in this case, you should make it clear that this is how it happened and that it wasn’t pre-specified.”

“What I do and what I recommend as an editor at the Cochrane Collaboration, we constantly say that if people tend to, or plan to, interpret the statistical analysis in terms of inference afterwards, they are supposed to present in the protocol what key clinical characteristics they would consider relevant, or in terms of exploring it afterwards. So the main issue here, which I’m feeling very strongly about, would be if they were supposed to interpret it. Actually, I did a meta-analysis on weight loss for knee osteoarthritis, and in that, we had a strong a priori saying we wanted to include the dosage that would be the average weight lost as a covariate. That was an example of something we knew prior to doing the statistical analysis. But in the [other] paper, we did it the other way around; we wanted to explore reasons for heterogeneity. So if we used a statistical analysis package to generate an inference, then it should be carefully stated in the full paper. That’s a very strong argument and I really feel strongly about that because it’s obvious that very often we see people doing whatever subgroup analysis or meta-regression analysis and sometimes it seems that they have been inspired to do that following looking at the data.”

One author suggested a two-step process for identifying subgroups that entails (1) looking a priori for patient populations that make clinical sense to a clinician working at the bedside and to opinion of clinical leaders as well—looking at the knowledge that has already been developed; and (2) looking after the fact at heterogeneity that is evident in the results to see if that can highlight patient characteristics or study characteristics that lead the development of subgroups.

Ideal process for subgroup identification for studying clinical conditions in an SR. The participants provided a range of responses when asked about the ideal process for identifying subgroups in an SR. To identify subgroups, one author said:

“Authors should, together with content experts, consider what’s clinically relevant … in terms of what we anticipate would mean something to the response to therapy. That should be based solely on external knowledge without looking at the data. It’s very important that the content expert is not involved in the data handling and that the ideas for how you are supposed to explore reasons for heterogeneity is made a priori. That would mean that the protocol is based on content expertise rather than looking at the particular study.”

Two participants cited the importance of considering additional sources of potential information, such as content experts or a literature review, when attempting to identify subgroups. One felt that identification of subgroups should include:

“A blend of leaning on the usual suspects like age, disease duration, disease severity, sex—those kinds of things that are almost always considered—and then also leaning on what the literature suggests might be important subgroups. That’s why it’s so important before even beginning on the review to have done a preliminary literature search, to get a sense of what important subgroups might exist, and also talking with experts who may already be familiar with what subgroups exist.”

Two participants noted that subgroup analyses are feasible when they have sufficient numbers of studies. One participant noted that many Cochrane reviews include too few studies to do a subgroup analysis and the other participant added that to do subgroup analyses that combine data across studies, patient demographics, treatments or interventions, and outcomes should be fairly homogeneous. He further elaborated saying,

“[F]or my particular study, there was little information at the time of the study, so we had a small number of patients . . . .[T]he four randomized trials that we eventually identified to include in our analyses were fairly small and we had a total of only seven to twenty-five patients. So therefore, that itself limited us to do subgroup analyses for our studies. As you know from our study, we really didn’t do any subgroup analysis . . . we did not stratify the results based on subgroups because the numbers were too small.”

Considerations when developing key questions. Four of six participants considered demographic factors when developing the KQs for their SRs. One participant acknowledged that all the included factors were data-driven and were considered post-hoc because the authors did not anticipate that they would be able to find many relevant studies. He felt that all studies should be pooled and split again afterwards to reduce differences in clinical heterogeneity. One participant did not consider demographic factors at all, and another stated that there were too few relevant studies for subgroup analysis.

Half of the participants considered disease severity during development of the KQs. One participant noted that he considered disease severity post-hoc. Another said that severity will be a factor that will be considered in the future but not at present, given the limited number of studies available pertaining to the topic. One author indicated that the author team did not consistently look for severity across the studies.

None of the three authors of osteoarthritis SRs considered the affected joint when developing the KQs for their review. For example, one said,

“The particular project that we did was focused on OA [osteoarthritis] of the knee, so that wasn’t a specific issue and whether it was the right or left knee wasn’t really a concern to us.”

Three participants considered disease recurrence, one did so post-hoc, and the remaining two did not account for it at all. However, one author of a myocardial infarction review qualified his response:

“If the patient didn’t have troponin* or ECG [electrocardiogram] changes, then they had to have chest pain in the setting of a previous MI [myocardial infarction]. When we developed the questions, we were actually thinking of it that way so we did kind of include that, but it was by chance, rather than by design.”

Consideration of other clinical factors. Participants mentioned several other clinical factors that they considered when developing the KQs for their reviews. Among them were timing factors such as time between symptom onset and intervention administration, duration of trial (i.e., sufficient time for effectiveness to be noted / measured); prior or concurrent interventions; baseline risk factors (i.e., body mass index); and whether the disease was classified as primary or secondary (i.e., the reason for the trial vs. a comorbidity in trials where other conditions were primary). One participant stated,

“[C]linically we were interested in looking at the early phase of treatment. So we were trying to limit our inclusion criteria to location of treatment thinking that would be a proxy marker for ‘acutes.’ So we were trying to look at patients that were included only very early on in their presentation. . . . [W]e also struggled with looking at outcomes to look at how far out should be an appropriate look at outcomes.”

During our recruitment efforts, we contacted one author who declined to participate, indicating that her SR pertained to exercise after knee arthroplasty for osteoarthritis rather than osteoarthritis and exercise. However, her email offered insight on factors that her team considered when developing the KQs for their SR. Regarding diagnosis and disease severity, the patients were post-arthroplasty (i.e., all patients were considered to have sufficiently severe disease that had not responded to all previous forms of treatment and thus warranted total knee arthroplasty). All had clinical and radiographic changes that led their providers to perform the operation. As part of their inclusion criteria, the patients had to be able to undertake an exercise rehabilitation intervention. Thus, the authors of this review did consider clinical heterogeneity: both severity of disease and baseline risk factors for carrying out the planned intervention.

Selection of factors to report in the analysis. Two authors indicated that they selected factors a priori to include in their analysis. One author elaborated on the importance of having a clinical hypothesis, saying,

“[W]hen I start extracting data, I have a very optimistic outlook, so to speak. I try extracting almost whatever and if I don’t feel strongly about any specific covariates—those that we are discussing here, study characteristics, clinical heterogeneity reasons . . . when writing the protocol, if I allow one cell to be blank then I obviously don’t feel strongly about it. If I’m feeling like I have too many blanks, like 50 percent blanks, I omit it from the publications table. If that’s the case then obviously I don’t feel that strongly about it and I only include it in terms of external validity, making sure that the readers of the full paper feel confident that they understand the paper. But the thing about statistical analysis… if I’m supposed to believe in my own results afterwards, I need to have a clinical hypothesis.”

One participant mentioned that the small number of studies included in his team’s SR precluded them from even looking at clinical heterogeneity. Another noted,

“[O]ur stratification or our subgroup analyses were driven by the discussion around the discrepancy between the small studies showing substantial benefit and the very large ISIS-4 study [Fourth International Study of Infarct Survival ] that didn’t show any benefit at all. . . . I’m not sure that this is a very typical situation because it was really driven by this very large study which didn’t show any effect and discussion about if it was possible how the smaller studies had shown quite substantial benefit and these results were then nullified, if you like [and] were not confirmed by the large ISIS-IV trial.”

Another participant indicated that, for his team’s SR for myocardial infarction, they referred back to evidence from previous SRs and correspondence with authors to decide whether something might be an important contributor to clinical heterogeneity. The authors considered doing a subgroup analysis looking at patients who were troponin positive. Given that these patients would be potentially sicker or have more severe disease than others who were not troponin positive, the authors thought that these patients might respond differently than patients who were troponin negative. They also hoped to do a similar subgroup analysis looking at patients with positive electrocardiogram changes, reasoning that such changes may indicate a different disease or different severities of disease. However, the data were not available to conduct such analyses in either case.

One author noted the importance of using tests of interaction to evaluate the strength of the evidence for the differences between subgroups.

“[Y]ou should always use appropriate statistical tests to investigate to what extent differences you observe between two subgroups are real or mainly the play of chance.”

Reference to guidelines and manuals during study protocol development. All six participants mentioned use of or reference to the Cochrane manual during study protocol development; three of the participants were, of course, authors of Cochrane reviews. Other guidance documents mentioned included QUORUM,198 PRISMA, 199 the paper by Harris et al.,200 and the RTI-UNC EPC report by West et al. pertaining to rating the strength of evidence of different systems.201 The participants used the documents as reference tools to help resolve problems that they may have encountered during study protocol development, and also to assist with quality assessment of the included studies.

Additional considerations in the selection of patient or disease factors. Several authors provided additional thoughts on determining which factors to consider for assessing clinical heterogeneity.

Limited number of relevant studies published. Although he had mentioned it earlier in the interview, one participant thought it important to reiterate the limitation of having a small number of relevant published studies when conducting an SR.

Benefits of the PICO format. Many of the participants valued the PICO format. One participant stated:

“I’m very happy to recommend that people use the PICO format, as that’s how meta-analyses are supposed to be written. When authors use the PICO [format], then we know that all the studies that they are able to consider eligible should be pooled per se and then the I2[inconsistency index] is very important. . . . If you have lots of very different studies, but they all fulfill your PICO framework initially a priori, then they should be pooled. But if the I2 goes nuts, meaning that it’s far too high, say extremely over 50 percent, then the overall estimate is not relevant. Then you need to explore in more detail why the I2 went nuts . . . [M]y overall conclusion would be that we should always combine the studies that fulfill the PICO framework that they considered initially, but if the I2 goes nuts, they should not put that much emphasis on the overall results and make sure to continue and explore reasons for clinical heterogeneity.”

Exclusion of poor-quality studies. Quality of studies should be evaluated before inclusion in an SR. One participant remarked:

“[W]e made a great effort to really identify all studies. We went into the Chinese literature, and there was even some hand searching in the Chinese literature. All we found was all positive and not very well conducted studies. I personally think the problem here wasn’t publication bias, it was just low-quality, inadequate quality, bad, small studies.”

Scope of inclusion/exclusion criteria. Development of the study questions largely influences the extent to which clinical heterogeneity is addressed. As noted by one participant,

“[W]hen we were developing the protocol, our study question was very specific in that we were looking for patients initially who presented to the emergency departments with acute coronary syndromes [ACS]. That was our initial question and when we submitted that to the Cochrane [group], if I remember correctly, the review group came back and wanted us to make this a broader review so that we would include inpatients in the analysis. So we had a pretty lengthy discussion and one of the issues that we felt was, in part, around heterogeneity is that inpatients that developed ACS were in fact different from those presenting [to the] emergency department with ACS. We successfully were able to argue or communicate our point to the review group, so we kept things fairly narrow, but our biggest tool for dealing with clinical heterogeneity was initially in developing the question and that took a fair amount of revisions to make sure that we had fairly narrow definitions of what we would include and what we would not include.”


  • This report focuses on clinical heterogeneity, which we define as the variation in study population characteristics, coexisting conditions, cointerventions, and outcomes evaluated across studies included in an SR or CER that may influence or modify the magnitude of the intervention measure of effect. This is distinct from methodological heterogeneity, which refers to variation in study designs and analyses as reasons for differences in treatment effects among studies.
  • All five organizations (AHRQ, CRD, Cochrane, DERP, and EUnetHTA) refer to variation in population characteristics, interventions, and outcomes to define clinical heterogeneity. AHRQ, Cochrane, and CRD use the term “clinical diversity” rather than “clinical heterogeneity.”
  • The underlying rationale of statistical tests to assess heterogeneity is to investigate whether existing variations in treatment effects go beyond what would be expected by chance fluctuations alone. Commonly used statistical tests are Cochran’s Q test, I2 index, and meta-regression.
  • Common reasons for statistical heterogeneity include clinical heterogeneity, methodological heterogeneity, chance, and biases. False conclusions about clinical heterogeneity based on statistical heterogeneity result in false-positive conclusions (type I error) or false-negative conclusions (type II error).
  • Generally, reviewers use one or more of three common approaches to explore heterogeneity: stratified analyses of homogenous subgroups, meta-regression, and sensitivity analyses. We also consider restriction as a way to understand clinical heterogeneity.
  • We did not find guidance documents, studies, or commentaries indicating that clinical heterogeneity should be considered at all stages of the review.
  • Most EPC authors considered demographic variables such as age, sex, race, and ethnicity, and variables reflecting coexisting disease in their subgroup analyses.
  • Key informant interview respondents generally agreed that subgroups should be developed during the protocol development phase (a priori); several consult with experts in the field during the process and recommend this as a best practice. They tended to rely upon Cochrane Collaboration guidance and the PICO(TS) scheme in their review processes; some also referred to QUORUM, PRISMA, and methodological papers.
  • Similar to studies reviewed for this report, key informant interview respondents tended to consider disease severity, disease recurrence, and demographic factors in assessing clinical heterogeneity.
  • Studies assessing clinical heterogeneity methodology often conclude that systematic reviewers include demographic factors with little forethought about why these factors may be relevant or whether they should consider other, possibly more critical factors. However, guidance and processes to determine how to select important potential effect-measure modifiers is not readily available.
  • Analysis of individual, patient-level data in meta-analyses allows for better assessment of both within-and across-study clinical heterogeneity, but the time, cost, and difficulty in obtaining these data are often prohibitive barriers to following such practices/procedures.

The troponin test is used to help diagnose a heart attack, to detect and evaluate mild to severe heart injury, and to distinguish chest pain that may be due to other causes. In patients who experience heart-related chest pain, discomfort, or other symptoms and do not seek medical attention for a day or more, the troponin test will still be positive if the symptoms are due to heart damage. Troponin tests are often preferred as they are more specific for heart injury than other tests (which may become positive in skeletal muscle injury) and remain elevated longer. (http://www​.labtestsonline​.org/understanding​/analytes/troponin/test.html).



The troponin test is used to help diagnose a heart attack, to detect and evaluate mild to severe heart injury, and to distinguish chest pain that may be due to other causes. In patients who experience heart-related chest pain, discomfort, or other symptoms and do not seek medical attention for a day or more, the troponin test will still be positive if the symptoms are due to heart damage. Troponin tests are often preferred as they are more specific for heart injury than other tests (which may become positive in skeletal muscle injury) and remain elevated longer. (http://www​.labtestsonline​.org/understanding​/analytes/troponin/test.html).