NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Treadwell JR, Singh S, Talati R, et al. A Framework for "Best Evidence" Approaches in Systematic Reviews [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2011 Jun.

Cover of A Framework for "Best Evidence" Approaches in Systematic Reviews

A Framework for "Best Evidence" Approaches in Systematic Reviews [Internet].

Show details


In a series of conference calls, the two project leaders from ECRI Institute and three collaborators from three other EPCs (Vanderbilt University, University of Connecticut, and Johns Hopkins University) discussed methods for accomplishing the tasks noted above. After the initial conference call, the project leaders prepared a series of discussion documents specific to the first three tasks. Subsequent conference calls were scheduled to discuss comments and suggestions from the collaborators, whose feedback was incorporated in revisions in the task documents. The latter were combined into a single draft summary report approved by all members of the group prior to submission to AHRQ. The document was then externally reviewed by experts from other institutions, revisions were made based on reviewer comments, and the final report was re-submitted for posting on the AHRQ Web site.

Task 1. Lists of Inclusion Criteria and Factors That Might Affect a Reviewer's Decision to use Each Criterion

Two basic types of inclusion criteria are typically used in systematic reviews. The first set includes criteria pertaining to publication characteristics, such as full-article publication (not just an abstract), peer-reviewed publication, year of publication sufficiently recent (to ensure exclusion of outdated technologies), English-language publication (depending on the topic), and exclusion of duplicate publications (to avoid double-counting study participants) unless duplicate studies contain unique outcome data. These criteria are usually unaffected by subsequent decisions regarding “best evidence” and analysis.

The second set includes criteria pertaining to study design, study conduct and reporting, and study relevance to the Key Question(s). These criteria are context sensitive and require clinical and methodological judgments from the review team; in addition, the decision to use certain criteria may be influenced by the limitations of the available evidence (discussed in more detail later in this section). Given their importance, our focus in task 1 was this latter set of inclusion criteria.

Figure 1 illustrates the logical flow for application of inclusion criteria from a best evidence perspective. Note that this figure depicts a sequence of decisions rather than a hierarchy based on study design. The layout of the figure (with randomization at the top) may give an unintended impression: that the most important consideration is whether to limit the evidence to randomized controlled trials (RCTs). In reality, that may not be the most important consideration, particularly for Key Questions that do not address causation. The figure is structured in this manner because the decision to require or not require RCTs is often the first one made, and therefore leads naturally to other types of decisions about the inclusion criteria.

Figure 1 illustrates the logical flow for application of inclusion criteria from a best evidence perspective. The first questions asked are related to study design; the initial question at the top of the chart asks whether randomization is required. If the answer is yes, one moves down the left pathway asking a series of additional questions, beginning with three questions about blinding: is blinding required of study participants, of providers, and/or of outcome assessors? Then the reviewer asks three more questions related to study design. Is an adequate washout period required? Is direct comparison required? Is baseline comparability required? After these questions are answered, the reviewer moves to questions related to study conduct and reporting criteria, and lastly to questions related to relevance criteria (that is, relevance of study participants, interventions, and settings specified in the Key Questions). If the answer to the initial question about randomization is no, the reviewer moves down the right branch of the pathway to ask whether an independent control group is required. If the answer is yes, the reviewer asks five more questions related to study design. Is a concurrent comparison group required? Is an adequate washout period required? Is baseline comparability required? Is prospective planning required? Is consecutive enrollment required? After these questions are answered, the reviewer moves to questions related to study conduct and reporting criteria, and lastly to questions related to relevance criteria. If the answer to the independent control group question is no, the reviewer asks four more questions related to study design. Is any comparison required? Is an adequate washout period required? Is prospective planning required? Is consecutive enrollment required? After these questions are answered, the reviewer moves to questions related to study conduct and reporting criteria, and last to questions related to relevance criteria.

Figure 1

Process chart of application of inclusion criteria*. *This figure depicts a sequence of decisions about study inclusion criteria. Randomization is listed first because often the decision to require or not require RCTs is the first one made, and therefore (more...)

The relevance criteria involve whether the study participants, interventions, and settings are relevant to the Key Question. For example, suppose a Key Question specifies that the population of interest is adults with type 2 diabetes, and some studies enrolled not only these participants but also some adults with type 1 diabetes (and only presented combined results for the two populations). The reviewer must decide whether the combined results are sufficiently relevant to the Key Question. By “relevance,” we do not mean relevance to typical clinical practice, which is a concept we refer to as applicability, and is generally addressed at a later stage in the review (see task 2 in this paper).

Table 1, Table 2, and Table 3 list inclusion criteria and the factors that may affect a reviewer's decision to use each criterion. For example, if no RCTs are identified, the reviewer may consider inclusion of nonrandomized studies (if the risk of bias is not too high). Likewise, if the outcome is a harm related to treatment, the reviewer may believe that nonrandomized studies still provide useful information. This is not to imply that well-designed studies measuring harms are suboptimal for determining the true risks of harms. Rather, in some instances, less reliable evidence (even case reports) of rare harms associated with an intervention may be useful in decisionmaking.

Table 1. Inclusion criteria/analysis criteria pertaining to study design.

Table 1

Inclusion criteria/analysis criteria pertaining to study design.

Table 2. Inclusion criteria/analysis criteria pertaining to study conduct and reporting.

Table 2

Inclusion criteria/analysis criteria pertaining to study conduct and reporting.

Table 3. Inclusion criteria/analysis criteria pertaining to relevance.

Table 3

Inclusion criteria/analysis criteria pertaining to relevance.

For some reviewers, all criteria may be influenced by the number of studies that met that criterion. Rigid adherence to criteria that none of the available studies meet may result in exclusion of a considerable amount of lower quality evidence that might have provided some (albeit weak) evidence to address a Key Question. Ultimately, the reviewer must decide whether modifying initial criteria to allow for inclusion of lower quality evidence would result in inclusion of evidence with an unacceptably high risk of bias. In the latter instance, the reviewer may decide to keep the initial criteria, even if they result in no included studies for the Key Question.

Some reviewers may select a subset of these criteria for study presentation (encompassing studies whose data will be tabled but not necessarily analyzed) and a different subset of criteria for study analysis (studies that met the criteria for presentation and also criteria for analysis). We refer to the latter subset of criteria as analysis criteria. For example, a reviewer might choose “concurrent comparison group” as a criterion for study presentation and “random assignment to intervention groups” as an analysis criterion. In this case the reviewer would tabulate information from all studies with concurrent comparison groups, but only analyze data from RCTs. Alternatively, some reviewers may choose to have only one set of criteria such that any studies that are included will also be analyzed (quantitatively if appropriate, qualitatively if not). In either case, reviewers may choose from the list of criteria presented in Table 1, Table 2, and Table 3.

A tabulation between each inclusion criterion and each modifying factor is illustrated in Appendix A. In that representation, a check mark indicates that a specific criterion (row) is influenced by a specific modifying factor (column).

Inclusion criteria developed when a reviewer has insufficient knowledge of an evidence base sometimes require modification based upon findings of the initial literature searches or even review of retrieved study data. As noted in the AHRQ Methods Guide for Comparative Effectiveness Reviews,3 conditional modification of inclusion criteria can still be a priori as long as it is specified in the review protocol (e.g., if Type A studies are not available, Type B studies will be included). For many topics, the best possible evidence (e.g., RCTs with the lowest risk of bias) does not exist, and this absence may not be discovered until the reviewer scans the literature search results. If the initial inclusion criteria specified studies directly comparing specific interventions, the criteria might be modified to allow for indirect comparisons. Conversely, for other topics, overly broad inclusion criteria (e.g., allowing nonrandomized studies or indirect comparisons) may be impractical within the restrictions of time and budget; these criteria may be narrowed to include only the “best” evidence.

Task 2. List of Evidence Prioritization Strategies

After the set of included studies for the Key Question is determined (based on task 1), a reviewer must decide which studies comprise the “best evidence” set. We define this as the set of studies that will be assessed and/or analyzed in an attempt to answer the Key Question. Reaching this answer may or may not involve meta-analysis.

Studies not considered as part of the “best evidence” set, but still included, would be tabled but not otherwise used. Some reviewers may choose to use all included studies in the attempt to draw evidence-based conclusions. If so, then the full list of included studies already defines the “best evidence” set.

Sometimes, however, the included studies are so variable in their risk of bias and/or applicability that some further prioritization is necessary. In this effort, several strategies can be employed. The simplest strategy would be to take the single “best” study, and using it alone, determine what conclusions can be drawn. The definition of “best” would be based on a careful balance of both risk of bias and applicability. For example, this strategy might be employed when evaluating an evidence base that contains a single, high-quality mega-trial and a few smaller trials of clearly lesser quality. Alternatively, a single smaller high-quality trial might represent the best evidence in other circumstances.

The single-best-study approach has the advantage of maximizing quality (i.e., minimizing risk of bias and maximizing applicability). However, it has three disadvantages: (1) the lack of scientific replication of findings, (2) the inability to determine consistency across studies (e.g., heterogeneity of effect sizes), and (3) the likelihood of low statistical power (if the study is not a mega-trial) precluding an answer to the Key Question (resulting in an evidence grade of Insufficient). However, this latter consideration should not influence a reviewer's choice if the remaining evidence (outside of the “best” study) is inapplicable or has an unacceptably high risk of bias.

A second strategy is to add studies that, relative to the single best study, are more susceptible to bias and/or less applicable. This permits measurement of cross-study consistency, and increases power. However, this strategy does not explicitly consider whether the “best set” actually permits a conclusion.

This suggests a third strategy, which involves a further lowering of the bar: admit still lower quality studies into the formal analysis, to increase the chance of obtaining an answer to the Key Question. This approach underscores a tradeoff: increasing quantity in this way will also increase the risk of an inappropriate conclusion, because the just-added studies are of lower quality. An example of this third strategy can be found in the AHRQ Methods Guide chapter and recent paper by Norris et al.,3 which recommended that study inclusion decisions be influenced by whether the results of RCTs permit a conclusion (see Introduction for more details on this chapter). Note that this third strategy does not guarantee an answer to the Key Question, but does consider conclusiveness for the purpose of evidence prioritization.

One core component of the strength of the evidence is precision. For example, strength cannot be considered high if there is a wide confidence interval with respect to a decision threshold (e.g., precision is too wide to determine whether a difference can be considered clinically significant). Adding still more studies to the evidence base will increase the chance of obtaining a narrow confidence interval (unless the results show substantial heterogeneity), which may in turn increase the overall strength of the evidence. The resulting increase in overall risk-of-bias, however, may negate this possibility. This represents a fourth strategy: consider not only conclusiveness, but also the strength of the evidence, when making prioritizations.

With any of the above strategies, a decision to include lower quality evidence means that all studies on that level should be included. The selective inclusion based on study results or observed consistency with higher quality evidence would introduce bias. Thus, neither strategy 3 nor 4 involve the exclusion of outlier studies in an attempt to reach a conclusion or increase evidence strength. With strategy 4, if the newly included studies inclusion do not increase precision and thereby increase the overall strength of evidence, then the reviewer should exclude all studies on this level and only evaluate evidence from higher quality studies.

For strategies 3 and 4, the potential disadvantage of adding lower quality studies (increased risk of bias) is somewhat minimized in that lower quality studies can only increase the strength of the evidence if the findings are consistent with the findings of higher-quality studies. For example, if two higher quality studies together lead to a low strength of evidence, additional lower level studies would only boost the strength to moderate if they generally agreed with higher level studies.

Additional issues occur when the only available evidence is low in quality. Studies of low quality with biases in opposite directions might have similar (consistent) effect sizes, which could lead to an overestimate in the strength of evidence. Furthermore, consistency or precision in the findings of a low-quality evidence base does not change the fact that the evidence is low quality.

Table 4 outlines the four strategies. Checkmarks indicate which facets of the evidence are explicitly considered during evidence prioritization. All four strategies consider both risk of bias and applicability in prioritizing evidence. The specific implementations could involve:

Table 4. Strategies for defining the “best evidence” set for a given Key Question.

Table 4

Strategies for defining the “best evidence” set for a given Key Question.

  • The use of a criterion that was not employed for study inclusion (e.g., in a group of included RCTs, define the “best evidence” set as those studies that blinded study participants)
  • The use of a more stringent threshold (e.g., in a group of studies that all reported data on at least 50 percent of study participants, define the “best evidence” set as those that reported data on at least 80 percent of study participants),
  • The combination of several criteria involving risk-of-bias and applicability

Note that these implementation approaches are derived from the earlier list of potential inclusion criteria for selection of individual studies (Task 1).

Strategies 2–4 further consider both replication and cross-study consistency; strategies 3 and 4 consider whether the evidence is sufficient to permit an answer to the Key Question; and only 4 attempts to maximize the strength of the evidence underpinning the conclusion for that Key Question. Note that a conclusion is still possible using strategies 1 and 2, but strategies 3 and 4 are the only strategies that explicitly use the conclusiveness of the evidence as a factor.

Selecting an evidence prioritization strategy involves a number of tradeoffs. Although strategy 1 (best single study) is the most feasible and has a low risk of leading to an inappropriate conclusion, it also has a high risk of an inappropriate lack of conclusion. At the other extreme, strategy 4 is the least feasible as it may require analysis of a large number of studies, and it has a high risk of an inappropriate conclusion due to inclusion of lower quality studies; however, due to its greater statistical power it has the lowest risk of an inappropriate lack of conclusion. Strategies 2 and 3 allow for an intermediate level of tradeoffs between these two extremes.

A reviewer may specify in the protocol that they will initially use a more stringent strategy regarding study inclusion, but if the resulting evidence is insufficient to permit a conclusion, they may choose a less stringent strategy to increase the chances of reaching a conclusion. However, there is no guarantee that inclusion of lower quality studies will permit a conclusion. Even if a large amount of evidence is available, problems in quality or consistency or precision may preclude a conclusion.

As noted in the Introduction, the reviewer is free to decide whether meta-analysis is appropriate for a given evidence base. If so, a reviewer may choose to synthesize different bodies of evidence (e.g., RCTs and nonrandomized studies) separately and then decide whether the lower quality body of evidence may be used to enhance the overall strength-of-evidence rating.

Task 3. Methods for Evaluating Evidence Prioritization Strategies

The tradeoffs inherent in the choice of prioritization strategy raise at least two important questions. The first is whether different strategies on average would lead to similar or different conclusions. The second question is which strategy leads to the “most appropriate” conclusions. For a meta-analysis, this would include the best estimate of the true effect size with the highest strength of evidence.

The use of an alternate prioritization strategy can be viewed as a sensitivity analysis of inclusion criteria. For example, the Cochrane Handbook for Systematic Reviews of Interventions (version 5.0.2, September 2009)13 recommends numerous sensitivity analyses, especially for review decisions that were arbitrary or unclear. These include the addition or removal of studies wherein poor reporting made it difficult to determine whether they met inclusion criteria; changing criteria about participants (e.g., age range); changing criteria about interventions (e.g., doses); changing criteria about outcomes (e.g., length of followup); changing criteria about study design (e.g., whether to include randomized studies with unblinded outcome assessment).13 Similar recommendations have been made by several other authors.14–18

Note that one can consider two types of conclusions: one about the size of the effect, or one about the direction of the effect. One possible output of a strategy is that there is insufficient evidence, which reflects a non-conclusion about the size and direction of the effect. This may be the most appropriate reviewer decision. Also note that one could compare strategies not only on the conclusions drawn, but also on the strength of the conclusions drawn.

Question 1. Do Different Strategies Lead to Similar or Different Conclusions?

Addressing this question requires using methods that compare these strategies in systematic reviews. Three alternatives might be used.

Method 1. Compare Published Systematic Reviews

A literature search could identify and compare the conclusions of different systematic reviews that used different prioritization strategies to address the same clinical question. The advantage of this method is its relative ease of implementation. Provided a reviewer can find published reviews that addressed the same clinical question using different strategies, the comparison of the reviews' conclusions can be done relatively quickly. Although this would be the least labor-intensive method, it has some drawbacks. First, it may be difficult to identify clinical questions where different systematic reviews used different prioritization strategies. Second, the systematic reviews may have differed in other methodological areas, such as risk-of-bias assessment and strength of evidence assessment, which could then lead to differences in conclusions among reviews. This would make it difficult to determine whether different evidence prioritization strategies truly led to different conclusions, or whether they would have led to the same conclusion if the reviews had been similar in other methodological areas. In addition, methodology is not always well reported in published systematic reviews, often simply due to article length limitations in journals.

Method 2. Test the Robustness of an Existing Systematic Review

A reviewer could identify a single existing systematic review, determine its evidence prioritization strategy (by examining the report inclusion criteria), and test other prioritization strategies on the same evidence base, while keeping all other methodology the same. The advantage of this method over method 1 is that other methodological aspects of review (e.g., risk-of-bias assessment) would no longer confound the comparison. However, this method is more labor-intensive than method 1, as it requires performing independent research synthesis using the other prioritization strategies.

Method 3. Initiate a Systematic Review and Compare Prioritization Strategies

A reviewer could initiate a systematic review of a given clinical question and compare the conclusions generated by two or more different evidence prioritization strategies. Similar to method 2, the reviewer would use the same methods for risk-of-bias assessment and strength-of-evidence rating, so that any differences in conclusions could be attributable only to differences in the evidence prioritization strategies. The advantage over method 2 is that the reviewer would not be dependent on the quality of reporting of a published review, which may often lack important information. However, this would be more labor-intensive than methods 1 and 2 since there would be no reliance on already-published reviews.

Although methods 2 and 3 address the inherent drawbacks of comparing already published systematic reviews, they do not address the more important question of what is the “most appropriate” conclusion (or non-conclusion) to reach.

Question 2. Which Strategy (or Strategies) Leads to the Most Appropriate Conclusions?

In order to measure “appropriateness,” a reviewer needs to define the correct answer to a given clinical question. This could be based on meta-analysis of a complete evidence base on a well-understood clinical topic. However, we note that a meta-analysis is not a prerequisite for reaching the most appropriate conclusion.

Meta-analysis of a Complete Evidence Base

Perform a meta-analysis of all well-done studies of a given clinical topic (using participant-level data if available). Define criteria for which of the published studies are actually entered into this meta-analysis (e.g., only randomized blinded trials, or any direct comparison studies, etc.). This represents the reference standard.

Next, define a set of partial evidence that excludes the most recent x years of studies. The question is: which prioritization strategy best estimates the reference standard using only this partial evidence? The summary effect size of the complete evidence base would be the benchmark for comparison. However, reviewers should check to determine whether the standard of care that would be used in intervention comparisons has changed during the chosen time interval. Reviewers should also check to ensure that factors other than inclusion criteria, such as selective outcome reporting, publication bias, or changes in implementation strategies, are not potential explanations for observed changes in evidence base findings over time.

Limitations of this approach include the lack of agreement on reliable validity standards for meta-analysis and the possibility of incorporation bias due to testing the validity of a subset of evidence using the whole evidence as gold standard. In some instances, a small evidence base (consisting of one or a few well-designed, appropriately powered studies) may be sufficient to reach the most appropriate conclusion.

PubReader format: click here to try


Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...