NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Shojania KG, Sampson M, Ansari MT, et al. Updating Systematic Reviews. Rockville (MD): Agency for Healthcare Research and Quality (US); 2007 Sep. (Technical Reviews, No. 16.)

Cover of Updating Systematic Reviews

Updating Systematic Reviews.

Show details


Study Identification

Search Strategy

The first four questions are explored through a cohort of 100 systematic reviews identified through a search of ACP Journal Club database on Ovid, undertaken January 31, 2006. The search to identify candidates screened for inclusion in the cohort was:






data sources.ab.


(search$ or MEDLINE®).ab.




limit 5 to articles with commentary

Additional Cochrane reviews included for Question 3 were identified through the same search. Additional AHRQ reports used for Question 3 were identified through PubMed® with the query “Evid Rep Technol Assess (Summ)”[Journal:__jrid21544]. Searches were undertaken April 10, 2006.

Eligibility Criteria

The time to important changes in evidence might vary depending on a number of factors, including the type of question posed by the original review (e.g., therapeutic, diagnostic, prognostic, or health policy), the type of studies included (e.g., randomized controlled trials, observational studies), and whether or not the systematic review provided quantitative synthesis. In the interest of reducing potential sources of variation, we focused on systematic reviews that evaluated the clinical benefit or harm of a specific (class of) drug, device, or procedure and provided quantitative synthesis that included a point estimate and 95% confidence interval for at least one clinical outcome (disease endpoint, functional status, mortality) or established intermediate outcome (e.g., blood pressure, glycemic control, standard instrument for measuring disease activity, such as a depression scale). We excluded evaluations of alternative and complementary medicines, as well as educational and behavioral interventions.

Further eligibility requirements were as follows:

  • Publication from 1995 to 2005 (but with search date no later than Dec 31, 2004 to ensure at least one full year for new evidence to appear)
  • Reporting of at least one conventional meta-analytic estimate of treatment benefit or harm. We excluded individual patient data meta-analyses, meta-regressions, and indirect meta-analyses because of the difficulty of determining whether or not data from new trials would alter previous quantitative results.
  • Included at least one randomized controlled trial; other eligible designs were restricted to quasi-randomized or controlled clinical trials (CCTs).
  • Meta-analytic outcomes reported in the form of a relative risk, odds ratio, or absolute risk difference for binary outcomes and weighted mean differences for continuous outcomes. We excluded standardized effect sizes to avoid the complexity of assessing candidate new data reported using different outcome scales to determine if they would have met the authors' criteria for incorporation into the standardized effect measure in the original review.

We used as our sampling frame systematic reviews that were selected for commentaries in ACP Journal Club, a bimonthly publication of the American College of Physicians that aims “to select from the biomedical literature articles that report original studies and systematic reviews that warrant immediate attention by physicians attempting to keep pace with important advances in internal medicine.” 11 The article selection process involves “reliable application of explicit criteria for scientific merit, followed by assessment of relevance to medical practice by clinical specialists.” Moreover, systematic reviews indexed in ACP Journal Club must meet specific quality criteria. Thus, choosing this sampling frame allowed us to identify systematic reviews of reasonable quality (or better) that are directly relevant to clinical practice.

Cohort Selection Process

Each record identified through the search of ACP Journal Club was screened for eligibility on the basis of title and abstract by 2 reviewers. Records with consensus in favor of eligibility were promoted, where final confirmation of eligibility was made based on the full report. Records were screened in alphabetical order by first author until 100 eligible reviews were identified. We chose a sample size of 100 to balance the practical issue of time required to ascertain the need for updating for each review in the cohort with power considerations, such as the expected width of confidence intervals given a denominator of 100 and the ability to evaluate predictive models of the need for updating with at least 3 to 5 potential predictors in the models. Of the 100 total reviews, we set the maximum number of Cochrane reviews was to 30. We chose to limit the number of Cochrane reviews, as evidence suggests that they differ in important respects from other systematic reviews in the peer review literature on the basis of style and possibly on topic coverage.1

A supplemental sample was formed for question 3 as additional eligible reviews beyond the 100 had been identified, and because data extraction was quick and a larger cohort would facilitate comparisons between report types, these additional reports were included in the cohort for question 3. Few eligible HTA reports were identified through ACP Journal Club so Evidence Reports that were otherwise eligible were added to permit comparisons of production milestones between HTA reports undertaken by AHRQ and other types of reviews.

When an eligible review was an explicit update of an earlier review (e.g., in the case of Cochrane reviews, which are updated and reissued periodically as a matter of policy), we used the earliest version in the time frame of 1995-2005. Similarly, when more than one review on the same topic was identified, only the earliest was included, to avoid double counting the same changes in evidence (or lack thereof).

We abstracted data on primary outcomes for each systematic review. To qualify as primary outcomes, we required that authors use the words “primary” or “main” and that they identify no more than 3 such outcomes (i.e., we regarded identification of more than 3 “primary” outcomes as inconsistent with the concept of primary outcome). For reviews that did not identify primary outcomes, we selected outcomes in the order in which their results were presented, including up to 4 efficacy outcomes and up to 2 harm outcomes. Eligible outcomes were clinical outcomes (disease endpoint, functional status, mortality) or established intermediate outcome (e.g., blood pressure, glycemic control, standard instrument for measuring disease activity, such as a depression scale). Each must have provided an eligible quantitative synthesis in the formats noted above (relative risk, odds ratio, or absolute risk difference for binary outcomes and weighted mean differences for continuous outcomes).

Data Collection

The several questions reported here (detecting updating signals for the cohort of 100 quantitative systematic reviews, publication time lags for a larger cohort of 148 reviews, analysis of the patterns of growth in evidence in different clinical areas, and the survey of organizations involved in systematic review work regarding updating practices) involved different data collection methods. These details are presented in sections for each project.

I. Signals for Updating and Survival Analysis for the Cohort of 100 Systematic Reviews

Data extraction from the cohort reviews. For each of the 100 systematic reviews, we characterized the type of intervention (drug, device, or procedure), the numbers of included trials and participants, methodological features, such as the presence of heterogeneity or publication bias, descriptions of reported outcomes and identification of those explicitly identified as ‘primary’ or ‘main,’ the meta-analytic results for each outcome, and excerpted quotations of the authors' characterizations of these results and their interpretation of them.

We also classified all reviews into a clinical area. For reviews published in print journals, we primarily based this classification on the ISI classification of the clinical area of the journal in which the review appeared. For reviews published in general journals, Cochrane reviews, and HTA reports, we considered the specialty journals for which the review would have been most suitable. In the case of Cochrane reviews, we also based the classification of clinical content area on the review group that carried out the work (e.g., the Cochrane Musculoskeletal Group, the Cochrane Metabolic and Endocrine Disorders Group). For other types of reviews (e.g., HTAs), we searched the Cochrane library to find reviews on similar topics and examined the reviews to determine which review group undertook them. Two investigators undertook these classifications (AI, MS), with their results confirmed by a third reviewer (MA) with a clinical and research background.

Identification of new data for each review in the cohort. We performed systematic searches for each of the 100 reviews using a variety of electronic search strategies. Constructing searches as comprehensive as one would undertake for a formal systematic review (or an update) would involve a prohibitive amount of work given our cohort size of 100 systematic reviews.

Therefore, we adopted a combination of efficient strategies. Briefly, these involved developing simple subject searches and then limiting the results to the Core Clinical Journals subset plus the Randomized Controlled Trial publication type, subject searches run using the Clinical Query* filter in Ovid, applying the Related Articles function in PubMed® to the three largest and the three most recent studies in the original review (i.e., up to 6 studies in total), and using a ‘citing references’ search (through Scopus™) to identify new randomized trials that cited the original review. These search strategies served two purposes: one was to identify all new studies appropriate for updating the original systematic review; the other was to compare the performance of different strategies and evaluate their relative efficiency as surveillance methods for detecting signals for the need to update prior reviews. For studies where an updating signal occurred, we searched CENTRAL, The Cochrane Collaboration's Central Register of Controlled Trials, using the subject search developed for MEDLINE®. Examples of subject searches with the limits tested are shown in Appendix C *. A sample recording sheet used as the basis for assessing search performance is shown in Appendix D.

For each systematic review in our cohort, project team members who had backgrounds in both medicine and research screened citations retrieved by the above methods to identify trials that would have met the inclusion criteria in the original meta-analysis. Retrieved records were screened in chronological order, and the full text of articles was used when necessary to determine eligibility or extract data. The review protocol stopped when one of the signals for the need for updating (defined below) was met. Wherever possible we identified new systematic reviews on the same topic. When the search strategies yielded no eligible new trials, we conducted more comprehensive electronic searches and reviewed relevant chapters in sources such as Clinical Evidence and UpToDate to ensure that we had not missed new sources of evidence. Figure 1 outlines the overall review protocol for assessing the presence or absence of signals for updating for each of the systematic reviews in the cohort.

Figure 1. Review protocol to detect signals for updating.


Figure 1. Review protocol to detect signals for updating.

Outcomes: Signals for Changes in Evidence That Would Warrant Updating. Ideally, assessments of the need to update previous systematic reviews would involve assessments by experts of new evidence relevant to the original review. Shekelle and colleagues used such an approach in order to determine if guidelines required updating.7 By choosing a small number of guidelines (17) produced by a single agency, they were able to ask the authors of the original guidelines to assess changes in evidence. This approach would clearly not be feasible for a larger sample (100 systematic reviews in the present case). It is also worth noting that identifying experts is not a straightforward task, requiring a balance of context expertise, methodological expertise, and freedom from bias regarding the question under consideration (not always easy to find among experts in a given area).

In designing a method for detecting changes in evidence without resorting to consulting experts, we considered the work of previous investigators1214 who have addressed similar problems involving the comparison of two sets of results related to the question—randomized and non-randomized studies of the same intervention,14 initial and subsequent trials evaluating the same therapy,13 and conference proceedings versus full-length journal articles for the same trials.12 In all of these examples, investigators made determinations of important changes or differences between results without resorting to expert review. They achieved such determinations credibly by using a combination of quantitative signals (roughly the same as the ones we have chosen) and qualitative signals based on the language used to describe the results. For instance, if an article characterized a therapy as effective and another article evaluating the same therapy described it as ineffective, this would represent a major change. In a similar manner, we conceptualized quantitative and qualitative signals of potential changes in evidence sufficiently important to warrant updating of a previous systematic review.

Quantitative signals consisted of changes in statistical significance (using the conventional alpha of 0.05) or large changes in effect size (a relative change in effect magnitude of at least 50%). We restricted these changes to those involving one of the primary outcomes of the original systematic review or any mortality outcome (i.e., all-cause mortality or any cause-specific mortality outcome for which the original review provided a meta-analytic estimate of effect). We also discounted ‘borderline’ changes in statistical significance, which we defined as having occurred when the original and updated meta-analytic results both had p-values in the range of 0.04 and 0.06. For instance, a change from p =0.041 to p =0.059 would not count as a quantitative signal to update, nor would the converse change (from p =0.059 to p =0.041). We discounted such changes, as well as changes in effect magnitude less than 50% and all changes involving non-primary outcomes, so that quantitative signals of changes in evidence would represent robust indicators of the need to update previous reviews. Quantitative signals were detected by performing updated meta-analyses that combined data from eligible new trials with the previous meta-analytic results.

Qualitative signals of the need to update involved factors relevant to the application of evidence beyond changes in the original meta-analytic estimates. These included new information about harm sufficient to impact clinical decision making, important caveats to the original results, emergence of a superior alternate therapy, and important changes in certainty or direction of effect. Qualitative signals were detected using explicit criteria for comparing the language used to characterize findings in the original systematic review with descriptions of findings in new systematic reviews that addressed the same topic, new ‘pivotal trials’, new clinical practice guidelines, or new editions of major textbooks (e.g., UpToDate). Pivotal trials were defined as trials that had a sample size at least three times the previous largest trial or were published in one of the 5 top general medical journals (New England Journal of Medicine, The Lancet, JAMA, Annals of Internal Medicine, and BMJ) based on a ranking by journal impact factor. We defined qualitative signals with two levels of importance: signals of ‘potentially invalidating changes in evidence’, which we considered as changes such that one would no longer want clinicians or policy makers to base decisions on the original systematic review (e.g., a pivotal trial characterizes treatment effectiveness in opposite terms to those in the original review); and signals of ‘major changes in evidence’, which we regarded as changes that would not completely invalidate the previous results but would still affect clinical decision making in important ways. Such changes might include information about the way the treatment must be delivered to confer benefit, identification of populations of patients for whom treatment is more or less beneficial, or information about impact on harder outcomes than those reported in the previous systematic review (e.g., the previous review analyzed intermediate endpoints, such as blood pressure or lipid levels, whereas new trials provide data on disease end-points, such as myocardial infarction or stroke, functional status, mortality).

Major changes also included changes in characterizations of effectiveness that were less extreme than those for potentially invalidating signals, but which would still affect clinical decisionmaking. For example, whereas a change from ‘effective’ to ‘ineffective’ would represent a signal for a potentially invalidating change in evidence, a change from ‘possibly beneficial’ to ‘definitely beneficial’ would represent a major change. Importantly, no attempt was made to distinguish between varying descriptions of “possibly effective.” Characterizations such as “may be effective,” “promising,” “trends towards effectiveness,” and other similar phrases or concepts were all categorized as “possibly effective.” Thus, qualitative signals for changes in evidence captured substantive differences in the characterization of treatment effects, not merely semantic differences.

Detailed definitions of the criteria for qualitative and quantitative signals are provided in Appendix A *; Appendix B provides specific examples of their application.

Detection of Quantitative Signals for Updating

An Excel worksheet was developed in which, for a given systematic review, project team members could enter the original meta-analytic result for each outcome into a template for the appropriate format (relative risk, odds ratio, risk difference, or weighted mean difference) and enter the results of new trials identified as eligible for inclusion. For a given outcome, the work sheet allowed entry of a summary estimate or raw data (e.g., a relative risk and 95% confidence interval or the number of events for patients in each study group using the format of a two by two table).

The worksheet was programmed to perform updated meta-analytic estimates and to apply logical tests to indicate when the updated result met one of the criteria for a quantitative signal (change in statistical significance or relative change in effect size of at least 50%). Because many of the original systematic reviews included a large number of trials and often did not report data for the individual trials in a complete fashion, it was impractical for us to obtain data for each trial included in the original meta-analytic estimate. Consequently, we performed the updated meta-analyses by combining the original pooled result with the individual results of eligible new trials. With fixed effects models for meta-analysis, this procedure gives the same result as would be obtained using the individual trials from the original meta-analysis Therefore, for pragmatic reasons (avoiding having to obtain original data from each trial included in each of the100 systematic reviews) we employed fixed effects models in our updated meta-analyses. Though random effects models are usually preferred to avoid spurious precision in the face of heterogeneity, we regarded this approach as reasonable, since our goal consisted of detecting changes in evidence that had likely occurred, not producing exact estimates of updated treatment effects.

Data from new trials were entered into the meta-analytic calculator in chronological order, so that the time at which a quantitative signal was met could be identified. In general, we stopped the review protocol once a change in statistical significance or change in effect size of at least 50% occurred, though we sometimes continued to add new trials to confirm stability of the results.

Group review and classification. After assessment by the reviewer, each systematic review in the cohort was discussed at a case conference attended by the team of KS, MS, MA, and JJ. At this meeting, the final classification of the updating signal status was decided by group consensus, and the completeness of the evidence base was discussed. The team had the option to request additional searching, or search directly for new studies known or suspected by team members to be relevant.

Date definitions for survival analysis. Two survival analyses were undertaken. The first used the publication date as birth. We used the MEDLINE Entrez date as a surrogate for publication date of the systematic review, as this date always includes a day, month, and year (not always the case for journal publication dates) and because the Entrez Date closely follows the publication date (typically within days to several weeks). In a second survival analysis, we defined birth as the end of the search period reported in the review. (This date did not always include a day and month. We imputed all missing months as June and all missing days as the 15th.) The end point, ‘death’ for both survival analyses was the Entrez date associated with the new evidence that resulted in the signal for updating. Where the updating signal derived from non-MEDLINE sources (e.g., an advisory from the Centers for Disease Control and Prevention, the Food and Drug Administration, or a chapter in a textbook), we used the date of publication as the date of the signal for updating. For surviving systematic reviews, observations were censored on September 1, 2006, the approximate midpoint of the 4-month period during which searches were performed for the entire cohort.

Performance of the surveillance searches in detecting signaling evidence. Three main types of signaling evidence were used; new RCTs that were added to a meta-analysis from the original systematic review in the manner of a cumulative meta-analysis, single RCTs that met our criteria for a pivotal trial, and new systematic reviews that provided evidence that appeared to overturn the findings of the original review, either by contradicting the original findings, adding an important caveat or demonstrating a significant harm. Other signaling evidence (i.e., evidence that provided the basis for signals) included FDA advisories and expert opinion from UpToDate, and clinical trials that did not meet the criteria for pivotal trial. These sources were used as sources of signals for updating in only five reviews.

The surveillance searches looked for primary studies and for systematic reviews with the publication type meta-analysis in MEDLINE. To determine the effectiveness of these searches to detect signaling evidence, we examined recall of signaling articles in the subset of systematic reviews studies here were those updated by search (n=79), and for which a major or notable signal occurred. For the analysis of RCTs added to the cumulative meta-analysis, only those systematic reviews which also had a quantitative signal and were updated by search were studied.

Any signaling evidence added by nomination was tested to determine if it was indexed in MEDLINE and if would have been retrieved by the searches. In some cases, the evidence was published after the searches for new evidence for that systematic review were run. The database was updated with the search results for those nominated publications.

Targets for the cumulative meta-analysis were any RCT added to the meta-analysis of the outcome which had the signal, up to the point where the signal occurred. Targets for the final RCTs were the pivotal RCTs. Targets for final MAs were the newer meta-analyses that contained the evidence that rendered the cohort systematic review potentially in need of update. These were meta-analyses that were not explicit updates. Finally, all signaling evidence was considered. For each of these analyses, recall was calculated for each type of search. For the final analysis of recall of any signaling evidence, two additional variables were created representing recall from MEDLINE by any of the subject search methods (CQ, AIM RCT or MA) and recall from MEDLINE by either of the related articles search methods (RI RCT and RI MA).

II. Publication Time Lags

For question 3, the impact of publication time lags on updating, we supplemented the data set used in the survival analysis with additional eligible systematic reviews identified through ACP Journal Club, as well as AHRQ Evidence Reports that met all eligibility criteria for the main cohort, except inclusion in ACP Journal Club.

We determined dates for performance of the original search, manuscript acceptance, and publication of the review. We regarded the search date as the most recent date reported in the methods section of the systematic review. For Cochrane reviews, we used the most recent of the following dates: the search date reported in the search strategy section in the body of the review, the date new studies were found and included/excluded (e.g., for updated reviews), or the date new studies were sought but not found. For database dates, the end date reported for MEDLINE searching was used (i.e., 1966-June Week 4, 2003) if available. If the MEDLINE date was not reported, any other database end date was used. If no end date was reported, the variable was treated as missing. For all types of reviews, the publication date and indexing date was taken from the Ovid MEDLINE records.

For each date (original search, manuscript acceptance, publication), we identified a year, month, and day. When month was missing, we imputed the 6th month; when day was missing, we imputed the15th day of the month.

III. Growth of the Literature by Clinical Area

MEDLINE searches based on high-level MeSH headings corresponding to the ISI journal categories were undertaken. The resulting set of citations was limited to the publication type Randomized Controlled Trial, to the publication type Clinical Trial but not Randomized Controlled Trial, to the MEDLINE Systematic Review subset, and to the publication type Clinical Practice Guidelines. Searches were then limited by year for each year between 1988 and 2006. We chose 1988 as the beginning of the time period of interest, as this date corresponded to the period five years prior to the earliest search date for any systematic review in the cohort. Search strategies are illustrated in Appendix E *.

IV. Survey of Organizations Engaged in Funding or Production of Systematic Reviews

This exploratory Internet pilot survey on current updating practices and policies employed a purposeful sample to allow for investigation of likely information-rich cases. We chose 9 organizations well known to fund or carry out systematic reviews, as well as 12 EPCs, in addition to AHRQ, were also asked to complete this survey. The identities of the organizations have been kept anonymous per the statements contained in the informed consent signed by participants in the survey and as stipulated in the research protocol approved by the institutional ethics review board at the Children's Hospital of Eastern Ontario.

The survey was provided to participants via the Survey Monkey15 web-based service. This was considered a suitable forum given distribution of the sample across a wide international geographical area, and that key informants are frequent Internet users with email addresses.16, 17 Emails were sent directly to organizational Directors or to the highest ranking scientific or administrative official, asking them to identify the most appropriate internal respondent to answer the questionnaire. Data collection consisted of approximately 50 questions (including skip-logic functionality). These questions focused on the following topics: (a) updating policies, (b) responsibility for updating, (c) estimates of outdated reviews, (d) updating strategies and practices, including when to update, surveillance and triggers impacting updating decisions, (e) strategies for how to conduct an update, (f) barriers and facilitators to this process, (g) views on updating collaboration between groups and (h) descriptive demographics and characteristics of the organization and the representative key informant. It was estimated the survey took between 20 to 30 minutes to complete. (Appendix G *: Survey Instrument)

We attempted to increase our overall response rate by employing recommended survey methods to maximize Internet survey participation.1720 Participants were contacted four times. A small financial incentive was offered to all participants who completed the survey. On clicking on the link to the survey, participants were presented with a description of the purpose of the study, assurance of confidentiality, and a statement of the research protocol by the hospital ethics review board, followed by a request to provide informed consent or decline participation in the survey. Reminder emails were scheduled for day 10, 15 and 25 of the survey.


We fit non-parametric Kaplan-Meier curves to the data set of censored and uncensored observations and used multivariable proportional-hazards models to examine the association between survival and various features of the systematic reviews at the time of publication. We distinguished two categories of potential predictors of survival. The first category consisted of features knowable at the time of publication for a given systematic review, including clinical content area (e.g., cardiovascular medicine, obstetrics and gynecology, critical care, infectious diseases), numbers of participants and trials included in the meta-analysis, the identification of heterogeneity or publication bias, ‘recent or ongoing activity in the field’, which we defined as present if the review included at least one trial published within the last year of its search period or if the review identified ongoing trials eligible for inclusion. Because some evidence exists to suggest that Cochrane reviews differ in important ways from other systematic reviews,1 we also included a dichotomous variable for Cochrane review versus other systematic reviews. The second category of predictors consisted of features knowable only after some surveillance of the literature (but not performance of a full update of the review). Such predictors included the number of new trials eligible for inclusion in an update of the original review, the number of new participants in these trials, the ratio of the new total number of trials to the previous total, and the ratio of the new total number of participants to the previous total.

After confirming that the assumption of proportionality applied, we performed stepwise multivariate analyses using a threshold of p≤ 0.1 for variable selection and retention. In addition to the proportional hazards analysis to estimate predictors of survival, we conducted logistic regression analysis to identify predictors of survival less than two years. Cohort members that were censored in less than two years were counted as missing for this analysis. All analyses were performed with SAS version 9.0 (The SAS Institute, Cary, North Carolina).

Analysis of group differences in time lags in the publication process was made using nonparametric statistics (e.g., Kruskal-Wallis test for differences in median publication times between groups).

Survey Analysis

Closed-ended questions were analyzed primarily using a descriptive summary of findings in the form of frequencies. In addition, percentages were calculated and other details reported in text and tabular form. Participating organizations were not identified in the results as only aggregate data is reported. The EPCs also responded to several additional open-ended questions of particular interest to the AHRQ's EPC Program. The responses to these supplemental questions were compiled for internal use by the AHRQ and are therefore not discussed in this report.



In Ovid MEDLINE, there are three clinical queries available for therapies; sensitivity, specificity and optimized. We used the optimized query.


Appendixes cited in this report are available electronically at http://www​​.htm.


Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...