NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Cover of Reliability Testing of the AHRQ EPC Approach to Grading the Strength of Evidence in Comparative Effectiveness Reviews

Reliability Testing of the AHRQ EPC Approach to Grading the Strength of Evidence in Comparative Effectiveness Reviews

Methods Research Reports

Investigators: , PhD, , PhD, , MA, , MPH, , PhD, MPH, , PhD, , PhD, , MD, , PhD, and , BA.

Author Information
Rockville (MD): Agency for Healthcare Research and Quality (US); .
Report No.: 12-EHC067-EF

Structured Abstract


This project focused on Agency for Healthcare Research and Quality (AHRQ) methods guidance to its Evidence-based Practice Center (EPC) program on grading the strength of evidence (SOE) related to therapeutic interventions. Our project focused on inter-rater reliability testing of the two main components of the AHRQ approach to grading SOE for specific outcomes: (1) scoring evidence on the four required domains (risk of bias, consistency, directness, and precision), separately for randomized controlled trials (RCTs) and observational studies, and (2) developing an overall SOE grade, given the scores for the individual domains.

Data Sources and Methods:

We conducted inter-rater reliability testing using data obtained from two published CERs. We designed 10 exercises (5 positive outcomes [benefits] and 5 harms [adverse effects]); all 10 included RCTs, and 6 of the 10 included 1 or more observational studies.

Eleven pairs of reviewers (22 participants) participated in the exercises. Each reviewer independently completed each of the exercises; subsequently, each pair of reviewers reconciled their independent responses.

We calculated summary statistics to describe agreement among reviewers and their difficulty in making each rating assessment. We used logistic regression analysis to describe the relationship between domain scores and the final SOE grade, both in relation to the specific grade selected and level of agreement among reviewers. We examined the change in independent reviewer ratings following reconciliation among reviewer pairs.


The level of independent reviewer inter-rater agreement for domain scores varied considerably from substantial for RCT risk of bias and directness to slight for observational study risk of bias. Agreement on all other domains was either moderate or fair. Agreement was generally better for RCTs than observational studies and agreement among reconciled reviewer pairs was as good as or better than it was for individual independent reviewers.

Agreement on independent reviewer SOE grades was generally poorer than for domain scores. Overall agreement was slight and it was not appreciably better when limited to the exercises that included only RCTs. Neither agreement on domain scores nor agreement about the level of difficulty in evaluating particular domains predicted the overall SOE grades.

When evidence was limited to RCT studies, better SOE grades of moderate or high were related to RCT domain scores’ being considered consistent and precise. The inclusion of observational studies, in addition to RCTs, in an exercise was a strong predictor of a poorer SOE grade — namely, either insufficient or low.


Our findings demonstrate that the conclusions reached by experienced reviewers based on the same evidence can differ greatly, particularly when they are faced with bodies of evidence that do not lend themselves to meta-analysis and they need to rely more heavily on their own judgment. Of particular concern is how to deal with (a) outcomes that are evaluated through a combination of RCTs and observational studies, (b) outcomes that are evaluated through more than one measure and (c) grading evidence that appears to show no difference.

We conclude that additional methodological guidance is needed, including more details and examples, supported by more training, particularly on how best to evaluate the “thornier” bodies of evidence as discussed above. However, some potential will always exist for disagreement even among experienced reviewers. EPC reviewer teams need to be transparent in how they have conducted this task. This will help to ensure that stakeholders can be confident of their interpretation of the evidence.

Our study provided only a first approximation of reviewers’ rationales for differences in SOE decisions. Additional research is needed to understand gaps in guidance that should be filled, areas of insufficient understanding of the guidance itself and how best to overcome that deficit, and complex decisions that may still need to be left to the review team’s substantive expertise.

Prepared for: Agency for Healthcare Research and Quality, U.S. Department of Health and Human Services1, Contract No. 290-2007-10056-I, Prepared by: RTI International–University of North Carolina Evidence-based Practice Center,Research Triangle Park, NC

Suggested citation:

Berkman ND, Lohr KN, Morgan LC, Richmond E, Kuo TM, Morton S, Viswanathan M, Kamerow D, West S, Tant E. Reliability Testing of the AHRQ EPC Approach to Grading the Strength of Evidence in Comparative Effectiveness Reviews. Methods Research Report. (Prepared by RTI International–University of North Carolina Evidence-based Practice Center under Contract No. 290-2007-10056-I.) AHRQ Publication No. 12-EHC067-EF. Rockville, MD: Agency for Healthcare Research and Quality. May 2012.

This report is based on research conducted by the RTI International–University of North Carolina Evidence-based Practice Center (EPC) under contract to the Agency for Healthcare Research and Quality (AHRQ), Rockville, MD (Contract No. 290-2007-10056-I). The findings and conclusions in this document are those of the authors, who are responsible for its contents; the findings and conclusions do not necessarily represent the views of AHRQ. Therefore, no statement in this report should be construed as an official position of AHRQ or of the U.S. Department of Health and Human Services.

The information in this report is intended to help health care decisionmakers—patients and clinicians, health system leaders, and policymakers, among others—make well-informed decisions and thereby improve the quality of health care services. This report is not intended to be a substitute for the application of clinical judgment. Anyone who makes decisions concerning the provision of clinical care should consider this report in the same way as any medical reference and in conjunction with all other pertinent information, i.e., in the context of available resources and circumstances presented by individual patients.

This report may be used, in whole or in part, as the basis for development of clinical practice guidelines and other quality enhancement tools, or as a basis for reimbursement and coverage policies. AHRQ or U.S. Department of Health and Human Services endorsement of such derivative products may not be stated or implied.


540 Gaither Road, Rockville, MD 20850; www‚Äč

Bookshelf ID: NBK98221PMID: 22764383


  • PubReader
  • Print View
  • Cite this Page
  • PDF version of this title (2.4M)

Related information

Similar articles in PubMed

See reviews...See all...

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...