A Process for Robust and Transparent Rating of Study Quality: Phase 1 [Internet]

Stanley Ip; Georgios D Kitsios; Mei Chung; Joseph Lau

A Process for Robust and Transparent Rating of Study Quality: Phase 1 [Internet]

Review

Rockville (MD): Agency for Healthcare Research and Quality (US); 2011 Nov. Report No.: 12-EHC004-EF.

AHRQ Methods for Effective Health Care.

Authors

Stanley Ip¹, Georgios D Kitsios¹, Mei Chung¹, Joseph Lau¹

Affiliation

¹ Tufts Medical Center Evidence-based Practice Center

PMID: 22191113
Bookshelf ID: NBK82248

Excerpt

Background: Critical appraisal of individual studies with a formal summary judgment for methodological quality and subsequent assessment of the strength of a body of evidence addressing a specific question are essential activities of conducting comparative effectiveness reviews (CERs). Uncertainty concerning the optimal approach of quality assessments has given rise to wide variations in practice. A well-defined and transparent methodology to evaluate the robustness of quality assessments is critical for the interpretation of systematic reviews as well as the larger CER process.

Purpose: To complete the first phase of a project to develop such a methodology, we aimed to examine the extent and potential sources of inter- and intra-rater variations in quality assessments, as conducted in our Evidence-based Practice Center (EPC).

Methods: We conducted three sequential exercises: (1) quality assessment of randomized controlled trials (RCTs) based on the default quality item checklist used in EPC reports without further instruction; (2) quality assessment of RCTs guided by explicit definitions of quality items; and (3) quality assessment of RCTs based on manuscripts stripped of identifying information, and performance of sensitivity analyses of quality items. The RCTs used in these exercises had been included in a previous CER on sleep apnea. Three experienced systematic reviewers participated in these exercises.

Data synthesis: In exercise 1, an initial set of 11 RCTs was subjected to a quality assessment process without any guidance, conducted in parallel by three independent reviewers. We found that the overall study quality ratings were discordant among the reviewers 64 percent of the time. In exercise 2, quality assessments were performed in a second set of RCTs, guided by explicit quality item definitions. The overall study quality ratings were discordant in 55 percent of the cases. In exercise 3, the provenance (i.e., title, authors, journal, etc.) of the published papers used in exercise 2 were concealed and simultaneously “influential” factors like study dropout rate and blinding were variably modified in a sensitivity analysis scheme. Comparing inter-rater disagreements between exercises 2 and 3, we observed that reviewers were less often in disagreement regarding the overall study quality rating (54.5 percent in exercise 2 vs. 45.5 percent in exercise 3). Anonymization of the papers resulted in increased proportion of disagreements for several items (e.g., “definition of outcomes,” “appropriate statistics”). We also observed that for certain items that have a less subjective interpretation (e.g., blinding of outcome assessors or patients), there was a consistent extent of disagreement between exercises 2 and 3.

Limitations: The results presented here are based on a small sample of RCTs, selected from a single CER and assessed by three reviewers from one EPC only. The definitions of the items in our checklist were not evaluated for adequacy and clarity, other than for their face validity assessed by the reviewers of this study. We acknowledge that this default checklist may not be in widespread use across evidence synthesis practices, and is not directly aligned with the current trend to transfer the focus from methodological (and reporting) quality to explicit assessment of the risk of bias of studies. Due to these reasons, the generalizability and the target audience of this research activity may be limited. Furthermore, we did not examine how our quality assessment tool compared with other available tools or how our assessments would differ if applied in a different clinical question. Thus, our findings are preliminary, and no definite conclusions could and should be drawn from this pilot study.

Conclusions: We identified extensive variations in overall study ratings between three experienced reviewers. Discrepancies among reviewers in the assignment of individual items are common. While it may be desirable to have a single rating assessed by multiple reviewers using a process of reconciliation, in the absence of a gold standard method, it may be even more important to report the variations in assessments among different reviewers. A study with large variations in quality assessment may fundamentally be very different from one that has little variations, despite the fact that both of them are assigned the same consensus quality rating. Further investigations are needed to evaluate these hypotheses.

Sections

Publication types

Review

Grants and funding

Prepared for: Agency for Healthcare Research and Quality, U.S. Department of Health and Human Services, Contract No. 290-2007-100551, Prepared by: Tufts Medical Center Evidence-based Practice Center, Boston, Massachusetts