NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Cover of Validity and Inter-Rater Reliability Testing of Quality Assessment Instruments

Validity and Inter-Rater Reliability Testing of Quality Assessment Instruments

Methods Research Reports

Investigators: , PhD, , MSc, , MLIS, , MSc, , PhD, , MBBS, MMedSc, MPhil, , MD, MSc, , PhD, , MD, PhD, and , PhD.

University of Alberta Evidence-based Practice Center
Rockville (MD): Agency for Healthcare Research and Quality (US); .
Report No.: 12-EHC039-EF

Structured Abstract


Numerous tools exist to assess methodological quality, or risk of bias in systematic reviews; however, few have undergone extensive reliability or validity testing.


(1) assess the reliability of the Cochrane Risk of Bias (ROB) tool for randomized controlled trials (RCTs) and the Newcastle-Ottawa Scale (NOS) for cohort studies between individual raters, and between consensus agreements of individual raters for the ROB tool; (2) assess the validity of the Cochrane ROB tool and NOS by examining the association between study quality and treatment effect size (ES); (3) examine the impact of study-level factors on reliability and validity.


Two reviewers independently assessed risk of bias for 154 RCTs. For a subset of 30 RCTs, two reviewers from each of four Evidence-based Practice Centers assessed risk of bias and reached consensus. Inter-rater agreement was assessed using kappa statistics. We assessed the association between ES and risk of bias using meta-regression. We examined the impact of study-level factors on the association between risk of bias and ES using subgroup analyses. Two reviewers independently applied the NOS to 131 cohort studies from 8 meta-analyses. Inter-rater agreement was calculated using kappa statistics. Within each meta-analysis, we generated a ratio of pooled estimates for each quality domain. The ratios were combined to give an overall estimate of differences in effect estimates with inverse-variance weighting and a random effects model.


Inter-rater reliability between two reviewers was considered fair for most domains (κ ranging from 0.24 to 0.37), except for sequence generation (κ=0.79, substantial). Inter-rater reliability of consensus assessments across four reviewer pairs was moderate for sequence generation (κ=0.60), fair for allocation concealment and “other sources of bias” (κ=0.37, 0.27), and slight for the remaining domains (κ ranging from 0.05 to 0.09). Inter-rater variability was influenced by study-level factors including nature of outcome, nature of intervention, study design, trial hypothesis, and funding source. Inter-rater variability resulted more often from different interpretation of the tool rather than different information identified in the study reports. No statistically significant differences were found in ES when comparing studies categorized as high, unclear or low risk of bias. Inter-rater reliability of the NOS varied from substantial for length of followup to poor for selection of non-exposed cohort and demonstration that the outcome was not present at outset of study. We found no association between individual NOS items or overall NOS score and effect estimates.


More specific guidance is needed to apply risk of bias/quality tools. Study-level factors that were shown to influence agreement provide direction for detailed guidance. Low agreement across pairs of reviewers has implications for incorporation of risk of bias into results and grading the strength of evidence. Variable agreement for the NOS, and lack of evidence that it discriminates studies that may provide biased results, underscores the need for more detailed guidance to apply the tool in systematic reviews.


540 Gaither Road, Rockville, MD 20850; www​

Prepared for: Agency for Healthcare Research and Quality, U.S. Department of Health and Human Services1, . Contract No. 290-2007-10021-I. Prepared by: University of Alberta Evidence-based Practice Center, Edmonton, Alberta, Canada

Suggested citation:

Hartling L, Hamm M, Milne A, Vandermeer B, Santaguida PL, Ansari M, Tsertsvadze A, Hempel S, Shekelle P, Dryden DM. Validity and inter-rater reliability testing of quality assessment instruments. (Prepared by the University of Alberta Evidence-based Practice Center under Contract No. 290-2007-10021-I.) AHRQ Publication No. 12-EHC039-EF. Rockville, MD: Agency for Healthcare Research and Quality. March 2012.

This report is based on research conducted by the University of Alberta Evidence-based Practice Center under contract to the Agency for Healthcare Research and Quality (AHRQ), Rockville, MD (Contract No. 290-2007-10021-1). The findings and conclusions in this document are those of the author(s), who are responsible for its content, and do not necessarily represent the views of AHRQ. No statement in this report should be construed as an official position of AHRQ or of the U.S. Department of Health and Human Services.

The information in this report is intended to help clinicians, employers, policymakers, and others make informed decisions about the provision of health care services. This report is intended as a reference and not as a substitute for clinical judgment.

This report may be used, in whole or in part, as the basis for the development of clinical practice guidelines and other quality enhancement tools, or as a basis for reimbursement and coverage policies. AHRQ or U.S. Department of Health and Human Services endorsement of such derivative products or actions may not be stated or implied.

None of the investigators have any affiliations or financial involvement that conflicts with the material presented in this report.


540 Gaither Road, Rockville, MD 20850; www​

Bookshelf ID: NBK92293PMID: 22536612
PubReader format: click here to try


Related information

Related citations in PubMed

See reviews...See all...

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...