NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Hartling L, Hamm M, Milne A, et al. Validity and Inter-Rater Reliability Testing of Quality Assessment Instruments [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2012 Mar.

Cover of Validity and Inter-Rater Reliability Testing of Quality Assessment Instruments

Validity and Inter-Rater Reliability Testing of Quality Assessment Instruments [Internet].

Show details

Summary and Discussion

Key Points

Risk of Bias Tool and Randomized Controlled Trials

  • Inter-rater reliability between reviewers was fair for all domains except sequence generation which was substantial.
  • Inter-rater reliability between pairs of reviewers was moderate for sequence generation, fair for allocation concealment and “other sources of bias,” and slight for the remaining domains.
  • Low agreement between reviewers suggests the need for more specific guidance regarding interpretation and application of the Risk of Bias (ROB) tool or possibly re-phrasing of items for clarity.
  • Examination of study-level variables and their association with inter-rater agreement identifies areas that require specific guidance in applying the ROB tool. For example, nature of the outcome (objective vs. subjective), study design (parallel vs. other), and trial hypothesis (efficacy/superiority vs. other).
  • Low agreement between pairs of reviewers indicates the potential for inconsistent application and interpretation of the ROB tool across different groups and systematic reviews.
  • Post hoc analyses showed that disagreement most often arose from interpretation of the tool rather than discrepancies in the information that was extracted from studies.
  • The majority of trials in the sample were assessed as high or unclear risk of bias for many domains, likely due to inadequate reporting at the study level. This raises concerns about the ability of the ROB tool to detect differences across trials that may relate to biases in estimates of treatment effects.
  • No statistically significant differences were found in effect sizes (ES) across high, unclear and low risk of bias categories.

Newcastle-Ottawa Scale and Cohort Studies

  • Inter-rater reliability between reviewers ranged from poor to substantial, but was poor or fair for the majority of domains.
  • No association was found between individual quality domains and measures of association.


Risk of Bias Tool and Randomized Controlled Trials

We found that inter-rater reliability between reviewers was low for all but one domain in the ROB tool. These findings are similar to results of a previous study38 (Table 10). The sample of trials was distinct for the previous and current studies, focusing on pediatric and adult populations, respectively. The common feature of the two samples was that the trials were not part of a systematic review, rather they were trials randomly selected from a larger pool. Hence, the trials covered a wide range of topics. This may have contributed to some of the low agreement as reviewers had to consider different nuances for each trial. Hartling, et al. showed improved agreement within the context of a systematic review where all trials examined the same interventions in similar populations39 (Table 12).

Table 12. Inter-rater reliability on risk of bias assessments, comparison across studies.

Table 12

Inter-rater reliability on risk of bias assessments, comparison across studies.

Nevertheless, the low agreement raises concerns and points to the need for clear and detailed guidance in terms of applying the ROB tool. Despite pilot testing and providing supplemental guidance for this study, we still found low agreement. This is likely due to nuances encountered in individual studies. A compilation of examples, especially problem areas, with information on how experts would interpret and apply domains would be of particular benefit for this field. This could build on the examples we have provided in Appendix J where disagreements in interpretation occurred across pairs of reviewers. One of the unique contributions of the present study was the analysis of inter-rater reliability stratified by study-level variables. This provides some direction as to where more specific guidance may be beneficial. For instance, agreement was considerably lower for: allocation concealment when trials did not have a parallel design; blinding when the nature of the outcome was subjective; selective outcome reporting when the trial hypothesis was not one of efficacy/superiority; and “other sources of bias” for nonpharmacological interventions and when the outcome was subjective. In summary, agreement for some domains may be better in classic parallel trials of pharmacological interventions, whereas trials with different design features (e.g., crossover) or hypotheses (e.g., equivalence, noninferiority), and those examining nonpharmacological interventions appear to create more ambiguity for risk of bias assessments.

Another unique contribution of the present study was the examination of the consensus ratings across pairs of reviewers. These ratings should be free of individual rater errors and bias given that these are consensus ratings with disagreements resolved. This is based on the assumption that the consensus process was valid: the reviewers jointly made a decision and did not simply defer to the more senior reviewer. Further, the consensus rating is a more meaningful measure of agreement (as opposed to reliability between two reviewers), as these ratings are the ones reported in systematic reviews. In this study, the pairs of reviewers were from four different centers, each with a long history of producing systematic reviews. The agreement across the pairs of reviewers was generally lower than the agreement between reviewers. This raises concerns about the variability in interpreting and applying the ROB tool that can occur across different systematic review groups and across systematic reviews. Further, we found that discrepancies more often resulted from interpretation of the tool rather than different information being identified and recorded for the same study. We were not able to examine how consensus occurred which could be the focus for future research. Specifically, it is not clear whether the consensus process involves joint decisionmaking or whether consensus arises based on deferring to one of the reviewers within the pair based on seniority or some other factor.

Risk of bias for the sample of trials used for this study is described in Table 13 and is compared with samples from other studies. Of particular note is that 99 percent of this sample had overall risk of bias assessments as high or unclear. This is similar to three of the four other samples that had more than 90 percent assessed as high or unclear risk of bias overall (the fourth sample did not assess overall risk of bias). If the vast majority of trials are assessed as high or unclear risk of bias, the tool may not be sufficiently sensitive to differences in methodology that might explain variation in treatment effect estimates across studies, or study methodology as a potential explanation for heterogeneity in meta-analyses. Questions also arise regarding whether poor assessments are a result of inadequate or unclear reporting at the trial level. While the focus of the ROB tool is intended to be on methods rather than reporting, reviewers regularly indicate that they rely on the trial reporting to make their assessments. Even within recent samples of trials published after the emergence and widespread dissemination of reporting guidelines,73 we see large proportions assessed as high or unclear risk of bias. This is consistent with other recent reports of unacceptable reporting in trials.44 The risk of bias assessments were less severe within the individual domains. However, for the current sample the majority of trials were assessed as high or unclear risk of bias for three of the six domains, including allocation concealment, blinding, and “other sources of bias.” These findings may be beneficial for developers and promoters of reporting guidelines, as well as for researchers who are reporting RCTs.

Table 13. Trials at high or unclear risk of bias across samples.

Table 13

Trials at high or unclear risk of bias across samples.

Our sampling allowed us to broadly compare our assessments with those of another independent research team.44 The other team did not apply the ROB tool but did assess some of the same domains. Further, the other team examined a larger sample of trials published in 2006 from which our sample was randomly drawn. Nevertheless, the assessment between research teams was consistent for several domains. They found that 75 percent of trials did not report their method of allocation concealment while we found that 79 percent were at high or unclear risk of bias for allocation concealment. Likewise, they found that 59 percent of reports were either not blinded or methods of blinding were not reported while we found that 62 percent of trials were at high or unclear risk of bias for blinding. They found that attrition (intention-to-treat analysis) was not reported in 31 percent of trials while we found incomplete outcome data for 36 percent. There was variation for one of the domains that both groups assessed: the other team found that sequence generation was not reported for 66 percent of the sample, whereas we found high or unclear risk of bias for sequence generation in only 46 percent of our sub-sample.

We found no statistically significant association between effect estimates and risk of bias assessments. There are three main explanations for this finding. The first is that there was in fact no association between effect estimates and risk of bias. The second is that bias can either underestimate or overestimate treatment effects; hence, when studies were combined the association may have cancelled out. The first two explanations may have resulted in part from the sample of studies selected for this study. Third, and possibly most likely, is that there was insufficient power to detect differences. One of the factors contributing to low power was the small number of studies within certain domains in the low risk of bias category. This was particularly the case for overall risk of bias as there was only one study in the low category.

Newcastle-Ottawa Scale and Cohort Studies

This is the first study to our knowledge that has examined inter-rater reliability and construct validity of the Newcastle-Ottawa Scale (NOS). We found a wide range of agreement across the domains of the NOS, ranging from slight to substantial. The domain with substantial agreement was not surprising. This domain asked “was the followup long enough for the outcome to occur?” A priori we asked clinical experts to provide the minimum length of followup for each review question. Thus, the assessors had very specific guidance for this item. The agreement for ascertainment of exposure and assessment of outcome was moderate, suggesting that the wording and response options are reasonable. The remaining items had poor, slight, or fair agreement which may be attributable to some of the problems discussed below. Some of the disagreement is likely attributable to inadequate reporting at the study level, which is likely worse for observational studies than RCTs.

We found no association between NOS items and the measures of association using meta-epidemiological methods that control for heterogeneity due to condition and intervention. This may have resulted due to inadequate power, nevertheless it does support previous claims that “the NOS includes problematic items with an uncertain validity.”76

Implications for Practice

The findings of this research have important implications for practice and the interpretation of evidence. The low level of agreement between reviewers and pairs of reviewers puts into question the validity of risk of bias/quality assessments using the ROB tool or NOS within any given systematic review. Moreover, in measurement theory, reliability is a necessary condition for validity (i.e., without being reliable a test cannot be valid). Systematic reviewers are urged to incorporate considerations of risk of bias/quality into their results. Furthermore, integration of the GRADE tool into systematic reviews necessitates the consideration of risk of bias/quality assessments in rating the strength of evidence and ultimately recommendations for practice.77 While the ROB tool considers risk of bias for an individual study, the GRADE tool assesses the risk of bias across all relevant studies for a given outcome (e.g., most information is from studies at high/moderate/low risk of bias).77 The results of risk of bias assessments and their interpretation in a systematic review, as well as the strength of evidence assessments, will be misleading if they are based on flawed assessments of risk of bias/quality. Moreover, Stang declared with respect to the NOS that “use of this score in evidence-based reviews and meta-analyses may produce highly arbitrary results.”76

We do not intend for our results to suggest that reviewers abandon existing tools for other tools unless these have shown greater reliability and validity. Rather, our results underscore the need for reviewers and review teams to be aware of the limitations of existing tools and to be transparent in the process of risk of bias/quality assessment. Detailed guidelines, decision rules, and transparency are needed so that readers and end-users of systematic reviews can see how the tools were applied. Further, pilot testing and development of review-specific guidelines and decision rules should be mandatory and reported in detail.

The NOS in its current form does not appear to provide reliable quality assessments and requires further development and more detailed guidance. The NOS was previously endorsed by The Cochrane Collaboration; however, more recently the Collaboration has proposed a modified ROB tool to be used for nonrandomized studies.11 A new tool developed through the EPC Program for quality assessment of nonrandomized studies offers another alternative.78

Future Directions

There is a need for more detailed guidelines to apply both the ROB tool and the NOS, as well as revisions to the tools to enhance clarity. Additional testing should occur after further revisions to the tool and when expanded guidelines are available. We have identified specific trial features for which clearer guidance is needed. In addition, we have collated examples of discrepancies across pairs of reviewers. A living database that collects examples of risk of bias/quality assessments and consensus from a group of experts would be a valuable contribution to this field. Individual review teams and research groups should be encouraged to begin identifying examples and these could be compiled across programs (e.g., the EPC Program) and entities (e.g., The Cochrane Bias Methods Group), and made widely accessible. We have identified specific problems with application and interpretation of the NOS tool. Further revisions and guidance are needed to support the continued use of NOS in systematic reviews. Investment in further reliability and validity testing of other tools may be more appropriate (e.g., Cochrane ROB tool for nonrandomized studies, the EPC tool). Finally, consensus in this field is needed in terms of the threshold for inter-rater reliability of a measurement before it can be used for any purpose, even descriptive purposes (i.e., describing the risk of bias or quality of a set of studies).

Strengths and Limitations

This is one of few studies examining the reliability and validity of the ROB tool. It is the first to our knowledge that examines reliability between the consensus assessments of pairs of reviewers for a systematic review quality/risk of bias assessment tool. Further, it is the first study to provide empirical evidence on study-level variables that may impact reliability of ROB assessments. This is the first study to our knowledge that examined reliability and validity of the NOS.

The main limitation of the research is that the sample sizes (154 RCTs, 131 cohort studies) may not have provided sufficient power to detect statistically significant differences in ES estimates according to risk of bias/quality. Another potential limitation is that we did not use a ‘meta-epidemiological approach’72 (i.e., reanalysis of data from existing meta-analyses) to examine the association between ES and risk of bias, therefore the heterogeneity across trials may have limited our ability to detect differences. However, for the NOS sample we used ‘meta-epidemiological’ methods and found no significant associations between quality and measures of association within the cohort studies, which could be attributable to low power. We specifically selected meta-analyses with substantial heterogeneity in order to optimize our potential to see whether quality as assessed with the NOS might explain variations in measures of association.

We included only 30 RCTs in our analysis of inter-rater agreement across consensus ratings by two reviewers. This small sample may limit generalizability of the findings.

We involved a number of reviewers with different levels of training, type of training, and extent of experience in quality assessment and systematic reviews. Some of the variability or low agreement may be attributable to characteristics of the reviewers. Agreement may be higher among individuals with more direct experience or specific post-graduate training in research methods or epidemiology. Nevertheless, all reviewers had previous experience in systematic reviews and quality assessments, and likely represent the range of individuals that would typically be involved in these activities within a systematic review.

A final caveat to note is that the ROB tool has undergone some revisions since we initiated the study. These are detailed in the most recent version of the Cochrane Handbook11 but were not incorporated into our research. The changes affected primarily the blinding and the “other sources of bias” domains. This does not impact the general findings from our research; however, further testing with the modified tool is warranted.


More specific guidance is needed to apply and interpret risk of bias/quality assessment tools. We identified a number of study-level factors that influence agreement as well as examples where agreement is low. This information provides direction for more detailed guidance. Low agreement between reviewers has implications for incorporation of risk of bias into results and grading the strength of evidence. Low agreement across pairs of reviewers has implications for interpretation of evidence reported by different groups. There was variable agreement across items in the NOS. This finding, combined with a lack of evidence that it discriminates studies that may provide biased results, underscores the need for more detailed guidance to apply the tool in systematic reviews.

PubReader format: click here to try


  • PubReader
  • Print View
  • Cite this Page
  • PDF version of this title (1.2M)

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...