NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Dahabreh IJ, Trikalinos TA, Lau J, et al. An Empirical Assessment of Bivariate Methods for Meta-Analysis of Test Accuracy [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2012 Nov.

Cover of An Empirical Assessment of Bivariate Methods for Meta-Analysis of Test Accuracy

An Empirical Assessment of Bivariate Methods for Meta-Analysis of Test Accuracy [Internet].

Show details


Key Findings

We present a comprehensive empirical comparison of meta-analytic methods for studies of test accuracy, both in terms of number of meta-analyses included and in terms of the scope of the meta-analytic methods considered. Univariate and bivariate meta-analyses most often resulted in similar point estimates, regardless of the estimation method (inverse variance or MLE) or the distribution used to model within-study variability (normal or exact binomial). Use of a normal approximation (both in univariate and bivariate meta-analyses) resulted in summary estimates with lower values and led to narrower confidence intervals, compared to methods that used the exact binomial likelihood. Although some of the differences between estimates were numerically large, their clinical importance is entirely context-specific. As expected, differences were larger in meta-analyses of small studies where continuity corrections (for the normal approximation) were needed for a large proportion of analyzed studies. Bivariate models fit using Bayesian and maximum likelihood methods produced almost identical summary estimates of sensitivity and specificity. The methods gave practically identical results in meta-analyses with moderate to large numbers of studies and when included studies had large sample sizes. The credibility intervals produced by Bayesian bivariate meta-analysis methods were substantially wider compared to the confidence intervals of maximum likelihood methods (using the exact binomial likelihood to describe within-study variability for both models). Although often not well estimated, the between-study correlation (of sensitivity and specificity) was frequently far from zero. This indicates that ignoring it is generally inappropriate for meta-analyses. Alternative meta-analytic methods to obtain SROC curves resulted in substantially different curves; differences were substantial between alternative parameterizations of the HSROC curves (particularly when the correlation between sensitivity and specificity was estimated to be positive).

Meta-Analysis of Sensitivity and Specificity

Our findings substantially extend previous comparisons between methods for meta-analysis of test accuracy. Table 5 summarizes selected empirical comparisons of meta-analytic methods for test accuracy, where at least one of the methods allowed for correlation of sensitivity and specificity at the between-study level. Generally, previous reports have assessed only few applied meta-analysis examples (ranging from 1 to 50 meta-analyses), whereas we analyzed a much larger database using a wide array of analytic approaches.

Table 5. Summary of selected previous empirical comparisons of meta-analysis methods, including simulation studies.

Table 5

Summary of selected previous empirical comparisons of meta-analysis methods, including simulation studies.

Previous theoretical and simulation studies have suggested that the binomial distribution may be preferable to the normal approximation for modeling within-study variability. We believe that our observations are in concordance with this position. Not unexpectedly, the differences between the two methods were more pronounced in studies of small sample size and meta-analyses where tests had high sensitivity and specificity. In such cases the normal distribution will be a poor approximation to the binomial. Furthermore, in studies where some of the counts are zero, analysis using the normal likelihood will require the use of a continuity correction (so that the variance and point estimate of the study-level logit-sensitivity or logit-specificity can be calculated). The continuity correction will bias the point estimate of individual studies; this is why the difference in the summary estimates between methods that rely on the normal approximation versus those that do not is greater when the summary sensitivity or specificity are closer to one35. An additional reason for the systematically smaller summary sensitivity or specificity with normal approximation methods may be that in the meta-analysis, the estimate (logit-transformed sensitivity or specificity) and its variance are correlated, in that the variance is a function of the estimate and the sample size. This correlation is positive for proportions larger than 0.5, and thus estimates near one have larger variance (and receive less weight in the meta-analysis) compared to estimates near 0.5. The net effect is that summary sensitivity or specificity are biased towards 0.5.28,33 Such a bias is not a problem for meta-analysis methods using the exact likelihood, and is not observed when variance-stabilizing transformations are used for meta-analysis of proportions (such as the arcsine transformation).

We found that univariate and bivariate meta-analysis methods produced generally similar summary estimates and marginal confidence intervals for sensitivity and specificity. Differences are likely to be more pronounced when evaluating linear combinations of the estimates (e.g., the sum of sensitivity and specificity) particularly in problems of higher dimensionality (e.g., multiple index tests applied to the same patients and compared against a common reference standard). This issue is addressed in detail in a separate report of diagnostic tests in preparation by the EPC.

Few studies have compared the results of bivariate meta-analysis using maximum likelihood versus fully Bayesian methods for the meta-analysis of sensitivity and specificity and those that did used models that were not directly comparable).17,19,20 Many investigators have commented that Bayesian methods are less accessible to meta-analysts than the corresponding maximum likelihood methods. We provide the BUGS code we used to fit the bivariate model for the model in Appendix B. We found that convergence problems were not common when fitting the bivariate model; when present they were mostly due to numerical instability in cases where the number of studies was small, sensitivity and specificity were close to 1, or the between-study variance was very low. For Bayesian analyses, we were able to obtain model convergence in most datasets by slightly modifying the non-informative prior distributions used. Bayesian analyses resulted in summary estimates of sensitivity and specificity that were very close to those obtained from the maximum likelihood estimation. However, there were substantial differences in the width of the credibility and confidence intervals produced by Bayesian and maximum likelihood analyses, respectively.

Bivariate methods provide estimates of the correlation between sensitivity and specificity at the between-study level. Alternative models (normal approximation for within-study variability versus exact binomial distribution) and estimation methods (non-iterative versus MLE; frequentist versus Bayes) can yield quite different correlation estimates. This may be another symptom of the fact that the correlation parameter is generally poorly estimated. A telling observation is the following: frequentist approaches (maximum likelihood and inverse variance methods) often estimated the correlation parameter in the extremes of its domain, namely -1 (and sometimes +1). Riley 2007 made a similar observation in a simulation study.36 In contrast, Bayesian methods rarely produced extreme correlation values, due to shrinkage toward the mean of the prior distribution (the mean is zero for the uniform (-1,1) prior distribution that we used).

Constructing Meta-Analytic ROC Curves

Arguably, ROC curves provide additional information compared to meta-analytic estimates of sensitivity and specificity, because they illustrate the relationship between sensitivity and specificity. Based on our previous survey the most commonly used method for constructing SROC curves is the approach proposed by Moses and Littenberg.12 Despite its popularity this model has several shortcomings, including its failure to account for underlying binomial distribution of data, between-study heterogeneity, and measurement error on its independent variable. These shortcomings of the Moses-Littenberg SROC model are overcome by the hierarchical modeling approaches, including the increasingly used model proposed by Rutter and Gatsonis.13 It can been shown that the Rutter-Gatsonis HSROC model is equivalent to the bivariate meta-analysis of sensitivity and specificity, in the absence of covariates in the regression.14,28 Thus, the parameters of the HSROC model can be “back-calculated” using estimates from the bivariate meta-analysis model (an approach we followed in this report).

The Rutter-Gatsonis HSROC model is one of several possible parameterization of the HSROC curve. Arends 200828 discuss alternative parameterizations, which we implemented for all meta-analyses we performed (plots available from the authors upon request). These parameterizations often result in substantially different curves compared to the one produced by the Rutter-Gatsonis HSROC model.13,29 Importantly, in some cases the slope of the ROC curve is not always positive (in contrast to the Rutter-Gatsonis method) and, therefore, the relationship between sensitivity and specificity cannot be explained by threshold effects across studies. Based on this, Chappell et al. determine that SROC curves are not always a helpful summary of the data, and propose a stepwise algorithm for determining the most appropriate approach to summarize accuracy studies.29


Some limitations need to be considered when interpreting our results. Because of the way we constructed the database of systematic reviews of test accuracy, all included meta-analyses were conducted prior to 2003 and were published in English-language journals. Although this may limit the clinical applicability of their actual findings, it does not substantially affect the conclusions of our empirical comparison of methods because the datasets included are very diverse in terms of number of included studies, sample size, and reported test accuracy (Table 1). In a recent comprehensive review of reporting and design characteristics of systematic reviews of test accuracy that gave quantitative synthesis results (covering meta-analyses published up to 2010), we found no substantial change over time in the number of included studies or the number of meta-analyses conducted per review article.12

Another limitation of our work is that many systematic reviews contributed multiple datasets to the empirical comparison (approximately two datasets per review, on average). We believe that the effect of this clustering is probably minor because in most cases when multiple meta-analyses are presented in the same systematic review, they typically address different index or reference standard tests (often based on nonoverlapping sets of primary studies). Unfortunately, data to explore the potential effect of such clustering are typically not available in meta-analyses or primary diagnostic test studies. Nonetheless, our approach can be considered representative of current practice in applied meta-analyses, where pairs of tests and diagnostic outcomes are almost always evaluated one at a time.

Finally we have focused on meta-analysis of sensitivity, specificity and meta-analytic ROC curves, but did not consider other metrics such as likelihood ratios, odds ratios or areas under the ROC curve. We note that these metrics can be derived from the methods we assess (for example, likelihood ratios can be estimated from the output of the bivariate model) and are generally less commonly used in the diagnostic literature.

PubReader format: click here to try


Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...