Diagnostic and predictive tests are an important component of medical care, and clinicians rely on test results to establish diagnosis and guide patient management.1 Despite their central role in patient care, evaluating the effectiveness of specific tests is challenging. Tests affect clinical outcomes indirectly, through the effect of test results on physicians' diagnostic thinking and subsequent management decisions, making it difficult to ascribe patient outcomes to the use of a particular test. The many existing frameworks for assessing the value of testing propose a stepwise appraisal process, moving from analytic validity (technical test performance), to clinical validity (diagnostic and predictive accuracy), clinical utility (effect on clinical outcomes) and overall cost-effectiveness assessment.2 Primary studies that directly address all components of the assessment framework are very uncommon. Therefore, systematic reviewers are typically faced with the task of putting together the pieces of the puzzle by synthesizing studies that address each component of the framework. While the diagnostic or predictive accuracy of a medical test does not directly inform on the clinical value of testing, it is a crucial piece of the overall puzzle and one that is essential to synthesize in systematic reviews. The level of test accuracy required for tests to have any impact on clinical outcomes depends on their role (replacement, add-on, or triage test), the setting (screening, diagnosis, prognosis/prediction) of test use, and the specific clinical context.3,4 Meta-analysis of test accuracy can provide an estimate of average test accuracy as well as identify patient-, disease- or test-related modifiers of test performance.5

Meta-analyses of test accuracy present particular challenges compared to reviews of randomized trials of therapeutic interventions, not only because the studies reviewed are exclusively observational, but also because of the inherent associations among the metrics of performance. Sensitivity and specificity are likely to be correlated (between studies) because of threshold effects (i.e., because changing the diagnostic threshold affects sensitivity and specificity in opposite directions), necessitating the use of multivariate analytic methods.6 In the presence of such correlation, univariate meta-analyses of sensitivity and specificity may produce “average” values for each metric that are incompatible and have misleading confidence intervals. Only recently have these methods penetrated into common practice and into methodological guidelines aided by their implementation in readily available software.7-11 The large number of metrics that can be used to summarize information on test accuracy has added to the analytic complexity. In addition to sensitivity and specificity, metrics such as the odds ratio,12 area under the receiver operating characteristic (ROC) curve,13 and likelihood ratios have been proposed for the synthesis of studies of test accuracy.5 Finally, clinical heterogeneity is omnipresent because the studies differ so much in their settings, patient disease spectra, and versions of the tests used. This diversity often manifests as statistical heterogeneity. Thus, meta-analyses of test accuracy need to quantify and account for the presence of heterogeneity and allow the exploration of factors that may be causing it.

Early on, it was recognized that the quality of medical test accuracy studies was often inadequate.14,15 Many items typically considered in the appraisal of studies of therapeutic interventions, such as the use of randomization, blinding of patients to the interventions used, or allocation concealment, do not apply to studies of test accuracy. A number of studies have investigated study design and reporting items that may affect estimated test accuracy, but the evidence on which items are most important is inconclusive.16-18 Drawing on empirical evidence and expert opinion on the quality assessment of accuracy studies, the Quality Assessment of Diagnostic Accuracy Studies (QUADAS) tool was developed and published in November 2003.19,20 This tool has now been validated for use in systematic reviews of diagnostic tests.21 Further, a reporting checklist for primary studies of test accuracy, the Standards for the Reporting of Diagnostic Accuracy Studies (STARD),22,23 was also published in January 2003. Although the checklist was intended as a guide for the reporting of primary research studies on diagnostic tests, the 25 STARD items pertaining to the design, analysis, and reporting of studies are often used to develop items for quality assessment in systematic reviews. As QUADAS and STARD have now been available for some time, it is reasonable to assess their impact on quality assessment methods in meta-analyses of medical tests.

Along with the overall number of meta-analytic publications, the number of meta-analyses of test accuracy studies has skyrocketed, increasing from fewer than 10 per year in the early 1990s to almost 100 publications per year in recent years. The question therefore appears to have shifted from whether meta-analysis of medical test accuracy studies is useful,24 to what methods are best for undertaking such analyses, in terms of study identification and selection, assessment of study quality, statistical analysis and reporting. This report is the first in a series of three on meta-analysis of test accuracy, conducted by the Tufts Evidence-based Practice Center under contract with AHRQ. For this project we performed a systematic overview of meta-analyses of medical test accuracy, to assess the current state of the literature and evaluate trends over time in the methods and reporting of such studies. Here, we aimed to produce a descriptive summary of the current state of the literature of applied meta-analyses, with a focus on methods and reporting. Subsequent reports in this series will include an empirical assessment of alternative analytic methods and the development of novel methods for the analysis of diagnostic test networks.