NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.
Lefebvre C, Glanville J, Beale S, et al. Assessing the performance of methodological search filters to improve the efficiency of evidence information retrieval: five literature reviews and a qualitative study. Southampton (UK): NIHR Journals Library; 2017 Nov. (Health Technology Assessment, No. 21.69.)
Assessing the performance of methodological search filters to improve the efficiency of evidence information retrieval: five literature reviews and a qualitative study.
Show detailsThe research plan had several stages. It began with a series of five literature reviews into various aspects of search filter reporting and use. The reviews informed the development of an interview schedule and a web-based questionnaire (see Appendix 1). The reviews, interviews and questionnaire informed the development of suggested approaches to gathering and reporting search filter performance and a test website, on which we invite further feedback [see https://sites.google.com/a/york.ac.uk/search-filter-performance/ (accessed 22 August 2017)].
Reviews
The research was grounded in a series of five reviews. We conducted two reviews on how the performance of methodological search filters has been measured, in single studies and also in studies comparing the performance of search filters. In a third review we sought to find inspiration and synergies in the DTA literature by reviewing the literature on diagnostic test reporting and included an exploration of the potential relevance of performance measures used in DTA studies. Search filters are analogous to diagnostic tests, being designed to distinguish relevant records from irrelevant records, and the performance of search filters and diagnostic tests is reported using similar measures, such as sensitivity and specificity. A fourth review sought reports on how searchers make choices about filters based on the information presented to them and a fifth review sought to identify any information on how clinicians make choices about diagnostic tests to gain insights into how searchers do or might in the future be encouraged to make choices about search filters.
The reviews were informed by literature searches conducted in databases in a number of disciplines including information science. Further information about the searches can be found within each of the reviews described later in this chapter and the search strategies are all included in the relevant appendices. The sources searched were:
- The Cochrane Library
- EMBASE
- European network for Health Technology Assessment (EUnetHTA)
- health technology assessment (HTA) organisation websites
- Health Technology Assessment international (HTAi) Vortal
- Inter Technology Appraisal Support Collaboration (InterTASC) Information Specialists’ Sub-Group (ISSG) Search Filters Resource
- Library and Information Science Abstracts (LISA)
- MEDLINE
- PsycINFO.
The reviews were conducted to reflect the project objectives, which were to determine:
- what performance measures are reported for single studies of search filters and how they are presented (review A)
- what performance measures are reported when comparing a range of search filters and how the performance measures are synthesised (review B)
- what performance measures are reported in DTA studies and DTA reviews (review C)
- how searchers choose search filters (review D)
- how filter/test performance data are presented (e.g. text, graphs, tables, graphics) to assist users (searchers or clinicians) in choosing which filters or tests to use (reviews A, B and C)
- how clinicians or organisations choose diagnostic tests (review E).
Interviews and questionnaire
The objective of the reviews was to identify information about:
- performance measures in use
- the presentation of performance measures
- how searchers and clinicians choose search filters or diagnostic tests.
The next stage, consisting of two phases (semistructured interviews and a questionnaire survey), was to ascertain which search filter performance measures were deemed to be the most important by searchers for informed decision-making. We sought to gain information on how search filter performance information could most usefully be presented to assist decisions and whether or not there is scope for performance information to be obtained as part of routine project work.
Phase 1: semistructured interviews
As this project was funded to inform NICE methods development, the involvement of NICE staff was central to it. We contacted NICE information specialists and project managers and offered them the opportunity to participate in the project. Each interview, which was recorded, lasted for no more than 45 minutes. Once the interview time and date were agreed, confirmation details (date, time, length of interview and interviewer details), along with a topic guide and assurance of anonymity, were sent to each interviewee. After each interview, an e-mail containing a summary of the key points raised during the interview was sent to each interviewee, who was offered the opportunity to check the notes for accuracy and add any additional points that may have occurred to him or her after the interview had ended.
Phase 2: questionnaire survey
Information from the literature reviews and the interviews was used to inform the design and content of a web-based questionnaire. NICE information specialists and project managers were invited to complete the questionnaire but it was also used to collect the views of the wider (national and international) systematic review, HTA and guidelines information community. This information community is well networked and was reached via e-mail lists, as described in Chapter 4 (see Questionnaire methods).
Presentation of filter information
Information from the reviews and interview and questionnaire responses was used to develop suggested approaches to measuring search filter performance.
We also developed a series of pilot formats for presenting search filter performance information. With the approval of the authors, some of the data from the Cochrane methodology review of the performance of search filters in identifying DTA studies,1,2 which at the time of the project was not yet published, was used to populate the pilot formats.
Performance tests, reports and performance resource
We developed a prototype web resource (using content management systems available at the University of York) to present performance data and to facilitate feedback and comments from NICE staff and others from within the evidence synthesis information community. Without prejudging users’ requirements or the results of the research, the performance resource presented a matrix of information showing how well published search filters perform for specific study designs in different clinical specialties and with different user preferences for measures such as sensitivity or precision.
Based on the suggested approaches, we developed performance tests and performance reports, which were uploaded onto the project website. We also developed detailed procedures with the intention of assisting researchers to conduct and report future performance tests. We considered that if we could ascertain that users valued information in a specific format then we could try to develop suggested approaches to promoting these methods. The intention was to develop user-friendly tools for the future and to explore options to make these tools widely available.
Performance measures for methodological search filters (review A)
Introduction
Although there are a large number of search filters in existence, many have been developed pragmatically and have not undergone validation. Even for those search filters that have been validated, few have been validated beyond the data in the original publication. This method is described as internal validation and is a less rigorous approach than external validation, in which a filter is tested using a different gold standard from the one used to develop the filter. External validation provides an independent assessment of filter performance and gives a better indication of how a filter is likely to perform in the real world.
Selection of a search filter will depend on the particular searching task and on the performance of the search filter. Thus, it is important to report performance measures for search filters. There are a few tools available that can be used to assess or appraise search filters and these can help in the selection of search filters for specific tasks.3–5
The aim of this review was to look at the performance measures that are reported for search filters (single studies) and how they are presented. Single studies were defined as those in which a new search filter (or series of filters) was developed, or a search filter was revised, and in which performance measures of the search filter(s) were also reported.
The objectives of the review were to:
- identify and summarise the methods used to develop and validate search filters
- identify and summarise the performance measures used in single studies of search filters
- describe how these performance measures are presented.
Methods
Identification of studies
Studies were identified from the ISSG Search Filters Resource.6 The ISSG Search Filters Resource is a collaborative venture to identify, assess and test search filters designed to retrieve health-care research by study design. It includes published filters and ongoing research on filter design, research evaluating the performance of filters and articles providing a general overview of search filters. At the time of this project, regular searches were being carried out in a number of databases and websites, and tables of contents of key journals and conference proceedings were being scanned to populate the site. Researchers working on search filter design are encouraged to submit details of their work. The 2010 update search carried out by the UK Cochrane Centre to support the ISSG Search Filters Resource website was also scanned to identify any relevant studies that were not included on the website at that time.
We acknowledge that there has been a regrettable delay between carrying out the project, including the searches, and the publication of this report, because of serious illness of the principal investigator. The searches were carried out in 2010/11.
Inclusion criteria
The review included studies that reported the development and evaluated the performance of methodological search filters for health-care bibliographic databases. For pragmatic reasons, the review specifically focused on studies that developed and evaluated methodological search filters for economic evaluations, DTA studies, systematic reviews and RCTs. These study types are the ones most commonly used by organisations such as NICE to underpin their decision-making when producing technology appraisals and economic evaluations of health-care technologies and subsequent clinical guidelines. Publications prior to 2001 were excluded partly for pragmatic reasons but also because during this period search filters tended to be derived by subjective methods and because some of the filters had subsequently been updated or were now out-of-date because of changes in database indexing.
Exclusion criteria
Studies were excluded from the review if they:
- were available only in abstract form (e.g. conference abstracts)
- did not develop or revise a search filter
- did not report details of the methods used in developing the search filter
- did not evaluate search filter performance
- were published before 2001.
Data extraction
Data were extracted from selected studies using a standardised data extraction form to identify information regarding gold/reference standards, filter development/validation and performance measures reported.
Results
Fifty-eight studies were identified from the ISSG Search Filters Resource. After applying the outlined inclusion and exclusion criteria, 23 studies were identified for inclusion in the review.7–29 Details from the included studies, grouped according to type of methodological search filter (economic, diagnostic, systematic review and RCT), are provided in Tables 1–4.
TABLE 1
Review A: included studies – economic search filter studies
TABLE 2
Review A: included studies – diagnostic search filter studies
TABLE 3
Review A: included studies – systematic review search filter studies
TABLE 4
Review A: included studies – RCT search filter studies
Of the 35 studies excluded, 19 were rejected because they were published before 2001. The reasons why the remaining 16 studies were excluded are presented in Table 5.
TABLE 5
Review A: excluded studies
Study details
Three studies included analyses of more than one search filter type: one study12 included details of a diagnostic filter and a secondary (systematic review) filter and two studies16,21 included details of both systematic review and RCT search filters. Thus, there were two studies examining economic search filters, seven studies examining diagnostic search filters, seven studies examining systematic review search filters and 10 studies examining RCT search filters.
The majority of the studies (n = 14)8–10,12–14,17–19,22,23,26,27,29 addressed the development of search filters for use with MEDLINE, 10 for the Ovid platform,8,9,13,14,17,19,22,23,27,29 three for PubMed12,18,26 and one for DataStar.10 Six studies developed search filters for the EMBASE database,7,11,15,20,24,28 four for the Ovid platform,7,15,20,28 one for DataStar11 and one that used three different platforms (DataStar, Dialog and Ovid).24 The remaining three studies developed search filters for the Cumulative Index to Nursing and Allied Health Literature (CINAHL),21 PsycINFO16 and the Latin American and Caribbean Health Sciences Literature (LILACS) database25 respectively. The CINAHL and PsycINFO search filters used the Ovid platform whereas the LILACS database was searched using an internet interface.
Internal gold standards
A reference standard is a set of relevant records against which a search filter’s performance can be measured. In some studies the reference standard is used both to derive and to test a search filter. In these cases the standard is described as an internal standard.
Almost all of the studies used an internal standard to derive and/or validate the search filters. Only three of the 23 studies did not include an internal standard.18,26,29 These studies tested the search filters against external standards (see External standards). Seventeen7–11,13,15–17,19–21,23–25,27,28 of the 20 studies that included an internal standard had derived this standard by hand-searching journals. The number of journals searched ranged from 2 to 161. In the other three studies12,14,22 the internal standards were generated by a PubMed subject-specific search or from studies included in a number of systematic reviews, or from a database search [MEDLINE and the Cochrane Central Register of Controlled Trials (CENTRAL)]. One other study24 used a search of EMBASE as well as hand-searching of journals to derive an internal standard. The size of the gold or reference standards varied from 58 to 1587 records. In three studies, the reference standard was initially split into two, with one set used to derive the filter and the second set used to internally validate the performance.17,19,23
Inter-rater reliability in selecting studies for inclusion in the reference standard was assessed for almost all of the studies produced by the McMaster Hedges team7,8,13,15–17,20,21,23,28 and exceeded 80% in every case. In one McMaster Hedges team study,8 articles were independently assessed by two reviewers with disagreement being resolved by a third independent reviewer. Two studies quoted inter-rater reliabilities of 71%12 and 81%27 after articles were assessed by two reviewers. Two further studies10,19 reported that articles were assessed by two reviewers, whereas one study11 reported that articles were assessed by one reviewer with 10% of articles assessed by a second reviewer and one study9 reported that articles were assessed by three researchers with discrepancies resolved through discussion. None of these studies reported values for inter-rater reliability. The remaining four studies that derived internal standards14,22,24,25 did not describe how the studies were selected.
Identifying candidate terms and combining them to create filters
In the 20 studies with internal standards, the internal standard records were used as a source for the identification of candidate search terms. Ten of these studies7,8,13,15–17,20,21,23,28 were carried out by the McMaster Hedges team and used essentially the same methodology for deriving search filters. This method involved the identification of index terms and text words from an internal standard of records as well as consultation with clinicians, librarians and other experts to add any other relevant terms. The individual terms identified were analysed for sensitivity and specificity and then terms with specified values of sensitivity and specificity were combined to create multiple-term search filters using the Boolean OR operator. The specified values for term inclusion varied for sensitivity and specificity from > 10% to > 75%. In one of the 10 studies23 stepwise logistic regression was also used to try to optimise search filter performance. The use of logistic regression, however, did not result in better-performing search filters than those developed simply using the Boolean OR operator and therefore this approach was not used in any of the subsequent studies.
Another study25 also identified terms from an internal standard and then combined terms with particular values for sensitivity, specificity and accuracy to derive multiple-term strategies to produce a maximally sensitivity strategy. Single terms with an individual sensitivity of > 20% and specificity and accuracy of > 60% were combined to give two-term strategies. Terms in the two-term strategies with sensitivity, specificity and accuracy of > 60% were then combined to give three- or four-term strategies. All terms in the three- and four-term strategies were then combined to give a maximally sensitivity strategy consisting of 10 terms. This final strategy was refined further by using the Boolean AND NOT operator to exclude single terms with zero sensitivity and high specificity. This increased the specificity of the final strategy while maintaining high sensitivity.
Five studies10,11,19,22,27 used bibliographic software to undertake a more formal frequency analysis of the terms in the internal standard. Two of these studies10,11 carried out word frequency analysis for all of the records in the internal standard and then created search strategies by combining those terms that had the highest scores as determined by multiplying the sensitivity and precision scores. Two studies19,22 used textual analysis of the internal standard records followed by discriminant analysis using logistic regression to determine the best terms to be included in the search strategy. The fifth study27 also used frequency analysis to identify candidate terms for building a search strategy.
Previously published filters were used as a source of terms for four studies.9,12,14,24 These strategies were then further developed by adding extra medical subject heading (MeSH) and text terms identified from the internal standard records. In one of these studies24 the MeSH terms were first translated from a MEDLINE strategy into Emtree terms before adding additional Emtree terms and free-text terms identified from the internal standard records. This study also consulted experts for further suggestions. Individual terms were tested against the internal standard and those with a precision of > 40% and sensitivity of > 1% were added sequentially to develop the filter. Astin et al.9 also used the sequential addition of search terms to develop the search filter.
Internal validation performance measures
The performance of the search filters was tested against the gold or reference standard in 19 studies7–17,19–21,23–25,27,28 to test internal validity. Nine studies7,13,15–17,20,21,23,28 carried out by the McMaster Hedges team reported the results for single-term and combined-term search strategies, whereas the remaining study8 from this team reported only the performance of combination-term strategies. Studies reporting single-term strategies included between one and six single-term strategies whereas the number of combination strategies reported varied between four and 14. The performance of strategies was usually reported in terms of high sensitivity, high specificity or optimised balance between sensitivity and specificity. The other nine studies9–12,14,19,24,25,27 tested between one and eight filters, with some single-term strategies but mostly combination strategies. The focus of these search filters was to produce highly sensitive, highly specific or highly precise outcomes.
The performance measures reported for internal validation are presented in Table 6. Sensitivity was reported by all 19 studies, precision was reported by 16 studies and specificity was reported by 14 studies. Accuracy was reported by seven studies and the number needed to read (NNR) by four studies. Positive likelihood ratio (LR+) values and fall-out were each only reported in a single study. All of the performance measures were presented in tables with the exception of one study,25 for which the results were presented in a figure that contained the full search strategy and values for sensitivity and specificity.
TABLE 6
Review A: performance measures – internal standards
External standards
Nine of the 23 studies used external standards to test or validate the search filters that had been developed or revised.9,10,17–19,22,26,27,29 For these studies, a reference standard that was different from the one used to derive the search filter was used. These studies included studies of diagnostic test, systematic review and RCT filters. Four studies9,10,17,18 used hand-searching of journals to generate the external standard. The number of journals searched ranged from 1 to 161, resulting in between 53 and 332 records in the external standards. Two of these four studies17,18 increased the numbers in the external standard by adding records from a search of either the Cochrane Database of Systematic Reviews (CDSR) or the Database of Abstracts of Reviews of Effects (DARE).
Four22,26,27,29 of the other five studies that used external standards were of RCT search filters and one19 was of a systematic review search filter. Two of these studies27,29 identified records for their standards by searching systemic reviews (one searched 61 reviews from the CDSR29 and one27 searched seven systematic reviews of cluster RCTs). Another study26 searched for records in 11 journals in the CENTRAL database, generating 308 references. In the remaining RCT search filter study22 MEDLINE was searched to identify records that were assessed as being trials. In the study that examined a systematic review search filter19 models were tested using a validation data set and against a ‘real-world’ scenario using Ovid MEDLINE on compact disc, read-only memory (CD-ROM). The validation data set had been created from a hand-search of five journals. The results of this hand-search had been split into an internal test set (n = 256, 75%) and an external validation set (n = 89, 25%).
External validation performance measures
The performance of the search filters was tested against external standards in nine studies.9,10,17–19,22,26,27,29 The performance measures reported for external validation are presented in Table 7. All nine studies reported sensitivity and seven of the nine studies reported precision. Two studies9,17 reported specificity and two10,29 reported the NNR (described as ‘article read ratio’ in one article). Two studies26,27 reported a single performance measure, that is, sensitivity only, three studies18,19,22 reported two performance measures and four studies9,10,17,29 reported three performance measures. The performance measures were again presented almost exclusively in tables, with one exception,26 in which the performance measures were simply discussed in the text of the article.
TABLE 7
Review A: performance measures – external standards
Discussion
Methods used to develop and validate search filters
A total of 23 studies were included in this review. In the majority of these studies an internal gold or reference standard was used to develop the search filter by identifying candidate terms and assessing performance. The way in which terms were chosen for inclusion, however, and how the combinations were determined varied. The internal gold standards were mainly derived from journal hand-searches although a few were derived by other methods (from a database search or studies identified from systematic reviews). Ten of the studies were produced by the McMaster Hedges team and these all used the same method of search filter development, for example through consultation with experts and use of their internal gold or reference standard. Five other studies made use of statistical methods for filter development. The use of statistical methods helps to make the process more objective rather than depending on human expertise. In a few cases, the search filter was not developed using a gold standard or reference standard but was adapted from a previous search filter. Only nine studies undertook external validation, that is, validation against a standard that was different from the one used to develop the filter. As this provides an independent assessment of filter performance, it provides a more rigorous assessment and gives a better indication of how a filter is likely to perform in the real world.
Reported performance measures
Across the 23 studies included in the review, eight different performance measures were reported; however, as precision and positive predictive value (PPV) are equivalent, there were actually seven different performance measures. The performance measures used for internal and external validation and their frequency of use are listed in Tables 6 and 7 respectively. The most frequently reported performance measures were sensitivity, precision and specificity respectively.
All studies reported sensitivity, reflecting the importance of this measure when determining the usefulness of a search filter. As the filters are used to identify relevant articles, it is important to measure the number of relevant articles retrieved by the filter compared with the total possible number of relevant articles. When carrying out a systematic review, in which it is important to identify as many relevant studies as possible, it makes sense to use a search filter with a high sensitivity value.
The performance measures of specificity and precision were the next most reported measures. It is important that a search filter rejects non-relevant articles and thus a high specificity is desirable. In a well-performing search filter a high specificity value would be desirable as well as a high sensitivity value, as there would not be much point in using a filter that retrieves lots of non-relevant articles as well as all of the relevant articles. The articles in the review often included search filters that were optimised for the best balance of sensitivity and specificity.
As precision measures the number of relevant articles as a proportion of all articles retrieved, the aim is to maximise the precision of a search filter. As sensitivity and precision are, however, inversely related, it is difficult to achieve both high sensitivity and high precision. The NNR is another way of reporting precision as it is calculated by dividing 1 by the precision value. This measure gives the number of articles that need to be read to find one relevant article and may, therefore, be more easily understood than precision, which is usually quoted as a percentage value.
The accuracy performance measure was used only in articles produced by the McMaster Hedges team. It provides a measure of the number of articles that are classified correctly as either relevant or non-relevant. The usefulness of this measure on its own, however, is unclear as a high accuracy value may be obtained when the specificity value is high but the sensitivity value is medium or low. In most cases the accuracy value is close to the specificity value and does not give an indication of the sensitivity value.
The other two performance measures that were found (LR+ and fall-out) each appeared in one article. These performance measures were reported in addition to sensitivity and either specificity or precision.
Presentation of performance measures
The most commonly used format for the presentation of performance measures used for single studies of search filters was tables. Only two studies of RCT filters did not present the performance measures in tables. One of these studies presented the search strategy and its performance measures in a figure whereas the other study simply discussed the performance measures in the text of the article. Thus, tables seem to be a popular and useful way of presenting performance measures. Often the results are ordered in tables according to one of the performance measures, for example sensitivity, thus making it easy to identify the most sensitive and the least sensitive search filter. The studies often presented the performance measures in a number of tables to allow ordering by different performance measures, for example tables ordered by sensitivity or specificity or precision. This makes it easier to select a search filter for a specific need, for example researchers involved in performing systematic reviews requiring very sensitive search filters could select the most sensitive search filters whereas busy clinicians who are simply looking for some relevant articles could select a filter with the highest precision.
Key findings
- Internal gold or reference standards were mostly derived by hand-searching of journals.
- Validation of filters was mostly carried out using internal validation.
- The most commonly used performance measures were sensitivity, precision and specificity.
- The majority of the studies presented performance measures in tables.
Measures for comparing the performance of methodological search filters (review B)
Reproduced with permission from Harbour et al.46 © 2014 The authors. Health Information and Libraries Journal © 2014 Health Libraries Journal. Health Information & Libraries Journal, 31, pp. 176–194.
Introduction
A variety of methodological search filters are already available to find RCTs, economic evaluations, systematic reviews and many other study designs. In principle, these filters can offer efficient, validated and consistent approaches to study identification within large bibliographic databases. Search filters, however, are an under-researched tool. Although there are many published search filters, few have been extensively validated beyond the data offered in the original publications.47–49 This means that their performance in the real-world setting of day-to-day information retrieval across a range of search topics is unknown.50 Furthermore, search filters are seldom assessed against common data sets, which makes a comparison of performance across filters problematic. Consequently, the use of search filters as a standard tool within technology assessment, guideline development and other evidence syntheses may be pragmatic rather than evidence based.50,51
As search filters proliferate, the key question becomes how to choose between them. The most useful information to assist search filter choice is likely to be performance data derived from well-conducted and well-reported performance tests or comparisons. Methods exist to test search filter performance and to build the performance picture, including reviews of search filter performance.48,49,52–54 There is no formal guidance, however, on the best methods for testing filter performance, on which performance measures are valued by searchers and on which measures should ideally be reported to assist searchers in choosing between filters. The performance picture for filters across different disciplines, questions and databases is therefore largely unknown. Different performance measures are reported in studies describing search filters and the process whereby searchers choose a filter remains unclear.
The purpose of this review was to consider the measures and methods used in reporting the comparative performance of multiple methodological search filters.
Objectives
This review addressed the following questions:
- What performance measures are reported in studies comparing the performance of one or more methodological search filters in one or more sets of records?
- How are the results presented in studies comparing the performance of one or more methodological search filters in one or more sets of records?
- How reliable are the methods used in studies comparing the performance of methodological search filters?
- Are there any published methods for synthesising the results of several filter performance studies?
- Are there any published methods for reviewing the results of several syntheses?
Methods
Identification of studies
Studies were identified from the ISSG Search Filters Resource.6 The ISSG Search Filters Resource is a collaborative venture to identify, assess and test search filters designed to retrieve health-care research by study design. It includes published filters and ongoing research on filter design, research evaluating the performance of filters and articles providing a general overview of search filters. At the time of this project, regular searches were being carried out in a number of databases and websites, and tables of contents of key journals and conference proceedings were being scanned to populate the site. Researchers working on search filter design are encouraged to submit details of their work. The 2010 update search carried out by the UK Cochrane Centre to support the ISSG Search Filters Resource website was also scanned to identify any relevant studies not at that time included on the website. We acknowledge that there has been a regrettable delay between carrying out the project, including the searches, and the publication of this report, due to serious illness of the principal investigator. The searches were carried out in 2010/2011.
Inclusion criteria
For the purpose of this review, methodological search filters were defined as any search filter or strategy used to identify database records of studies that use a particular clinical research method. A pragmatic decision was taken to include only studies comparing the performance of filters for RCTs, DTA studies, systematic reviews or economic evaluation studies. These study types are the ones most commonly used by organisations such as NICE to underpin their decision-making when producing technology appraisals and economic evaluations of health-care technologies and subsequent clinical guidelines.
Studies were selected for inclusion in the review if they compared the performance of two or more methodological search filters in one or more sets of records. Studies reporting the development of new methodological filters whose performance was compared with that of previously published filters were also included.
Exclusion criteria
Studies were excluded from the review if they:
- reported the development and initial testing of a single search filter that did not include any formal comparison with the performance of other search filters
- compared methodological search filters that had not been designed to retrieve RCTs, DTA studies, systematic reviews or economic evaluation studies
- compared the performance of a single filter in multiple databases or interfaces
- were not available as a full report, for example conference abstracts
- were protocols for studies or reviews
- lacked sufficient methodological detail to undertake the data extraction process.
Data extraction and synthesis
A data extraction form was developed by two reviewers (JH, CF) to standardise the extraction of data from the selected studies and allow cross-comparisons between studies. Details extracted included the methods used to identify published filters for comparison, the methods used to test filter performance and the performance measures reported. Data extraction for each study was carried out by one reviewer (JH) and verified by a second reviewer (CF). A narrative synthesis was used to summarise the results from the review.
Results
Twenty-one studies were identified as potentially meeting the inclusion criteria for this review based on titles and abstracts2,10,14,15,17,19,22,23,25,33,48,49,55–63 Of these studies, 10 reported the development of one or more search filters, whose performance was then compared against the performance of existing filters10,14,15,17,19,22,23,25,56,57 and 11 reported the comparative performance of existing filters.2,33,48,49,55,58–63 On receipt of the full articles, three studies55,60,62 were excluded from the review based on the criteria outlined in the methods section. The 18 included studies are listed in Tables 8 and 9 and the excluded studies are listed in Table 10. No studies were identified that synthesised the results of several performance reports or reviewed the results of several syntheses.
TABLE 8
Review B: characteristics of the performance comparison studies included in this review
TABLE 9
Review B: table of included studies
TABLE 10
Review B: excluded studies
Of the 18 studies included in the review:
- one reported the performance of filters for economic evaluations59
- one reported the performance of RCT and systematic review filters.63
The methodological filters evaluated in the included studies had been developed in a variety of interfaces including the interfaces to LILACS, Ovid, PubMed and SilverPlatter. Most studies, however, did not specify the interface used in the development of some or all of the filters being compared.2,15,17,19,22,23,49,56–59,61 This absence of detail was particularly common in studies in which performance comparison was secondary to the development of one or more new filters.15,17,19,22,23,56,57
Fourteen studies compared the performance of filters in MEDLINE (various interfaces).2,10,14,17,19,22,23,33,48,49,56–58,61 Two studies tested filters in MEDLINE and EMBASE.59,63 One study only tested EMBASE filters15 and one study compared filters in LILACS.25 Seven of the eight studies comparing DTA filters used MEDLINE to test performance, although the interface used varied.2,10,14,48,49,57,58
Studies included in the review used a variety of methods to identify relevant filters for comparison, including database searches,2,14,48,49,61 consulting relevant websites14,23,59,61 and contacting experts in the field.2,49,59 Ten studies used other methods of identifying filters such as using studies that they already knew about or studies that they had conducted themselves.2,10,17,22,23,49,57,58,61,63 Five studies did not provide explicit details on how the filters for testing were identified.15,19,25,33,56
The number of filters compared in a single study ranged from 2 to 38. DTA study and RCT filters were the most common filters compared and systematic review and economic evaluation filters were the least common.
Gold standards
In search filter research a gold standard or reference set is a set of relevant records against which a filter’s performance can be assessed. For example, a collection of records of confirmed RCT studies would be used when testing the performance of a methodological search filter designed to identify RCTs.
Studies included in this review used a range of techniques to identify and/or create a gold or reference standard against which to test the performance of multiple filters. One study did not use a gold standard;33 instead, each of the filters was combined with single terms describing four topics (hypertension, hepatitis, diabetes and heart failure) and the retrieved studies were checked to confirm whether or not they were RCTs.
The size of the gold or reference standards used to test filter performance ranged from 33 to 1955 records. None of the studies included in this review reported whether or not they had carried out a sample size calculation when developing their gold or reference standard (a sample size calculation is a statistical process that determines the minimum number of records required for a gold standard to provide accurate estimates of performance). Four of the DTA filter studies2,14,49,58 and one RCT filter study22 limited their gold standard to specific clinical topics.
Ten studies developed their gold or reference standards by hand-searching journals.10,15,17,19,23,25,56,57,61,63 The number of journals hand-searched ranged from 4 to 161. The time span covered by hand-searching varied from 1 to 23 years. All of the studies using hand-searching had specific criteria for the identification of the desired study type for inclusion in their gold or reference standard.
Of the 10 studies identifying their gold or reference standard from hand-searching journals, eight were studies in which the authors had developed new search filters and then compared those filters with existing filters.10,15,17,19,23,25,56,57 One study that created a reference standard from hand-searching journals created a ‘control set’ of records from the same group of journals that were not of the desired study design.57
Five studies developed a gold or reference standard based on the studies included in systematic reviews [relative recall (RR) gold standard]2,14,48,49,58 and four studies used database searches to identify records to include in their gold standard.22,56,58,59 The number of completed systematic reviews used as a source of gold standard records varied: one study used included studies from 27 systematic reviews,48 one used included studies from two reviews,58 one used included studies from seven reviews of DTA studies2 and a fourth used studies included in a single case study review.49 One study that developed a DTA study filter and compared it with published filters used the studies included in 16 reviews as the gold standard.14
Translation of filters
Search filters were developed using a range of different search platforms (or interfaces), including Ovid, PubMed or WebSPIRS for MEDLINE filters. Any study comparing the performance of filters may therefore need to ‘translate’ the filters from the syntax used in the original development interface to the syntax required by the interface used in the filter comparison.
Four of the studies included in this review did not translate or adapt the filters that were being compared because the filters had been developed in the same interface as was used in the performance comparison.25,33,56,63 When one or more filter required translation, most of the studies comparing the performance of existing filters reported the complete details of the changes made so that the accuracy of the translation could be verified.2,48,58,59,61 In contrast, most of the studies reporting the development of new filters that included a comparison with existing filters did not mention the requirement to translate any of the filters or provide details of the translation, so it is unclear if valid comparisons were being made.10,17,22,23,57 The review of economic evaluation filters applied an exclusion strategy (animal studies and publication types such as letters and editorials, which are unlikely to be economic evaluations) to filters being tested in MEDLINE and EMBASE.59
Methods of testing
Four of the filter studies that used included studies from systematic reviews as their gold or reference standard replicated the original searches when possible with the addition of the filters being tested.2,48,49,58 None of the original searches incorporated a study method search filter.2,48,49,58 A fifth study using references from systematic reviews as a reference standard combined the filters with ‘terms for deep vein thrombosis’ but did not specify what these terms were or if the original search strategy was used.14
The performance analyses carried out by Leeflang et al.48 and Ritchie et al.49 occurred after the original reviews (on which the gold or reference standard was based) had been undertaken and therefore attempted to recreate a ‘historical’ search. Ritchie et al.49 noted a small discrepancy in the number of records retrieved between the original searches and the rerun searches, whereas Leeflang et al.,48 who could replicate only 6 out of 27 reviews, did not provide details of any differences in the numbers of retrieved records. Using the complete reference standard from the original reviews, Leeflang et al.48 tested whether those studies were captured by the filters being compared.
Two studies did not provide any information about whether the performance analysis had been undertaken concurrently with the reviews or at a later date.14,58 The review by Whiting et al.,2 which was published online in 2010 and to which we had prepublication access at the time of our study, recreated the original subject search and compared using the subject search alone with using the subject search combined with 22 other filters.
Four studies by the McMaster Hedges team at McMaster University used their internally developed database for testing filters, with the DTA, RCT and systematic review subsets acting as gold standards.17,23,61,63 One of these studies did not undertake any new analysis but collated the results from previous publications that had used a common gold standard.63
The economic filters study identified a gold standard by searching the NHS Economic Evaluation Database (NHS EED).59 Published MEDLINE and EMBASE economic filters were then tested for their ability to retrieve these gold standard records from MEDLINE and EMBASE. Corrao et al.33 had no gold standard but manually checked whether the records retrieved after applying the filters were RCT studies.
Studies that compared new search filters with existing filters can be divided into two groups based on the type of gold standard used to compare filter performance. One group used a reference standard that had not been used to develop the new filter strategy so that all of the filters in the comparison underwent external validation.10,17,22,23,57 In other words, the performance of all of the filters being compared was tested in a set of records that had not been used to develop any of the included filters. The other group of studies used the same reference standard that had been used in the development of the new filters, so that, although the new filters underwent only internal validation (filter performance was tested only on the one set of records that had also been used to develop the new filters), the comparison filters underwent external validation.14,15,19,25,56 The methodology used in the latter group risks introducing bias in favour of the new filters.
Performance measures reported
The most commonly reported performance measures in studies comparing the performance of search filters were sensitivity/recall and precision (Table 11). A total of 16 studies reported sensitivity/recall2,10,14,15,17,19,22,23,25,49,56–59,61,63 and 13 studies reported precision values.2,10,15,17,19,22,33,49,56,58,59,61,63 Specificity was reported in seven studies.15,17,23,25,57,61,63
TABLE 11
Review B: measures reported in filter performance comparisons
In one study that did not use a gold standard or reference standard, sensitivity could not be calculated and instead the proportion of retrieved records that met the authors’ criteria for being a RCT was reported.33 In another study the proportions of gold standard records retrieved and missed for each filter were reported.48 When the original search strategy could not be replicated, this article reported the NNR.48 Bachmann et al.10 reported the NNR for the filter that they developed but not the previously published filter that they used as a comparator. Whiting et al.2 reported the NNR and the number of records missed from the reference set.
No studies comparing the performance of two or more existing filters reported accuracy values (the number of records correctly retrieved or correctly not retrieved as a proportion of all records). The study by Manríquez25 reporting the development of a RCT filter for the LILACS database did report accuracy values for the new filter, as did the study by Wilczynski et al.15 for their newly developed DTA study filters. Additional measures reported in performance comparisons were:
- number of records retrieved49
- retrieval gain (absolute and percentage variations in the number of citations retrieved)33
- the proportion of articles missed per original review48
- the proportion of articles not identified per year48
- diagnostic odds ratio (DOR) (the odds of being truly relevant among the relevant divided by the odds of being assessed as relevant among the irrelevant)57
- the number of relevant articles retrieved.56
Confidence intervals surrounding performance results were reported by three of the studies that compared the performance of existing search filters.2,61,63 Five of the studies comparing the performance of developed search filters with that of existing search filters reported confidence intervals.10,15,17,25,57
Methods used to display performance comparisons/data
All of the studies included in the review displayed the results using a table format, with only two studies supplementing tables of results with graphical (non-tabular) displays of comparative data.2,48 None of the studies reporting the development of new filters displayed comparative performance in a graphical format.10,14,15,17,19,22,23,25,56,57
The majority of tables presenting performance comparison data displayed the filters in rows and performance measures in columns (an example is provided in Table 12). The results in the tables in all included studies were provided as percentages or proportions. Within tables, authors generally listed filter results in descending order by the measure of interest, for example decreasing sensitivity. Four studies reporting the development of a filter only included data on comparative performance in the text of the study report.10,23,25,57
TABLE 12
Review B: example of a filter performance comparison table as commonly presented in the literature
Tables that did not list filter results in descending order by the measure of interest instead arranged results by:
- filter criteria (sensitive, accurate, etc.)48
- filter alone compared with a clinical subject strategy58
- use or not of an exclusion strategy59
- subject search alone compared with the same subject search with each test filter2
- descending order of cumulative precision or cumulative sensitivity.56
Tables were also used to present information on the number of studies retrieved58 and the specificity, sensitivity and precision of single terms.63 One study that reported highest precision combined with sensitivity of > 69% showed the results of the filters meeting these criteria in a separate table.49
Leeflang et al.48 used a bar graph to display the average proportion of retrieved and missed gold standard records per filter tested (Figure 1). Whiting et al.2 presented the overall sensitivity and specificity of each filter tested in a forest plot, including confidence intervals (Figure 2).
Discussion
Eighteen published articles met the criteria for inclusion in this review. No numerical syntheses of filter performance comparisons were identified, which may be because of the limited availability of performance comparison articles. The majority of included studies reported the development of one or more new filters and compared performance against the performance of existing filters as an adjunct to the main research. This would seem to indicate a focus within filters research on the development of new, ‘better’ filters rather than on a comparison of performance across existing filters. The proliferation in search filters, however, may make it more difficult for searchers to quickly select the most appropriate filter for their particular purpose. The development of increasingly effective filters and the transparent reporting of performance comparisons are important in demonstrating improvements in the performance of new filters compared with current methodological filters.
The number of comparisons of performance varied across study designs. A single study was identified that compared the performance of economic evaluation filters59 whereas studies reporting on the performance of DTA study and RCT filters were much more common. As there have been, until recently, several specialist economics databases [NHS EED, the Health Economic Evaluations Database (HEED) and the Cost-effectiveness Analysis Registry], it may be that filters for the retrieval of economic evaluation studies have been given a lower research priority than filters for other study designs such as RCTs and DTA studies.
Reporting methods of comparison
It was difficult to assess the reliability of the methods used in studies comparing the performance of multiple search filters because the size of the gold or reference standard, the method of testing, the performance measures reported and the presentation of the results varied greatly across studies. In addition, among studies that developed new filters, the methodological detail provided on the comparison of filter performance between new and existing filters was limited.
The description of the methods used in studies reporting the development of new filters and studies comparing only published filter performance differed. Those developing new filters focused their methods section on describing the selection and combination of terms for use in the new filters, with only minimal detail provided in the sections dedicated to describing the performance comparison of the new filters and existing filters. The comparison was often secondary to the main analysis and suffered from a lack of transparency. In contrast, in studies in which the focus was on comparing the performance of multiple existing filters, the methods used in identifying and testing the published filters included in the study tended to be reported more fully.
Many filter development studies did not clearly explain how they had identified filters for inclusion in performance testing. Not reporting how filters were identified and whether or not they were developed in the same interface used for testing could have implications for reliability and bias within the studies. If studies do not report how the filters used in comparisons were identified, it is not possible to determine whether the filters were selected in an unbiased fashion or whether they might have been preferentially selected to suit the test environment. In this review, studies reporting the development and testing of one or more filters all found that the new filter performed better than the existing filters used as comparators. This makes it particularly important that studies clearly report how filters are selected and the comparison performed, as otherwise this could be a sign of bias in the results.
Details about the translation of published filters for different interfaces were lacking in many filter development studies. Generally, more details about methods of translation were provided in studies that reported filter performance comparisons separately from the development of new filters. Combined with the lack of information about the original interface used in the development of published filters, the lack of translation details in many filter development studies makes it almost impossible to determine the accuracy of any alterations. As incorrect or imprecise translation of a filter is likely to impact on the results retrieved, the lack of methodological detail provided is a cause for concern.66
Almost all of the included studies used a gold or reference standard to test the comparative performance of developed and existing filters. This would seem to indicate that using a gold or reference standard to test and compare filter performance is widely accepted in the filter research community. The size of the gold or reference standard used, however, varied widely, from tens to thousands of records. It is possible that the size and content of the gold standard may have an impact on the performance measures recorded for a specific filter, and so it would be helpful if researchers could justify their choice, by, for example, reporting a sample size calculation.
Some of the studies included in the review used a single gold or reference standard for both developing a new filter and comparing the new filter with published filters. This could potentially introduce performance bias in favour of the new filter as the new filter undergoes only internal validation whereas the comparator filters undergo external validation. In other words, the new filter is tested only against the set of records from which it was developed, whereas the comparator filters are tested against a set of records that are different from the gold or reference standards that were used to develop them. When a filter is tested against the same set of records from which it was developed, it is likely that the filter will perform better than it might in a different sample of records.
Reporting performance measures
Sensitivity and precision appear to be considered the most useful measures of filter performance as they are the most commonly reported measures in the literature. As the same performance measures were reported in studies developing new search filters and studies reporting the comparative performance of existing filters, this is one area of methodological consistency between the two types of performance comparison study included in this review.
There is a suggestion, from the small number of studies included in this review, that there are some measures that are preferentially reported for DTA study filters, for example the NNR. Similarly to the metric ‘number needed to treat’ (NNT), the NNR reflects the number of retrieved records that need to be assessed to identify a relevant study. By reporting the NNR, studies seek to make it easier for searchers to determine how effective a filter will be in reducing the number of irrelevant records retrieved and therefore the relative reduction in time needed to identify relevant studies for inclusion or full-text retrieval.
The method used to present the results of filter performance comparisons was limited to tables, with only two studies presenting data graphically, perhaps reflecting the difficulties in presenting filter performance comparisons visually. Many of these tables were long and complicated, making interpretation of the results and the selection of an appropriate filter challenging. In most cases it would not be easy to identify the most suitable filter without reading several studies, including tables, in detail. A lack of time and search filter expertise potentially compounds the problem of selecting an appropriate filter based on performance data as they are currently reported in the literature.
Of the two graphics used in the included studies to present results, a design similar to a forest plot (see Figure 2) may prove attractive to searchers as it is a familiar format used in systematic reviews and meta-analyses. This design may also make it easier to identify visually the most precise, most sensitive and best-balanced filter. A further exploration of methods for graphically presenting filter performance comparisons would be useful for both researchers involved in filter performance research and searchers needing to identify a suitable filter for their project.
Limitations of this review
There are a number of potential limitations to this review. It was not possible to undertake a full systematic review because of time constraints. It was also not possible to review all filters for all study methods. The review was, however, focused on study types that were felt to be the key study designs of current interest in evidence-based health research (namely RCTs, DTA studies, systematic reviews or economic evaluation studies). Finally, research carried out on the performance of multiple search filters that has not yet been published or has been presented only at conferences was excluded from the review, possibly resulting in some alternative formats for the presentation of results being missed. Conference abstracts, however, would be likely to report even fewer methodological details than full articles included in this review.
Key findings
- The main measures of search filter performance reported in the literature are sensitivity/recall, precision and specificity.
- Filter performance comparison studies most commonly report highest sensitivity, highest precision and optimal/balanced filter strategies.
- Articles reporting the development of new search filters and a comparison with existing filters provide limited methodological details.
- Tables are the most frequently used method for reporting the results of filter performance comparisons but graphs may be more useful.
Recommendations
The following recommendations for the presentation of filter performance comparisons are made based on the results of this review.
- Studies that compare search filter performance should explicitly report the methods and results to help searchers identify the most appropriate filter for their particular purpose.
- Studies presenting the development of new search filters that include comparisons with existing filters should present detailed methods describing how the performance comparisons were undertaken.
- One or more gold or reference standards should be used for testing filter performance.
- Search filters should be validated on gold or reference standards that are different from those from which they were developed.
- The size of the gold or reference standard(s) should be clearly stated and a sample size calculation presented to justify the size of the standard(s).
- Any translation of filters should be specifically reported in all articles in which a filter has been used in a different interface from that in which it was developed.
- Results should be presented systematically, identifying clearly the best-performing filter for specific purposes (sensitive strategy, specific strategy, balanced strategy).
- When tables of performance results are provided, a consistent format and order should be used to make the information easy to extract.
Reproduced with permission from Harbour et al.46 © 2014 The authors. Health Information and Libraries Journal © 2014 Health Libraries Journal. Health Information & Libraries Journal, 31, pp. 176–194.
Measuring performance in diagnostic test accuracy studies (review C)
Introduction
Performance measurement of search filters can be seen as analogous to DTA in that DTA studies aim to reliably differentiate those with a specific disease (relevant studies for searchers) from those who do not have the disease (irrelevant studies for searchers). They also aim to be as accurate as possible in distinguishing cases of disease from cases of non-disease, by minimising false positives (positive results for those who do not have the disease) and false negatives (missing cases of people with a disease). Similarly, search filters aim to identify all relevant studies (true positives) while aiming to minimise the retrieval of irrelevant studies (false positives).
This review explores published guidance and recommendations that inform best practice in the measurement and reporting of DTA and assesses their applicability to the area of search filter performance.
Objectives
- To identify recommended methods for conducting DTA studies and evaluating test performance.
- To identify the diagnostic test performance measurements that have been reported and presented.
- To identify methods to compare DTA performance from primary studies.
- To assess how applicable these measures and methods are to search filter performance and how these measures might add value to the filter selection process.
Methods
We undertook literature searches of electronic databases to identify articles that reviewed methodological aspects of undertaking DTA studies and DTA reviews or provided guidelines and other recommendations on how DTA studies or reviews should be carried out and how the results should be reported. These searches were supplemented by consulting key HTA agencies and Cochrane websites for relevant reports or recommendations.
The following databases were searched in October 2011: Cochrane Methodology Register, The Cochrane Library (Issue 4, 2011), Medion (October 2011), MEDLINE (1950 to October Week 3 2011), MEDLINE In-Process & Other Non-Indexed Citations (28 October 2011) and EMBASE (1980 to Week 43 2011). Full details of the strategies used are reproduced in Appendix 2 along with a list of websites that provided potentially useful reports.
From the electronic database searches, 1454 records were retrieved, which was reduced to 972 records after deduplication. After screening titles and abstracts, 97 records were selected as being potentially useful (Figure 3). The full articles were obtained and read for relevance. In addition, eight reports were obtained from organisation websites. Forty-seven of these reports contributed information to the review.36,67–112 A list of the remaining 58 retrieved documents that were excluded from the review is provided in Appendix 3. Studies were excluded because they were considered to be irrelevant, described issues or methods that were better expressed or more thoroughly considered in another publication or were duplicate publications. A flow chart showing the selection process for inclusion of studies in the review is provided in Figure 3.

FIGURE 3
Review C: selection of reports for inclusion in the review. CMR, Cochrane Methodology Register.
We acknowledge that there has been a regrettable delay between carrying out the project, including the searches, and the publication of this report, because of serious illness of the principal investigator. The searches were carried out in 2010/11.
Results for diagnostic test accuracy studies
Conducting diagnostic test accuracy studies
Diagnostic test accuracy measures the ability of the diagnostic test being evaluated, the index test, to distinguish between patients with and patients without the targeted disease or condition.67 The results are verified against the results of a reference standard in the same group of patients. The reference standard is independent of the index test and is usually the best available method to identify patients with the target condition.68,69 When a comparator test is also under evaluation, the index and comparator test must be evaluated against the same reference standard and in the same population.69 In the absence of a suitable reference standard a number of alternative methods have been proposed.70–72
Test accuracy is not fixed and can vary between patient subgroups, with disease severity, in different clinical settings and with different test interpreters.67 Several guidance documents describe how these variations in the design and conduct of diagnostic tests can lead to bias, resulting in substantial differences being observed between primary studies.69,73–76 The effects of different types of bias have been estimated using empirical data.76–79
As diagnostic tests do perform differently in different populations, the importance of testing in a suitable sample of patients receives much attention in the literature. The patient sample should be representative in terms of the disease severity of the target population for whom the test is intended, to avoid spectrum bias (i.e. the variation in the sensitivity and/or specificity of a diagnostic test when applied to people of different ages, genders, nationalities or specific disease manifestations).69,73,75,80 Ideally, patients should be recruited consecutively or randomly in a single cohort and be unselected by disease state.74 Case–control studies are likely to lead to bias because patients with and without the condition are recruited using different sets of criteria69,73 and because they overestimate diagnostic accuracy.77 Other main sources of bias relate to the unsuitability of the reference standard, how the reference and index tests have been undertaken, interpreter blinding and interpretation of the results.79
Uncertainty around estimates of diagnostic accuracy decreases with increasing sample size75 and it is recommended that sample size calculations should be undertaken during study planning to ensure that a reasonably precise estimate of test accuracy can be achieved.81,82 Tables have been published to assist in determining the minimum sample size required83 for a DTA study once the prevalence of the target condition in the population as well as the expected sensitivity have been determined. However, two reviews of DTA studies found that very few studies gave any consideration to sample size.81,82
The Quality Assessment of Diagnostic Accuracy Studies (QUADAS) tool has been developed to assist researchers to assess the quality of primary DTA studies84,85 and as such provides a useful guide to the issues that should be addressed when undertaking a DTA study. The questions cover aspects of methodology that are thought to make a difference to the reliability of a study, such as the suitability of the patient sample and the reproducibility of the reference standard and index test. Poor reporting of DTA studies, however, can make applying the QUADAS tool difficult.78 Since the searches for this review were undertaken, a revised version of the QUADAS tool has been published. The QUADAS 2 tool, which is applied in four phases, will, according to the publishers, allow for a more transparent rating of bias and the applicability of primary diagnostic accuracy studies than the original QUADAS tool.
Measuring diagnostic test accuracy
Contingency table
The primary outcomes of interest in DTA studies are the data required to populate 2 × 2 contingency tables presenting the presence or absence of the target condition or disease, as defined by the reference standard against the result of the index test (Table 13). From this all DTA measures can be derived.
TABLE 13
Review C: contingency table
Measures
Table 14 describes the measures of diagnostic accuracy that are commonly calculated, namely sensitivity, specificity, likelihood ratio (LR), DOR and predictive value.
TABLE 14
Review C: measures of diagnostic accuracy
Two statistical measures of diagnostic accuracy are traditionally used in a clinical setting: the true positive rate or the sensitivity of the test (the proportion of those with the disease who have an abnormal test result) and the specificity of the test (the proportion of those without the disease who have a normal test result). To rule out a diagnosis a test must have high sensitivity whereas to confirm a diagnosis a test must have high specificity.69,73,80 Both measures are susceptible to spectrum bias76,86 but are not directly influenced by prevalence.76
The predictive value is the probability of the test correctly diagnosing patients. The PPV is the proportion of patients with a positive test result who are correctly diagnosed. Conversely, the negative predictive value (NPV) is the proportion of patients with a negative test result who are correctly diagnosed. Predictive values depend on the prevalence of the condition in the population being tested. When prevalence is high, it is more likely that a positive test result is correct and a negative result is wrong.86,87
Likelihood ratios describe the performance of diagnostic tests and can be useful in a clinical setting. The ratio describes whether or not a test result usefully changes the probability that a condition exists. The LR+ is the probability of a person who has the disease testing positive divided by the probability of a person who does not have the disease testing positive. A LR+ of > 10 and a negative likelihood ratio (LR–) of < 0.1 are judged to provide convincing diagnostic evidence.88 Their interpretation, however, depends on the clinical context.87
The DOR is a summary measure of the diagnostic accuracy of a diagnostic test. It is calculated as the odds of positivity among diseased persons divided by the odds of positivity among non-diseased persons. When a test provides no diagnostic evidence then the DOR is 1.0.89 This measure has a number of limitations. In particular, it combines sensitivity and specificity into a single value, hence losing the relative values of the two, and is difficult to interpret clinically.87
Sensitivity and specificity are based on binary classification of test results (either positive or negative). Test measures, however, are often categorical or continuous and so a cut-off point must be defined to classify results as either positive or negative. As the threshold shifts, the sensitivity and specificity of a test will change, with an increase in one resulting in a decrease in the other. This trade-off at different thresholds can be presented graphically in a receiver operating characteristic (ROC) curve, describing the relationship between the true-positive value (sensitivity) and the false-positive value (1 – specificity), and can be used to identify a suitable threshold for clinical practice.69 Figure 4 displays a sample ROC curve of test performance using different threshold values from ≥ 5 to > 25.

FIGURE 4
Review C: example ROC curve.
The Q* value is the point on the ROC curve where sensitivity equals specificity and can be used as a single indicator of overall test performance when there is no preference for maximising sensitivity (minimising false negatives) or specificity (minimising false positives) but can give misleading results if used to compare performance between tests.69,90 Overall, diagnostic accuracy is summarised by the area under the curve (AUC) and ranges from 0.5 (very poor test accuracy and equivalent to chance) to 1.0.69,87 The more accurate the test, the more closely the curve approaches the top left hand corner and has a value close to 1.0.
Whiting et al.87 have undertaken an overview of the various types of graphical presentations that have been used in the DTA literature and describe other graphical displays that could be used to present DTA data. These include dot plots, box-and-whisker plots and flow charts (Figure 5).

FIGURE 5
Review C: example graphical displays for primary study data. (a) Dot plot; (b) box-and-whisker plot; (c) flow chart.
Dot plots are used for test results that take many values and display the distribution of results in patients with and without the target condition but do not directly display diagnostic performance. Box-and-whisker plots summarise the distributions of true-positive and true-negative groups by a continuous measure. Flow diagrams depict the flow of patients through the study, for example how many patients were eligible, how many entered the study, how many of these had the target condition and the numbers testing positive and negative.
Reporting of test accuracy results
The Standards for the Reporting of Diagnostic Accuracy Studies (STARD) statement68 provides guidance on how DTA studies should be reported to provide transparency and allow the reader to assess the validity of a study. Full details on participants, method of recruitment, reference and index tests, statistical methods and results are required. Several predominantly small reviews of between 16 and 243 studies91–98 have looked at the reporting of DTA studies and found poor description of the methods used. Studies either lacked completeness of reporting, with < 50% of studies reporting over half of the STARD items,95,96 or lacked clarity, hence making assessment difficult.97 These reviews concluded that the STARD statement seems to have resulted in little improvement in study reporting. Most of these reviews, however, included studies that were published prior to or soon after the STARD statement was published91,92,98,99 and so it may be the case that insufficient time had elapsed to make a valid assessment.
Guidance documents provide few recommendations about which DTA measures should be reported. The choice of accuracy measures presented depends on the aims of a particular study and on who is likely to use the information. For example, LRs may be more useful in a clinical setting as they can be used to calculate the probability of disease for individual patients, whereas DORs are difficult to interpret clinically. US,75 Australian76 and UK69 guidance suggests that the 2 × 2 contingency table together with sensitivity and specificity pairs and LR pairs should be presented, along with 95% confidence intervals.75,76 The US Food and Drug Administration (FDA) also recommends that measures are reported both as fractions and as percentages.75
There is some information about measures reported in the literature.100 In a review of 90 DTA reviews,101 sensitivity or specificity was the most common measure used to report the results of primary studies (in 72% of reviews); predictive values were included in 28% of reviews; and LRs were included in 22% of reviews. In reviewing the reporting of DTA measures in primary studies, two studies have noted that sensitivity and specificity were reported in most studies, with ROC curves reported in less than half of the studies.95,96
There is some evidence that studies rarely present diagnostic information graphically.87,91,102 In a review of 57 primary studies,99 57% used graphical displays to present results. Dot plots or box-and-whisker plots were the most commonly used graphs in the primary studies (in 39% of studies) whereas ROC curves were displayed in 26% of studies.
Methods to compare and synthesise diagnostic test accuracy performance from primary studies
Several HTA organisations, in guidance for undertaking DTA evidence synthesis,69,76,90,103,104 recommend using the QUADAS tool or a modified version to assess the methodological quality of primary studies. Undertaking a formal assessment provides an indication of the degree to which the included studies are prone to bias100,102,105,106 and hence the reliability of the study results. A report from the Agency for Healthcare Research and Quality (AHRQ)100 found that there had been a trend in recent years for an increasing number of DTA reviews to formally assess study quality.
Several organisations have developed guidance on carrying out systematic reviews of DTA studies69,75,76,90,102–104 and agree that analysis is more complex than for clinical effectiveness. Combining results from individual studies can be problematic because of the methodological variability (heterogeneity) found across the studies. In particular, combining test accuracy studies with heterogeneity can produce biased, and hence inaccurate, results.74,79,104,107,108
It is recognised that variability among studies is to be expected. Some of the variability is due to chance, because many diagnostic studies have small sample sizes. The remaining heterogeneity may be the result of differences in study populations or differences in study methods or the result of variation in the diagnostic threshold adopted.74 Several methods have been described to measure heterogeneity, using graphical plots and statistical tests.36,76,109 Although it is recommended that such a thorough investigation be undertaken prior to meta-analysis,69,75,76,86,90,100,102–104 this is often not carried out. In a review of 189 systematic reviews,109 only 32% investigated heterogeneity and the authors concluded that this underuse reflected uncertainty about the correct approach to adopt.
It is recommended that only studies using the same reference standard, including substantially similar patients and showing minimal heterogeneity should be synthesised by meta-analysis.69,74,76,90,104 When this type of complex analysis is undertaken it has been recommended that reviewers should enlist the specialist support of an experienced statistician in the field.36,69,109 When it is not suitable to undertake meta-analysis a narrative approach should be adopted using graphical presentations, such as forest plots and ROC space plots,69 to provide a visual overview of the results from the included studies.
Paired forest plots (Figure 6) can show the spread of estimated values for sensitivity and specificity for each study. Point estimates are shown as dots or squares and can be sized according to the precision of the estimate or sample size. Confidence intervals around the estimate are shown by horizontal lines either side of the point estimate. If meta-analysis is then undertaken, the pooled estimate is displayed as a diamond. ROC space plots (Figure 7) present the relationship between sensitivity and specificity, with each point representing the summary performance for each study.69

FIGURE 6
Review C: example of a paired forest plot. FN, false negative; FP, false positive; TN, true negative; TP, true positive.

FIGURE 7
Review C: example of a ROC space plot showing summary sensitivity and specificity. df, degrees of freedom.
When performance measures are pooled, separate meta-analyses of sensitivity and specificity data are both the simplest and the most useful approach.69,104 Such an approach, however, assumes that all included studies are using the same threshold value. Summary ROC (SROC) curves are a form of meta-analysis in which the result is a ROC curve with each data point representing the paired estimate of sensitivity and 1 – specificity from the separate studies (Figure 8). Hierarchical and bivariate statistical models have been developed to estimate the SROC curve.110,111 The SROC curve is a useful presentation when a threshold effect is observed. The curve provides a global summary of test accuracy and, as with a ROC curve, shows the trade-off between sensitivity and specificity at different threshold levels. It does not, however, provide a single statistic of overall test performance104 and a review has indicated slow uptake of these newer methods.112

FIGURE 8
Review C: example of a paired SROC curve, comparing the accuracy of test 1 with that of test 2.
Other graphical methods that can be used to present data in a way that is useful in a clinical context have been suggested.87 The two main methods are LR nomograms and the probability-modifying plot. These graphs enable the clinician to estimate the post-test probability of a patient having the disease, based on their pretest probability, when the LRs of tests are known.
Whiting et al.87 reviewed the graphical presentation of diagnostic information in 49 systematic reviews. Just over half (53%) of the reviews used graphical displays to present the results. ROC plots were the most common type of graph and were included in 22 reviews (45%), whereas forest plots were used in 10 reviews (20%) to display individual study results. In another review of DTA reviews, Honest and Khan101 found that, when meta-analysis had been undertaken, pooled sensitivity or specificity was reported in 35 out of 60 (58%) reviews, pooled predictive values in 11 out of 60 (18%) reviews, pooled LRs in 13 out of 60 (22%) reviews and pooled DORs in five out of 60 (8%) reviews. SROC plots were reported in 44 out of 60 (73%) of the meta-analyses. Dinnes et al.109 noted that, out of 189 systematic reviews included in their review, 30% had involved narrative analysis and, when meta-analysis had been undertaken, 52% statistically pooled data, 18% reported SROC plots and a further 30% employed both techniques.
Summary
- Diagnostic test accuracy studies should be carried out on a sample of patients who are representative of the target population, particularly in terms of disease state, and should use an appropriate reference standard with interpreter blinding to previous test results.
- Sensitivity (true-positive rate) and specificity (true-negative rate) are the most commonly reported outcomes and are subject to spectrum bias.
- Predictive values, used to calculate the probability of a test giving a correct result, are influenced by the disease prevalence in the population.
- LRs are useful in a clinical setting to determine the probability of a patient having the target disease.
- DORs provide a summary measure combining sensitivity and specificity but are difficult to interpret clinically.
- ROC curves present sensitivity and specificity pairs at different test thresholds, whereas the AUC gives an overall value of DTA.
- International HTA organisations that have addressed the issue recommend that DTA studies should present 2 × 2 contingency tables, sensitivity and specificity pairs and LR pairs.
- Several types of graphical presentations can be used to display DTA data but these have not been used extensively in the DTA literature.
- In undertaking systematic reviews of DTA studies, heterogeneity between studies is a common feature and should be investigated before combining data in a meta-analysis.
- A narrative approach, presenting forest plots and ROC space plots, is recommended when heterogeneity exists.
- Poor quality in relation to methodology and reporting affects the inferences that can be drawn from DTA studies.
Applicability to research in search filter performance
Diagnostic test accuracy and search filter studies share similar characteristics in that both evaluate the performance of an index test (or search filter) against that of a reference standard in the same sample of patients (or records). In the clinical literature, the reference standard should be the best available method to identify the ‘target condition’. In the search filter literature the reference standard usually refers not to the method per se but rather the set of relevant records that the method has been designed to identify.51,76 Typically, the reference standard is described as the records obtained by hand-searching a set of journals over a specified time period (i.e. the ‘positive’ records in the sample to be tested) rather than describing the reference standard as the method used (i.e. ‘hand-searching’). Other reference standards used, such as the records of included studies from systematic reviews or studies held in a specialised register, again conflate the method and the sample. In these cases, the method used is implicit: searching and screening to identify relevant studies. Although the terminology is different, the principle is the same: the results of applying the index test or filter to a sample are compared with the results of a method that is considered to be robust.
Methods for conducting a search filter performance study
Guidance on measuring DTA performance emphasises the importance of using a sample of patients who are representative of the intended population, particularly in relation to the target condition, otherwise the study may be subject to spectrum bias. Likewise, when measuring search filter performance of a filter intended for a particular bibliographic database, the set of records on which the filter is tested should be representative of that database.
When hand-searching is undertaken, the selection of journals used should be representative of the journals that are indexed in the bibliographic database for which the filter is intended. In terms of subject/clinical focus this can be problematic because hand-searching is labour intensive and so the requirement to include a representative selection of journals has to be balanced against the need to obtain a sufficient yield of articles efficiently by using specialist high-yield journals. For example, when testing or developing a DTA study filter, hand-searching radiology journals may be an efficient way to provide a good yield of DTA studies but these will not be representative of health-care journals in general. The underlying prevalence in the test sample is likely to be much higher than for the whole database and will result in overestimation of the internal precision of the resulting filter. Other factors to consider in selecting journals might include language (including UK/US variations), impact factors and the inclusion of abstracts in the database records.
Using included studies from reviews or a study register such as CENTRAL is likely to provide a wider range of publication sources. The original search strategies used in the reviews should be sensitive and ideally not include methodological search filters so that bias is not introduced by limitations in the searches. However, the inclusion criteria used to select the studies for the reviews or registers may also introduce bias. For example, the reviews may include only large RCTs so the reference standard under-reports all RCTs on the review topic retrieved by the subject search. This will impact on the measurement of the performance of the search filter, particularly in terms of reducing precision. Reduction in the NNR, which calculates a reduction in the number of records to be screened, may be a more appropriate parameter in these circumstances.
As bibliographic databases have changed over time in terms of both content and indexing vocabulary, the publication span for hand-searched journals and included studies also deserves attention to ensure representative coverage.
The DTA literature mentions sample size as another important issue, although the literature suggests that this is seldom formally reported. This is also the case for search filter performance literature. The performance measures calculated for the test sample are an estimate of the population value and uncertainty around these performance measures (as demonstrated by the confidence intervals) decreases with an increase in the sample size.
Tables have been published to assist in sample size calculations for DTA studies and would be appropriate to use for search filter studies.83 An example is shown in Table 15.
When the prevalence of relevant records across the results set is expected to be < 0.50 (which would be the case in search filter design studies), the following steps can be followed to calculate the sample size:
Reference set:
- for example, based on the assumption that the expected specificity of the filter will be 90% (see Table 15, seventh row) and
- if we specify that the minimal acceptable lower confidence limit is, for example, 0.75 (see Table 15, sixth column)
- then the minimal sample size for the reference set (Ncases) is read from the table as 70 records.
Results set:
- the minimum results set is calculated from the equation (Ncases) + Ncontrols, where Ncontrols = Ncases[(1 – prevalence)/prevalence]
- if we assume that the expected prevalence of relevant records is 5% of the hand-search or search results then the results set is calculated as 70 + 70[(1 – 0.05)/0.05] = 70 + 1330 = 1400 records
A lower assumed prevalence would increase the size of the required results set. For example, for a 1% assumed prevalence, the reference set should be 7000 records.
Other main sources of bias mentioned in the DTA literature relate to the suitability of the reference standard (appropriate to the target condition and independent from the index test) and to the methods used in carrying out the test (interpreter blinding and standard interpretation of the results). In terms of search filter testing, there are factors that might affect the independence between the index test and the reference test. For example, when screening journal abstracts, hand-searchers should be unaware of the indexed terms used in the corresponding database records and, when the included studies in a review are used as the reference set, the original search strategy terms should not include any of the search terms being tested. Ideally, the review’s search strategies should have no methodological terms.
Irrespective of how the reference standard is obtained, methods should be standardised to help limit variability. When multiple hand-searchers are involved in creating the reference standard, they should work to the same inclusion and exclusion criteria, which match the study type(s) that the test filter is intended to retrieve, and reviewers’ reliability should be formally assessed before commencement.
Checklists similar to the QUADAS tool85 and the STARD statement,68 but designed for search filter studies, would enable a formal assessment of study quality and might assist search filter researchers to adopt a more consistent and high-quality methodology. Examples of checklists for search filter studies have been reported,3,4,51 with only that of Bak et al.3 including a scoring system.
Search filter performance measures
In DTA performance measurement, sensitivity and specificity are the most commonly reported values and are judged to be essential by most guidance. Other measures that tend to be reported are PPVs and NPVs, LRs and DORs. For search filter performance, sensitivity (or recall) is almost universally reported, with specificity and precision (equivalent to PPV) the next most frequently reported measures (see reviews A and B).
Specificity and precision (or PPV) are both measures of the false-positive rate; the former is measured in relation to the total number of negatives whereas the latter relates to the number selected by the filter or test. In situations in which data are highly skewed, as is the case with literature retrieval, when typically a very small fraction of records in a bibliographic database are relevant (positive), precision rather than specificity better captures changes in the false-positive rate. This is because the number of false positives is being compared with a relatively small number of true positives rather than the much larger number of true negatives.113
This phenomenon is illustrated by the precision and specificity of the three filters shown in Table 16. Filter A has 83% sensitivity, 25% precision and 92% specificity. For filters B and C, the number of relevant records retrieved is the same and so sensitivity is maintained at 83%. The number of retrieved irrelevant records, however, varies. For filter B, the number has more than doubled from 750 to 1750 and consequently precision has been halved to 12.5% whereas specificity has been reduced from 92% to 82%, a reduction of only 11%. A large increase in the number of irrelevant records retrieved has led to a substantial change in precision but a relatively small change in specificity. For filter C, the number of retrieved irrelevant records has increased almost seven-fold, resulting in specificity being reduced by half to 46%. The resulting change in precision of approximately 80%, from 25% to 4.6%, again better reflects the huge increase in number of irrelevant records being retrieved.
TABLE 16
Review C: precision and specificity illustration
In the context of evidence synthesis, a searcher’s primary interest is to know how many relevant records have been missed by the search as well as how many retrieved records are irrelevant but will still require to be screened. These factors affect how efficiently and accurately data gathering for evidence synthesis will be carried out. Sensitivity and precision are therefore of most interest. A busy clinician, however, may prefer to retrieve a small set of records in which a high proportion are relevant, and so high precision is very important whereas sensitivity is less important. Knowing the proportion of irrelevant records in a bibliographic database that have not been retrieved, as measured by specificity, is of lesser value.
Likelihood ratios, although useful in a clinical situation for indicating a patient’s probability of truly having the target condition, are probably of less use in literature searching because searchers are less interested in individual records. The DOR, sometimes referred to as ‘accuracy’, is a single indicator of diagnostic performance and has occasionally been calculated in search filter literature. As with a clinical situation, however, it provides a summary measure and hence does not provide as much useful information on performance as other measures.
Presentation of results
In search filter performance studies, tabular presentation of the results is the norm. DTA study guidance suggests several different graphical presentations that can be used, although they seem to be underused in the DTA literature.
In clinical situations, test measurements are frequently continuous in nature and so thresholds are set to define positive and negative results. The trade-off between sensitivity and specificity at different thresholds is often graphically presented in a ROC plot. This situation does not occur in standard literature searching: a search filter produces a binary result, either selected or not. At the filter development stage, however, a ROC plot could be a useful way to display the performance characteristics of variations in a filter, showing the change that results from the inclusion or exclusion of particular search terms.
Other graphical presentations that have been used in the DTA literature include dot plots, box-and-whisker plots and flow diagrams. Plots can be used for tests that can have a range of values so again would not be applicable to search filter performance. A flow diagram, however, could be considered as a method for presenting search filter performance.
Comparing the results of search filters
Systematic reviews of the DTA literature are complex, largely because of the variability (heterogeneity) between studies in terms of the reference standards that have been used and the populations that have been tested. When heterogeneity exists, meta-analysis is not recommended and a narrative approach is advised using graphical presentations such as forest plots and ROC space plots.
In the search filter literature, a variety of approaches have been adopted to test search filters using different search interfaces and so heterogeneity is likely to be present between filters. There have been few systematic reviews undertaken in the search filter literature and these have tended to adopt a different approach from that taken in the DTA literature. Although DTA reviews frequently compare studies that have evaluated the performance of one index test against the performance of the same reference standard but in different samples, search filter reviews published to date compare several search filters using both the same reference standard and sample (review B). In this situation, synthesising the results is not applicable; rather, we can directly compare performance between filters. These reviews have tended to display the results only in tabular form but ROC space plots or paired forest plots would be highly appropriate for displaying these comparisons. Displaying the results using graphs may convey them more effectively and assist users to choose between filters.
Conclusions
Guidance on conducting and analysing the results of DTA studies is applicable to several aspects of search filter research. The identification of a representative sample of records, of sufficient size and using a standardised approach will assist in producing robust and generalisable results. Although appropriate performance measurements are generally reported, the greater use of some graphical presentations may facilitate the dissemination and interpretation of results.
How do searchers choose search filters? (review D)
Objectives
The objective of this review was to identify any published research into how searchers (information specialists, librarians, researchers and clinicians) choose search filters based on the information presented to them.
Methods
Studies were eligible for inclusion if they reported criteria or methods that searchers used to choose filters, for example:
- the characteristics of the filter, such as how the filter was designed, what performance measurements were used and the currency of the filter
- whether or not searchers asked for advice from others on the choice of filters, including colleagues, recognised experts in the field (such as members of the ISSG or the McMaster Hedges project team) or other professional networks
- where searchers found the filter; for example, did they choose the filter because they found it in a source they regarded as ‘reputable’ (such as MEDLINE/PubMed or the ISSG Search Filters Resource) or in published guidance documents [such as those produced by the Centre for Reviews and Dissemination (CRD)69 or Cochrane114].
Studies were excluded if they were not specifically about search filter choice or were in languages other than English. Studies from any discipline were eligible.
Although there is a large volume of literature on resource selection, this is not directly applicable to this very specific type of tool selection. At the protocol stage we decided against searching for generic literature about resource selection ‘choices’ as this was likely to retrieve a large number of records with little or no direct relevance to the review question.
To identify relevant studies we searched databases in a number of disciplines including information science and health care. Table 17 summarises the database and other resources searched to identify relevant studies.
TABLE 17
Review D: databases and other resources searched
The search strategy consisted of subject indexing (e.g. MeSH, Emtree) and free-text terms (in the title and abstract). It included search terms for ‘searchers/information specialists’ in combination with terms for ‘choice/decision’ and terms for ‘methodological search filters’. No date or language limits were applied to the search. Full search strategies are listed in Appendix 4. Records were downloaded from databases and then imported into EndNote X5 bibliographic software (Thomson Reuters, CA, USA), which allowed categorisation and coding, as well as streamlining of the production of draft and final reports. Duplicate records were then removed.
The titles and abstracts of the records identified in the searches were assessed for relevance. The intention was to select those studies reporting how searchers make choices about search filters. Studies not specifically about search filter choice and studies in languages other than English were excluded.
We acknowledge that there has been a regrettable delay between carrying out the project, including the searches, and the publication of this report, because of serious illness of the principal investigator. The searches were carried out in 2010/11.
Results
In total, 2266 records were identified by the searches. Table 18 shows the numbers of records by resource identified from the searches.
TABLE 18
Review D: numbers of records identified from various resources
After the removal of duplicates, 837 records remained for assessment. The titles and abstracts of these 837 records were assessed for relevance and no records met the inclusion criteria (Figure 9).
Discussion
The search strategy used search terms relevant to systematic review methods (‘search strategy’, ‘search filter’, ‘information specialist’, ‘choice/decision’) and as a result a high proportion of the records identified were systematic reviews, which typically report search strategies in their abstracts. In total, 48% (402/837) of the records assessed were Cochrane reviews, which report their methods in detail and whose abstracts tend to include search terms similar to those used in this search strategy. Many other non-Cochrane reviews were also identified for the same reason. This also explains the high number of duplicate records retrieved as Cochrane reviews were identified across most of the databases searched.
Studies about the creation, testing, evaluation and awareness of search filters were also identified because of the similarity of the search terms used in the strategy and those used in the bibliographic records. Other studies looked at search techniques for identifying study populations by age or sex; investigated the differences between databases and database interfaces; and discussed the growing importance of searching via the internet. In addition, a significant number of records were completely irrelevant, such as those about searching bioinformatics (genes, proteins) databases.
However, we did not identify any studies that had explored how searchers select search filters. The absence of studies was not unexpected, despite the fact that our searches were relatively sensitive and were undertaken across a wide range of resources (including databases covering health care and information science as well as HTA organisation websites).
It was decided when developing the protocol that, given the resources available for this project, it would not be possible to undertake broader searches to identify research about how searchers or information specialists (including librarians) make choices about the resources/tools they use. It was felt that this literature would be very large as it would include library stock selection, database selection and other situations in which informed choice is required. It may be that this literature could suggest how information seekers choose between tools. The literature would not be specific, however, to the choice of search filters and might be qualitatively different as many stock selection decisions may be governed by factors such as cost and subject coverage rather than sensitivity and precision.
There is literature about the development and quality of search filters, as well as research comparing published filters, but we did not identify any studies reporting the use and choice of filters by searchers in practice. A survey about the awareness of search filters among searchers was published in 2004 and, although awareness of filters was relatively high at that time, usage was still low.5 Since that questionnaire was undertaken, the promotion of search filters through the ISSG Search Filters Resource, through training courses conducted in the UK, the USA and elsewhere and through the increasing numbers of published filters may have increased awareness and usage by searchers. We have not identified any current published evidence, however, to support this. Investigations of how searchers are choosing filters seem not to have been published.
How do clinicians choose between diagnostic tests? (review E)
Introduction
Database searchers have access to a range of methodological search filters that have been designed to retrieve records relating to studies that employ a particular research design. It is unclear, however, what factors influence the choice of an appropriate filter. As search filters can be viewed as analogous to diagnostic tests (as outlined above), it is hypothesised that the factors that lead clinicians to choose between diagnostic tests or health-care organisations to choose between screening tests might offer insights into how searchers do, or might in the future be encouraged to, make choices about search filters.
Objective
To identify and summarise evidence, in a narrative review, on factors that influence clinicians’ choice between diagnostic tests.
Methods
Evidence for this review was obtained from literature searches of the major health-care databases and consultation of national screening programme websites. MEDLINE, MEDLINE In-Process & Other Non-Indexed Citations and EMBASE were searched in March 2011 and CINAHL, PsycINFO and Applied Social Sciences Index and Abstracts (ASSIA) were searched in June 2011. The search strategies that were used are reproduced in Appendix 5. No date restrictions were applied but a pragmatic decision was taken to search only for English-language publications. Reference lists of relevant studies were scrutinised and citation searching of key articles was undertaken in Scopus and ISI Web of Knowledge. Results were downloaded into Reference Manager 12 (Thomson ResearchSoft, San Francisco, CA, USA). Titles and abstracts were screened and full-text copies of all studies deemed to be potentially relevant were obtained and assessed for inclusion by one researcher.
We acknowledge that there has been a regrettable delay between carrying out the project, including the searches, and the publication of this report, because of serious illness of the principal investigator. The searches were carried out in 2010/11.
Inclusion criteria
- Studies that report how clinicians choose between diagnostic tests and what factors influence their decisions.
- Screening programmes that provide criteria for the selection of screening tests.
Exclusion criteria
- Studies that report on any factors influencing test ordering decision behaviour without reference to test choice.
- Studies that consider the decision whether or not to order one particular test.
- Studies that report interventions designed to influence test ordering behaviour.
- Studies written in languages other than English.
Data extraction
For studies meeting our criteria, the following information was collected:
- research method(s) used to elicit data
- clinical discipline of participants and setting
- clinical condition or disease and diagnostic tests from among which clinicians made their choice
- factors implicated in clinicians’ choice.
Results
The electronic searches retrieved 1559 records after deduplication (Figure 10). Titles and abstracts were screened and 47 records were selected for full-text assessment. Seven studies met the inclusion criteria.115–121 Table 19 provides details of the included studies. The references and citations of these seven publications generated an additional 38 articles for further checking, none of which met the inclusion criteria.

FIGURE 10
Review E: numbers of records retrieved and assessed for relevance.
TABLE 19
Review E: included studies
Studies were excluded for a variety of reasons. One-quarter (10/40) of the excluded studies considered the reasoning that underpins diagnostic decisions, mainly factors that can lead to errors and suboptimal diagnostic strategies, and one-quarter (10/40) surveyed the use of a range of tests for different conditions. Six articles examined factors that influence the diagnostic process or adopted strategy, characterised by a stepwise series of hypothesis testing using information from a variety of sources and series of tests. These included symptoms elicited from patients, patient and physician characteristics and structural issues.
Other reasons for exclusions were examination of patient choice or compliance (n = 4), use of interventions designed to influence test ordering behaviour (n = 2) and use of an economic model to assess screening strategies (n = 1). An additional two articles did examine test choice but did not elicit the reasons involved. Appendix 6 provides details of the excluded studies together with the primary reason for exclusion.
Of the seven studies that met the inclusion criteria, none was set in the UK. Four studies were set in the USA,115,116,118,120 one was set in Canada,121 one was set in Switzerland117 and one was multinational.119 Information from the clinicians was obtained by survey (n = 3117,119,121), questionnaire (n = 2115,118) or interview (n = 2116,120) and the number of participants ranged from 11116 to 1184.117 Three studies looked at cancer screening tests (two for colorectal cancer),117,120,121 two at imaging tests for pulmonary embolism,115,119 one at balance assessment tests116 and one at tests to diagnose pertussis.118
Four studies mentioned high test performance as a reason in support of clinician choice. In the study by Jha et al.,115 90% of emergency physicians and 95% of radiologists who responded to a questionnaire cited test accuracy as a reason for test choice. Both Stein et al.119 and Zettler et al.121 noted that perceived test performance was a factor in decision-making whereas Sox et al.118 reported that 70% of participants who had received information on DTA performance chose the best-performing test compared with 21% of controls who had not received this information. One further study, which interviewed physiotherapists about balance assessment tests, found that the perceived value of information gathered was a deciding factor in clinician choice of test rather than the psychometric properties of the assessment tests.116
Two studies reported economic factors: the perceived cost-effectiveness of colorectal cancer screening tests121 and the perceived added benefit as set against resource use of various diagnostic tests for pulmonary embolism.119 One further study looked at the influence of equity in physician choice.117 The participants were asked to choose between one test given to the whole population and a better (in terms of lives saved) and more expensive test given to half of the population. Three-quarters (75%) opted for the universal test although the better, more expensive test was seen as being more acceptable if clinical factors determined who would receive it.
Two studies reported patient characteristics as factors influencing test choice. Stein et al.119 mentioned age and sex whereas Wackerbarth et al.120 identified family history as an influencing factor for screening at an earlier age. Patient acceptance of the proposed tests and whether or not the tests were covered by patients’ insurance coverage were also mentioned.120
Other factors considered were clinician experience (McGinnis et al.116 reported this as the primary influence on test choice for balance assessment), mortality reduction121 and adverse events, primarily in terms of radiation exposure.119 The study by Jha et al.,115 which took place in an emergency department, found that ready access to the test and whether or not 24-hour interpretation support was available were the two most frequently reported factors after test performance.
In addition to the studies identified in the review, information on selection criteria for four screening programmes was identified (Table 20). Three of the four screening programmes that provided information were national, set in the UK,122 USA123 and Australia.124 The fourth, providing criteria for cancer screening, was produced by the World Health Organization.125 Most programmes identified high test performance in terms of sensitivity,123–125 specificity,124,125 PPV124,125 and/or NPV124,125 as important. The UK programme122 stipulates that the test should be precise and that the distribution of test values in the target population should be known and a suitable cut-off level should be defined.
TABLE 20
Review E: reports from national screening programmes
Other characteristics listed included being safe,122,124,125 being reliable,124 having been validated,122,124 easy to administer122,124 and being acceptable to the target population.122,124,125 All of the programmes consider factors other than test performance. The effectiveness of undertaking a screening programme, in terms of morbidity and mortality reduction, should be established,122,123,125 with effective identification of disease at an early disease stage124 and the availability of effective treatment.123,125 The condition under investigation should be sufficiently prevalent123,125 so that a screening programme can be effective. The UK programme122 adds that an agreed policy of further diagnostic investigation and disease management should have been agreed. Both the UK122 and the USA123 programmes mention that the perceived benefits of the screening programme should outweigh any harms resulting from screening and treatment.
Discussion
From this overview it seems that there is limited evidence to clarify how clinicians choose between diagnostic tests. What evidence there is suggests that test performance is the main factor that informs their choice. It has been reported, however, that a substantial proportion of clinicians have an inaccurate understanding of test performance parameters and apply them inaccurately126–131 and so it may be the case that choices are being based on false assumptions. Other factors mentioned in more than one study were the pretest probability of having the condition, as defined by patient characteristics, patient acceptance of the test and the costs involved in carrying out the test, which are factors that are not readily transferable to the search process. Additional attributes reported related to the particular scenario being investigated: the harmful effect of radiation when imaging tests were being considered and the need for immediate testing and interpretation in an emergency department were important criteria in two studies.
The screening programmes also valued high test performance but add that a test should have been proven to be valid and reliable. Furthermore, the screening committees set other criteria to ensure the effectiveness of public health programmes: the prevalence of the target disease or condition as well as whether or not there is effective disease management and treatment available. In a screening setting, where patients are asymptomatic, acceptability was mentioned as crucial by three of the screening programmes and the need to evaluate benefits against harms was also considered to be an important criterion.
Conclusion
From the very limited evidence available in a clinical setting, it is difficult to gain much insight into how searchers might make choices about search filters. Diagnostic test performance (perceived or known) was the most frequent factor mentioned and is the main factor that is readily applicable to search filter choice. However, it may be beneficial to provide additional explanatory information when reporting search filter performance to ensure that searchers make choices based on an accurate understanding of test performance parameters.
- Reviews
- Interviews and questionnaire
- Presentation of filter information
- Performance tests, reports and performance resource
- Performance measures for methodological search filters (review A)
- Measures for comparing the performance of methodological search filters (review B)
- Measuring performance in diagnostic test accuracy studies (review C)
- How do searchers choose search filters? (review D)
- How do clinicians choose between diagnostic tests? (review E)
- Methods - Assessing the performance of methodological search filters to improve ...Methods - Assessing the performance of methodological search filters to improve the efficiency of evidence information retrieval: five literature reviews and a qualitative study
Your browsing activity is empty.
Activity recording is turned off.
See more...

