• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of bmjBMJ helping doctors make better decisionsSearch bmj.comLatest content
BMJ. Feb 23, 2002; 324(7335): 477–480.
PMCID: PMC1122397
Evidence base of clinical diagnosis

Evaluation of diagnostic procedures

J André Knottnerus, professor of general practice,a Chris van Weel, professor of general practice,b and Jean W M Muris, senior lecturer in general practicec

Development and introduction of new diagnostic techniques have greatly accelerated over the past decades. The evaluation of diagnostic techniques, however, is less advanced than that of treatments. Unlike with drugs, there are generally no formal requirements for adoption of diagnostic tests in routine care. In spite of important contributions,1,2 the methodology of diagnostic research is poorly defined compared with study designs on treatment effectiveness, or on aetiology, so it is not surprising that methodological flaws are common in diagnostic studies.35 Furthermore, research funds rarely cover diagnostic research starting from symptoms or tests.

Since quality of the diagnostic process largely determines quality of care, overcoming deficiencies in standards, methodology, and funding deserves high priority. This article summarises objectives of diagnostic testing and research, methodological challenges, and options for design of studies.

Summary points

  • Development of diagnostic techniques has greatly accelerated but the methodology of diagnostic research lags far behind that for evaluating treatments
  • Objectives of diagnostic investigations include detection or exclusion of disease; contributing to management; assessment of prognosis; monitoring clinical course; and measurement of general health or fitness
  • Methodological challenges include the “gold standard” problem; spectrum and selection biases; “soft” measures (subjective phenomena); observer variability and bias; complex relations; clinical impact; sample size; and rapid progress of knowledge

Objectives of testing

Diagnostic investigations collect information to clarify patients' health status, using personal characteristics, symptoms, signs, history, physical examination, laboratory tests, and additional facilities. Objectives include the following.

  • Increasing certainty of the presence or absence of disease—This requires sufficient discriminative power. Measures of discrimination are commonly derived from a 2×2 table relating test outcome to a reference standard (figure), thus allowing tests to be compared. Tests for similar purposes may vary in accuracy, invasiveness, and risk, and, for example, history may be no less valuable than laboratory tests (table). To be useful, additional investigations should add relevant information to less invasive and cheaper tests performed earlier.
  • Supporting clinical management—For example, determining presence, localisation, and shape of arterial lesions is necessary for treatment decisions.
  • Assessing prognosis—As the starting point for clinical follow up and informing patients.
  • Monitoring clinical course—When a disease is untreated, or during or after treatment.
  • Measuring fitness—For example, for sporting activity or for employment.

Tests must be evaluated in accordance with their intended objectives, also taking into consideration possible inconvenience and complications, such as intestinal perforation during endoscopy. Using and not using a test, or using alternative tests, should therefore be compared.

If a test is evaluated before introduction into routine care, using or not using it can still be freely compared to study the effect on prognosis. Early evaluation helps decisions on whether to introduce a test and on planning its postmarketing surveillance.

Methodological challenges

The “gold standard” problem

To evaluate discriminatory power (accuracy), the outcome of a test is compared with an independently established standard diagnosis. “Gold standards” providing full certainty are rare. Even biopsies can fail to do so. Generally the challenge is to find a standard as close as possible to the theoretical gold standard.

Sometimes no suitable reference standard at all is available—in determining the accuracy of liver tests, neither imaging techniques nor biopsies will detect all liver abnormalities. Moreover, invasive procedures cannot easily be made the standard in a study. An independent standard may not even conceptually exist, as for example when evaluating symptoms incorporated in the definition of a disease (as in migraine), or when the symptoms are more important than anatomical status, as with prostatism. In studying the value of physical examination to detect severe disease in non-acute abdominal pain, comprehensive screening, including invasive procedures (if ethically allowable), might yield many irrelevant findings but still fail to exclude relevant pathology. An appropriate clinical follow up—a “delayed type cross sectional study,” with a final assessment by independent experts—is then the best approach.1,9

New diagnostic tests superior to prevailing reference standards may be developed. If research into accuracy of test procedures were to consist only of comparing tests with standards, possible new standards would be ignored as they are not in agreement with prevailing standards. Up to date pathophysiological expertise is therefore required to be able to change a reference standard.

Spectrum and selection bias

Spectrum bias may occur when the study population has a different clinical spectrum (more advanced cases, for instance) than the population in whom the test is to be applied.1,10,11 If sensitivity is determined in seriously diseased subjects and specificity in clearly healthy subjects, both will be grossly overestimated relative to practical situations where diseased and healthy subjects cannot be clinically distinguished in advance.

Selection bias is likely if inclusion in a study is related to test results. As subjects with abnormal exercise electrocardiograms are more often referred for coronary angiography, calibration of this investigation among preselected subjects will show higher sensitivity and lower specificity than if there had been no preselection.

Spectrum and selection bias often occur together—for example, when tests calibrated in hospital are introduced in primary care; all measures of accuracy may then be affected.12

“Soft” measures

Subjective phenomena such as pain and feeling unwell often evoke diagnostic and therapeutic actions and thus may themselves be “tests.” Also, they are indispensable for assessment of clinical outcome.13 Evaluation studies should measure these factors as reproducibly as possible, recognising that interindividual and intraindividual differences always have a role.

Observer variability and observer bias

Interobserver and intraobserver variability in reading and interpreting diagnostic data not only influence “soft” diagnostic aspects, but also results of “harder” investigations like x rays and biopsies. Even without human interpretation, interinstrument and intrainstrument variations occur. Variability should be limited in order to assure utility of information.

Prior knowledge may evoke observer bias. If doctors' accuracy in diagnosing ankle fractures on the basis of physical examination is being evaluated, theyshould not know the x ray results; pathologists establishing an independent diagnosis must not know the clinical conclusion already.14 Bias can also occur if, in comparing two techniques, observers are prejudiced and perform one more carefully than the other. And since, for a fair assessment, diagnostic skills should be at a similar level for each technique, new tests can be at a disadvantage shortly after being introduced.

Complex relations

Ideally evidence reflects the clinical context,15 where tests are often not applied in isolation but in combinations, as, for instance, in the context of protocols. Moreover, tests can be used to differentiate between a number of diseases, rather than just checking for one. Multivariate statistical techniques then help to evaluate the (added) value of diagnostic items separately and in combination.While analysis of data to determine aetiology generally addresses the overall impact of factors adjusted for covariables, analysis of diagnostic data focuses on the best individual prediction. Accordingly, diagnostic data analysis needs specific methodological development.1618

Sample size

Whether sample size is adequate to provide the desired information with sufficient precision is often ignored in diagnostic studies. Progress in diagnostic performance consists of a series of small steps that gradually increase certainty rather than by one big breakthrough. Evaluating small steps, however, requires large study populations.

Clinical impact

More accurate tests do not necessarily improve management. They may add little to what is known already, or to the results of earlier, perhaps less invasive or cheaper, investigations. Also, clinicians may not make full use of information from results. In a classic study of the value of upper gastrointestinal endoscopy, management changed in 23% of cases without a change in diagnosis, while in 30% of those with changes in diagnosis management was not altered.19 Also, tests may have no practical benefit; brain scans showing details of untreatable brain conditions would be an example. Therefore, diagnostic research should consider not only the accuracy of diagnostic tests but also their practical clinical value.

If the probability of disease is extremely low or high, the outcome of subsequent investigations rarely influences management and false positive or false negative results, respectively, are common.2 Generally, investigations are indicated when the probability of disease is somewhere between the two extremes. Evaluation studies must take place in populations with prior probabilities for which the test is particularly suitable. For example, tests with moderate specificity are inappropriate for population screening (with low probability of disease) because of the high risk of false positive results.

Changes over time and the mosaic of evidence

Thorough evaluation may take longer than developing better techniques. The position of computer assisted tomography was not yet defined when magnetic resonance imaging and positron emission tomography appeared; evaluation studies can thus be outdated before they are completed. Progress is especially rapid where information technology and molecular genetics are important. Therefore, we need comprehensive scenarios with relatively stable overall frameworks into which new data are inserted like pieces of a puzzle. For example, evaluation of the impact of new imaging techniques on the effectiveness of breast screening can be based on data on the accuracy of the techniques being compared if other “mosaic” pieces are already available and unchanged. Since accuracy can often be assessed cross sectionally, lengthy new prospective studies may then be avoided.

Options in diagnostic research

Clinical studies

Methodological approaches must be relevant to the type of study objective (box). Diagnostic accuracy—that is, the relation between the test under study and the disorder as expressed in measures of discrimination (see table 1), can be assessed cross sectionally if the results of the test and the reference standard procedure are known for all subjects in the study population. Possible designs are comparing test distributions in samples already known to have the disorder (cases) and known to be free of it (controls); or comparing disease distributions in samples with already known test results; and a survey in an “indicated” population (a target population in which testing would be relevant). Case-control sampling or sampling based on test results is efficient as a phase I study1 (see also the next article in this series20), and it should be considered before any extensive study in a population where neither the distribution of a disease nor the test results are known. If tests already adopted are also applied to all study subjects, the added value of a new test can be directly estimated. Furthermore, and very importantly, the clinical diagnostic contribution of the test being evaluated can also be assessed if tests that have been performed earlier or are less invasive—for example, from history or physical examination—are also included in the design. Invasiveness and possible adverse effects of the approaches being compared can then be measured.

Options in diagnostic research in relation to study objectives

Clinical studies

Objective—Diagnostic accuracy

Options: Cross sectional studiesCase-control samplingSampling based on test resultsSurveys in indicated population

Objective—Impact of (additional or replacing) diagnostic testing on prognosis or management

Options: Randomised controlled trialCohort studyCase-control studyBefore and after study

Synthesising findings and expertise

Objective—Synthesising results of multiple studies

Options: Systematic reviewMeta-analysis

Objective—Determining the most (cost) effective diagnostic strategy

Options: Clinical decision analysisCost effectiveness analysis

Objective—Translating findings for practice

Options: Integrating results of the above approachesExpert consensus methodsDeveloping guidelines

Integrating information in clinical practice

Options: ICT support studiesStudying diagnostic problem solvingEvaluation of implementation in practice

For studying the impact of a test on clinical decision making and prognosis the randomised controlled trial is the standard method. The experimental group undergoes the index test and the control group the usual test or no test. The value of the index test in addition to or as a replacement for the usual procedure, or instead of no test, can be assessed as (possible) gain in correct diagnoses, management, and prognosis. A variant is to apply the index test to all subjects but randomise disclosure of its results to the care givers, if this is ethically permissible. This constitutes an ideal placebo procedure for the patient. Studies on breast cancer screening, with a treatment protocol linked to the screening result, were classic examples of randomised controlled trials of diagnostic methods.21If such a trial is not feasible, observational approaches can be considered. The cohort design compares the clinical outcome of previously tested and untested groups, without the diagnostic information being randomised.22 A point of concern is whether both groups have a similar clinical spectrum at baseline, especially regarding unmeasured factors. The case-control design is efficient if patient outcome among indicated subjects is already known: were fewer cases than controls tested? Examples are studies on the relation between breast cancer mortality and previous mammographic screening.23 Comparability of tested and not tested subjects at baseline is, again, important.

The impact on clinical management can be also investigated by comparing the (intended) management before and after test results are available, as was done early in evaluation of computer assisted tomography of the brain. Such before and after comparisons have specific potentials and limitations.24

Appropriate inclusion and exclusion criteria are indispensable for focusing on the relevant clinical question, target population, clinical spectrum, and setting (primary care or a population referred to hospital, for instance).

Synthesising findings and expertise

If results from a number of studies are available, a systematic review of diagnostic methods and meta-analysis of pooled data can provide a comprehensive synthesis of present knowledge.Diagnostic accuracy can be assessed overall and for subgroups. Much effort is being invested to make systematic reviews of diagnostic methods as solid as the methodologically more established systematic reviews of treatment methods.25,26

If the diagnostic problem is well structured, and if estimates are available for accuracy and risks of testing, occurrence and prognosis of the suspected disorder, and “values” of clinical outcomes, quantitative decision analysis can identify the most effective/cost effective strategy. A combined analysis of diagnostic and treatment aspects is essential. Often qualitative analysis can be already very useful. For example, non-invasive techniques can nowadays detect carotid stenoses reasonably well in asymptomatic patients. This allows preselection of patients for the more invasive investigation, carotid angiography, to decide about surgical intervention; it would yield quite a complex “decision tree.” But if surgery of asymptomatic stenosis is not shown to improve prognosis,27 the decision tree is greatly simplified: it would no longer include angiography nor surgery, and maybe not even non-invasive testing.

Decision analysis cannot always provide an answer. Problems may be too complex to be summarised in a tree; data may be missing; and there can be disagreement over valuing outcomes. Consensus procedures are then essential to translate research into practice guidelines. Clinical experts can integrate current knowledge with experience to achieve agreement on clinical guidelines for diagnostic approaches to particular medical problems.

Integrating information in practice

To help clinical investigators harvest data from clinical databases to support clinicians in improving diagnostic decisions, innovations in information anc communication technology are indispensable.28 For utilising the potentials in this field, specific methodological requirements apply, such as avoiding confounding by indications or contraindications.

Ensuring that information providing approaches have optimal impact on the diagnostic decision making of individual clinicians is far from simple. The growing cognitive efforts associated with diagnostic management make insight into diagnostic problem solving increasingly important.29

Clinical studies, systematic reviews, and guideline construction are all necessary but not alone sufficient to improve practice. Implementation research has been developed to bridge the gap from clinical science to routine diagnostic management.

Setting formal standards

Assessment of diagnostic technologies would be greatly stimulated if formal standards for acceptance of diagnostic procedures in routine care were adopted by health authorities. Professional organisations are responsible for setting, implementing, maintaining, and improving clinical standards. International cooperation is important, as has been proved in the field of quality control of drugs. Along these lines, governmental, industrial, and societal funding for assessments of diagnostic technologies should be intensified.

Derivation of measures of discrimination
Discrimination of some diagnostic tests (estimates based on on several sources)


This is the first of a series of five articles


Series editor: J A Knottnerus

 Competing interests: None declared.

An external file that holds a picture, illustration, etc.
Object name is ebcd.f1.jpg

“The Evidence Base of Clinical Diagnosis,” edited by J A Knottnerus, can be purchased through the BMJ Bookshop (www.bmjbookshop.com)


1. Feinstein AR. Clinical epidemiology. The architecture of clinical research. Philadelphia: WB Saunders; 1985.
2. Sackett DL, Haynes RB, Tugwell P. Clinical epidemiology: a basic science for clinical medicine. Boston: Little, Brown; 1985.
3. Sheps SB, Schechter MT. The assessment of diagnostic tests. A survey of current medical research. JAMA. 1984;252:2418–2422. [PubMed]
4. Reid ML, Lachs MS, Feinstein AR. Use of methodological standards in diagnostic research. Getting better but still not good. JAMA. 1995;274:645–651. [PubMed]
5. Lijmer JG, Mol BW, Heisterkamp S, Bonsel GJ, Prins MH, van der Meulen JH, et al. Empirical evidence of design-related bias in studies of diagnostic tests. JAMA. 1999;282:1061–1066. [PubMed]
6. Panzer RJ, Black ER, Griner PF, editors. Diagnostic strategies for common medical problems. Philadelphia: American College of Physicians; 1991.
7. Stoffers HEJH, Kester ADM, Kaiser V, Rinkens PELM, Knottnerus JA. Diagnostic value of signs and symptoms associated with peripheral arterial obstructive disease seen in general practice: a multivariable approach. Med Decis Making. 1997;17:61–70. [PubMed]
8. Fijten GHF. Rectal bleeding, a danger signal? Amsterdam: Thesis Publishers; 1993.
9. Knottnerus JA, Dinant GJ. Medicine based evidence, a prerequisite for evidence based medicine. BMJ. 1997;315:1109–1110. [PMC free article] [PubMed]
10. Ransohoff DF, Feinstein AR. Problems of spectrum and bias in evaluating the efficacy of diagnostic tests. N Engl J Med. 1978;299:926–930. [PubMed]
11. Begg CB. Biases in the assessment of diagnostic tests. Med Stat. 1987;6:411–423. [PubMed]
12. Knottnerus JA, Leffers P. The influence of referral patterns on the characteristics of diagnostic tests. J Clin Epidemiol. 1992;45:1143–1154. [PubMed]
13. Feinstein AR. Clinimetrics. New Haven: Yale University Press; 1987.
14. Schwartz WB, Wolfe HJ, Pauker SG. Pathology and probabilities, a new approach to interpreting and reporting biopsies. N Engl J Med. 1981;305:917–923. [PubMed]
15. Van Weel C, Knottnerus JA. Evidence-based interventions and comprehensive treatment. Lancet. 1999;353:916–918. [PubMed]
16. Spiegelhalter DJ, Crean GP, Holden R, Knill-Jones RP. Taking a calculated risk: predictive scoring systems in dyspepsia. Scand J Gastroenterol. 1987;22(suppl 128):S152–S160. [PubMed]
17. Knottnerus JA. Application of logistic regression to the analysis of diagnostic data. Med Decis Making. 1992;12:93–108. [PubMed]
18. Moons KG, Stijnen T, Michel BC, Buller HR, Van Es GA, Grobbee DE, et al. Application of treatment thresholds to diagnostic-test evaluation: an alternative to the comparison of areas un received operating characteristic curves. Med Decis Making. 1997;17:447–454. [PubMed]
19. Liechtenstein JI, Feinstein AR, Suzio KD, DeLuca V, Spiro HM. The effectiveness of panendoscopy on diagnostic and therapeutic decisions about chronic abdominal pain. J Clin Gastroenterol. 1980;2:31–36. [PubMed]
20. Sackett DL, Haynes RB. The architecture of clinical diagnosis. In press.
21. Shapiro S, Venet W, Strax Ph, Roeser R. Ten to fourteen year effect of screening on breast cancer mortality. J Natl Cancer Inst. 1982;69:349–355. [PubMed]
22. Harms LM, Schellevis FG, van Eijk JT, Donker AJ, Bouter LM. Cardiovascular morbidity and mortality among hypertensive patients in general practice: the evaluation of long-term systematic management. J Clin Epidemiol. 1997;50:779–786. [PubMed]
23. Verbeek ALM, Hendriks JHCL, Holland R, Mravunac M, Sturmans F, Day NE. Reduction of breast cancer mortality through mass-screening with modern mammography. Lancet. 1984;i:1222–1224. [PubMed]
24. Guyatt GH, Tugwell P, Feeny DH, Drummond MF, Haynes RB. The role of before-after studies of therapeutic impact in the evaluation of diagnostic technologies. J Chron Dis. 1986;39:295–304. [PubMed]
25. Irwig L, Macaskill P, Glasziou P, Fahey M. Meta-analytic methods for diagnostic test accuracy. J Clin Epidemiol. 1995;48:119–130. [PubMed]
26. Buntinx F, Brouwers M. Relation between sampling device and detection of abnormality in cervical smears: meta-analysis of randomised and quasi-randomised studies. BMJ. 1996;313:1285–1290. [PMC free article] [PubMed]
27. Benavente O, Moher D, Pham B. Carotid endarterectomy for asymptomatic carotid stenosis: a meta-analysis. BMJ. 1998;317:1477–1480. [PMC free article] [PubMed]
28. Van Wijk MA, van der Lei J, Mosseveld M, Bohnen AM, van Bemmel JH. Assessment of decision support for blood test ordering in primary care. A randomized trial. Ann Intern Med. 2001;74:274–281. [PubMed]
29. Elstein AS. Heuristics and biases: selected errors in clinical reasoning. Acad Med. 1999;74:791–794. [PubMed]

Articles from BMJ : British Medical Journal are provided here courtesy of BMJ Group
PubReader format: click here to try


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...