NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Bruening W, Uhl S, Fontanarosa J, et al. Noninvasive Diagnostic Tests for Breast Abnormalities: Update of a 2006 Review [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2012 Feb. (Comparative Effectiveness Reviews, No. 47.)

  • This publication is provided for historical reference only and the information may be out of date.

This publication is provided for historical reference only and the information may be out of date.

Cover of Noninvasive Diagnostic Tests for Breast Abnormalities

Noninvasive Diagnostic Tests for Breast Abnormalities: Update of a 2006 Review [Internet].

Show details


Topic Development

AHRQ requested an update of the evidence report Effectiveness of Noninvasive Diagnostic Tests for Breast Abnormalities.7 The original report was finalized in February 2006. Due to technological advances and continuing innovation in the fields of noninvasive imaging, the conclusions of the original report are possibly no longer relevant to current clinical practice. Consequently, the topic was selected for update. The EPC recruited a technical expert panel (TEP) to give input on key steps including the selection and refinement of the questions to be examined. The expert panel membership is provided in the front matter of this report.

Upon AHRQ approval, the draft Key Questions were posted for public comment. After receipt and consideration of the public commentary, ECRI Institute finalized the Key Questions and submitted them to AHRQ for approval. These Key Questions are presented in the Scope and Key Questions section of the Introduction.

ECRI Institute created a work plan for developing the evidence report. The process consisted of working with AHRQ and the TEP to outline the report’s objectives, performing a comprehensive literature search, abstracting data, constructing evidence tables, synthesizing the data, and submitting the report for peer review.

In designing the study questions and methodology at the outset of this report, the EPC consulted several technical and content experts. Broad expertise and perspectives were sought. Divergent and conflicted opinions are common and perceived as healthy scientific discourse that results in a thoughtful, relevant systematic review. Therefore, in the end, study questions, design and/or methodologic approaches do not necessarily represent the views of individual technical and content experts.

The topic development procedure employed the “PICOTS” approach; namely, carefully and clearly defining the Patients, the Intervention(s), the Comparator(s), the Outcomes, the Timing of followup, and the Setting of care.105


The patient population of interest is the general population of women participating in routine breast cancer screening programs (including mammography, clinical examination, and self-examination). who have been recalled after discovery of a possible abnormality and who have already undergone standard work-up, which may include diagnostic mammography and/or ultrasound (BI-RADS 0, and 3 to 5). Populations that will not be evaluated in this review include: women thought to be at very high risk of breast cancer due to family history or BRCA mutations; women with a personal history of breast cancer; women with overt symptoms such as nipple discharge or pain; and men.


The noninvasive diagnostic tests to be evaluated are:

  • Ultrasound (conventional B-mode, harmonic, tomography, color Doppler, and power Doppler)
  • Magnetic resonance imaging (MRI) with breast-specific coils and gadolinium-based contrast agents, with or without computer-aided diagnosis (CADx)
  • Positron emission tomography (PET) with 18-fluorodeoxyglucose (FDG) as the tracer, with or without concurrent computed tomography (CT) scans, and positron emission mammography.
  • Scintimammography with technetium-99m sestamibi (MIBI) as the tracer, including Breast Specific Gamma Imaging (BSGI).

Technologies that were proposed for evaluation but, after discussion by the TEP, were not included, are: elastography; molecular breast imaging; scintimammography using tracers other than MIBI; PET using tracers other than FDG; digital tomosynthesis mammography; computer-aided diagnostic x-ray mammography; breast thermography; electrical impedance tomography; and optical breast imaging. The primary reasons that the TEP decided to not include these technologies in the current CER was a) insufficient robust evidence available about the technology at this time; b) no devices that employ the technology are currently available or approved in the United States; and/or c) the technology is primarily intended to be used in the screening setting.


The accuracy of the noninvasive imaging tests was evaluated by a direct comparison to histopathology (biopsy or surgical specimens) or to clinical followup, or a combination of these methods. In addition, the relative accuracy of the different tests under evaluation was evaluated by directly and indirectly comparing the tests (as the reported evidence permitted).


Outcomes of interest are diagnostic test characteristics, namely, sensitivity, specificity, and likelihood ratios. Adverse events related to the procedures, such as radiation, discomfort, and reactions to contrast agents, were also discussed.


Any duration of followup, from same-day interventions to many years of clinical followup, was evaluated.


Any care setting was acceptable, including general hospitals, physician’s offices, and specialized breast imaging centers.

Search Strategy

The medical literature was searched from December 1994 through September 2010. The full strategy is provided in Appendix A. In brief, we searched 10 external and internal databases, including PubMed and EMBASE, for clinical trials addressing the Key Questions. To supplement the electronic searches, we also examined the bibliographies/reference lists of included studies, recent narrative reviews, and scanned the content of new issues of selected journals and selected relevant gray literature sources.

Study Selection

We selected the studies that we consider in this report using a priori inclusion criteria. Some of the criteria we employed are geared towards ensuring that we used only the most reliable evidence. Other criteria were developed to ensure that the evidence is not derived from atypical patients or interventions, and/or outmoded technologies.

Studies of diagnostic test performance compare results of the experimental test to a reference test. The reference test is intended to measure the “true” disease status of each patient. It is important that the results of the reference test be very close to the truth, or the performance of the experimental test will be poorly estimated. For the diagnosis of breast cancer, the “gold standard” reference test is open surgical biopsy. However, an issue with the use of open surgical biopsy as the reference standard in large cohort studies of screening-detected breast abnormalities is the difficulty of subjecting women with probably benign lesions to open surgical biopsy. Furthermore, restricting the evidence base to studies that used open surgery as the reference standard for all enrolled subjects would eliminate the majority of the evidence. Therefore, we have chosen to use a combination of clinical and radiologic followup as well as core-needle biopsy and open surgical biopsy as the reference standard for our analysis, although we acknowledge that this decision may cause our analysis to over-estimate the accuracy of the noninvasive tests.106

We used the following formal criteria to determine which studies would be included in our analysis. Many of our inclusion criteria were intended to reduce the potential for spectrum bias. Spectrum bias refers to the fact that diagnostic test performance is not constant across populations with different spectrums of disease. For example, patients presenting with severe symptoms of disease may be easier to diagnose than asymptomatic patients in a screening population; and a diagnostic test that performs well in the former population may perform poorly in the latter population. The results of our analysis are intended to apply to a general population of women participating in routine breast cancer screening programs (mammography, clinical examination, and self-examination programs) and therefore many of our inclusion criteria are intended to eliminate studies that enrolled populations of women at very high risk of breast cancer due to family history, or populations of women at risk of recurrence of a previously diagnosed breast cancer.

  1. The study must have directly compared the test of interest to core-needle biopsy, open surgery, or clinical followup of the same patient.
    Although it is possible to estimate diagnostic accuracy from a two-group trial, the results of such indirect comparisons must be viewed with great caution. Diagnostic cohort studies, wherein each patient acts as her own control, are the preferred study design for evaluating the accuracy of a diagnostic test.107 Studies may have performed biopsy procedures on all patients, or may have performed biopsy on some patients and followed the other patients with clinical examination and mammograms. Fine-needle aspiration of solid lesions is not an acceptable reference standard for the purposes of this assessment.108111
    Retrospective cohort studies that enrolled all or consecutive patients were considered acceptable for inclusion. However, retrospective case-control studies and case reports were excluded. Retrospective case-control studies have been shown to overestimate the accuracy of diagnostic tests, and case reports often report unusual situations or individuals that are unlikely to yield results that are applicable to general practice.106,107 Retrospective case studies (studies that selected cases for study on the basis of the type of lesion diagnosed) were also excluded because the data such studies report cannot be used to accurately calculate the overall diagnostic accuracy of the test.106
  2. The studies must have used current generation scanners and protocols of the selected technologies only. Other noninvasive breast imaging technologies are out of the scope of this assessment.
    Studies of outdated technology and experimental technology are not relevant to current clinical practice. Definitions of “outdated technology” and “current technology” were developed through discussions with experts in relevant fields. Definitions of “current technology to be included” are defined in Table 2.
  3. The study enrolled female human subjects.
    Animal studies or studies of “imaging phantoms” are outside the scope of the report. Studies of breast cancer in men are outside the scope of the report. However, studies of predominantly female patients that enrolled one or two men were considered acceptable.
  4. The study must have enrolled patients referred for the purpose of primary diagnosis of a breast abnormality detected by routine screening (mammography and/or physical examination).
    Studies that enrolled women who were referred for evaluation after discovery of a possible breast abnormality by screening mammography or routine physical examination were included. Studies that enrolled subjects that were undergoing evaluation for any of the following purposes were excluded as being out of scope of the report: screening of asymptomatic women; breast cancer staging; evaluation for a possible recurrence of breast cancer; monitoring response to treatment; evaluation of the axillary lymph nodes; evaluation of metastatic or suspected metastatic disease; or diagnosis of types of cancer other than primary breast cancer. Studies that enrolled patients from high-risk populations such as BRCA1/2 mutation carriers, or patients with a strong family history of breast cancer, are also out of scope. If a study enrolled a mixed patient population and did not report data separately, it was excluded if more than 15 percent of the subjects did not fall into the “primary diagnosis of women at average risk presenting with an abnormality detected on routine screening” category.
  5. Study must have reported test sensitivity and specificity, or sufficient data to calculate these measures of diagnostic test performance; or (for Key Question 3) reported factors that affected the accuracy of the noninvasive test being evaluated.
    Other outcomes are beyond the scope of this report.
  6. Fifty percent or more of the subjects must have completed the study.
    Studies with extremely high rates of attrition are prone to bias and were excluded.
  7. Study must be published in English.
    Moher et al. have demonstrated that exclusion of non-English language studies from meta-analyses has little impact on the conclusions drawn. Juni et al found that non-English studies typically were of lower methodological quality and that excluding them had little effect on effect size estimates in the majority of meta-analyses they examined. Although we recognize that in some situations exclusion of non-English studies could lead to bias, we believe that the few instances in which this may occur do not justify the time and cost typically necessary for translation of studies to identify those of acceptable quality for inclusion in our reviews.112,113
  8. Study must be published as a peer-reviewed full article. Meeting abstracts were not included.
    Published meeting abstracts have not been peer-reviewed and often do not include sufficient details about experimental methods to permit one to verify that the study was well designed.114,115 In addition, it is not uncommon for abstracts that are published as part of conference proceedings to have inconsistencies when compared to the final publication of the study, or to describe studies that are never published as full articles.116120
  9. The study must have enrolled 10 or more individuals per arm.
    The results of very small studies are unlikely to be applicable to general clinical practice. Small studies are unable to detect sufficient numbers of events for meaningful analyses to be performed, and are at risk of enrolling unique individuals.
  10. When several sequential reports from the same patients/study are available, only outcome data from the most recent report were included. However, we used relevant data from earlier and smaller reports if the report presented pertinent data not presented in the more recent report.
Table 2. Noninvasive current technologies to be evaluated.

Table 2

Noninvasive current technologies to be evaluated.

The abstracts of articles identified by the literature searches were screened for possible relevance in duplicate by four analysts. All exclusions at the abstract level were approved by the lead research analyst. The full-length articles of studies that appeared relevant at the abstract level were then obtained and three research assistants examined the articles to see if they met the inclusion criteria. All exclusions were approved by the lead research analyst. The excluded articles and primary reason for exclusion are shown in the Appendixes.

Data Abstraction

Standardized data abstraction forms were created and data were entered by each reviewer into the SRS© 4.0 database (see Appendixes). Three research assistants abstracted the data. The first fifty articles were abstracted in duplicate. All conflicts were resolved by the lead research analyst.

Study Quality Evaluation

We used an internal validity rating scale for diagnostic studies to grade the quality (internal validity) of the evidence base (see Appendixes). This instrument is based on a modification of the QUADAS instrument with reference to empirical studies of design-related bias in diagnostic test studies.106,121 Each question in the instrument addresses an aspect of study design or conduct that can help to protect against bias. Each question can be answered “yes,” “no,” or “not reported,” and each is phrased such that an answer of “yes” indicates that the study reported a protection against bias on that aspect.

Responses to the questions in the quality assessment instrument for each study are presented in the Evidence Tables in Appendix C.

Strength of Evidence Assessment

We applied a formal grading system that conforms with the CER Methods Guide recommendations on grading the strength of evidence.122,123

The overall strength of evidence supporting each major conclusion was graded as High, Moderate, Low, or Insufficient. The grade was developed by considering four important domains: the risk of bias in the evidence base, the consistency of the findings, the precision of the results, and the directness of the evidence.

The risk of bias (internal validity) of each individual study was rated as being Low, Medium, or High; and the risk of bias of the aggregate evidence base supporting each major conclusion was similarly rated as being Low, Medium, or High. We used our inclusion/exclusion criteria to eliminate studies with designs known to be prone to bias from the evidence base. Namely, case reports, case-control studies, and retrospective studies that did not enroll all or consecutive patients were not included for analysis. Because we eliminated all studies with a High risk of bias from the evidence base, we consider the remaining evidence base to have either a Low or Medium risk of bias.

We initially used an internal validity rating instrument for diagnostic studies to grade the internal validity of the individual studies (see section above Study Quality Evaluation). However, after we had conducted meta-regressions investigating the correlation between key individual items on the quality rating instrument and the results reported by the studies (see Appendix D for details), we consistently found that the majority of the items on the instrument had no statistically significant correlation with the reported results (with one exception, discussed below). We therefore concluded that the quality instrument was not adequately capturing the potential for bias of the studies in our sample (after eliminating study designs known to be prone to bias, such as retrospective case-control studies and case reports during the inclusion/exclusion process). Unlike studies of interventions, diagnostic cohort studies are quite simple in design, with one group of patients acting as their own controls. As long as all enrolled patients receive both the diagnostic test and the reference standard test, opportunities for bias (due to study design or conduct) to affect the results are limited. As mentioned above, we eliminated all studies with a High risk of bias due to their study design from the evidence base. We did not identify any obvious design flaws in the remaining studies that suggested they were at Medium risk of bias; therefore, we rated all of the included studies, and the aggregate evidence bases, as being at Low risk of bias.

Meta-regressions did identify a statistically significant correlation between blinding of image readers to patient clinical information and the reported results of studies of MRI and ultrasound. Studies that blinded image readers to patient clinical information generally reported the blinded image readers had less accurate findings. It may, therefore, be that lack of blinding is a design flaw that is biasing the results. However, an alternative interpretation, which we favor, is that blinding image readers to patient clinical information is an artificial construct that will rarely if ever occur in clinical practice; therefore, non-blinded studies are generating an estimate of accuracy that is closer to the “real” accuracy that can be obtained in clinical practice. The majority of the studies are either non-blinded or did not specifically state whether they were blinded, leading us to believe that our aggregate pooled summary estimate of accuracy is close to the “real” accuracy of the technologies as used in routine clinical practice.

We rated the consistency of conclusions supported by meta-analyses with the statistic I2.124,125 Datasets that were found to have an I2 of less than 50 percent were rated as being “Consistent”; those with I2 of 50 percent or greater were rated as being “Inconsistent”; and datasets for which I2 could not be calculated (e.g., a single study) were rated as “Consistency Unknown.”

For qualitative direct comparisons between different diagnostic tests, we rated conclusions as consistent if the effect sizes were all in the same direction. For example, when comparing the accuracy of ultrasound without a contrast agent to the accuracy of ultrasound with a contrast agent, if the estimates of sensitivity of the individual studies are consistently higher for studies that used a contrast agent, then the evidence base would be rated as “consistent.”

We defined a “precise” estimate of sensitivity or specificity as one for which the upper AND lower bound of the 95 percent confidence interval was no more than 5 points away from the summary estimate; for example, sensitivity 98 percent (95% CI: 97 to 100%) would be a precise estimate of sensitivity, whereas sensitivity 95 percent (95% CI: 88 to 100%) would be an imprecise estimate of sensitivity. Precision could be rated separately for summary estimates of sensitivity and specificity for each major conclusion.

For qualitative direct comparisons between different diagnostic tests, the conclusion is “Precise” if the confidence intervals around the summary estimates being compared do not overlap. We did not derive any formal conclusions (or formally rate the strength of evidence for any speculative statements) about indirect comparisons between different diagnostic tests.

According to the Methods Guide,122

The rating of directness relates to whether the evidence links the interventions directly to health outcomes.

For studies of diagnostic test accuracy, the evidence should always be rated as “Indirect” because the outcome of test accuracy is indirectly related to health outcomes. However, the Key Questions in this particular comparative effectiveness review do not ask about the impact of test accuracy on health outcomes. We therefore did not incorporate the “Indirectness” of the evidence into the overall rating of strength of evidence for these Key Questions because they did not ask about health outcomes.

Overall Rating of Strength of Evidence

The initial rating is based on the risk of bias. If the evidence base has a Low risk of bias, the initial strength of evidence rating is High; if the evidence base has a Moderate risk of bias, the initial strength of evidence rating is Moderate; if the evidence base has a High risk of bias, the initial strength of evidence rating is Low. For this particular comparative effectiveness review, as explained above, the rating of risk of bias was Low for all evidence bases, and therefore the initial strength of evidence rating is High. The remaining two domains are used to up- or down- grade the initial rating as per the following flow charts:

  • Consistent, Precise: High
  • Inconsistent, Precise: Moderate
  • Consistent, Imprecise: Moderate
  • Inconsistent, Imprecise: Low
  • “Consistency Unknown,” Precise: Low
  • “Consistency Unknown,” Imprecise: Insufficient

Evidence bases judged to be too small to support an evidence-based conclusion (e.g., one or two small studies) were simply rated “Insufficient” without formally considering the various domains. Further details about grading the strength of evidence may be found in the Evidence Tables section of the Appendixes.


The issue of applicability was chiefly addressed by excluding studies that enrolled patient populations that were not a general population of asymptomatic women participating in routine breast cancer screening programs. We defined the population of interest as women at average risk of breast cancer participating in routine breast cancer screening programs (including mammography, clinical examination, and self-examination) who had been recalled after discovery of an abnormality and who had already undergone a standard work-up (diagnostic mammography and/or ultrasound and/or physical examination). We excluded studies that enrolled women thought to be at very high risk of breast cancer due to personal history, family history, or known carriers of BRCA mutations, and also excluded studies that enrolled patients presenting with overt symptoms such as nipple discharge or pain.

Data Analysis and Synthesis

The majority of studies reported data on a per-lesion rather than a per-patient basis, and therefore we analyzed the data on a per-lesion basis assuming that statistical assumptions about data independence were not being violated. Because the number of lesions was usually very similar to the number of patients (i.e., the vast majority of patients only had one lesion) we do not believe that this assumption will have a significant impact on the results.

We performed a standard diagnostic accuracy analysis. For the diagnostic accuracy analysis:

  • True negatives were defined as lesions diagnosed as benign on imaging that were found to be benign by the reference standard;
  • False negatives were defined as lesions diagnosed as benign on imaging that were found to be malignant (invasive or in situ) by the reference standard;
  • True positives were defined as lesions diagnosed as malignant (invasive or in situ) on imaging that were found to be malignant (invasive or in situ) on the reference standard
  • False positives were defined as lesions diagnosed as malignant that were found to be benign on the reference standard.

We meta-analyzed the data reported by the studies using a bivariate mixed-effects binomial regression model as described by Harbord et al.8 All such analyses were computed by the STATA 10.0 statistical software package using the “midas” command.9 The summary likelihood ratios and Bayes’ theorem were used to calculate the post-test probability of having a benign or malignant lesion. In cases where a bivariate binomial regression model could not be fit, we meta-analyzed the data using a random-effects model and the software package Meta-Disc.10 Meta-regressions were also performed with the STATA software and the “midas” command. We did not assess the possibility of publication bias because statistical methods developed to assess the possibility of publication bias in treatment studies have not been validated for use with studies of diagnostic accuracy.126,127

Diagnostic tests all have a trade-off between minimizing false-negative and minimizing false-positive errors. False-positive errors that occur during breast screening diagnostic workups are not considered to be as clinically relevant as false-negative errors. Women who experience a false-positive error will be sent for unnecessary procedures, and may suffer from anxiety and a temporarily reduced quality of life, as well as morbidities related to the procedures. However, women who experience a false-negative error may suffer morbidities, reduced quality of life, and possibly even a shortened lifespan from a delayed cancer diagnosis.

Likelihood ratios can be used along with Bayes’ theorem to directly compute an individual woman’s risk of actually having a malignancy following a diagnosis on imaging. However, each individual woman’s post-test risk varies by her pre-test risk of malignancy. Simple nomograms are available for in-office use that allow clinicians to directly read individual patients’ post-test risk off a graph without having to go through the tedium of calculations. Predictive value is another commonly used measure of errors; however, negative and positive predictive values are specific to specific populations of women. Predictive values vary by the prevalence of disease in each specific population and should not be applied to other populations with different prevalences of disease. For this reason, we have avoided the use of predictive values in this systematic review.

Peer Review and Public Commentary

A draft of the completed report was sent to the peer reviewers and representatives of AHRQ. The draft report was posted to the Effective Health Care Web site for public comment. In response to the comments of the peer reviewers and the public, revisions were made to the evidence report, and a summary of the comments and their disposition has been submitted to AHRQ, and will be made publicly available within 3 months of publication of this final report. Synthesis of the scientific literature presented here does not necessarily represent the views of individual reviewers.


Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...