NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Humphrey L, Chan BKS, Detlefsen S, et al. Screening for Breast Cancer [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2002 Aug. (Systematic Evidence Reviews, No. 15.)

  • This publication is provided for historical reference only and the information may be out of date.

This publication is provided for historical reference only and the information may be out of date.

Cover of Screening for Breast Cancer

Screening for Breast Cancer [Internet].

Show details


Identification and Selection of Articles

Based on input from USPSTF members, our search focused on the effectiveness of screening with mammography, CBE, and BSE. We also sought to identify key recent studies about the accuracy and adverse effects of these tests, but did not systematically review these areas.

We identified controlled trials and meta-analyses by searching the Cochrane Controlled Trials Registry (all dates), supplemented by a search for recent publications in MEDLINE (January 1994 to December 2001). Other sources were a PREMEDLINE search (December 2001 through February 2002); the reference lists of previous reviews, commentaries, and meta-analyses11–24; and suggestions from experts.

In the electronic searches, the Medical Subject Heading (MeSH) terms breast neoplasms and breast cancer were combined with the terms mammography and mass screening or physical examination and breast self examination, and terms for controlled trials or prospective studies to yield 954 citations. Titles and abstracts were reviewed to identify publications that were randomized controlled trials (RCTs) of breast cancer screening and had a relevant clinical outcome (advanced breast cancer, breast cancer mortality, or all-cause mortality). In all, the searches identified 146 controlled trials, of which 132 were excluded at the title and abstract phase because they concerned promoting screening rather than the efficacy of mammography (Figure 2). Four of the remaining 14 trials were excluded. Two were randomized trials of screening with mammography that have not yet presented mortality or advanced breast cancer outcomes.25, 26 The third was a controlled trial that reported a reduction in breast cancer mortality but was not randomized.27, 28 The fourth, the Malmo Prevention Study, was apparently a randomized trial of a variety of preventive interventions including mammography.29 It reported significantly lower mortality from cancer among women younger than age 40 at entry, but provided no information about the mammography protocol, referring the reader to another randomized trial, the Malmo Mammographic Screening Program, for further information. We believe that the two trials were in fact separate and that the results of the Malmo Mammographic Screening Program probably do not include results for the 8,000 women who participated in the Malmo Prevention Study.

Figure 2. Results of literature search.


Figure 2. Results of literature search.

The remaining eight randomized trials of mammography* were conducted between 1963 and 1994. Using the electronic searches and other sources, we retrieved the full text of 241 publications about these trials (these are listed in the bibliography, Appendix 1). We also identified 11 previous systematic reviews of the trials. Eight15–18, 20, 22, 24, 30–34 of these concerned breast cancer mortality and three addressed test performance.35–37 We identified three non-randomized controlled trials38–40 that are not included in the meta-analysis but are discussed in the report. Two randomized trials of BSE were identified and reviewed41, 42 as well as prior observational studies9, 43 and one non-randomized controlled trial evaluating BSE.44

Data Abstraction

Eight randomized trialsa conducted between 1963 and 1994 provide almost all of the pertinent information about the effect of mammography and clinical breast examination on breast cancer mortality. Table 1 summarizes the setting, compared groups, mammography protocol, size, and years of follow-up for each randomized trial. (The studies are described in detail in Appendix 2.) The trials varied in the number of mammographic views, the frequency or interval between screens, the number of rounds of screening, and the length of follow-up. Since 1996, when the USPSTF last examined this issue, longer-term results have been made available for most studies, especially in women aged 40 to 49.

Table 1. Controlled Trials of Mammography and Clinical Breast Examination.


Table 1. Controlled Trials of Mammography and Clinical Breast Examination.

Two of the authors abstracted information about each RCT. We compiled an appendix (Appendix 2) consisting of detailed information about the patient population, design, potential flaws, missing information, and analysis conducted in each trial. For the primary endpoint of breast cancer mortality, we abstracted results for each reported length of follow-up. All but two of these trials were designed to assess the overall effectiveness of screening rather than the effect in different subgroups of women based on their age. In addition to recording the overall results reported in the trials, whenever possible we abstracted data separately by age decade. Subgroup analysis has been criticized because there is no clear demarcation in breast cancer incidence or screening effectiveness at ages 50, 60, or at older ages. However, since there are differences in the burden of illness, and thought to be differences in the accuracy of tests among these groups, it is possible that there are also differences in the effectiveness of early detection. The purpose of our subgroup evaluation was to compare these age groups with respect to the degree of agreement among randomized trials and to compare the size and direction of effects, both in terms of relative risk and absolute risk. Finally, when available, we abstracted results for total mortality as a measure of the comparability of the two randomized groups.

The randomized trials of screening provide little information about morbidity or the adverse effects of screening or treatment. A systematic review of adverse effects was beyond the scope of our review. In examining titles and abstracts, we obtained the full text and reviewed several recent articles reporting the frequency of false-positive screening mammograms in the community and surveys of women's reactions to positive screening test results.

Assessment of Study Quality: General Approach

We used predefined criteria developed by the USPSTF to assess the internal validity of each randomized trial of screening.45 Two authors rated each study as “good,” “fair,” or “poor,” resolving disagreements by discussion among the authors after review of the data and from comments by 12 peer reviewers of earlier drafts of the report. We tried to apply the same standards to the mammography trials as we have applied to other prevention topics. We based our quality ratings on the entire set of publications from a trial rather than on individual articles.

The USPSTF criteria were designed to be adaptable to the circumstances of different clinical questions. Like other current systems to assess the quality of trials, the criteria are based as much as possible on empiric evidence of bias in relation to study characteristics. However, while the body of such evidence is growing, it does not permit a high degree of certainty about the importance of specific quality criteria in judging the mammography trials. This is because nearly all empiric studies evaluating the impact of bias on effect size examined drug treatment or other therapies, rather than screening.46, 47 Thus, generalizing these findings to large, population-based trials of screening is not straightforward. In recognition of this fact, cancer screening literature from the 1970s emphasizes that design standards for conventional trials of treatment should not always be applied to cancer screening trials.48

The quality of reporting of trials limits precision in critical appraisal.49 This is a particular issue in the mammography screening trials, many conducted in the 1960s and 1970s. In several trials, the methods were poorly described. Although some reviewers have promoted extensive query of trial authors to fill in gaps in published articles, the reliability of such data, as well as the appropriate interpretation of query data that contradicts what has been published in multi-authored, peer-reviewed papers, is uncertain. Moreover, authors are often unable to provide clarifying information.50 For these reasons, we relied on published data and did not query study authors.

Assessment of Study Quality: Application of Specific Criteria

All of the trials clearly defined interventions and co-interventions (CBE and BSE), all considered mortality outcomes, and all used intention-to-screen analysis. For this reason, the following receive particular emphasis in judging the quality of the mammography trials: (1) initial assembly of comparable groups, (2) maintenance of comparable groups and minimizing differential loss to follow-up or overall loss to follow-up, and (3) using outcome measurements that were equal, reliable, and valid. As described below, we used a systematic approach to assess the flaws of the trials in each of these areas.

Assembly of Comparable Groups

In the mammography trials, randomization was done either individually or by clusters. Randomization of individuals is preferable because it is less likely to result in baseline differences among compared groups. In individually-randomized trials, we classified allocation concealment as adequate, inadequate, or poorly described, according to the criteria used by Schulz and colleagues.47 In a cluster-randomized trial, concealment of the assignment of individual patients is impossible, and the importance of concealing the allocation of clusters is unclear. Accordingly, we placed more importance on concealment in individually randomized trials.

We rated how each trial compared the subjects in the screened and control groups. To obtain the highest rating in this category, a trial must obtain baseline data on possible covariates prior to randomization, and the distribution of these covariates must be similar in screening and control groups. In a large, individually randomized trial, baseline differences in sociodemographic variables would suggest that randomization failed, especially if there were opportunities for subversion (that is, if allocation was not concealed).

This standard applies only if baseline data can be reliably collected in all patients in both groups. In several of the mammography screening trials, subjects in the usual care arm were followed passively, and there was no opportunity to collect baseline data from all of them. The decision not to contact each individual in the control group has logistic advantages and probably reduced contamination, but it limits comparison between the screened and control groups. When clusters are used, some baseline differences in the compared groups are almost inevitable.

We evaluated whether the method of identifying clusters (geographic areas, month or year of birth, etc.) was likely to result in bias, and whether measures such as matching were used to reduce it. If bias in assigning clusters to intervention or control groups seemed likely, we considered this a major flaw that was enough to invalidate the findings and rated this study poor. However, in contrast to individually randomized trials, we did not take small differences in the mean age of compared groups to be an indicator that randomization failed to distribute more important confounders equally among the groups.

Several of the trials measured mortality from causes of death other than breast cancer to establish the comparability of the mammography and control groups. When available, we recorded this information. Although comparable total mortality supports balanced randomization, it does not assure it. However, if there were dramatic differences in mortality from other causes, we considered it to be evidence that randomization failed.

Maintenance of Comparable Groups

Exclusions after randomization are considered to be a serious flaw in the execution of randomized trials, although empiric evidence of this bias is inconsistent.46, 47 Post-randomization exclusions were poorly described in several of the mammography trials and could have resulted in bias if the exclusions resulted in different levels of risk for breast cancer death between the groups. In most of the mammography trials, however, exclusion of subjects after randomization was an expected consequence of the protocol since some exclusion criteria, such as prior mastectomy, could not be applied to all subjects before randomization, because subjects were not individually contacted. We examined the number of, reasons for, and methods for exclusion of subjects after randomization. We based our rating on whether the methods used to ascertain patients were objective and consistent, not on the numbers of exclusions in the compared groups. Since ascertainment of clinical variables that might result in exclusion of a participant will be greater among intervention subjects and is an expected consequence of the study design, we did not consider unequal numbers of excluded subjects in the treatment and control groups after randomization as definitive evidence of bias.

Measurements are Equal, Reliable, and Valid (Including Masking of Outcome Assessment)

Over the duration of most of the trials, breast cancer death (the primary endpoint) occurs in 2 to 9 per 1,000 subjects. The relatively low numbers of events means that misclassification or biased exclusion of a few deaths could change the direction and statistical significance of the trial results. For this reason, selection of cases for review of cause of death on broad criteria; use of reliable sources of information to ascertain vital status (death certificates, medical records, autopsies, registries); and use of independent blinded review of the cause of death are important measures to prevent bias. We considered blinded review of deaths a requirement for a fair or better quality rating.

Approach to Multiple Analyses

The mammography trials have been criticized for decades,10, 51–53 and the trialists have responded by conducting additional analyses intended to address these criticisms. In our assessment of quality, we took into account the results of these supplemental analyses. For example, the cluster-randomized trials have been criticized because they analyzed results using statistical methods appropriate only to individually randomized trials, but an independent re-analysis using the correct statistical method found that the results were unchanged.54 The Canadian trialists addressed criticisms that women who had palpable nodes might have been enrolled preferentially in the mammography group55 by re-analyzing their data and showing that the exclusion of these subjects did not affect the results.56

Data Synthesis

Although there are many meta-analyses dealing with this topic, we conducted another to incorporate new information about the quality of the trials and updated information from several studies, reflecting longer follow-up. Four trials compared mammography alone to usual care, and four compared mammography plus CBE to usual care. Because of lack of certainty that CBE is effective, and in consultation with USPSTF members, we decided that these trials were homogeneous qualitatively. The homogeneity of the trials was also assessed using the standard □2 test. The P-value from the test was greater than 0.1, indicating the effect sizes estimated by the studies are homogeneous.

We conducted two meta-analyses to address two key questions posed by the USPSTF: (1) Does mammography reduce breast cancer mortality among women over a broad range of ages when compared to usual care? and (2) If so, does mammography reduce breast cancer mortality among women aged 40 to 49 when compared to usual care?

In the first analysis, we included all data from the seven fair-quality trials, treating the two Canadian studies as one trial in subjects aged 40 to 59. In the second analysis, we included the six fair-quality trials that reported results for women less than age 50. See Appendix 3 for all data contributing to the meta-analyses.

We conducted each meta-analysis in 2 parts. First, using WinBUGS software, we constructed a 2-level Bayesian random-effects model to estimate the effect size from multiple data points for each study and to derive a pooled estimate of relative risk reduction and credible interval (CrI) for a given length of follow-up.57 The purpose of this analysis was to make use of repeated measures of the effect over time to estimate the relationship between length of follow-up and effect size. Second, we pooled the most recent results of each trial to calculate the absolute and relative risk reduction, using the results of the first analysis to estimate the mean length of observation. Risks were modeled on the logit scale.

To avoid bias that could result from excluding any data from valid studies, we included the results of all trials of fair-or-better quality in the base case analysis. The disadvantage of this approach is that it combines results from two distinct types of studies. The 6 population-based trials randomized women to an invitation-to-screening or a control group who received “usual care” and were followed passively. In these trials, women who were invited, but chose not to be screened, were included in the analysis of the “screened” group. The other 2 trials, which were from Canada, differed from the other trials. First, the Canadian trials used mass media to recruit a sample of volunteers, and all women randomized to mammography had at least one mammogram.58, 59 Second, in one of the Canadian trials, the control group was screened periodically with CBE. To estimate the relative risk reduction and number needed to invite to screen to prevent one breast cancer death compared with usual care, we re-analyzed the data excluding the results of the Canadian studies.

To model the relationship between length of follow-up and relative risk, a two-level hierarchical model was used. The first level was the result of a trial at a given average or median follow-up time, xij, where i indexes the trial and j indexes the data point within a trial. The second level is the trial itself. The model allows for within-trial and between-trial variability. Specifically, the model was:

α* ~ Normal(.,.)

β* ~ Normal(.,.)

αi ~ Normal (α*, σ2 α)

βi ~ Normal(β*, σ2 β)

μij = αi + βixij + τzij

τ ~ Γ(.,.)

Zij ~ Normal(0,1)

logRRij ~ Normal(μij,S2)

A global regression curve was estimated as log RR = α* + β* x The random effect is τ z ij The model to estimate summary risk was

# deathscontrol,i ~ Binomial (π control,i, n control,i)

# deathsintervention,i ~ Binomial (π intervention,i, n intervention,i)

logit(π control,i) =α+τ zi

logit(π intervention,i) =α+β+τ zi

α ~ Normal(.,.)

β ~ Normal(.,.)

τ ~ Γ(.,.)

Absolute risk difference was calculated as π control,i - π intervention,i. Relative risk was calculated as exp(β).

The models were estimated using a Bayesian data analytic framework.60 The data were analyzed using WinBUGS,57 which uses Gibbs sampling to simulate posterior probability distributions. Noninformative (proper) prior probability distributions were used: Normal(0, 106) and Γ(0.001, 0.001). Five separate Markov chains with overdispersed initial values were used to generate draws from posterior distributions. Point estimates (mean) and 95% credible intervals (2.5 and 97.5 percentiles) were derived from the subsequent 5 × 10,000 draws after reasonable convergence of the five chains was attained. The code to model the data in WinBUGS is available upon request from the authors.



Four of these were Swedish studies: Malmo, Kopparberg, Ostergotland, Stockholm, Gothenburg. (Kopparberg and Ostergotland together are known as the Swedish Two-County Trial.) The remaining studies were Edinburgh, the New York Health Insurance Plan (HIP) study, and the two Canadian National Breast Screening Studies (NBSS-1 and NBSS-2).