U.S. flag

An official website of the United States government

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Institute of Medicine (US) and National Research Council (US) National Cancer Policy Board; Curry SJ, Byers T, Hewitt M, editors. Fulfilling the Potential of Cancer Prevention and Early Detection. Washington (DC): National Academies Press (US); 2003.

Cover of Fulfilling the Potential of Cancer Prevention and Early Detection

Fulfilling the Potential of Cancer Prevention and Early Detection.

Show details

5Potential of Screening to Reduce the Burden of Cancer1

Acomplementary strategy to preventing the occurrence of cancer (primary prevention) is early detection of cancer through screening (secondary prevention). The fundamental tenet of screening for cancer is that finding the disease before symptoms develop enables detection at a less advanced stage and that instituting treatment at that time leads ultimately to improved health outcomes. Although this syllogism seems intuitive and is widely assumed to be true by both health professionals and the lay public, its validity is unclear for many cancers.

The term screening, as used in this report, refers to the early detection of cancer or premalignant disease in persons without signs or symptoms suggestive of the target condition (the type of cancer that the test seeks to detect). Some investigators draw a distinction between screening and case finding, using the former term to describe population-based screening programs, such as those conducted at health fairs or shopping malls, and the latter term to refer to testing of patients in the clinical setting. This report refers to both forms of testing as screening because the evidence base is similar in both contexts. Diagnostic testing, which is not addressed in this report, refers to the evaluation of patients with signs or symptoms associated with cancer (e.g., a breast lump, blood in stool, fatigue, or weight loss), often by use of the same tests used for screening. Surveillance refers to follow-up screening for new evidence of cancer in patients who have already been diagnosed with and treated for cancer or premalignant disease.

This chapter reviews the principles used for determination of the effectiveness of cancer screening and applies those principles in an examination of current scientific evidence regarding the benefits and harms of screening for four types of cancer (cancers of the colon and rectum, breast, prostate, and cervix). Chapter 6 examines strategies for optimization of the delivery of recommended cancer screening tests from the perspective of the health care system, providers, and, most importantly, the patient. Chapter 7 presents a case study that reviews the history and prospects for screening for lung cancer, illustrating the difficulties of adopting new technologies in the face of uncertain science.

PRINCIPLES FOR ASSESSMENT OF THE EFFECTIVENESS OF SCREENING FOR CANCER

The principal considerations in judging the effectiveness of cancer screening are (1) the burden of suffering, the frequency of cancer, and the severity of its health effects; (2) the accuracy and reliability of the screening test in detecting cancer and minimizing inaccurate test results; (3) the effectiveness of early detection, including the incremental benefit of detecting and treating cancer at an earlier stage; (4) the harms of screening, both from the testing process and from the incremental harms from evaluation and treatments that follow; and (5) costs. These considerations form the tradeoffs used to weigh the benefits and harms of screening. All of the preceding analytical steps are necessary to address the pivotal question of whether patients and populations experience better outcomes with screening than without it.

Burden of Suffering

The first consideration in assessing the effectiveness of cancer screening is the frequency with which cancer occurs in the population and its attendant health effects. The prevalence rate determines the pretest probability of disease or the average likelihood that a person in the screened population will have cancer. The lower this value is, the larger the number of tests that must be performed to detect one case of cancer (i.e., it will have lower yield) and, for statistical reasons discussed below, the greater the chances that a positive test result will be erroneous (a false-positive result).

Mortality rates and other measures of the probability of adverse health effects from cancer influence the absolute benefit of screening (see the discussion of absolute benefit versus relative benefit below). For example, if a screening test reduces the risk of dying from cancer by 20 percent (relative risk reduction), the number of lives saved by screening or the probability that a person undergoing screening will avert death (absolute benefit) depends directly on the baseline mortality rate in the screened population. If that rate is 30/100,000 per year, screening will save six lives per 100,000 screened. The same form of screening would save only one life in a lower-risk setting where the mortality rate is only 5/100,000. As detailed below, absolute risks influence the number of individuals who need to be screened to achieve a health benefit.

Most published rates for cancer morbidity and mortality are derived from patients with clinically detected disease (i.e., cancers that come to the attention of health care providers in the evaluation of abnormal symptoms or physical findings), but the types of cancers detected by screening also include those that were not destined to manifest clinical symptoms and that are therefore of uncertain clinical significance. Autopsy studies have demonstrated for decades that a large proportion of persons live their lives harboring occult cancers that cause little or no clinical symptoms because of their slow rate of growth or late onset. Screening often detects such lesions, but it is often difficult to determine at the time of diagnosis whether those cancers were destined to progress. This phenomenon of screening is known as overdiagnosis and is important because the degree to which cancers that are not destined to progress are represented among cancers detected by screening limits the net health benefits of screening.

Overdiagnosis figures prominently in debates about the benefits of screening for various cancers. A common criticism of screening for prostate cancer, for example, is that many of the cancers detected by screening are latent carcinomas that, due to that disease's slow growth characteristics, are unlikely to progress or cause clinical symptoms (Woolf, 1995). Screening mammography has led to increased detection of ductal carcinoma in situ (Feig, 2000; Winchester et al., 2000), the clinical significance of which is debated. Cervical cancer screening uncovers various forms of cervical atypia for which the need to treat and proper approach for follow-up are uncertain. Finally, new imaging technologies for lung cancer screening are finding small cancers and pulmonary nodules about which the natural history is uncertain (Frame, 2000).

Accuracies and Reliabilities of Screening Tests

The second consideration in judging the effectiveness of screening for cancer is whether the available test(s) can detect cancer at an early stage without producing large numbers of false-positive and false-negative results. Of greatest concern is the test's accuracy, the degree to which it measures the true value of the attribute it is testing, and its reliability, the consistency of the result when it is repeated. The principal parameters for measuring accuracy are sensitivity, specificity, and predictive value.

Sensitivity, Specificity, and Predictive Value

Sensitivity is the proportion of persons with cancer who correctly test positive, and specificity is the proportion of persons without cancer who correctly test negative (Box 5.1). Sensitivity and specificity are usually inversely related, so that tests with high sensitivities (i.e., those that miss few cases of cancer) tend to have low specificities (i.e., produce a higher proportion of false-positive results). If patients are put at substantial risk by receiving false-positive results, it may be worth compromising sensitivity—even though it means fewer cancers will be detected—in the interest of adopting a screening test or threshold with a higher specificity that generates fewer false-positive results.

Box Icon

BOX 5.1

Definitions of Screening Test Performance. The performance of a screening test is often defined by three related measurements: sensitivity, specificity, and positive predictive value. The sensitivity (se) of a screening test is the proportion of people (more...)

Although the sensitivity and specificity of a cancer screening test are generally constant across populations and settings, this is not true for the positive predictive value (PPV), which is the probability that an abnormal result correctly indicates cancer (Box 5.1). The PPV depends on the pretest probability or likelihood that cancer is present at the time that the person is tested. For any cancer screening test, the PPV is lower (and the chances of false-positive results are higher) when there is a lower prevalence of cancer.

This important principle that underlies many concerns about cancer screening is best understood by example (Table 5.1). Suppose a test has a sensitivity and a specificity of 90 percent each. Clinicians would characteristically misinterpret these data to mean that a patient who has a positive result has a 90 percent likelihood of having cancer (i.e., PPV = 90 percent). In actuality the PPV is dependent on a third variable, the prevalence, or pretest probability, of cancer. Suppose the prevalence of cancer is 1 percent (1,000/ 100,000 population). This means that if 100,000 persons are screened, of whom 1,000 actually have cancer, then the 90 percent sensitivity means that 900 of these 1,000 will test positive, and the 90 percent specificity means that 89,100 of the 99,000 people without cancer will test negative. The chances that a positive test result is indicative of cancer (the PPV question asked above) would not be 90 percent, but 900/10,800, or 8 percent. The seeming accuracy conveyed by the “90 percent” figure for both sensitivity and specificity obscures the disturbing problem that the test would give false-positive information to 92 percent of those testing positive (11 people for every 1 person who truly had cancer). Because PPV correlates with prevalence, if the same test is administered in a community with a lower prevalence, the PPV would fall even further and the risk of producing false-positive results would climb higher (99 percent of those testing positive or 111 people for every 1 person who truly had cancer) (Table 5.1). The policy significance of these mathematics is that, regardless of the accuracy of a screening test, the administration of a test to populations or individuals with a low risk of cancer has a potential to introduce major problems with false-positive results, leading to harms that can offset the benefits of screening.

TABLE 5.1. Illustration of Influence of Prevalence on Positive Predictive Value.

TABLE 5.1

Illustration of Influence of Prevalence on Positive Predictive Value.

Reliability

Reliability (or reproducibility) is the degree to which a screening test yields the same result when it is repeated under the same conditions. A laboratory assay for a serum tumor marker, for example, lacks reliability if it yields significantly different results when the test is repeated with a sample from the same tube of blood. Radiologists' interpretation of a screening chest radiograph can suffer from poor reliability due to either interobserver variation (differences between radiologists' interpretation of the same film) or intraobserver variation (different interpretations of the same film by one radiologist).

Effectiveness of Early Detection

A common mistake in determining whether screening for cancer is justified and a reason for premature enthusiasm for promoting screening tests is to limit consideration to the issues described above: burden of suffering and accuracy. Proponents argue that if the disease is serious and an accurate test is available, routine screening should be instituted. What this argument overlooks is the possibility that early detection of the disease may not improve outcomes either for the screened population as a whole or even for the individuals who will be found to have cancer.

Effectiveness of Treatment

The efforts and potential adverse effects of screening are not justified if an effective treatment is unavailable for persons found to have cancer. The tragedy of many cancers is that they progress inexorably, despite the use of the best available treatment regimens, because of the inability of these therapies to alter the natural history of the disease. Screening for such cancers serves only to identify the disease earlier in its course, not to improve the prognosis. This longer apparent survival time is not a benefit to the patient (and indeed may be a psychological and social cost) if that earlier diagnosis did not result in either less morbidity from treatment or longer life.

The benefits of early detection are muted for cancers that have a short preclinical period because the time window for early detection is short and the opportunity to affect outcomes is brief. Screening is also unlikely to confer benefits by detecting cancers that would have excellent outcomes under usual circumstances, when treatment is not initiated until patients present with symptoms. This concern underlies skepticism about the incremental benefit of screening for endometrial or testicular cancer, for example. Latent cancers detected by screening-induced overdiagnosis also may not benefit from early detection if the lesions were not destined to progress or affect the patient's health. Although little harm would have occurred if the cancer went undetected, the excellent outcomes of screening programs that predominantly detect such lesions are often cited as evidence of the benefits of screening. These principles are embodied in Whitmore's now-famous aphorism about prostate cancer: “Is cure possible for those for whom it is necessary, and is cure necessary for those in whom it is possible?” (Whitmore, 1988, pp. 7–11).

Incremental Benefit of Early Detection

Having an effective treatment is not enough. The logic behind screening rests on the argument that outcomes are improved by the early institution of treatment. If there is no incremental health benefit to early detection and patients fare just as well if their cancers are diagnosed after signs or symptoms appear, then there is not a good argument for screening. In this case, there are harms of screening, including adverse effects of screening on people without cancer, many of whom will experience anxiety and undergo workups for false-positive results, and the adverse effects of consumption of resources that would help patients more effectively if they were invested elsewhere.

The presumption that early detection improves outcomes is almost axiomatic in U.S. society. Epidemiological evidence would seem to support this belief. For almost all forms of cancer, 5-year survival rates are substantially lower for persons with advanced-stage disease (see Chapter 1). Such statistics are often mistakenly interpreted as evidence that patients are likely to live longer if their cancer is diagnosed early (see discussion of “lead time bias” below). Screening is consistently associated with the diagnosis of smaller and more localized tumors and with the familiar phenomenon of “stage shift,” in which the proportion of cancers diagnosed at an earlier stage increases after screening is introduced. Also, observational studies demonstrate that patients whose cancers are diagnosed through screening often have better outcomes than those whose cancers are diagnosed otherwise. Many advocates of cancer screening find such evidence more than adequate to justify the intuitive notion that early detection is beneficial.

Whether such evidence is indeed adequate lies at the heart of many controversies about cancer screening. Critics of such evidence argue that such observations do not offer proof of benefit because the same patterns would be expected even if screening did not improve outcomes. For example, the fact that patients who participate in screening programs have better outcomes than those in other settings may be due to the fact that patients who participate in screening are more likely to have a college education, to be nonsmokers, and to have other healthier habits (Rimer et al., 1996b). Similarly, the fact that screening detects disease at an earlier stage and that patients diagnosed with localized disease have higher 5-year survival rates may reflect length and lead-time biases rather than true lengthening of life (Welch et al., 2000). The influence of these factors cannot be excluded unless outcomes are examined for a control group that is comparable in all respects other than exposure to screening, as has been done in trials of screening mammography.

Lead-Time Bias Lead-time bias refers to the overestimation of survival time simply due to a backward shift in the starting point for the measurement of survival as a result of early detection (Last, 1988). Patients diagnosed earlier can seem to live longer after diagnosis even if the time that they die does not change. For illustration, consider a man who is destined to develop symptoms from prostate cancer at age 65 and to die at age 70. His survival after diagnosis (5 years) can be doubled (10 years) if the cancer is detected through screening at age 60, even if he still dies from that same cancer at age 70. Because of lead-time bias, the fact that 5-year survival rates are higher for early-stage cancer than for advanced-stage cancer does not, by itself, prove that patients who are screened benefit from that screening and live longer; it may mean only that their disease is detected earlier. Similarly, the tendency of screening to detect smaller, localized tumors proves that cancers are being found at an earlier stage of their progression, not that the outcomes of that progression will necessarily be altered.

Length Bias Length bias refers to the tendency of screening to detect slowly growing lesions more readily than aggressive cancers. Rapidly progressive cancers, because they lead more hastily to death, are present in the screened population for a shorter period of time, thereby reducing their prevalence in the population and, thus, their odds of being detected when a screening test is administered. The consequence of length bias is that cancers detected by screening contain a higher proportion of slowly growing cancers than among cancers detected by symptoms. The favorable prognosis observed for cancers detected through screening may therefore imply a benefit from screening even when there is none.

Screening Interval and Duration Under conditions of uncertainty, in which the optimal frequency of screening has not been determined directly in clinical studies, there is a tendency to assume that a shorter interval is appropriate if the individual is at high risk of acquiring cancer. This assumption, which underlies the common advice that individuals in high-risk groups undergo more frequent screening, may be invalid because the proper determinants of the frequency of a screening test are the rate of progression of the disease and the sensitivity of the test. If these variables are held constant, increasing the frequency of testing offers little benefit, regardless of one's underlying risk of acquiring cancer (Frame and Frame, 1998).

Many controversies in cancer screening surround the question of when to stop. For most cancers, the absolute risk of dying from cancer increases with age, making elderly individuals the largest subset of people with cancer. On the other hand, the decreasing life expectancy and the greater likelihood of having other diseases that accompany advancing age tend to offset these benefits. One analysis, based on certain assumptions of efficacy, estimated that lifetime screening for breast cancer from age 50 until death results in a maximum potential life expectancy gain of 43 days, whereas the cessation of screening at age 75 or 80 would result in women giving up a maximum potential life expectancy gain of 9 or 5 days, respectively (Rich and Black, 2000). Rather than relying on such modeling data, which have their limitations, it would be preferable to examine direct evidence of the relative benefits of screening with advancing age, but most screening trials have limited enrollment to patients under the ages of 70, limiting access to definitive data. Because many older adults have excellent life expectancies and qualities of life, current thinking is shifting away from reliance on strict age cutoffs for screening and looking more closely at the life expectancy and health status of each individual to assess the potential benefits of screening.

Study Designs For the reasons outlined above, epidemiological studies reporting better outcomes for individuals with early-stage cancer tend not to persuade skeptics that early detection improves outcomes. Study designs fall in a hierarchy of persuasiveness (Box 5.2), in which uncontrolled epidemiological data and case series rank lowest in proving effectiveness.

Box Icon

BOX 5.2

Hierarchy of Effectiveness of Study Designs. Experimental trials Randomized controlled trials

Controlled observational studies compare outcomes among those who do or do not receive screening and bring investigators and clinicians one step closer to having definitive evidence of the effectiveness of screening. Historical studies (before-and-after studies), such as a comparison of outcomes within a community before and after the introduction of a screening program, raise questions about the influence of temporal factors (e.g., improved treatment regimens) other than screening that occurred contemporaneously with the screening program. Cross-sectional comparisons, such as comparisons of outcomes for patients screened at a local institution with those for other patients in the community, also lack persuasiveness because of potential confounding variables: the characteristics of patients at these institutions may have an independent effect on the observed outcomes that are unrelated to screening.

In a case-control study, a retrospective review of medical records is undertaken to compare patients who died of cancer to a matched group of patients who did not die from cancer. If the patients who died from cancer were significantly less likely to have undergone screening, it is tempting to infer that the screening test was beneficial. The limitations of such studies include their retrospective design (e.g., medical records may not systematically capture relevant variables) and the difficulties of addressing confounding variables (persons who underwent screening may have other characteristics, such as healthier lifestyles, which may have contributed to the observed outcomes). Matching of the two groups by known confounding variables (e.g., age and risk factors) and the formulation of statistical adjustments in the odds ratios to control for such cofactors address some of these problems, but such studies cannot exclude the role of unknown or unmeasured confounding variables.

Prospective cohort studies overcome some of the limitations of retrospective analyses by establishing the variables of interest at the start of the study and collecting them systematically over time, often with long periods of follow-up, but the potential influence of confounding remains. Unless the decision to screen patients is made randomly, it is possible that screened and unscreened persons differ in characteristics other than screening that may account, at least in part, for the observed outcomes. It is this concern that accounts for the primacy of randomized controlled trials in demonstrating the effectiveness of screening (Jadad, 1998). The defining characteristic of such trials is that the assignment of patients to undergo screening is made randomly, creating comparison groups that are essentially the same in all respects other than exposure to screening. Unrecognized, as well as known, confounding variables are thereby distributed equally and should therefore not contribute to observed differences in outcomes.

Outcome Measures

The persuasiveness of evidence that screening does or does not improve outcomes depends in large part on which outcomes are considered. The outcomes that matter most are health outcomes, which in this report refer to outcomes that are perceptible to patients (e.g., pain, dysfunction, and death). Because of the lengthy follow-up periods and methodological challenges associated with the measurement of such outcomes, however, many studies infer effectiveness by measuring intermediate or surrogate outcomes. Intermediate outcomes are findings that are not health outcomes in themselves (e.g., histological features of a cancer) but that are thought to increase the risk of such outcomes. Surrogate outcomes are indicators that correlate with but that are not themselves health outcomes (e.g., length of hospital stay). One must be cautious, however, in relying on such indicators to infer effectiveness because screening can improve intermediate outcomes without necessarily improving health (Bucher et al., 1999; Gøtzsche et al., 1996).

The most definitive health outcome in terms of both importance to patients and relative ease of measurement is death, and thus, much of the focus in cancer screening is on evaluating whether death rates are lowered. As noted earlier, lead-time bias limits the utility of measuring survival after diagnosis, and thus, the conventional basis of comparison in screening trials is the proportion of persons in the intervention and control groups who die from cancer in a defined follow-up period.

The customary endpoint is the cancer-specific mortality rate and not mortality from all causes. In theory, a demonstrated reduction in all-cause mortality would be ideal, to ensure that death from cancer is not traded for death from another cause (such as fatal complications induced by screening or treatment). But because any specific cancer accounts for a relatively small proportion of all deaths in a population, the statistical power required to demonstrate an effect on all-cause mortality would require trials to have a sample size and duration that would render them unfeasible. Although most trials are therefore not powered to show an effect on all-cause mortality, their failure to do so is often mistakenly interpreted as evidence of a lack of benefit or, more erroneously, as evidence that screening somehow induces deaths from other causes.

Results can be statistically significant without having clinical or public health significance. Proponents of screening, in making their case, often emphasize the relative benefits rather than the absolute benefits of interventions. The absolute benefit of a 20 percent relative reduction in the risk of dying from cancer depends on the baseline probability of death. If that probability is 100/100,000 over some defined interval of time, the intervention reduces the risk of death to 80/100,000, an absolute difference of 20/ 100,000 or an absolute risk reduction of 0.02 percent, a far less impressive figure than the relative risk reduction of 20 percent. Although both figures are true, the absolute risk reduction has important policy implications, because it indicates that a large number of people must receive the intervention to save the life of one individual. The number of people who need to be treated to save one life is known as “the number-needed-to-treat” (NNT) which in this case is 100/0.02, or 5,000 people.

The NNT and its specific counterpart in screening, “the number-needed-to-screen” (NNS), have their limitations. They do not stipulate the health outcome prevented—two screening tests can have the same NNS, with, for example, one saving lives and the other one preventing fractures—nor do they address the harms and costs of interventions. In the context of screening, however, this measure can help place in context the size of the populations that do and do not benefit from early detection (Rembold, 1998). In the example presented above, most of the 5,000 people who must be screened to save one life will experience no personal benefits from screening but will be exposed to the inconvenience, discomfort, and potential harms of the screening experience. The difficult policy and ethical challenge in recommending that screening test would turn on deciding whether it is proper to expose that number of people to those particular harms to benefit one individual. Payers must decide whether it is worth the monetary costs (see below), but even without such considerations, for health reasons alone the NNS may sometimes be too large to make the argument that the population is better off with screening.

HARMS

Importance of Harms in Cancer Screening

The Hippocratic oath of primum non nocere (the first thing [is] to do no harm) establishes an ethical duty to ensure that medical interventions result in more good than harm. This duty is manifest throughout medicine but takes on special implications with regard to cancer screening (Ewart, 2000; Stewart-Brown and Farmer, 1997). Unlike patients who seek treatment for health complaints, persons undergoing cancer screening are, by definition, asymptomatic. With some exceptions (Rogers, 2000), most ethicists recognize a stronger moral imperative to avoid net harm in the case of preventive interventions and to ensure that what is offered is good for people.

Screening differs from conventional treatment interventions because in the latter case everyone exposed to a potential harm has a disorder, whereas the group exposed to potential harms from screening is the entire screened population, which is generally large (sometimes numbering in the millions) and predominantly free of disease. For every person found to have disease through screening, many more people in the screened population are exposed to potential harms. If the NNS for a screening test is 5,000, those who advocate screening must make the ethical argument that the large benefits to 1 individual justify the sum of the harms to which 4,999 people are exposed. Whether this holds up to moral scrutiny depends on the nature of the harms.

Test Procedure Although many screening test procedures are innocuous, involving little more than venipuncture, others (e.g., colonoscopy) are associated with various degrees of cost, discomfort, and potential complications (e.g., colonic perforation). Separate from the physical or psychological harms of the procedure are other difficulties such as the inconvenience of arranging testing, embarrassment in undergoing the procedure, and the unpleasantness of preparing for some procedures (e.g., bowel preparation for colonoscopy). The degree to which these matters are troublesome to patients is highly dependent on individual circumstances and personal values.

False-Positive and False-Negative Results The more common adverse effects of screening emanate from the information generated by testing. Positive or indeterminate test results plant the seed of anxiety, at least for some patients, and especially for serious diseases, and they usually require follow-up tests to determine whether the disease is present. In some cases the follow-up procedure is simple, such as a repeat blood test, but in other cases the patient is advised to undergo more invasive studies or procedures (e.g., biopsy) that may be associated with greater inconvenience, discomfort, or potential complications. Patients awaiting appointments for these confirmatory tests spend days or weeks, often in a state of worry and anxiety, not knowing whether they have a serious disease. The tally of harms against which the potential benefits of screening should be measured includes psychological morbidity and the accumulated potential physical risks associated with the cascade of tests and treatments triggered by screening. These harms are often borne by a sizable proportion of the screened population. For tests with a very low PPV, the net sum of the severity of harms experienced by persons without disease can outweigh the benefits to the small proportion of individuals with the disease.

For some individuals the harms do not end after the false-positive or false-negative error has been clarified. Patients who are ultimately told that their false-positive test results were erroneous may continue to believe that there is still something wrong. For example, as discussed later in this chapter, some studies of women who have received false-positive mammography results reveal continued anxiety on long-term follow-up, well after biopsies have shown no breast cancer.

These concerns, as well as ethical and legal ramifications, become more intense in the context of emerging technologies that screen for genetic susceptibility to cancer. Although such testing is currently considered primarily for families with a high likelihood of having uncommon familial cancer syndromes, technological advances raise the specter that screening of the population for genetic susceptibility to cancer will become more commonplace (Evans et al., 2001; Golub, 2001; Wilfond et al., 1997). The growing difficulty of keeping pace with the breathtaking advances in genetic technology makes it more likely that primary care physicians will provide misleading interpretations to patients undergoing genetic screening for cancer (Emery and Hayflick, 2001). The cascade of potential adverse consequences of genetic screening can reach beyond the patient to relatives and descendants.

Finally, normal results on screening are potentially harmful. False-negative results allow cancers to escape detection, but even true-negative results pose a potential risk. Patients may mistakenly assume that they are no longer in need of repeat screening at recommended intervals or that the clean bill of health makes it unnecessary to engage in other preventive behaviors or to seek clinical attention for abnormal signs or symptoms. Arguing against routine screening for lung cancer, Frame wrote: “A significant potential harm of screening is that smokers will interpret negative results of screening tests as assurance that they are disease free and will be less motivated to quit smoking” (Frame, 2000, p. 1982).

Harms of Treatment

The benefits of early detection of cancer must be weighed not only against the harms of screening but also against the harms of treatment. For some conditions it is possible for the adverse effects of treatment to offset the more modest benefits of screening. This is especially problematic when screening results in the overdiagnosis of latent cancers of uncertain clinical significance. Lesions that may pose little threat to patients' health are often treated, sometimes aggressively, with surgery, chemotherapy, radiotherapy, or other modalities that carry substantial risks of untoward side effects and complications.

A less obvious, but very real, harm of screening is the diversion of attention, time, and resources away from the primary prevention of cancer and other measures with greater health benefit to patients than screening. The American public has a particular fascination with technology (Smith, 2001) and is often more interested in getting a test for cancer than in adopting lifestyle measures that can prevent the very occurrence of cancer (e.g., the cessation of smoking or the consumption of a healthy diet). The limited time that individuals spend with clinicians is often consumed with testing, discussions of whether testing is necessary, and the interpretation of test results, leaving little time to talk about smoking cessation, dietary modification , or other primary prevention issues. This “opportunity cost” represents one of the most important arguments against the promotion of screening tests of unknown effectiveness, even if they are harmless and low-cost.

Costs

Expenditures for Screening

Those who are concerned about the costs of screening measure expenditures in different ways. The simplest measures are clinician and laboratory charges for screening. Charges are not the same as costs, however, especially if one considers the indirect costs of screening, such as the administrative overhead required to process the results or the patient's lost time from work. What counts as an expenditure also depends on one's perspective. The costs faced by a managed care organization or an employer differ from the copayments faced by the patient. A population-based perspective, which is recommended for economic analyses, considers all the costs faced by society, including time spent in treatment and time spent by unpaid caretakers (Gold et al., 1996).

The obvious criticism of considering only the up-front costs of screening is that it ignores the benefits, in both health and economic terms, of early detection. The pivotal economic question for most health services is not how much they cost but their value (the ratio of expenditures to benefits). An intervention with a low value, even if it is relatively cheap in terms of up-front costs, represents a poor use of resources, whereas a highly costly service may be an excellent value if it is highly effective. This argument makes perfect sense from a societal perspective but is often less compelling to insurance plans. Managed care organizations, for example, in which patients are unlikely to remain members for more than a few years, face the up-front costs of screening with little confidence that they will be the recipients of the downstream economic benefits.

Cost Analyses

The argument that screening pays for itself is often made on economic grounds, based on the contention that the up-front costs of screening are offset by the economic benefits of avoiding treatment for advanced-stage cancer. This argument is an example of cost-benefit analysis, in which the benefit is, by definition, measured in monetary units. This approach is not favored in economic analyses for both methodological and moral reasons. The methodological limitation stems from difficulties in quantifying the economic benefits of early detection and treatment. The moral difficulty is in assigning a monetary value to improved health or lengthened survival.

The more accepted approach is to compare health services on the basis of how much health benefit is purchased per dollar. A common measure is the cost-effectiveness ratio, in which the numerator is the monetary cost of the intervention and the denominator is the incremental health gain (e.g., years of life saved) with that expenditure. A screening test can be more or less cost-effective, depending on how it is used (Russell, 2000), and estimates can vary markedly depending on the methods used in the cost-effectiveness analysis. The greatest ambiguity surrounding cost-effectiveness analyses is the difficulty of which cutoff ratio constitutes a “good buy”: how does one decide whether $50,000, $70,000, or $100,000 per year of life saved is a good value? As reviewed below, cancer screening escapes many of these quandaries because the cost-effectiveness ratios for recommended tests generally fall within widely accepted ranges of affordability, often below $20,000 per year of life saved.

The measurement of health benefits in terms of years of life saved does not capture the beneficial effects of screening on morbidity or quality of life, and thus, the ideal approach is to measure cost-utility ratios, in which health benefits are adjusted to reflect the relative importance of the outcome to patients. A common example of this ratio is the dollar cost of interventions per quality-adjusted life year (QALY), or disability-adjusted life year.

The validity of such comparisons—and the validity of economic calculations more generally—is highly dependent on the quality of the available cost estimates, which is often poor, and on the sophistication of the analytical methods. Standards for good cost-effectiveness analyses have been developed (Gold et al., 1996), but few published studies abide by these methods. Estimates of the cost-effectiveness of health services, cancer screening tests included, often vary widely because of differences in how the analyses were approached.

Trade-Offs and Shared Decision Making

Responsible decisions about whether cancer screening is appropriate require a methodical weighing of benefits and harms to determine whether the screened population gains more than it loses through screening. This judgment can be straightforward when the trade-offs are stark. Screening of all women for ovarian cancer, for example, is likely to result in unnecessary biopsies and laparotomies for a large proportion of women, few of whom will experience any proven benefit, making it clear that routine screening is inappropriate (National Institutes of Health, 1994). Cervical cancer screening illustrates a scale that tips the other way. The sizable benefits in terms of decreased mortality rates clearly offset the inconvenience of testing and the consequences of false-positive results, so that routine screening of the population has been widely accepted for decades and has been implemented around the world.

The more difficult controversies in cancer screening relate to what Kassirer and Pauker (1981) call “toss-ups,” in which the balance between benefits and harms is less obvious, so which way the scales tip depends on subjective value judgments. Cancer screening tests often have both proponents and critics who, examining the same body of evidence, reach different conclusions about whether benefits outweigh harms. The facts (i.e., the data) are often less contentious than their interpretation. In some cases disagreements occur because reviewers set different thresholds for the quality of evidence that must be demonstrated to infer effectiveness. An element of subjectivity enters into assessments of science: the relative importance of various study design flaws, the validity of generalizing from one clinical context to another, and whether there have been enough studies with sufficient consistency.

However, the subjective value judgments that are perhaps most dominant in controversies about cancer screening do not concern the magnitude of benefits and harms but their relative importance. Differences in utilities, the relative importance that people assign to potential outcomes, explain why those examining the same data reach different conclusions about whether screening is appropriate. Proponents consider the benefits worth the harms, whereas skeptics take the opposing view. Guideline developers attempting to decide whether cancer screening is good or bad for a population inevitably apply their own value judgments (in effect, the average utility of the committee) in reaching a decision. Implicit in this act is the presumption that their value judgments are representative of the population to which their recommendations will be applied.

Studies have demonstrated that the relative importance that physicians assign to potential outcomes is often discordant with the relative importance assigned by their patients (Holmes et al., 1987), but even guideline panels composed of patients would have difficulty with “toss-up” screening controversies because of the degree to which preferences vary from person to person. If a hypothetical screening test with an NNS of 10,000 to prevent 1 death from cancer induces a non-fatal pulmonary embolus in 10 patients, people will differ regarding the appropriateness of screening, depending on the importance that they assign to these outcomes. Studies show that patients given the same facts about four colorectal cancer screening tests make different choices about which option is best (Leard et al., 1997; Pignone et al., 1999). When the best choice depends highly on personal preferences that vary substantially in the population, groups that issue uniform guidelines for or against screening expose a sizable proportion of the population to the wrong choice (Woolf, 1997a). Many who follow the guideline are screened (or not screened) in a manner that they would have deferred if given the opportunity to choose for themselves.

These considerations explain one of the most striking transitions in cancer screening guidelines in the last decade: for a growing number of screening tests, organizations are moving away from making uniform recommendations for or against screening and are instead encouraging clinicians to adopt shared decision making as part of an individualized, patient-centered approach to screening (Kassirer, 1994; Woolf, 1997a). This approach entails (1) giving patients information about potential benefits and harms, the probability of such outcomes, and the quality of evidence on which the estimates are based; (2) assisting them in considering their personal preferences and risk profile; and (3) helping them arrive at a choice that best suits their needs (Coulter, 1997; Frosch and Kaplan, 1999; Woolf, 1997a). The trend is toward giving patients the opportunity, if they so desire, to decide for themselves which choice is best.

A full discussion of the opposing arguments and logistical impediments to shared decision making is beyond the scope of this report and is addressed elsewhere (Barry, 1999; Elwyn et al., 1999; Frosch and Kaplan, 1999; Lang, 2000; Woolf, 2001). However, because consideration of patient preferences is embedded in many of the cancer screening guidelines discussed in this report, several fundamental challenges to the idea deserve mention. First, although most patients appreciate information about options, the degree to which they want to exert control over decisions is unclear (Deber et al., 1996; Strull et al., 1984). Many patients would rather have their physicians make such decisions and are overwhelmed both by the cognitive challenges of processing the facts and by the emotional toll of having made the wrong choice on a life-threatening matter.

For their part, physicians are unaccustomed to truly informed decision making, in one study doing so for only 9 percent of decisions (Braddock et al., 1999). Some physicians consider the sharing of decisions an abdication of their role as doctors and as a slight to their medical expertise. Others support the notion but in busy practices are too hurried to entertain the lengthy discussions that such decision making would require. In a recent survey, only 17 percent of internists reported that they would make their decision to order a prostate-specific antigen (PSA) test contingent on patient preferences (Dunn et al., 2001. The leading reasons for not discussing PSA testing were a lack of time (51 percent), the complexity of the topic (48 percent), and a language barrier between the physician and the patient (32 percent) (Dunn et al., 2001). Many physicians lack the knowledge, decision aids, or support staff to give patients the objective data that they need to make informed choices.

There are considerable difficulties in presenting risk information to patients (Bogardus et al., 1999) and uncertainties about how best to communicate probabilities (Goyder et al., 2000). In an intriguing study of 500 women, 96 percent of whom were high school graduates, 80 to 90 percent were unable to interpret simple probabilities (e.g., how many coin flips will come up heads?) or to understand relative or absolute risk reductions when they were applied to their perceived risk of breast cancer (Schwartz et al., 1997). How information is framed affects its interpretation: relative benefits are more impressive than absolute risk reductions, and percent gains are more attractive than percent losses. Visual displays to convey risks can help, but the optimal approach is unclear. A systematic review of 17 studies of aids for shared decision making concluded that they improved knowledge, reduced decisional conflict, and stimulated patients to engage in decision making without increasing their anxiety; but they had variable effects on decisions and no discernible effect on satisfaction (O'Connor et al., 1999).

The boundaries for shared decision making are indistinct (Woolf, 2001). It is unclear whether health systems and payers can afford to provide the choices that patients might prefer. Clearly, it is not the duty of clinicians and health systems to deliver services that patients might want but that are ineffective or medically contraindicated. What constitutes the dividing line between such services and the reasonable options from which patients have a right to choose must be delineated if the trend reflected in cancer screening guidelines continues its progression into other areas of medicine.

EFFECTIVENESS OF CANCER SCREENING

The remainder of this chapter focuses on four cancers for which there is a large body of evidence regarding the effectiveness of routine screening, including three cancers that are among the leading causes of cancer deaths in the United States: breast, colorectal, and prostate cancer. The review also examines cervical cancer, which claims fewer lives but for which important evidence and screening guidelines are available. Screening for lung cancer will be commented on in detail in Chapter 6, as this is a timely case study of the ongoing challenge of dealing with the uncertainties of screening efficacy as new screening technologies are developed.

Screening through routine self-examination, physician examination, or laboratory testing and imaging studies has been advocated for some of the cancers excluded from this report. Examples include cancers of the ovary, oral cavity, stomach, pancreas, bladder, endometrium, testis, and thyroid. For various reasons, however, few organizations recommend routine screening of the population for these conditions, and therefore, they are not reviewed here. Many guidelines do advocate screening for these cancers in individuals at especially high risk. For example, although no organization recommends routine screening of the population for ovarian cancer, a National Institutes of Health Consensus Conference did advocate screening of women at high risk for ovarian cancer due to their cancer family history (National Institutes of Health, 1994). Skin cancer screening is also excluded from this review. Although some organizations (e.g., American Cancer Society) recommend for the adult general population periodic examination of the skin by a physician, there is no direct evidence from either randomized trials or case-control studies that such screening reduces rates of morbidity or mortality from skin cancer (U.S. Preventive Services Task Force, 2001b). In a recent review, the Institute of Medicine (2000c) concluded that evidence was insufficient to support the adoption of a new program of clinical screening for skin cancer among asymptomatic Medicare beneficiaries.

The review presented in the remainder of this chapter focuses on the cancers detected and the screening options available in the United States. The findings may not be applicable in other countries, where there may be important differences in cancer prevalence, the availability and performance characteristics of screening tests, the values placed on benefits and harms, and the structures and resources of health care systems.

The evidence reviewed in this chapter was compiled from multiple sources, beginning with studies previously evaluated or currently under review by the U.S. Preventive Services Task Force (USPSTF) and other groups that have developed evidence-based cancer screening guidelines. It was supplemented by a manual search of recent literature on cancer screening and a computerized search of the National Library of Medicine's bibliographical MEDLINE database, conducted in February 2001, of relevant studies published since 1995, the closing year for the review of evidence for the second edition of the Guide to Clinical Preventive Services, the report of USPSTF (1996). Evidence published after that date is not included in the report.

Colorectal Cancer

The colorectal screening tests considered in the review in this part of the chapter are the fecal occult blood test (FOBT), flexible sigmoidoscopy, double-contrast barium enema, and colonoscopy. The review does not consider a variety of investigational technologies, such as computerized colography (virtual colonoscopy), and testing of feces for mutations in DNA (Traverso et al., 2002), which are less invasive than current screening options but which have not been sufficiently validated for routine use in the clinical setting.

Recent studies have shown an association between screening for colorectal cancer and a decreased incidence of the disease (Mandel et al., 2000), lending support to the notion that the removal of polyps, precipitated by screening, prevents colorectal cancer. For many years, the most compelling evidence was from the National Polyp Study, which demonstrated that the detection and removal of adenomatous polyps in patients with a prior history of such lesions could reduce the subsequent incidence of colorectal cancer by 76 to 90 percent (Winawer et al., 1993b). Such evidence lends support to the existence of an adenoma-carcinoma sequence: the hypothesis that colorectal cancer arises largely from adenomatous polyps. That said, an unknown proportion of colorectal cancers may arise de novo or from hyperplastic polyps (Bedenne et al., 1992). Concerns remain over flat lesions that are not discernible on colonoscopy and that may progress to cancer (Rembacken et al., 2000).

Certain patients are at increased risk for colorectal cancer, accounting for 30 to 35 percent of colorectal cancer cases (Winawer et al., 1997). Risk factors include a personal or family history of polyps or prior colorectal cancer and inflammatory bowel disease.

Fecal Occult Blood Testing

The test that has undergone the most extensive evaluation is the home FOBT, which in most studies consists of two samples from three consecutive specimens (six samples total) applied to guaiac-impregnated cards. The cards are mailed or delivered to the clinician's office or screening center, where a positive reaction indicates the possible presence of occult blood. The sensitivity of a single FOBT is limited and in some studies is as low as 40 percent (Ransohoff and Lang, 1997) because cancers and polyps may not bleed or may bleed intermittently and because blood is not fully distributed in the stool. The test is therefore typically repeated every 1 to 2 years to improve the likelihood of sampling blood and to thereby achieve greater “program” sensitivity. In screening trials, a program of FOBT every 1 to 2 years has been reported to detect 72 to 92 percent of colorectal cancers (Hardcastle et al., 1996; Kronborg et al., 1996; Mandel et al., 1993), with the higher values obtained by rehydration of the slides. Rehydration increases sensitivity at the expense of specificity, reducing the latter from approximately 98 percent (unrehydrated form) to 90 to 92 percent. FOBT is less sensitive for the detection of polyps than for the detection of cancers because polyps are less likely to bleed.

The chief limitation of FOBT is its limited specificity for the detection of colorectal neoplasms. False-positive results may occur if patients have ingested peroxidase-containing foods or gastric irritants (e.g., anti-inflammatory agents) or bleed from noncancerous sources anywhere in the gastrointestinal tract (e.g., gastritis or duodenal ulcers or hemorrhoids). The problems with specificity are reflected in the reported PPV of FOBT, which for unrehydrated slides is 5 to 18 percent for cancer and which for the combination of curable cancers or large adenomas is 20 to 40 percent (Ransohoff and Lang, 1997). In a trial with primarily rehydrated slides, the PPV was 2.2 percent (Mandel et al., 1993). Thus, a substantial majority of persons with abnormal FOBT results do not have neoplasms but must undergo further evaluation (typically colonoscopy) to rule out disease. In a 13-year trial of screening by FOBT, 38 percent of patients invited for annual screening underwent at least one colonoscopy (Mandel et al., 1993).

Office FOBT (testing of stool from the examination glove following a digital rectal examination) is thought to have lower sensitivity and specificity than home FOBT, but direct proof is limited. Studies have reported that the two tests have equivalent yields and PPVs (Bini et al., 1999; Eisner and Lewis, 1991), but because the profiles of the patient populations receiving each test may have been dissimilar, such findings provide a weak basis for contrasting the sensitivities and specificities of the tests.

Newer stool tests in development may increase the sensitivity and specificity of screening. These include immunochemical tests for blood and molecular biology-based analysis for neoplastic markers (e.g., testing for DNA markers of colorectal cancer present in the stool) (Ahlquist et al., 2000). One study of such a test for DNA reported sensitivities of 91 percent for cancer and 82 percent for adenomas (at least 1 centimeter [cm] in diameter) and a specificity of 93 percent (Ahlquist et al., 2000), though others have found less sensitivity for other mutations (Traverso et al., 2002).

Randomized controlled trials in Minnesota (Mandel et al., 1993, 1999, 2000), the United Kingdom (Hardcastle et al., 1996), and Denmark (Kronborg et al., 1996) have demonstrated that a program of annual or biennial screening by home FOBT reduces the rate of mortality from colorectal cancer by 15 to 33 percent (Table 5.2). This was achieved by referring patients with positive results on rehydrated FOBT slides for colonoscopy or barium enema. The Minnesota trial demonstrated that annual and biennial screening by FOBT reduced the mortality rates by 33 percent (Mandel et al., 1993) and 21 percent (Mandel et al., 1999), respectively. The European trials (in which screening was biennial, unrehydrated cards were used, and rates of colonoscopy were lower) reported lower reductions in the mortality rates (15 to 18 percent). Longer follow-up data from the Minnesota trial revealed that screening also reduced the incidence of colorectal cancer by 20 percent and 17 percent, respectively, for the annually and biennially screened groups (Mandel et al., 2000). This suggests that, in addition to secondary prevention (early detection of cancer), FOBT also achieves primary prevention (preventing the occurrence of cancer), presumably by leading those screened to colonoscopy, thus facilitating the detection (and removal) of premalignant polyps.

TABLE 5.2. Randomized Controlled Trials of Fecal Occult Blood Testing.

TABLE 5.2

Randomized Controlled Trials of Fecal Occult Blood Testing.

Flexible Sigmoidoscopy

Endoscopy (sigmoidoscopy or colonoscopy) has a high sensitivity for the detection of large adenomas and cancers, but only for the portion of the bowel that is directly visualized. The 60-cm flexible sigmoidoscope can reach as far as the descending colon in approximately 80 percent of examinations and is thus capable of reaching 40 to 60 percent of the colorectum (Winawer et al., 1997). Examiner skill, bowel preparation, and spasm influence the depth of insertion. Although the detection of distal lesions by sigmoidoscopy often prompts examinations by colonoscopy or barium enema that thereby can lead to the detection of proximal lesions beyond the reach of the sigmoidoscope, this approach fails to detect proximal lesions that are unaccompanied by distal disease. Colonoscopic studies suggest that 20 to 32 percent of advanced adenomas or cancers would go undetected if patients were screened only by sigmoidoscopy (Imperiale et al., 2000; Lieberman et al., 2000). Thus, even though sigmoidoscopy directly examines only the lower half of the colorectum, it can lead to the identification of advanced adenomas or cancers in 70 to 80 percent of people who have such lesions.

Although the specificity of sigmoidoscopy falls short of 100 percent because normal mucosa is occasionally mistaken as polyps, a more common form of “false-positive” result occurs even with tissue confirmation of an adenomatous polyp because most adenomas do not progress to cancer. Tubulovillous, villous, or large (greater than 1 cm in diameter) adenomas are an established premalignant precursor to colorectal cancer and their presence increases the risk of developing colorectal cancer (Atkin et al., 1992), but only some progress to cancer.

Randomized controlled trials of sigmoidoscopy screening with mortality as an endpoint are under way, and thus, prospective evidence that sigmoidoscopy has a benefit in terms of reducing the rate of mortality is lacking. The Prostate, Lung, Colorectal, and Ovarian trial sponsored by the National Cancer Institute has randomized more than 150,000 subjects to receive sigmoidoscopy screening or not, but the trial will not be completed until 2014 unless it is stopped early due to a large mortality benefit (Gohagan et al., 2000). Case-control studies have demonstrated, however, that patients who die of colorectal cancer are significantly less likely than matched controls to have undergone sigmoidoscopy (Muller and Sonnenberg, 1995; Newcomb et al., 1992; Selby et al., 1992). To address concerns that confounding variables might account for this observation, Selby and colleagues (1992) conducted an intriguing analysis in which they demonstrated that the benefit was observed only for lesions within 20 cm of the anus, a pattern consistent with an effect from sigmoidoscopy and unlikely to result from lifestyle or other confounders. The adjusted odds ratio for the detection of such lesions was 0.41 (95 percent CI, 0.25 to 0.69), suggesting a 59 percent reduction in the rate of mortality, whereas there was no benefit for the detection of more proximal colon cancers, beyond the reach of the scope (adjusted odds ratio 0.96) (Selby et al., 1992).

Several studies in California report an association between a lower incidence of distal colorectal cancers and increased rates of screening by sigmoidoscopy (Cress et al., 2000; Inciardi et al., 2000). It is unclear whether these temporal trends can be attributed to sigmoidoscopy or other confounding variables, such as dietary or environmental changes.

The combination of FOBT and sigmoidoscopy is often recommended to enhance effectiveness, but evidence of its incremental benefit is limited. A nonrandomized trial found that adding FOBT to rigid sigmoidoscopy detected more colorectal cancers on initial screening, but the mortality rate was not significantly lowered (Winawer et al., 1993a). Other trials have examined the incremental benefit of adding flexible sigmoidoscopy to FOBT and have found that the amount of previously unrecognized disease identified was significantly increased compared to FOBT alone (Berry et al., 1997; Rasmussen et al., 1999).

Barium Enema

Data regarding the accuracy of barium enema for the detection of polyps and colon cancer in asymptomatic screened populations are limited. Studies that were poorly designed to assess test accuracy report sensitivities of 70 to 90 percent for the detection of polyps larger than 1 cm and 55 to 85 percent for the detection of colorectal cancer and specificities of 90 to 95 percent and 99 percent, respectively (Winawer et al., 1997). One study reported that double-contrast barium enema has a sensitivity of 85 percent for the detection of colorectal cancer (Rex et al., 1997b). In one study, patients undergoing surveillance for previously diagnosed adenomatous polyps underwent double-contrast barium enema followed by colonoscopy in which the endoscopist was unaware of the results of the barium enema. (The endoscopist was later given the results if a neoplasm was seen so that the involved segment could be reexamined.) Compared with colonoscopy, barium enema detected only 48 percent of the polyps larger than 1 cm and had an estimated specificity of 85 percent (Winawer et al., 2000). No trial has studied the effect of barium enema screening on the incidence of or rate of mortality from colorectal cancer.

Colonoscopy

Colonoscopy serves as the reference standard for most studies, and thus, its sensitivity and specificity are difficult to determine. Back-to-back colonoscopic examinations report a sensitivity of 90 percent for the detection of polyps larger than 1 cm (Rex et al., 1997a). Specificity for the correct tissue diagnosis by biopsy approaches 100 percent. As noted earlier, however, a large proportion of polyps, although correctly identified by colonoscopy, do not progress to clinically significant disease.

A trial to test the effect of colonoscopy screening on mortality rates is under consideration. Given the evidence from sigmoidoscopy screening reviewed above, one can postulate that the effect of colonoscopy screening on mortality would be of equal or, more likely, of greater magnitude, but direct evidence is lacking. One of the previously mentioned case-control studies of sigmoidoscopy, which also included colonoscopic examinations, reported that the odds ratio for reduced colon cancer mortality was 0.47 (95 percent CI, 0.37 to 0.58) for patients who underwent colonoscopy (Muller and Sonnenberg, 1995), but this subgroup analysis is of limited persuasiveness in proving a benefit in terms of a reduction in the rate of mortality.

Colonoscopy screening is routinely advocated for patients with a family history of familial polyposis syndrome and hereditary nonpolyposis colorectal cancer or for patients with inflammatory bowel disease. Controlled observational studies suggest that such screening improves survival from familial polyposis syndrome, a hereditary syndrome associated with an extremely high risk of colorectal cancer (Heiskanen et al., 2000). A randomized trial to confirm the benefits of this practice is unlikely for ethical reasons.

Harms

Performing the FOBT is not harmful, but the results can be distressing. Patients screened for colorectal cancer were distressed by an invitation letter, a positive test result, and delay in the process of screening (e.g., a wait of 10 days to colonoscopy) but later reported that it was worthwhile to have had the test (Mant et al., 1990). The more important consequence is the large proportion of patients who must undergo colonoscopy because of false-positive FOBT results. Aside from the risks of bleeding and perforation outlined below, patients experience the discomfort, embarrassment, and inconvenience associated with bowel preparation and the examination itself, and the anxiety and other negative consequences of awaiting the evaluation of positive results.

A British screening trial of sigmoidoscopy screening reported that bleeding and moderate to severe pain were reported by 3 and 14 percent of patients, respectively (Atkin et al., 1998). Bowel perforation is estimated to occur in 1 to 2 sigmoidoscopy examinations for every 10,000 performed (Winawer et al., 1997). An audit of 49,501 sigmoidoscopies performed over 10 years at the Mayo Clinic in Scottsdale, Arizona, reported perforations in 1 in 25,000 examinations (Anderson et al., 2000).

Bowel perforation is more common with colonoscopy than with sigmoidoscopy, although the exact incidence is uncertain and varies depending on whether the procedure is diagnostic or therapeutic. It has been estimated that 1 in 1,000 patients experience perforation, 3 in 1,000 have major hemorrhage, and 1 to 3 in 10,000 die from the procedure (Winawer et al., 1997). A British trial reported a colonoscopy complication rate of 1 in 200, and most of these complications required surgical intervention (Robinson et al., 1999). An audit of 10,486 colonoscopies performed over 10 years at the Mayo Clinic in Scottsdale, Arizona, reported perforations and colonoscopy-related deaths in approximately 1 in 500 and 1 in 5,000 examinations, respectively (Anderson et al., 2000). Perforations are estimated to occur in 1 in 25,000 barium enema examinations, and death occurs in 1 in 55,000 examinations (Winawer et al., 1997).

Uncertainties About Periodicity

Direct evidence about the optimal frequency of screening exists only for FOBT, for which annual and biennial screening intervals were prospectively evaluated in the trials discussed earlier. Although sigmoidoscopy screening was once recommended every 3 years and more recently every 5 years, such guidelines are based on expert opinion and not outcomes data. Indirect evidence from case-control studies of sigmoidoscopy suggests that an interval of at least 6 years (Muller and Sonnenberg, 1995) or even 9 to 10 years (Selby et al., 1992) may have an equivalent benefit. Colonoscopy screening is recommended every 10 years not on the basis of direct evidence but on the basis of inferences about the time it takes for the evolution from normal mucosa to invasive carcinoma (approximately 10 years). A 5-year interval for barium enema screening is advocated because of concerns that its lower sensitivity makes it more likely that extant lesions would escape detection in the interim (Smith et al., 2001).

Cost-Effectiveness Studies

A series of recent analyses (Frazier et al., 2000; Khandker et al., 2000; Sonnenberg et al., 2000; Theuer et al., 2001) have confirmed the findings of earlier reports (Eddy, 1990a; Wagner et al., 1996; Winawer et al., 1997) that screening for colorectal cancer has acceptable cost-effectiveness ratios. Calculations of the cost-effectiveness of screening for colorectal cancer are highly sensitive to certain assumptions, such as the time assumed to evolve from polyps to cancer and the performance characteristics of tests. Thus, individual reports often reach different conclusions about which test or which combination of tests is most cost-effective. The options that tend to dominate most analyses are the combination of annual FOBT and flexible sigmoidoscopy every 5 years or colonoscopy alone every 10 years. Cost-effectiveness ratios also vary to some extent by racial and ethnic group (Theuer et al., 2001). Nonetheless, in most reports every option, including colonoscopy, costs less than $20,000 per quality-adjusted life year (QALY).

Subjective Value Judgments

Although there is universal agreement that screening for colorectal cancer beginning at age 50 is worthwhile, the controversy over which test is best is heavily influenced by subjective value judgments. The current trend is to promote colonoscopy as the “preferred” test, a message promoted by celebrities and the news media (Gorman, 2000) and a formal position recently adopted by one gastrointestinal specialty society (Rex et al., 2000). Advocacy for colonoscopy is fueled by its superior accuracy over other screening tests. Accepting the alternative of sigmoidoscopy was said in one editorial to be the equivalent of performing mammography on one breast (Podolsky, 2000). The counterargument is that sigmoidoscopy screening can detect 80 percent of people who have significant neoplasms and that the incremental benefit of colonoscopy may be offset by its added harms, costs, and lack of availability.

In actuality, which screening test is best from an individual's perspective depends on subjective judgments regarding multiple variables: the relative importance that one assigns to scientific certainty, accuracy, benefit, safety, acceptability, costs, and feasibility (Woolf, 2000a). One value judgment concerns whether the absolute benefit of screening is large enough, which some question (Budenholzer, 1998). Although the relative reduction in mortality rates in FOBT trials was large (15 to 33 percent), the absolute benefit of screening is limited by the low prevalence of clinically significant disease in the screened population. The prevalence of advanced neoplasia (cancer, adenoma, or villous histology) in screened populations is only 6 percent (Imperiale et al., 2000; Lieberman et al., 2000), and cancers account for a relatively small proportion of these lesions. By one calculation, 1,374 patients must undergo screening by FOBT for 5 years to prevent 1 death from colorectal cancer (Rembold, 1998).

Research has shown that patients' preferences, given the same factual information, vary considerably. For example, Leard and colleagues (1997) gave 100 patients a 10-minute, scripted oral presentation about the benefits and risks of the four screening tests for colorectal cancer. When the patients were asked which test they would prefer on the basis of the information that they had just heard, 38 chose colonoscopy, 31 chose FOBT, 14 selected barium enema, and 13 chose sigmoidoscopy (Leard et al., 1997). Another study found that preferences for FOBT and sigmoidoscopy changed with more information. When patients received general descriptions of colon cancer and the two tests, their order of preference was FOBT alone (45 percent), both tests (38 percent), and sigmoidoscopy alone (13 percent). When they next learned about test accuracy, more patients preferred both tests (47 percent) and fewer wanted FOBT alone (36 percent). When they were then told about out-of-pocket costs, preferences for FOBT alone rose to 53 percent and requests for both tests fell to 31 percent (Pignone et al., 1999).

With an appreciation of this heterogeneity in patient preferences, shared decision making was recommended in a guideline produced in 1997 by a consortium led by the American Gastroenterological Association and endorsed by the American Cancer Society and a dozen other organizations. Beginning with a generic statement that screening was recommended for average-risk persons beginning at age 50, it offered five options for screening regimens and encouraged physicians to individualize the choice: “Decisions about which test or tests to use should take into account the patient's preferences, the patient's age, any existing comorbidity, and local resources and expertise. The panel considers planning of colorectal cancer screening an ideal opportunity for clinicians to share the decision making process with their patients as well as for exercising their own clinical judgment” (Winawer et al., 1997, p. 603).

Subsequent guidelines, including a recent update from the American Cancer Society (Smith et al., 2001), continue to advocate this shared decision-making approach to increase the likelihood that patients will be screened in a manner that suits their preferences. Preliminary evidence suggests that such engagement gives patients the knowledge they need to make informed choices. A randomized trial that compared the effect of giving elderly patients an informational intervention that simulated one of two informed consent presentations (with one emphasizing absolute risk and the other emphasizing relative risk) versus the effect of a brief scripted message about screening options showed that they created no differences in interest in screening—63 percent of patients in both groups intended to be screened. The informational intervention group demonstrated a more accurate understanding of the PPV, but those receiving the absolute risk information rated efficacy lower than did those who received relative risk information; controls rated efficacy highest (Wolf and Schorling, 2000).

Current Guidelines

Most organizations are in agreement that all Americans age 50 and older should be periodically screened for colorectal cancer and should be allowed to choose from options that include FOBT, flexible sigmoidoscopy, colonoscopy, or double-contrast barium enema (see Box 5.3 for guidelines published since 1996). Most organizations offer additional guidelines on the screening of adults at increased risk for colorectal cancer. For example, the American Cancer Society recommends colonoscopy screening beginning at age 40 (or 10 years before the age of occurrence of colorectal cancer or adenomatous polyps in the youngest affected person in the family) for persons with a first-degree relative who has or who has had colorectal cancer or adenomatous polyps before age 60 or for persons with two or more first-degree relatives who have or who have had colorectal cancer or adenomatous polyps at any age (Smith et al., 2001; Winawer et al., 1997). Most groups recommend early colonoscopy screening (by puberty or age 21) in persons with a family history of familial adenomatous polyps or hereditary nonpolyposis colon cancer and colonoscopy screening of persons with inflammatory bowel disease within 8 years of the onset of pan-colitis. Guidelines for surveillance of persons with previously diagnosed polyps or colorectal cancer have also been disseminated but fall outside the scope of this review.

Box Icon

BOX 5.3

Recommendations for Colorectal Cancer Screening in Average-Risk Persons.

Breast Cancer

The screening tests reviewed here include clinical breast examination, mammography, screening for mutations in BRCA1 and BRCA2 (mutations of breast cancer-associated tumor suppressor genes), and breast self-examination. Ultrasound and newer technologies that offer promise in improving the accuracy of breast cancer screening (e.g., full-field digital mammograms, magnetic resonance imaging, and filmless imaging) are not reviewed because their performance characteristics and incremental benefits over existing screening modalities require further evaluation before they are suitable for routine clinical use (Lewin et al., 2001; Orel, 2000). They are likely to offer important advances in the screening of high-risk groups, such as women suspected of carrying breast cancer susceptibility genes (Brown et al., 2000; Kuhl et al., 2000; Tilanus-Linthorst et al., 2000). The review also does not examine computer-based algorithms and other measures used to improve the accuracy of mammographic interpretations (Boccignone et al., 2000; Floyd et al., 2000).

The incidence of ductal carcinoma in situ (DCIS) (and of mastectomies) has increased in association with more widespread screening mammography (Ernster et al., 1996). DCIS now represents 12 to 20 percent of all newly diagnosed breast cancers (Feig, 2000; Winchester et al., 2000). In one community case series, DCIS accounted for 41 percent of cancers detected by screening in women ages 40 to 49 (Linver and Paster, 1997). The American Cancer Society estimated that 46,400 cases of carcinoma in situ breast cancer were diagnosed in 2001, and of these, approximately 88 percent were DCIS (Greenlee et al., 2001). A spectrum of controversy surrounds DCIS, ranging from those who are convinced that DCIS is potentially fatal to those who consider it a false-positive finding. There is no direct evidence from controlled trials that women benefit from the early detection and treatment of DCIS (Ernster et al., 1996). Observational data suggest that it is a major risk factor for the development of invasive carcinoma and for death from breast cancer (Ernster et al., 2000), but a large proportion of DCIS cases remain indolent (Page et al., 1995; Winchester et al., 2000). Among women diagnosed with DCIS, the 10-year death rate from breast cancer is 3.4 percent (Ernster et al., 2000). Diagnostic criteria for DCIS are imprecise, producing inconsistencies in the differentiation of low-grade DCIS from ductal hyperplasia. Whether small nonpalpable foci of DCIS require treatment and the proper treatment modalities are also debated (Morrow and Schnitt, 2000).

Clinical Breast Examination

An analysis based on pooled data from controlled trials and case-control studies calculated that the sensitivity and specificity of the clinical breast examination were 54 and 94 percent, respectively. Clinical breast examination was estimated to detect 3 to 45 percent of breast cancers that screening mammography missed (Barton et al., 1999). A sensitivity of 59 percent and a specificity of 93 percent were recently reported in an evaluation of 752,000 clinical breast examinations (Bobo et al., 2000). The sensitivity of the clinical examination appears to be higher in younger women, although it is less likely than mammography to detect small lesions. In a Canadian trial, lesions at least 20 millimeters (mm) accounted for 56 percent of cancers detected in women screened only by physical examination, whereas they accounted for 21 percent of cancers detected in women screened only by mammography (Miller et al., 2000). False-positive results are common, however. The PPV of the clinical breast examination ranges from 4 to 50 percent (Barton et al., 1999), reflecting differences in prevalence rates, technique, and case definitions. In one study, the cumulative risk of a false-positive finding over 10 years of screening by clinical breast examination was 22 percent (Elmore et al., 1998).

There is limited evidence about the effect on mortality of routine clinical breast examination in isolation. Trials have evaluated it in combination with mammography and have reported significant reductions in rates of mortality (Alexander et al., 1999; Shapiro, 1988), but the contribution of the clinical breast examination to outcomes is uncertain because similar effects were observed in trials that evaluated mammography alone.

Mammography

Because several clinical trials combined screening mammography with clinical breast examination, the performance characteristics of mammography in isolation are uncertain. The sensitivity, specificity, and PPV reported by trials that examined mammography alone are 68 to 88, 95 to 98, and 4 to 22 percent, respectively (Fletcher et al., 1993). Another review that used modified definitions of sensitivity and specificity reported sensitivity and specificity ranges of 83 to 95 and 94 to 99 percent, respectively (Mushlin et al., 1998). The sensitivity of mammography is somewhat lower (e.g., 72 percent) in some community practice audits (Poplack et al., 2000). Higher and lower sensitivities have been reported in individual trials depending on case definitions, the length of follow-up, and the number of views and interpreters. A retrospective review found discernible cancers in 5 to 50 percent of older films obtained before incidence cancers were diagnosed, depending on the methods and number of radiologists involved (Moberg et al., 2000).

Estimates of the specificity of mammography vary widely (82 to 99 percent in one review), as do the reported PPVs (4 to 22 percent) (Fletcher et al., 1993). The Canadian Breast Cancer Screening Initiative reported a PPV of 7 to 8 percent in population-based screening across seven provinces (Paquette et al., 2000). As these values indicate, most abnormal screening mammograms are falsely positive. In one study, among 2,400 women screened over 10 years (median of four mammograms per woman), 24 percent had at least one false-positive mammogram; the estimated cumulative risks of a false-positive result and the need to have a biopsy increased to 49 and 19 percent, respectively, after 10 mammograms (Elmore et al., 1998). There are also considerable inter- and intraobserver variations among radiologists in the interpretation of mammograms (Kerlikowske et al., 1998).

The sensitivity of screening mammography appears to be lower for women under age 50 than for older women, approximately 14 to 20 percent lower in most screening trials (Fletcher et al., 1993; Kerlikowske et al., 1996, 2000). This is likely due to differences in breast density, the more rapid growth of cancers in younger women, or both. Sensitivity also appears to be lower with longer intervals between examinations. In one study, the sensitivity of the first mammogram for registry-confirmed breast cancer was 99 percent with 7 months of follow-up, but sensitivities were 93 and 86 percent with 13 and 25 months of follow-up, respectively (Kerlikowske et al., 1996).

As with any screening test, the PPV of screening mammography depends on the prevalence (pretest probability) of breast cancer. The PPV of the first screening mammogram is between 5 and 38 percent (Kerlikowske et al., 1993) and is generally higher with increasing age or a positive family history (Kerlikowske et al., 2000). In a Canadian trial, the rates of false positivity for the combination of screening mammography and clinical breast examination were 7 to 10 percent for women ages 40 to 49 and 5 to 8 percent for women ages 50 to 59 (Miller et al., 1992a,b). In an American analysis of claims data, for every 1,000 women aged 65 to 69 who underwent mammography, 85 had follow-up testing in the subsequent 8 months (23 had biopsies); the PPVs of mammograms requiring further testing were 8 percent for women aged 65 to 69 and 14 percent for older women (Welch and Fisher, 1998). An audit of 36,850 screening mammography examinations of women with a mean age of 59 reported that 5 percent of the examinations were abnormal, and approximately 25 percent of these led to a biopsy. The PPV of an abnormal mammogram was approximately 10 percent (Dee and Sickles, 2001). A modeling analysis predicted that the cumulative probability of experiencing a false-positive result over nine mammograms in women ages 40 to 69 was 43 percent, but it ranged widely, depending on the combination of risk factors for false-positive results (Christiansen et al., 2000). These included young age, the number of breast biopsies, family history of breast cancer, estrogen use, time between screenings, no comparison with previous mammograms, and the radiologist's tendency to label mammograms as abnormal.

Effectiveness of Early Detection

Although the effectiveness of early detection of breast cancer has been examined in observational (e.g., case-control) studies, eight randomized controlled trials of screening mammography provide more compelling evidence and are less subject to bias from confounding variables and other methodological factors. This review therefore focuses on the trials.

The eight trials include studies from Scotland (Alexander et al., 1999), Canada (Miller et al., 1992a,b, 2000), and the United States (Shapiro, 1988) and four trials from Sweden: Malmö (Andersson and Janzon, 1997), Kopparberg and Östergotland (the Swedish Two-County Trial) (Tabàr et al., 1999, 2000), Stockholm (Frisell and Lidbrink, 1997), and Gothenburg (Nyström et al., 1993) (Table 5.3). The studies, conducted from 1963 to 1990, varied in certain important respects: the age of the women at entry ranged between 39 and 74 years; individual, cluster, and combined methods of randomization were used; the mammograms included one or two views; and clinical breast examination was not part of the intervention in most trials. The subjects underwent between two and six rounds of screening at intervals that ranged from 12 to 33 months, and follow-up in the most recent reports varied between 11 and 20 years.

TABLE 5.3. Results of the Randomized Clinical Trials of Screening Mammography.

TABLE 5.3

Results of the Randomized Clinical Trials of Screening Mammography.

Despite these differences, the results of the trials are relatively consistent. Most reported that screening mammography reduced the risk of death from breast cancer, with relative risk reductions ranging from 3 to 32 percent. The relative risk reduction achieved statistical significance in only a few trials, but the combined data from all trials show a highly significant reduced risk. The pooled relative risk ratio reported in one meta-analysis (across all age groups) was 0.79 (a 21 percent reduction in the risk of dying from breast cancer) (Kerlikowske et al., 1995), and the pooled relative risk ratio in a more recent meta-analysis based on longer follow-up data was 0.84 (95 percent CI, 0.77–0.91) (Humphrey et al., 2002). The narrow 95 percent CI around this estimate denotes the high level of statistical certainty about the magnitude of benefit.

That said, the designs of these trials were imperfect and have been criticized (Berry, 1998; Gøtzsche and Olsen, 2000). A recent analysis by Danish investigators, published as a Cochrane Collaboration report (Olsen and Gøtzsche, 2001), drew attention to poor documentation of concealment of allocation, imbalances in the baseline characteristics, inconsistent data on the number of subjects, and inadequate accounting for loss to follow-up and breast cancer ascertainment.2 In contrast to the positive findings in other trials, the combined evidence from the two trials that the investigators considered “adequately randomized” (Canada, Malmo) showed no effect on breast cancer mortality (pooled relative risk, 1.04; 95 percent CI, 0.84 to 1.27). The investigators concluded that screening mammography is unjustified, generating a flurry of broadcast and print media reports in early 2002 questioning the merits of mammography and prompting an announcement by the National Cancer Institute PDQ advisory panel expressing reservations about the evidence (Kolata, 2001a,b; Kolata, 2002a; Circling the Mammography Wagons, 2002; Henderson, 2002; Kolata and Moss, 2002; Ernster, 2002; Excerpts from Speech, 2002; Kolata, 2002b).

There has not been uniform acceptance of the conclusions of Olsen and Gøtzsche within the scientific community. Most of the design flaws of concern to the Danish authors have been discussed in the literature, and by the investigators themselves, for many years. Their review therefore did less to “discover” these problems than to give them public airing. The disagreement is less about whether the trials have flaws but whether such imperfections are grounds, as the Danish authors contend, for invalidating the results. Not all analysts and review groups agree that the design flaws are fatal (Duffy and Tabar, 2000; Cates and Senn, 2000; Law et al., 2000; Moss et al., 2000; Nyström, 2000; Hayes et al., 2000; de Koning, 2000; Duffy et al., 2001; Wald, 2000; Smith et al., 2002; Miller, 2001; Lee and Zuckerman, 2001; Senn, 2001; Duffy et al., 2001; Duffy et al., 2002; Nyström et al., 2002; Gelmon and Olivotto, 2002; Humphrey et al., 2002).

Those who are willing to accept the trial data advance a variety of arguments. They note that some design imperfections, such as the imbalance between groups that occurs with cluster randomization, were fully expected at the outset and are a necessary compromise in conducting population-based screening trials. Other flaws that seem unacceptable by today's standards, such as failure to conceal allocation or lack of blinding, received less attention in prior decades when these trials were designed. Rather than rejecting trials in a formulaic fashion because of the presence of such flaws, groups such as the U.S. Preventive Services Task Force examine whether the biases potentially introduced by these design flaws were of sufficient magnitude, duration, and direction to account for the observed mortality reductions. They concluded that the mortality patterns are more likely attributable to a real effect from mammography than to an artifact from improper study designs (US Preventive Services Task Force, 2002; Humphrey et al., 2002). Others disagree with the very premises that underlie the Danish critique, noting the lack of empirical evidence demonstrating that the types of design flaws of concern to the Danes are more likely to produce invalid data. Others question the appropriateness of using all-cause mortality to judge the merits of a cancer screening test, or the analytical methods used in their meta-analysis (Cates and Senn, 2000; Duffy and Tabàr, 2000; Hayes et al., 2000; Law et al., 2000; Moss et al., 2000; Nyström, 2000; Woolf, 2000b).

Prior to this recent controversy, debates within the scientific community about mammography have historically centered less on whether mammography is effective than on questions about the magnitude of benefit and the proper starting and stopping ages and interval for screening. Controversies about magnitude relate to uncertainties about the relative risk reduction associated with screening (values vary somewhat from one meta-analysis to another) and how this uncertainty impacts estimations of absolute benefit. Even with a relative risk reduction of 20–25 percent, large numbers of women (perhaps 1,000) must be screened to prevent a single death from breast cancer, raising a legitimate question about whether the benefit is of sufficient magnitude to outweigh potential harms (see below).

Since the 1980s there has been controversy regarding the effectiveness of screening mammography for women under age 50. The results for women who were age 40 to 49 at entry into the trials tend to be smaller in magnitude and less statistically significant than results for women ages 50 and older, possibly because the trials were not designed with sufficient statistical power (sample size, length of follow-up) to detect a difference in outcome for this age group. At the frequency with which mammography was conducted in the trials, the mortality benefits that do occur for women ages 40 to 49 appear to be delayed. A clear separation in survival curves suggesting a mortality benefit did not become apparent in most studies until after 8 to 10 years of follow-up, whereas benefits were observed within 4 to 5 years of screening in women ages 50 to 69.

Because of the modest reduction in absolute benefit observed when women ages 40 to 49 were screened, it has been uncertain whether the lower mortality rates observed in this age group are statistically significant or due to chance. For many years, no trial had reported reductions in mortality that were statistically significant for women ages 40 to 49 at the time of screening. Early meta-analyses for this age group also failed to demonstrate statistical significance. The pooled relative risk ratios reported by Kerlikowske and colleagues (1995) and Smart and colleagues (1995) were 0.92 (95 percent CI, 0.75 to 1.13) and 0.84 (95 percent CI, 0.69 to 1.02), respectively, suggesting no significant benefit. A 1993 overview of the data from the Swedish trials for women ages 40 to 49 reported a pooled relative risk ratio of 0.90 (95 percent CI, 0.65 to 1.24) (Nyström et al., 1993). The ratio was 0.91 (95 percent CI, 0.72 to 1.15) when the results were updated through 1997 (Jonsson et al., 2000). A meta-analysis by an Australian team, limited to seven trials, yielded a pooled relative risk ratio of 0.95 (95 percent CI, 0.77 to 1.18) (Glasziou et al., 1995).

In recent years, however, extended follow-up of women who were ages 40 to 49 when they were recruited into the trials and at the start of screening has revealed a delayed separation in survival curves that approaches or achieves statistical significance. This trend was first reported at a 1996 conference in Falun, Sweden (Swedish Cancer Society and the Swedish National Board of Health and Welfare, 1996). In 1997, updates from the Gothenburg and Malmö trials revealed for the first time a statistically significant reduction in the mortality rate for women in this age group. The relative risk ratios were 0.56 (95 percent CI, 0.32 to 0.98) (Bjurstam et al., 1997) and 0.64 (95 percent CI, 0.45 to 0.89) (Andersson and Janzon, 1997), respectively. A meta-analysis incorporating the new data with the results from other clinical trials concluded that the pooled relative risk ratio for women ages 40 to 49 was 0.82 (95 percent CI, 0.71 to 0.95) (Hendrick et al., 1997). Other meta-analyses suggested slightly more modest benefits that bordered on statistical significance, yielding values of 0.84 (95 percent CI, 0.71 to 1.01) (Kerlikowske, 1997), 0.85 (95 percent CI, 0.71 to 1.01) (Glasziou and Irwig, 1997), and 0.85 (95 percent CI, 0.73 to 0.99) (Humphrey et al., 2002). An updated 1997 meta-analysis of the Swedish trials yielded a relative risk of 0.77 (95 percent CI, 0.59 to 1.01) (Larsson et al., 1997).

The estimates presented above rely on subgroup analysis of the cohort aged 40 to 49. The only trial designed specifically to evaluate screening for women ages 40 to 49 was the Canadian National Breast Screening Study. That randomized trial, which assessed the effectiveness of the combination of annual mammography, physical breast examination, and teaching of breast self-examination, initially reported a relative risk ratio of 1.36 (95 percent CI, 0.84 to 2.21) (Miller et al., 1992a), which some interpreted as evidence that screening was harmful. The ratio reported at the latest followup was 1.06 (95 percent CI, 0.80 to 1.40) (Miller et al., 2002), supporting an interpretation of no effect.

The design and conduct of the Canadian trial have been criticized. Questions have been raised about the randomization methods, prompted by differences in the baseline characteristics of the study arms (Boyd, 1997; Burhenne and Burhenne, 1993; Tarone, 1995). However, independent investigations have disclosed no evidence that the randomization process was flawed or subverted (Bailar and MacMahon, 1997; Cohen et al., 1996). Critics have also asserted that the physical examination that preceded mammography may have confounded its effects, that the mammographic technique used by some centers was outdated or of poor quality, that the trial lacked statistical power, and that the trial included excess numbers of individuals with advanced disease (Burhenne and Burhenne, 1993; Kopans et al., 1994). Moreover, because patients in the control arm received clinical breast examination (rather than the “usual care” that control arms in other trials received) many have suggested that the trial was less likely to show a benefit, or at least tested a different question than did other mammography screening trials. To provide further evidence, additional trials of screening in this age group are under way or are being considered in Europe and the United States (Forrest and Alexander, 1995; Moss, 1999).

Whether the existing evidence is sufficient to justify routine screening mammography for women ages 40 to 49 has long been a matter of debate (Smith, 2000). Beyond concerns about statistical significance, some question the absolute benefit of screening, given the low prevalence of breast cancer in premenopausal women. Analyses suggest that 1,500 to 2,500 women aged 40 to 49 would require screening mammography for 10 to 15 years to avert 1 death from breast cancer (Berry, 1998; Salzmann et al., 1997). In light of the 8- to 10-year delay in observing a significant separation in survival curves for women who are aged 40 to 49 at the start of screening, some speculate that the benefit does not occur until after age 50 and could be retained if screening was deferred until that age. In the Health Insurance Plan (HIP) trial, for example, all of the decrease in mortality observed among women who began screening at age 45 to 49 occurred among those whose breast cancer was detected at ages 50 to 54 (Shapiro et al., 1982).

For their part, proponents of screening at age 40 to 49 argue that the modest benefit observed in clinical trials is due to limitations in study design. The trials were not designed to test the effectiveness of screening in this age group, recruiting instead a large proportion of older women, and the studies lacked adequate sample sizes and adequate durations of followup to detect a difference in this subgroup. They note that the trials used outdated methods (e.g., single views) and screened women too infrequently (e.g., every 2 years) to detect rapidly growing tumors (Feig, 1996). They also emphasize that the age of 50 years is an arbitrary cutoff point that has little biological significance; younger women with certain risk factors face the same absolute risk of breast cancer and may benefit as much from mammography as older women with fewer risk factors (Gail and Rimer, 1998). Indeed, the pooled relative risk reductions in the various meta-analyses reported above suggest that screening mammography reduces mortality by an average of 15–21 percent for women in all age groups, with reductions being smaller on average for younger women (e.g., ages 40 to 49) than for those age 50 and older.

Most of the evidence from screening trials suggests that the effectiveness of mammography is equivalent whether it is performed annually or every 2 years. However, a body of natural history and modeling evidence suggests that an annual interval may be more effective (Boer et al., 1999; Michaelson et al., 1999; Ren and Peer, 2000), especially among women ages 40 to 49. Breast cancer appears to have a shorter mean sojourn time in premenopausal women (Moskowitz, 1986; Tabàr et al., 1995). A greater ratio of interval cancers (cancers that arise between screenings) to total cancers among women ages 40 to 49 may account for the diminished effectiveness of mammography observed in clinical trials, most of which screened women every 18 to 24 months. (Only the American and Canadian trials screened women ages 40 to 49 annually.) The limited number of women of this age in the trials gave the trials inadequate power to conclude whether the difference in relative risk reduction between annual and less frequent screening is statistically significant. Indirect evidence suggests that annual screening might significantly lower the rate of mortality from breast cancer (Feig, 1995) without increasing the frequency with which women are recalled for follow-up imaging studies (Hunt et al., 1999).

Women Age 70 and Older Data from studies that have evaluated the effectiveness of screening of older women for breast cancer are lacking because most trials enrolled women younger than age 69. The Swedish Two-County Trial, in which women up to age 74 were enrolled for screening, reported a 32 percent reduction in the rate of mortality from breast cancer (95 percent CI, 0.51 to 0.89) for women aged 70 to 74 at the time of randomization (Chen et al., 1995). Modeling studies also support the screening of women over age 65 (Mandelblatt et al., 1992), but beyond age 69 the relative improvement in detecting metastatic cancer may be lower (Smith-Bindman et al., 2000) and the incremental benefit in terms of reduced mortality may be modest (Kerlikowske et al., 1999). Others have argued that the evaluation of incremental benefit should give greater consideration to differences in life expectancy, comorbidities, effects on treatment, and value preferences (Mandelblatt et al., 2000). It is generally held that continued screening is unwarranted in elderly women who have significant morbidities, poor functional status, low bone mineral density (an indicator of low estrogen levels), or an unwillingness to accept the potential harm of screening (Parnes et al., 2001).

Testing for BRCA1 and BRCA2 Mutations

Cloning of the BRCA1 and BRCA2 genes has made it possible to identify individuals who carry the mutations for breast cancer susceptibility in the BRCA1 and BRCA2 genes. In one British study, carriers of mutations in the BRCA1 and BRCA2 genes accounted for 2 percent of breast cancer cases diagnosed before the age of 55 in population-based screening (Anglian Breast Cancer Study Group, 2000). Advances in genetic technology may soon enable testing for other disturbances in molecular pathways to identify women susceptible to breast cancers unrelated to the early-onset familial form (Golub, 2001).

The potential benefits of early identification of genetic susceptibility to breast cancer include the opportunity for earlier or more frequent screening, antiestrogen therapy, prophylactic surgery, and lifestyle modifications (Goodwin, 2000). Because these mutations are also associated with an increased risk of ovarian cancer, the opportunity to consider prophylactic oopherectomy is another potential benefit. There is some evidence that testing influences women's choices. In a Canadian survey of women who tested positive for mutations in the BRCA1 or BRCA2 genes, 58 percent indicated that their screening practices had changed, 28 percent had undergone prophylactic mastectomy, and nearly two-thirds were considering prophylactic mastectomy or oopherectomy (Metcalfe et al., 2000). In a Dutch study, 51 percent opted for prophylactic mastectomy (Meijers-Heijboer et al., 2000).

Breast Self-Examination

There is limited evidence regarding the accuracy of breast self-examination. The upper limit of its sensitivity has been estimated at 12 to 25 percent (Fletcher et al., 1993), making it considerably less sensitive than either clinical breast examination or mammography. Its specificity is unknown, but Chinese women instructed in breast self-examination in a large randomized trial (see below) identified 331 cases of breast cancer but also had 1,457 false-positive findings, whereas the control group (which received education about low back pain) found 322 cancers and had 623 false-positive results (Thomas et al., 1997).

There remains little direct evidence that breast self-examination improves the outcomes from breast cancer. A nonrandomized study in the United Kingdom found that two centers at which women were invited to education sessions on breast self-examination had combined mortality rates that were similar to those at control centers, but one of the two centers did have significantly lower death rates (Lancet, 1999).

A randomized trial in Russia sponsored by the World Health Organization is evaluating the effectiveness of the training of small groups in breast self-examination combined with reinforcement techniques. Over 10 years there has been no significant difference in mortality rates, although the intervention group has had more physician visits, referrals, and breast biopsies (Semiglazov et al., 1999). Methodological limitations raise doubts about the ability of this trial to rule out a benefit associated with the intervention. A randomized trial in Shanghai, China, involved 267,040 textile workers in which the intervention group received instruction in breast self-examination and reinforcement interventions. No reduction in mortality was observed at the 5-year follow-up (Thomas et al., 1997).

Harms

The potential harms of screening mammography relate primarily to false-negative and false-positive results (which some define as including detection of cancer of unknown clinical significance such as DCIS). The latter can be especially significant because of the emotional (e.g., anxiety) and physical (e.g., biopsy) implications. Approximately 8 to 12 percent of women who undergo screening mammography must be reevaluated because of abnormal results, which typically entails repeat and more intensive imaging studies (Carney et al., 2000; Paquette et al., 2000; Poplack et al., 2000).

A growing literature has analyzed the psychological impact of this experience. Not surprisingly, women who have had abnormal mammogram results are substantially more worried about getting breast cancer than women who have not had such results (Lipkus et al., 2000a). Although the psychological impact of false-positive results requires further study (Lerman and Rimer, 1995), surveys suggest that after receiving a false-positive result women experience increased anxiety at both short-term and long-term follow-ups and experience added stress when they undergo biopsy (Gilbert et al., 1998; Gram et al., 1990; Lerman et al., 1991b; Lowe et al., 1999; Olsson et al., 1999). In one survey, 41 to 47 percent of women with suspicious mammograms expressed anxiety and worry about breast cancer (Lerman et al., 1991b). In a British study, psychological effects were reported for at least 5 months by 44 to 61 percent of women (Brett et al., 1998). About half of the women with normal mammograms said that the results reduced their fears, but concerns about breast cancer persisted for 28 percent.

These feelings do not appear to dampen interest in future screening. Among American women, having had a false-positive mammogram does not diminish interest in subsequent screening and may even heighten interest (Burman et al., 1999; Pisano et al., 1998b). In one study, women who had had an abnormal mammogram within the 2 years before the interview were more likely to be on schedule for mammography than women who had never had an abnormal mammogram (Lipkus et al., 2000a). Similarly, although moderate to extreme discomfort from mammography was reported by 52 percent of American women, it was not associated with disinterest in future testing (Dullum et al., 2000). A Dutch study reported that 73 percent of women found mammography mildly to severely painful, but only 3 percent indicated that the pain would deter them from future screening (Keemers-Gels et al., 2000).

It has been estimated that for every 1,000 mammographic examinations, between 3 and 42 biopsies are performed to investigate abnormal results (Fletcher et al., 1993). Recall examinations, often with special views, can reduce the need for unnecessary biopsies (Sickles, 2000b). A more recent audit of 36,850 consecutive screening mammographies reported that 1.4 percent of women were biopsied; these women represented approximately 25 percent of all those with abnormal mammograms (Dee and Sickles, 2001). In recent community studies, 22 to 38 percent of biopsy specimens were positive for cancer (Dee and Sickles, 2001; Poplack et al., 2000). At the biopsy rate in the Canadian trial, women have a 10 to 20 percent chance of undergoing biopsy at some time during 5 years of screening by annual mammography and clinical breast examination (Fletcher et al., 1993). The probability of undergoing a biopsy for a false-positive result depends on breast cancer risk factors. The false-positive rate may be higher among younger women and is dependent on other risk factors. In the Stockholm trial, 41 to 56 percent of the costs related to false-positive results occurred in women who were under age 50 when screening began (Frisell and Lidbrink, 1997). In an American study, women ages 40 to 49 having their first mammogram underwent twice as many diagnostic tests per cancer detected compared with the number for women ages 50 to 59 (43.9 versus 21.9 diagnostic tests) (Kerlikowske et al., 1993).

There is no direct evidence that ionizing radiation from mammography causes breast cancer. Given a mean dose to the breast of 0.1 rad during mammography, modeling based on data from studies of higher levels of radiation exposure suggests that 100,000 women screened annually from ages 50 to 75 would lose 12.9 years from radiogenic cancers but would gain 12,623 years from an assumed 20 percent reduction in the rate of mortality from breast cancer (Feig and Ehrlich, 1990). The benefit-risk ratio was narrower in a Swedish model (Mattsson et al., 2000). Other models have estimated that mammography would detect 114 to 815 cancers for every case of cancer that it might induce (Beemsterboer et al., 1998; Law, 1993).

Testing for mutations in the BRCA1 or BRCA2 genes introduces complex psychological, medical, social, and legal consequences if women are informed that they are carriers (Fasouliotis and Schenker, 2000; Koenig et al., 1998). A study of a kindred tested for a BRCA1 mutation reported higher levels of psychological distress (Croyle et al., 1997), but such experiences are not uniform. A prospective study of families that underwent testing for BRCA mutations revealed that individuals found to be non-carriers of a mutation reported fewer depressive symptoms and functional impairment and that mutation carriers did not exhibit increased levels of depression; 1 month later, 17 percent of carriers intended to have mastectomies (Lerman et al., 1996). Those who declined the test and who had high baseline levels of cancer-related stress had higher rates of depression at 1 and 6 months of follow-up (Lerman et al., 1998). At 1 year only 3 percent of unaffected carriers of a mutation in the BRCA1 or BRCA2 genes had undergone prophylactic mastectomy. Although their mammography rates were higher than those for noncarriers, analysis revealed that this was because noncarriers decreased their rate of adherence to screening (Lerman et al., 2000). Another study reported that women who underwent prophylactic mastectomy had lower levels of psychological morbidity than those who decided not to have prophylactic mastectomy (Hatcher et al., 2001).

The potential harms of genetic screening reach beyond the patient being tested. When women in one survey were asked whether it was appropriate to share genetic test results with family members, 100 percent were supportive if the only option was prophylactic mastectomy, 97 percent were supportive if it was a preventable disease, and 85 percent were supportive if it was a nonpreventable disease (Lehmann et al., 2000). Even if the individual was opposed to sharing the results with family members, 16 to 22 percent of respondents said the physician should seek out and inform the family members against the patient's wishes. A survey of adults in one study revealed that one-fourth would permit testing of children (under age 18) (Hamann et al., 2000). Under such a policy, minors (who are not in a position to choose otherwise) would be preempted from living their lives without knowing their genetic susceptibility, a choice they might make as adults. The ripple effects of genetic testing for mutations in the BRCA1 and BRCA2 genes pose formidable challenges to the physician attempting to offer patients informed consent before undergoing such testing (Miesfeldt et al., 2000).

Cost-Effectiveness

A review of economic evaluations of breast cancer screening published through 1997 found that estimates ranged widely. Estimates of the cost per case of cancer detected ranged from $5,226 to $58,331 (Brown et al., 1999). Studies suggest that the cost-effectiveness of screening mammography is within customarily accepted ranges: approximately $16,100 to 21,400 per year of life saved (Leivo et al., 1999; Rosenquist and Lindfors, 1998; Salzmann et al., 1997). Studies in other countries estimate the cost at $3,750 (U.S. dollars) per year of life saved (Wan et al., 2001).

Compared with beginning screening at age 50, however, the incremental cost-effectiveness of beginning screening at age 40 appears to be high, perhaps as much as $105,000 per year of life saved (Salzmann et al., 1997). The incremental costs of performing mammography annually rather than biennially may also be substantial, especially for older women (Boer et al., 1999). The incremental cost-effectiveness of continuing screening mammography beyond age 69 for women with increased bone mineral density was estimated to be $66,773 per year of life saved (Kerlikowske et al., 1999).

Given certain assumptions, a modeling study estimated that the cost-effectiveness of screening Ashkenazi Jewish women for BRCA1 and BRCA2 mutations ranged from $20,717 to $72,780 per year of life saved, depending on which surgical options were selected by carriers, but testing followed only by surveillance was not cost-effective ($134,273 per year of life saved) (Grann et al., 1999).

Subjective Value Judgments

The longstanding controversy about whether mammography screening should begin at age 40 underscores the important role of subjective value judgments in the crafting of screening guidelines. For some time the argument centered on statistical significance because no individual trial or meta-analysis could exclude the possibility that lower mortality rates in that age group were due to chance. The extended follow-up data now available has convinced most analysts that mammography reduces breast cancer mortality in women ages 40 to 49 as well. The relative risk reduction is probably slightly less than that observed in women age 50 and older, but not markedly so. The more important distinction between these age groups is the difference in absolute benefit. The probability of acquiring breast cancer is much lower for premenopausal women, so that even an equivalent relative risk reduction in both age groups would mean that a much larger number of women would need to be screened to prevent a breast cancer death—that is, the NNS would be higher—than is the case for older women.

Unless one is a payer concerned about costs, a higher NNS is, by itself, an inadequate argument against screening. It is the trade-off between benefits and harms that makes a higher NNS a concern. Weighing that tradeoff is made more complicated for younger women because of their higher probability of receiving false-positive results. The precise nature of that trade-off is unclear; estimates of the NNS and the risk of harms for women ages 40 to 49 vary, depending on the data and methods used for the calculation and the risk profile of the woman involved. Several analyses suggest that the NNS to prevent a breast cancer death over 10 years is 1,500 to 2,500, and the risk of follow-up biopsy or surgery during that decade is 8 to 20 percent (Harris and Leininger, 1995; Kerlikowske, 1997; National Institutes of Health, 1997).

Whether this trade-off is worthwhile has no objective answer and is clearly dependent on personal values. Some women, perhaps the majority (Schwartz et al., 2000), would say that the benefits outweigh the risks. Some women might feel otherwise. A consensus panel convened in 1997 by the National Institutes of Health, faced with this evidence, concluded that it was inappropriate to issue a uniform recommendation for all women ages 40 to 49. Instead, it recognized the diversity of women's views and recommended that the choice be individualized for each woman on the basis of “how she perceives and weighs each potential risk and benefit, the values the woman places on each, and how she deals with uncertainty” (National Institutes of Health, 1997, p. 1015). The recommendation ignited a firestorm of controversy, beginning at the conference and followed by harsh criticism in the lay press, a unanimous resolution by the U.S. Senate denouncing the statement, congressional pressure on the National Cancer Institute to take a firm position in favor of screening, and hearings on Capitol Hill. The pressure ultimately culminated in the issuance of guidelines by the National Cancer Institute in favor of screening women beginning at age 40 and an immediate announcement at the White House of expanded Medicare coverage for such screening (Begley, 1997; Taubes, 1997).

Debates of similar intensity erupted more recently in response to the critique of the mammography trials by Gøtzsche and Olsen. Fueled by the prominent attention it received from the news media, organizations and agencies were compelled to take public positions on the importance they assigned to design flaws in trials, stances that ultimately reflected subjective value judgments rather than hard facts. Following the announcement that the Physician's Data Query (PDQ) panel of the National Cancer Institute was also concerned about the design flaws, a dozen professional and cancer advocacy organizations published a full-page advertisement in the New York Times to reassure women that guidelines in favor of mammography screening were still in effect, and the National Cancer Institute reaffirmed its 1997 guideline in favor of screening (http://newscenter.cancer.gov/pressreleases/mammstatement31jan02.html). A subsequent editorial in the same newspaper accused the groups of “circling the wagons” and urged an independent review of the evidence by an impartial body (Circling the Mammography Wagons, 2002). When the Secretary of Health and Human Services announced the results of just such a review (the U.S. Preventive Services Task Force completed its 3-year update of its recommendations shortly thereafter), skeptics questioned whether the conclusions had been influenced by politics.

Much has been written in the medical literature about what occurred at the 1997 National Institutes of Health consensus conference (Ernster, 1997; Fletcher, 1997; Kassirer, 1997; Pauker and Kassirer, 1997; Ransohoff and Harris, 1997; Woolf and Lawrence, 1997) and much more will probably be written about the more recent controversy. Concerns that swirl around both incidents include the intrusion of politics into science, the loss of comity by the belligerents in the debate, and the difficult circumstances in which the consensus panel in 1997 and the U.S. Preventive Services Task Force in 2002 were placed. The intensity of the acrimony was amplified by the strong emotions surrounding breast cancer and the powerful agendas of interest groups. Analyses of the disparate perspectives of these parties and the larger policy implications of this controversy are covered elsewhere (Woolf and Dickey, 1999). What underlies these debates are ultimately differences in value judgments. The pivotal decision points for setting policy—e.g., whether design flaws are fatal, the NNS is too high, or benefits outweigh harms—are in many ways matters of opinion.

In the context of this report, it is also worth noting why the 1997 consensus panel's recommendation for shared decision making—a position scientifically justified for the reasons outlined above and consistent with the patient-centered theme appearing in screening guidelines for other cancers issued at the same time by other groups—met with such misfortune. The answer is complex and reflects in part the challenges of satisfying the public desire for clear, explicit guidelines when scientific uncertainty precludes such statements. Some of the resistance stemmed from fundamental discomforts with the entire notion of shared decision making and the tension between “paternalists” and “pluralists” in defining which choice is best (Woolf and Lawrence, 1997).

However, it is also conceivable that nuances in the wording of the recommendation, prepared hurriedly under pressure from the National Institutes of Health to produce a statement within 1 day of hearing the evidence, influenced its reception. A close examination of the recommendation (italics added) reveals the degree to which the language implied that the decision was the patient's rather than a shared decision as the panel intended: “Each woman should decide for herself whether to undergo mammography. ...Her decision may be based...” (National Institutes of Health, 1997, p. 1015). This wording may have fueled the adverse press coverage, which accused the panel of abdicating its authority and leaving to women the responsibility for a decision that experts could not make.

If the locus of control over decisions is viewed as a continuum, with physician paternalism on one extreme and patients making independent decisions on the other, it is important for guidelines that advocate shared decision making to use precise language that occupies the middle ground. Research indicates that only a small minority of patients, characterized by a high degree of self-empowerment and autonomy, are comfortable at the far extreme and that most patients who want to share decisions prefer a partnership role. The success with which the language used in recent prostate cancer screening guidelines has clarified this role is discussed in the next section of this chapter.

Increasingly evident in the years since the consensus conference is that the risk of breast cancer is not a dichotomous variable in which the need for subjective value judgments exists only before age 50. The artificial demarcation of the age 40 to 49 cohort is an artifact of the recruitment ages in the mammography trials. In reality there is a continuum of risk with advancing age; the probability of benefit increases and the probability of false-positive results decreases as women grow older, and there is nothing distinctive at age 50, other than approximating the age of menopause, that disrupts this pattern (Smith, 2000). Whatever the age, the magnitude of benefit that outweighs the risk is a subjective judgment that varies from woman to woman. If the appropriateness of screening mammography in a 48-year-old women depends on how she weighs the NNS against the risk of a false-positive result, the same is true of a 52-year-old woman; it is only the ratios, and not the need for value judgments, that change with time. As Smith advocated, rather than continuing to focus on women ages 40 to 49 as a distinct cohort, “an alternative and more productive view is that women of all ages need to be fully informed about the benefits and limitations of breast cancer screening” (Smith, 2000, p. 331). This should include elderly individuals, whom studies indicate are significantly less likely to be offered active involvement in decision making about mammography (Burack et al., 2000a).

The presumption in considering personal preferences is that a substantial proportion of women will defer screening because of concerns about false-positive results. The first studies to examine this subject are challenging that premise, suggesting that women are not terribly troubled by the downsides of screening (Cockburn et al., 1999; Schwartz et al., 2000). It is unclear whether these data reflect true preferences, patient overestimation of risk for breast cancer (Lipkus et al., 2000a), the challenges of understanding the meaning of relative and absolute risk reductions (Schwartz et al., 1997), or an artifact of the ways in which questions are framed (Peticolas, 2000). In one survey (Schwartz et al., 2000), women expressed a high degree of tolerance of a 20 percent risk of false-positive results, with 63 percent of women indicating that 500 or more false-positive results per life saved was a reasonable trade-off. Fully 37 percent were willing to tolerate a rate of 10,000 or more false-positive results per life saved. Only 8 percent viewed mammography as potentially harmful, and 62 percent indicated that they did not want to consider false-positive results when deciding about screening.

The challenge of shared decision making lies in ensuring that preferences are expressed on the basis of an accurate understanding of the facts. Software tools have been developed for use in the physician-patient discussion to help women quantify their personal risk of developing breast cancer (Gail and Rimer, 1998). Materials in both English and Spanish are being developed for this purpose (Lawrence et al., 2000). However, the difficulties of presenting probabilistic information in a narrative, numerical, or graphic format will need to be overcome to help women make choices that are grounded in accurate perceptions of likely outcomes.

In response to the renewed controversy surrounding mammography, the National Cancer Institute has formed a Breast Screening Working Group to promote timely examination of issues related to breast screening and to track research progress in these areas (P. Greenwald, Director, Division of Cancer Prevention, personal communication to Roger Herdman, National Cancer Policy Board, March 21, 2002). A subcommittee on mammography and communication issues will focus on approaches to evidence synthesis and communicating the implications of evolving evidence to the public. Other subcommittees will focus on the basic biology of early breast cancers, and new technologies and molecular methods to advance early detection.

Box 5.4 summarizes selected recommendations for breast cancer screening from major organizations.

Box Icon

BOX 5.4

Recommendations for Screening for Breast Cancer.

Prostate Cancer

This part of the chapter on prostate cancer screening examines digital rectal examination and PSA (prostate-specific antigen) testing as screening tests for prostate cancer. Investigational tests of potential usefulness for screening (e.g., testing for the human glandular kallikrein 2 protein) are not discussed (Partin et al., 1999).

Digital Rectal Examination

The digital rectal examination lacks sensitivity for the detection of small tumors (McNeal et al., 1986). By definition, stage A tumors are nonpalpable. The digital rectal examination has a PPV of 15 to 30 percent and a sensitivity of approximately 60 percent. There is little evidence that digital rectal examinations reduce the rate of mortality from prostate cancer. One observational study of digital rectal examination found evidence of benefit (Jacobsen et al., 1998), but several others have reported no effect (Friedman et al., 1991; Gerber et al., 1996; Jacobsen et al., 1998; Richert-Boe et al., 1998) and all have been the target of methodological criticisms (Weiss et al., 1999).

Prostate-Specific Antigen Testing

A PSA concentration greater than 4 nanograms per deciliter (ng/dl) has a sensitivity of up to 80 to 85 percent in detecting prostate cancer (Catalona et al., 1994; Jacobsen et al., 1996). Analyses of archived blood samples suggest that PSA level elevations (and low free PSA ratios [see below]) precede the development of prostate cancer by as much as 13 years before diagnosis (Gann et al., 1995; Parkes et al., 1995; Tibblin et al., 1995; Tornblom et al., 1999). PSA testing appears to be significantly more sensitive for the detection of aggressive prostate cancers than for the detection of nonaggressive (small, well-differentiated) prostate cancers.

PSA testing has limited specificity, producing false-positive results in the presence of benign prostate disease. About 25 to 46 percent of men with benign prostatic hypertrophy have elevated PSA levels (Oesterling, 1991; Sershon et al., 1994). Biological variability and differences among PSA assays can also affect accuracy (Wu, 1994). PSA levels fluctuate by as much as 20 to 30 percent for physiological reasons (Komatsu et al., 1996; Stamey et al., 1995). The specificity of PSA testing is age-related. On the basis of population data from one region in the United States, the specificities of PSA testing are 98, 87, and 81 percent for men ages 50 to 59, 60 to 69, and 70 to 79, respectively (Jacobsen et al., 1996).

In asymptomatic men, the PPV of a PSA level above 4 ng/dl is 28 to 35 percent (Bretton, 1994; Catalona et al., 1991, 1994; Cooner et al., 1990); that is, roughly two of three men with elevated levels of PSA do not have prostate cancer. The reported PPV when the digital rectal examination is negative is 20 percent (Andriole and Catalona, 1993). It is unclear whether these data can be extrapolated to healthy men screened in clinical practice. Most of the participants in those studies were either patients seen at urology clinics or volunteers recruited from the community through advertising. Such volunteers may not be representative of men in the general population (Demark-Wahnefried et al., 1993). In one PSA screening study, 53 percent of the volunteers had symptoms of prostatism (Catalona et al., 1994).

Recent interest has focused on the incremental benefit of lowering the cutoff for abnormal PSA levels to below 4.0 ng/ml, especially in men with a suspicious digital rectal examination. The increased sensitivity of this change is accompanied by a decreased specificity. In one study of this approach, the PPVs were 5, 14, and 30 percent depending on whether the PSA level was 0.0 to 1.0, 1.1 to 2.5, or 2.6 to 4.0 ng/ml, respectively (Carvalhal et al., 1999a). In another study the PPV of a digital rectal examination and transrectal ultrasound for men with PSA levels below 4.0 ng/ml was 9.7 percent. It was estimated that only 37 percent of prostate cancers were diagnosed by using a cutoff PSA level of 2 to 4 ng/ml (Schroder et al., 2000).

There is little direct evidence regarding the optimal periodicity for performing PSA tests. A cohort study suggested that few curable cancers would be missed in men with PSA levels less than 2.0 ng/ml if screening occurred every 2 years (Carter et al., 1997). Modeling studies suggest that biennial screening is most cost-effective and may reduce the need for biopsies (Etzioni et al., 1999a; Ross et al., 2000).

Other Strategies to Improve PSA Specificity Most research on PSA screening has focused on improving specificity to reduce the probability of false-positive results and unnecessary biopsies. Several approaches have been used. PSA density is the concentration of circulating PSA divided by the gland volume as measured by ultrasound (Bazinet et al., 1994). This measure accounts for the relationship between PSA and prostatic enlargement apart from cancer. A value greater than 0.15 ng/ml2 may be predictive of cancer (Benson et al., 1992). PSA velocity is the rate of change in PSA levels over time. An increase of at least 0.75 ng/ml within 1 year has a reported specificity of 90 percent in differentiating cancer from benign disease (Carter et al., 1992). PSA cutoff levels can also be defined for specific age and race categories, as PSA levels generally increase with age and are higher in certain racial or ethnic groups (Kalish and McKinlay, 1999; Morgan et al., 1996; Oesterling et al., 1993). Tables that stratify normal PSA levels by the age or the race or ethnicity of the patient have been created. Finally, free PSA is the ratio of the amount of free PSA to the amount of PSA bound to α1-antichymotrypsin and other moieties (Catalona et al., 1998). A low free PSA ratio (e.g., <25 percent) is more common with prostate cancer than with benign prostatic hypertrophy. Free PSA has garnered much interest in recent years. A cutoff of 15 percent has been reported to be most discriminatory in predicting favorable outcomes (Southwick et al., 1999). Some recommend the use of this ratio for men with total PSA levels of 2.51 to 4 ng/ml to optimize cancer detection and minimize unnecessary biopsies (Catalona et al., 1999b). Assays for the specific moieties to which PSA binds offer promise in further reducing the rates of false-positive results. However, no single approach has yet been proved to be more accurate than the others (Hayek et al., 1999; Wald et al., 2000), although the age-specific PSA level tends to be less sensitive than either free PSA or PSA density (Catalona et al., 2000).

Many cancers detected by testing for PSA (true positives) lack clinical significance. Autopsy studies suggest that 30 percent of men over age 50 have histological evidence of latent prostate cancers that are unlikely to produce symptoms or affect survival (Scardino, 1989; Woolf, 1995). Because of the indolent behavior of most prostate tumors, men are more likely to die of other causes (e.g., coronary artery disease or stroke) before their prostate cancers progress to clinical significance or metastasize. There are methodological challenges to ascertaining the true cause of death of men with prostate cancer (Albertsen, 2000), but one analysis of elderly American decedents known to have prostate cancer reported that 61 percent died of other causes (Newschaffer et al., 2000). Other investigators have reported similar findings (Satariano et al., 1998). Whether screening leads to the overdiagnosis of such cases is controversial. One study estimated that overdiagnosis increased the number of cases by 51 and 93 percent for men screened at ages 60 and 65, respectively (Zappa et al., 1998).

A subset of tumors detected by PSA testing do progress and are fatal. Certain histopathological features provide important clues regarding the likelihood of progression; observational studies and modeling data suggest that advanced tumor grade, stage, and volume increase the probability of progression and metastasis (Chodak et al., 1994; Epstein et al., 1993; Stamey et al., 1999). Gleason scoring is a system used by pathologists to grade cell differentiation of prostate cancers. Patients with Gleason scores of 2 to 4 (well-differentiated cancer) face a 4 to 7 percent probability of dying from prostate cancer within 15 years when treated conservatively, but the mortality rate is 60 to 87 percent when the Gleason scores are 8 to 10 (poorly differentiated cancer) (Albertsen et al., 1998).

The high prevalence of unfavorable histopathological features in tumors detected by PSA testing suggests that PSA-detected cancers might be more clinically important than latent cancers detected on autopsy. Pathological staging of cancers detected through PSA screening and radical prostatectomy reveals extension beyond the prostate capsule, poorly differentiated cells, large gland volumes, and metastases in 31 to 41 percent of PSA-detected patients (Catalona et al., 1993; Epstein et al., 1994a; Humphrey et al., 1996; Mettlin et al., 1994; Smith and Catalona, 1994). Other retrospective studies also report high prevalences of large tumor volumes, extracapsular extension, and positive surgical margins in cancers detected through PSA screening (Scaletscky et al., 1994; Stormont et al., 1993).

Effectiveness of Early Detection

Few studies to date have used randomized controlled designs to test whether screening for prostate cancer reduces rates of morbidity or mortality. One randomized controlled trial of screening has reported preliminary results: Canadian men offered screening were reported to have a 67 percent reduction in the rate of mortality from prostate cancer (Labrie et al., 1999). Critics have expressed concern about the design of the study: men were not randomized to screening but, rather, to receiving a letter of invitation for screening, introducing a potential imbalance in volunteer bias. Substantial crossover occurred between groups, and the investigators did not perform an intention-to-treat analysis. The latter, applied to the available data, suggests no significant reduction in mortality rates (Ruffin, 1999). Other randomized controlled trials of screening are under way in Europe (Schroder et al., 1995) and the United States (Gohagan et al., 2000). The results of these trials will not be available for at least several years, however, leaving only indirect evidence with which to evaluate effectiveness.

Indirect evidence that early detection improves outcome is also limited. Men with localized tumors at diagnosis appear to live longer and have higher 5-year survival rates than those with more advanced disease. Five-year survival rates are 98 to 100 percent for patients diagnosed with localized disease but are only 30 to 32 percent for those diagnosed with distant metastases (Greenlee et al., 2001). Men who undergo PSA screening are more likely to have early-stage tumors at diagnosis (stage shift) (Catalona et al., 1993; Mettlin et al., 1998; Rietbergen et al., 1999). Ongoing screening programs in countries and regions where PSA screening is prevalent report that the proportion of cancers that are clinically or pathologically advanced has been steadily decreasing over time (Farkas et al., 1998; Labrie et al., 1996; Smith et al., 1996a; Spapen et al., 2000). Many have not been persuaded by this evidence, however, because of concerns about lead-time and length biases.

Recent attention has focused on evidence that prostate cancer mortality rates began declining in the United States (Tarone et al., 2000) and Canada (Meyer et al., 1999; Skarsgard and Tonita, 2000) within several years of the introduction of PSA screening. The same pattern has been observed in Olmsted County, Minnesota, where PSA screening has been longstanding (Roberts et al., 1999). Preliminary reports from a province of Austria, Tyrol, indicate a 42 percent decrease in mortality 5 years after the introduction of PSA screening, although death rates remained unchanged throughout the rest of Austria (Bartsch et al., 2001). Epidemiologists indicate that some (Hankey et al., 1999), but not all (Etzioni et al., 1999b), of this decline may be attributable to PSA screening. A decline so suddenly after the introduction of screening would be unexpected for a cancer known for its long latency. Other potential contributors to these trends (e.g., misclassification of deaths and improved treatment) therefore cannot be excluded (Feuer et al., 1999).

Efficacy of treatment for prostate cancer. The principal treatment options for localized prostate cancer include radical prostatectomy, external beam or interstitial radiation therapy, hormonal treatment, cryosurgical ablation, brachytherapy, and no treatment (expectant management or “watchful waiting”). New and investigational treatments, such as gene therapy, are not reviewed here.

For men with localized prostate cancer, the stage of disease for which screening is intended, there are no results from controlled trials to indicate that any treatment improves survival over watchful waiting. Studies that suggest otherwise tend to be uncontrolled case series (Bagshaw et al., 1994; Catalona and Smith, 1998; Gerber et al., 1996; Hanks et al., 1994, 1995; Lerner et al., 1995; Shipley et al., 1999; Tefilli et al., 1999; Zincke et al., 1994), which suffer from a host of methodological problems. These include the lack of internal control groups (to which another treatment or no treatment was offered), the selection of subjects on clinical grounds that introduce a favorable selection bias, failure to control for confounding, and defining “cure” and progression on the basis of surrogate measures (e.g., PSA levels) rather than hard clinical endpoints. Biochemical failure is an unreliable surrogate for survival (Chodak, 1998; Jhaveri et al., 1999).

Aside from these problems, the survival rates reported from those studies differ little from the expected rates after adjustment for stage and grade of disease. A number of randomized trials have compared the relative benefits of one treatment regimen over another, and to the extent that some have shown benefit, it could be argued that treatment is effective; but most trials rely on intermediate and surrogate outcomes. Some studies do report improved survival with specific treatments (Adolfsson et al., 1993), but the differences typically lack clinical and statistical significance because of confounding variables and wide confidence intervals.

A randomized controlled trial of the effectiveness of treatment was conducted in the United States in the 1970s (Graversen et al., 1990) and found no difference in 15-year survival rates between radical prostatectomy and watchful waiting, but the sample size may have been too small to observe an effect. Larger randomized controlled trials comparing radical prostatectomy with expectant management are under way in Scandinavia (Johansson, 1994) and the United States (Wilt and Brawer, 1994), but the results will not be available for some years.

Conservative treatment. Enthusiasm about the efficacy of treatment has been dampened by evidence that long-term survival for localized prostate cancer may be good even without treatment. In a widely cited study, Johansson et al. (1997) monitored a population-based cohort of 223 men with prostate cancer who were initially untreated. After a mean follow-up period of 12.5 years, 10 percent had died of prostate cancer and 56 percent had died of other causes (Johansson, 1994). Although regional extension occurred in more than one-third of the patients and metastases occurred in 17 percent, the 15-year disease-specific survival rate was 81 percent (Johansson et al., 1997). By comparison, 10-year survival rates after radical prostatectomy in the United States are 75 to 97 percent for patients with well-differentiated and moderately differentiated cancers and 60 to 86 percent for patients with poorly differentiated disease (Krongrad et al., 1997).

Critics of the Swedish study worry that survival rates were high because of the large proportion of older men with small, well-differentiated tumors (Walsh and Brooks, 1997). Other studies, however, have also reported high 10-year survival rates (74 to 96 percent) in untreated men with palpable but clinically localized prostate cancer (Adolfsson et al., 1994, 1999; Warner and Whitmore, 1994). A retrospective 16-year cohort study of American patients ages 65 to 75 who underwent conservative treatment for localized prostate cancer (either no treatment or hormonal treatment) found that life expectancy was unchanged from that of the general population if the tumor was low grade (Albertsen et al., 1995). However, survival was reduced by 4 to 5 years if the tumor was moderate grade and 6 to 8 years if it was high grade.

Other studies have reported more pessimistic outcomes from conservative therapy. A retrospective study in Sweden reported a disease-specific mortality rate of 50 to 100 percent for patients with conservatively treated localized tumors, but the denominator included only men who had died of prostate cancer (Aus, 1994). The same denominator problem affects other studies reporting high mortality rates with conservative treatment (Borre et al., 1997). A prospective Canadian study reported that 60 percent of men (median age, 75 years) placed in a watchful waiting program had clinical progression (McLaren et al., 1998). A study of Danish men with prostate cancer who survived for at least 10 years reported that it was the direct cause of death in 43 percent of the men (Brasso et al., 1999).

Researchers have attempted to model the natural history of untreated prostate cancer by pooling the results of the studies mentioned above, but the assumptions used in the models are controversial. On the basis of the results of six studies, one model concluded that conservative management (delayed hormone therapy but no surgical or radiation therapy) was associated with 10-year disease-specific survival rates of 87 percent for men with well-differentiated or moderately differentiated tumors and 34 percent for men with poorly differentiated tumors (Chodak et al., 1994). For patients alive after 10 years, the probabilities of having metastatic disease were 19, 42, and 74 percent for those with well-, moderately, and poorly differentiated cancers, respectively. Critics of the analysis disagree with the study's probability estimates (Catalona, 1994; Scardino et al., 1994).

An older review using pooled data from 144 articles estimated that the annual risks of metastasis and death from untreated prostate cancer were low (1.7 and 0.9 percent, respectively) (Wasson et al., 1993). That study has been criticized for including a large proportion of patients with well-differentiated tumors and those receiving early androgen deprivation therapy (Walsh, 1993). A different review calculated higher annual rates of metastasis and death (2.5 and 1.7 percent, respectively), but that analysis was limited to studies of patients with palpable, localized cancers and excluded cancers found incidentally at prostatectomy (Adolfsson, 1993).

Harms

Screening all men age 50 and older in the United States (35 million persons) would expose a large population to potential harms. Both false-positive and false-negative results commonly occur with PSA screening. On the basis of the reported PPV of 20 to 35 percent (see above), two to four men with abnormal results on routine PSA screening will not have cancer for every man who does. These individuals must generally return for one or more repeat PSA measurements and rectal examinations or more invasive testing (e.g., ultrasound and fine-needle biopsy) to rule out cancer. If PSA levels are suspicious, one or more biopsies may be required. In one recent study, approximately 75 percent of the men who underwent biopsy did not have cancer (Naughton et al., 2000a). As with breast cancer, the psychological morbidity that may occur while the patient awaits the possibility of having cancer may be significant, but fewer studies have been performed on this topic.

Biopsy itself carries its own morbidity. In a large American screening program, needle biopsy was performed for 18 percent of patients screened by digital rectal examination and PSA testing (Catalona et al., 1994). Discomfort from the biopsy procedure is reported by half of men. Patients recall more pain 2 and 4 weeks after the procedure than immediately after the biopsy (Naughton et al., 2000b). The standard sextant (6-core) biopsy procedure allows cancers to escape detection, prompting recent interest in obtaining 12 cores, which appears to increase the incidence of hematochezia and hematospermia but which has no incremental adverse effect on quality of life (Naughton et al., 2000b, 2001). Other potential harms include local infection (0.3 to 5 percent of patients), sepsis (0.6 percent of patients), and significant bleeding (0.1 percent of patients) (Aus et al., 1993; Cooner et al., 1990; Desmond et al., 1993; Hammerer and Huland, 1994). One study reported that bacteremia occurred in 16 percent of patients and that 12 percent of patients described dysuria 1 day after the procedure (Lindert et al., 2000).

Published rates of mortality from radical prostatectomy are 0.2 to 2 percent, with lower rates reported at specialized centers or for patients under age 65 (Andriole et al., 1994; Catalona and Avioli, 1987; Kramer et al., 1993; Lu-Yao et al., 1993; Mark, 1994; Murphy et al., 1994; Optenberg et al., 1995; Wasson et al., 1993; Wilt et al., 1999; Zincke et al., 1994). The potential iatrogenic complications of treatment for prostate cancer are substantial. Chief among these are impotence and incontinence, but several other adverse effects are possible. Surgical complications have been reduced to some extent by using bilateral nerve-sparing techniques and by limiting the operation to younger and healthier men; at experienced centers, recovery of erections occurs in 68 percent of preoperatively potent men (with bilateral nerve-sparing surgery), and 92 percent of men regain continence (Catalona et al., 1999a). One such center, at the Washington University School of Medicine, evaluated men at least 1 year after starting treatment and reported that urinary control was a problem for 9 percent. Sexual function was a moderate or major problem for 58 percent of those treated by prostatectomy, 48 percent treated by radiotherapy, 64 percent treated by hormonal therapy, 45 percent treated by cryoablation, and 30 percent treated by observation (Smith et al., 2000). Overall dissatisfaction with treatment, primarily because of urinary dysfunction and bothersome symptoms, was reported by 11 to 21 percent of patients who underwent prostatectomy, 14 percent of those treated with radiotherapy, and 8 percent of those treated by observation (Carvalhal et al., 1999b).

Such outcomes in experienced hands are not always reproducible in normal community practice (Talcott et al., 1997). In a study of 1,291 men, 56 to 66 percent of men who were potent before radical prostatectomy complained of impotence at least 18 months later, and 8 percent were incontinent (Stanford et al., 2000). Other surveys report high complication rates (Shrader-Bogen et al., 1997). An audit of procedures performed from 1986 to 1996 at veterans' hospitals revealed major cardiopulmonary, vascular, and colorectal injury complications in 1.7, 0.2, and 1.8 percent of men, respectively (Wilt et al., 1999). Complication rates are lower in healthier and younger patients (Zincke et al., 1994).

The reported incidences of acute and chronic gastrointestinal complications and genitourinary complications from radiotherapy are 55 to 76 and 11 to 12 percent, respectively (Leibel et al., 1994). Sexual dysfunction is common, with approximately 40 percent of persons who were potent before diagnosis being impotent 24 months later (Hamilton et al., 2001). Comparisons across studies to contrast the relative safety of radical prostatectomy and radiation therapy are generally unreliable because of differences in study design, patient populations, and outcome measures. Patients tend to report more bowel dysfunction with radiation therapy and more sexual dysfunction with radical prostatectomy (Shrader-Bogen et al., 1997). At a median follow-up of 14 years, patients who undergo radiotherapy report worse bladder, bowel, and erectile functions than are reported for men without prostate cancer (Johnstone et al., 2000). A recent study reported that almost 2 years after treatment, men receiving radical prostatectomy were more likely than men receiving radiotherapy to be incontinent and impotent. Radiotherapy produced greater declines in bowel function (Potosky et al., 2000).

Cost-Effectiveness

The widespread performance of PSA testing is costly. A 1995 Canadian study reported screening of all eligible men in Canada would have cost $317 million (Canadian dollars) (Krahn et al., 1999). An older U.S. article claimed 1 year of PSA screening could cost the United States $28 billion (Kramer et al., 1993).

Several cost-effectiveness analyses have been published. One estimated that digital rectal examination and PSA screening at ages 50 to 69 would cost $12,491 to $18,769 per year of life saved (Coley et al., 1997). An analysis from the Medicare perspective by the Office of Technology Assessment of the U.S. Congress estimated that, given favorable assumptions, a one-time digital rectal examination-PSA screening would cost from $14,200 per year of life saved at age 65 to $51,290 per year of life saved at age 75, although the report emphasized that the estimates were highly sensitive to arguable assumptions (U.S. Congress, 1995). Similarly, other analyses have conjectured that screening for prostate cancer would have favorable cost-effectiveness ratios given certain assumptions about benefits and performance characteristics (Benoit and Naslund, 1997; Littrup, 1997). A screening program in Sweden estimated that screening costs about $14,900 per patient (U.S. dollars, 158,000 SEK) (Holmberg et al., 1998). Claims of cost-effectiveness are dubious if the denominator, the magnitude of benefit from screening, is uncertain and if assumptions used in the model are debatable.

Modeling Studies to Weigh Trade-Offs

In contrast to many of the other forms of cancer screening reviewed in this report, investigators studying prostate cancer screening have attempted to use modeling techniques to quantify the influence of subjective value judgments on the weighing of benefits and harms. The models take account of potential harms by adjusting for patient utilities. Older analyses found that screening achieves minor improvements in absolute survival (Love et al., 1985; Thompson et al., 1987), but more recent analyses that adjust for utilities have concluded that screening produces, at best, a modest gain, measured in days to weeks, or a net loss in QALYs (Cantor et al., 1995; Coley et al., 1997; Krahn et al., 1994; Mold et al., 1992). According to those studies, the harmful effects of screening and treatment on quality of life undercut the potential gains in life expectancy, but the assumptions used in the models have been challenged (Miles et al., 1995).

Some modeling studies have compared the relative impacts of different testing protocols. One examined the benefits of conducting screenings less frequently, estimating for a hypothetical population of 1,000 men that annual PSA testing beginning at age 50 would require 10,500 PSA tests, prevent 3.2 deaths, and require 600 biopsies, whereas a policy of PSA testing at ages 40 and 45 years followed by biennial testing beginning at age 50 would require 7,500 PSA tests, prevent 3.3 deaths, and require 450 biopsies (Ross et al., 2000).

Other decision analyses have focused on treatment. An analysis for men aged 60 to 75 concluded that treatment increases quality-adjusted survival by less than 1 year (in most cases, by less than 0.2 QALY) compared with observation (Fleming et al., 1993). For men over age 70 and younger men with well-differentiated disease, treatment appeared to be more harmful than watchful waiting. Critics of the analysis questioned the probabilities for certain components of the model and the inclusion of a relatively older population of men with low-volume and low-grade tumors (Beck et al., 1994; Walsh, 1993). The investigators emphasize that the data were adjusted for age and tumor grade. Other studies also concluded that radical prostatectomy and radiation therapy produce a net decrease in quality of life, even after adjusting for prevalence rates for sexual and urinary dysfunction (Litwin et al., 1995). Although some patients are willing to risk these complications of treatment, others do not believe that the risks are justified. In one study, 26 percent of patients (mean age, 66 years) indicated a preference for expectant management over surgery, even if the latter would extend life by 10 years (Mazur and Merz, 1996).

Subjective Value Judgments

Given the lack of direct evidence about the benefits of early detection, the uncertainty about complication rates, and the indefinite implications of modeling studies, the ultimate judgment of whether benefits outweigh harms remains subjective (Woolf and Rothemich, 1999). Physicians who weigh the trade-offs are affected by personal beliefs about the intuitive benefits of early detection, clinical training and experience, practice norms, patients' expectations, insurance coverage, and medicolegal concerns. Many clinicians feel compelled to screen patients for prostate cancer to protect themselves against litigation and damages should patients later develop prostate cancer.

However, what is best for the individual patient depends on personal preferences, subjective values, and individual risks (Woolf, 1997a,b). A man's fears, lifestyle plans, and priorities dictate whether the balance of benefits and harms is favorable. These issues received little attention in the early 1990s, when PSA screening guidelines first emerged, and organizations assumed polar positions on whether men should receive the test. Groups on one extreme recommended that all men uniformly undergo screening (American College of Radiology, 1991; American Urological Association, 1992; Mettlin et al., 1993), and opposing groups argued against routine screening (U.S. Preventive Services Task Force, 1989). Most guidelines now take account of the importance of patient preferences (Box 5.5). The move toward a policy of shared decision making has subdued the heated controversy that once characterized guidelines on this topic and is giving way to an emerging consensus around a patient-centered approach. The shift in policy began to occur in 1997 when the American College of Physicians, a group previously associated with its resistance to prostate cancer screening, recommended that the physician “describe the potential benefits and known harms of [prostate] screening ..., listen to the patient's concerns, and then individualize the decision to screen” (American College of Physicians, 1997, p. 482) (italics added). In 1998, the American Academy of Family Physicians adopted a similar policy (American Academy of Family Physicians, 1998), advising that physicians “counsel [men] regarding the known risks and uncertain benefits of screening for prostate cancer” (American Academy of Family Physicians, 2000, p. 14). In 2000, the American Urological Association, once the staunchest advocate of routine screening, stated that “early detection of prostate cancer should be offered” and emphasized in its 2000 practice policy that “the decision to use PSA for the early detection of prostate cancer should be individualized. Patients should be informed of the known risks and the potential benefits” (American Urological Association, 2000, p. 271). The 2001 guidelines of the American Cancer Society state that “the PSA test and the digital rectal examination should be offered.... Information should be provided to patients about benefits and limitations of testing. Specifically, prior to testing, men should have an opportunity to learn about the benefits and limitations of testing for early prostate cancer detection and treatment” (Smith et al., 2001, pp. 42–43).

Box Icon

BOX 5.5

Recommendations for Screening for Prostate Cancer.

Referring to the controversy surrounding the National Institutes of Health consensus conference statement on breast cancer screening, the absence of a similar phenomenon for prostate cancer screening and the likely role that language has played in the acceptability of the prostate cancer screening policy are worth noting. The organizations' consistent advice that physicians “offer” choices to patients conveys a sense of partnership and places the locus of control at a more moderate position in the continuum of control than did the National Institutes of Health consensus conference's reference to the “woman's decision” about mammography.

Guidelines that individuals use to make decisions about their personal choices about prostate cancer screening differ in important respects from the guidelines that governments and policy makers for health plans must use to make decisions for populations of individuals (Woolf, 1997b; Woolf and Rothemich, 1999). Which way the scales tip for a population depends on the average utilities of men as a whole. Although some men favor screening, modeling studies that incorporate the full distribution of men's utilities, cited above, suggest that screening decreases QALYs. Population policy also requires consideration of resources: whether it is appropriate to invest in screening, especially for an intervention of uncertain effectiveness and safety, if it comes at the expense of other services.

Policy positions opposing routine screening of the population for prostate cancer have therefore been issued in the United States by the U.S. Preventive Services Task Force (1996) and the Office of Technology Assessment (U.S. Congress, 1995), as well as in Canada (Canadian Task Force on the Periodic Health Examination, 1994), the United Kingdom (Morris, 1997), Sweden (Swedish Council on Technology Assessment in Health Care, 1996), and Australia (Australian Health Technology Advisory Committee, 1996). For reasons that are too extensive to outline in this report, such positions are not inconsistent with the clinical recommendations presented above that each man should decide for himself whether to be screened (Woolf, 1997b).

A factor that influences the balance of benefits and harms at the societal level is the cascade effect of screening on stimulating inappropriate procedures. For example, the dramatic escalation in PSA screening in the United States in the early 1990s was accompanied by a striking increase in the performance of radical prostatectomies (Lu-Yao and Greenberg, 1994; Wilt et al., 1999). Many of these operations, especially the large number performed on men over age 75, may not have been indicated. A similar phenomenon is becoming apparent in other countries, such as the Netherlands (Spapen et al., 2000) and Australia (Ansari et al., 1998).

Cervical Cancer

This section of the chapter reviews the Pap (Papanicolaou) smear and adjunctive technologies that can be used to improve the accuracy of detection of cervical cancer. Alternative screening strategies, such as testing for human papillomavirus (HPV), testing for molecular biomarkers (e.g., fluorescent immunochemical labeling) (Patterson et al., 2001), and cervicography, are not reviewed.

Pap Smear

A fundamental difficulty in the evaluation of screening tests for cervical cancer is the lack of reliability of the reference standard: cytological and histological interpretation of cervical specimens. Even among expert pathologists, interobserver variations in interpreting atypical squamous cells of undetermined significance and low-grade squamous intraepithelial lesions are substantial (Stoler and Schiffman, 2001).

A principal limitation of the Pap smear is its poor sensitivity. A recent meta-analysis calculated that the sensitivity and specificity of the Pap smear were 51 percent (95 percent CI, 37 to 66 percent) and 98 percent (95 percent CI, 97 to 99 percent), respectively (Agency for Health Care Policy and Research, 1999). False-negative results are due to both sampling errors (in obtaining the sample from the cervix and in cell collection and cell preparation techniques) and interpretation errors, with the latter accounting for about one-third of false-negative results. Efforts to improve sensitivity have included the introduction of the cytobrush, broom brushes, and plastic spatulas to gain better access to the squamocolumnar junction and endocervix. Other measures have been programmatic, such as federal legislation mandating manual reexamination of a portion of negative slides under the Clinical Laboratory Improvement Amendments.

Adjunctive Technologies

New technologies have been introduced in recent years to improve the sensitivity and specificity of screening. These include thin-layer cytology (ThinPrep), computerized rescreening neural network technology (Papnet), and algorithm-based computer rescreening (AutoPap). Although these innovations offer the promise of improving the sensitivity and specificity of screening, a systematic review by the Agency for Health Care Policy and Research concluded that existing data for making comparisons were inadequate to reach conclusions about their incremental impacts on health outcomes (Agency for Health Care Policy and Research, 1999). Coupling Pap smears with screening for HPV infection has also been advocated, but testing for HPV plays a larger role (including the evaluation of cervical atypia) that is beyond the scope of this review.

Following introduction of the Pap smear by Papanicolaou in the 1930s, experience has revealed a consistent association between the routine use of cervical smears for cytological examination and lower rates of mortality from cervical cancer. Virtually all evidence of a benefit in terms of a lower rate of mortality is observational rather than from experimental trials. The consistency of the evidence is impressive, however, with evidence coming from studies conducted over time, ecological studies, cross-national comparisons, and case-control studies. A body of literature suggesting a 20 to 60 percent relative reduction in mortality rates has been reviewed by the U.S. Preventive Services Task Force (1996).

Harms

There are no direct harms from cervical cancer screening aside from the inconvenience, discomfort, and embarrassment that may accompany the examination procedure. The principal harms relate to the consequences of false-positive and false-negative results. As with other screening tests, psychological harms are a potential concern. In one study, 3 months after a positive Pap smear result, women were significantly more worried about cancer and had greater impairments in mood, daily activities, interest in sexual activity, and sleep patterns (Lerman et al., 1991a). Other studies have also reported an association between a positive Pap smear result and adverse emotional reactions, fears of cancer, decreased sexual function, social dysfunction, and feelings of unattractiveness (Khanna and Phillips, 2001). False-positive results also incur the inconvenience of follow-up re-examinations and colposcopic procedures. False-negative results introduce the risk of failing to detect interval dysplasia and cancer and underlie concerns about the need for frequent screening.

Periodicity of Screening

Although many physicians recommend an annual interval for Pap smears, there is little evidence to suggest that it confers greater benefit than screening every 2 or 3 years. A collaborative study of screening programs in eight countries, published in 1986 by the International Agency for Research on Cancer, shed considerable light on the incremental benefit of frequent screening. The analysis clarified the limited difference in the protection afforded by screening every year compared with that afforded by screening every 3 years. Relative to no screening at all, the protection afforded by screening was many-fold greater than no screening, across a wide range of screening intervals. Relative to no screening at all, screening at intervals less than one year offered about 15-fold protection, at intervals between 1–2 years about 12-fold protection, and at intervals of 2–3 years about 8-fold protection (IARC Working Group on Evaluation of Cervical Cancer Screening Programmes, 1986).

More recent insights have been gained from a large prospective cohort study of 128,805 women at community-based clinics throughout the United States who were screened for cervical cancer within 3 years of normal smears (Sawaya et al., 2000a). It documented that the yield of screening is relatively low (for high-grade squamous intraepithelial lesions or lesions suggestive of squamous cell carcinoma, the incidence per 10,000 women was 66 for women under age 30, 22 for women ages 30 to 49, 15 for women ages 50 to 64, and 10 for women ages 65 and older). Moreover, the incidence rate did not differ significantly on the basis of the frequency of screening: 25/10,000 for screening at 9 to 12 months, 29/10,000 for screening at 13 to 24 months, and 33/10,000 for screening at 25 to 36 months. In previously screened postmenopausal women, in whom the incidence of new cytological abnormalities was low, the PPV of an abnormal smear was zero 1 year after a normal smear and 0.9 percent within 2 years. On the basis of this evidence, the investigators concluded that cervical smears should not be performed for postmenopausal women within 2 years of normal cytological results (Sawaya et al., 2000b).

Cost-Effectiveness

A variety of studies document that cervical cancer screening has acceptable cost-effectiveness ratios compared with those for no screening (Agency for Health Care Policy and Research, 1999). In one analysis, the cost-effectiveness ratios for screening every 1 or 3 years were $7,345 and $2,254 per year of life saved, respectively, and cost savings seemed likely if screening was targeted to women who have not had regular screenings (Fahs et al., 1992). Several analyses have cast doubt on the incremental cost-effectiveness of computerized rescreening versus conventional cytological evaluation (Meerding et al., 2001; Troni et al., 2000). The imprecision of the available data makes it inappropriate to draw conclusions about the relative cost-effectiveness of these modalities (Agency for Health Care Policy and Research, 1999), but the ratios tend to be more favorable if screening is less frequent. For example, in one analysis, annual use of the AutoPap cytology smear was estimated to cost $166,000 per year of life saved, whereas use of AutoPap every 4 years cost $7,777 per year of life saved (Brown and Garber, 1999). Some have cautioned that the resources expended to pay for these adjunctive technologies could compromise the delivery of cervical cancer screening to high-risk groups (Sawaya and Grimes, 1999).

Box 5.6 summarizes the cervical cancer screening recommendations of selected organizations.

Box Icon

BOX 5.6

Recommendations for Screening for Cervical Cancer.

SUMMARY AND CONCLUSIONS

The intuitive notion that early detection saves lives is supported by scientific evidence for some cancers. As detailed in this chapter, studies demonstrate that screening for colorectal, breast, and cervical cancer significantly lowers cancer mortality rates. For other cancers, however, the evidence is less direct. Although screening increases the likelihood that cancer will be diagnosed at an early stage, when survival rates are generally higher than those for individuals with advanced-stage disease, these findings do not necessarily prove that screening improves outcomes because of potential statistical artifacts (e.g., length and lead-time biases). Doubts about the value of screening grow even stronger when the available treatment options appear to be of limited efficacy and are unable to alter the natural progression of the disease.

Given the alarming death toll from cancer, many would argue that the mere possibility of benefit from screening offers sufficient grounds for moving forward, even in the absence of scientific certainty. However, screening can itself be harmful. A substantial proportion of a population that is screened for cancer can receive false-positive or false-negative results (depending on the accuracy of the test), and this misinformation can set off a cascade of adverse physical and emotional health consequences. Even for those in whom cancer is accurately diagnosed, the incremental benefit of early detection may be outweighed by the side effects and complications of treatment.

The monetary costs of screening, which can be substantial, may be offset by the savings achieved through early detection, but quantifying the health gains achieved per dollar invested in screening requires evidence that screening produces health gains. Determining whether resources spent on screening are wise investments are concerns not only of insurance companies and other third-party payers but also of society at large. This is an era in which escalating health care costs and the mounting pressures on service delivery are stretching the capacities of the U.S. health care system to the point of compromising quality (Institute of Medicine, 2001c) and threatening patient safety (Institute of Medicine, 2000e). Under such conditions it is reasonable to examine whether resources spent on screening tests of uncertain benefit would save more lives and achieve greater health gains if they were invested in health care services for which effectiveness is more certain.

Health care organizations, government agencies, advocacy organizations, and expert panels have struggled for decades with these issues in deciding what constitutes prudent policy and guidelines for cancer screening. Groups that develop such guidelines approach these issues from different perspectives—depending on their audiences, methods of developing guidelines, and the importance that they place on supporting scientific evidence (Woolf and George, 2000; Woolf et al., 1996)—and have reached different conclusions about who should be screened, how often, and by which tests (see Boxes 5.3 to 5.6).

Despite these inconsistencies, however, a core consensus has emerged about the appropriateness of certain types of cancer screening. There is essentially universal agreement across organizations that all adults age 50 and older should be screened for colorectal cancer, that all women should receive mammograms every 1 to 2 years beginning at least by age 50 (some say age 40), and that all sexually active women with a cervix should be screened regularly for cervical cancer. Of course, controversies about cancer screening persist, the details of which receive some attention in this report and are dissected in detail elsewhere (U.S. Preventive Services Task Force, 1996). The debate over whether men should routinely receive the PSA test symbolizes such controversies. A case study describing efforts to screen individuals for lung cancer, first using chest radiography and more recently using low-dose spiral computed tomography (CT), is presented in Chapter 7 to illustrate the dilemma of adoption of a new screening technology in the face of uncertain science.

From a public health perspective, the disturbing paradox is that the cancer screening tests for which there is a core consensus are not being administered to a large proportion of the Americans for whom they are recommended. Upward trends in the proportion of Americans receiving recommended cancer screening tests are heartening, but disparities in screening by socioeconomic status are substantial, many individuals are tested too late to obtain the full benefits of early detection, they are tested incorrectly, or their results receive inadequate follow-up. Chapter 6 examines the size of this gap and reviews evidence regarding potential strategies to improve the delivery of cancer screening services.

Footnotes

1

This chapter is a condensed version of a background paper prepared by Steven H. Woolf (www​.iom.edu/ncpb).

2

Although this chapter is based on a review of evidence available in February 2001, this review was published in October 2001 as a follow-up to a January 2000 article. It is given special attention here because of the questions it has stimulated about the effectiveness of mammography.

Copyright 2003 by the National Academy of Sciences. All rights reserved.
Bookshelf ID: NBK223933

Views

  • PubReader
  • Print View
  • Cite this Page
  • PDF version of this title (20M)

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...