NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Institute of Medicine (US) Committee on New Approaches to Early Detection and Diagnosis of Breast Cancer; Herdman R, Norton L, editors. Saving Women's Lives: Strategies for Improving Breast Cancer Detection and Diagnosis: A Breast Cancer Research Foundation and Institute of Medicine Symposium. Washington (DC): National Academies Press (US); 2005.

Cover of Saving Women's Lives

Saving Women's Lives: Strategies for Improving Breast Cancer Detection and Diagnosis: A Breast Cancer Research Foundation and Institute of Medicine Symposium.

Show details

3Simultaneous Group Discussions with Invited Speakers

First Group Discussion: Delivering Better Breast Cancer Screening Services

, M.D.

Professor of Radiology and Biomedical Engineering, Chief of Breast Imaging and Director, UNC Biomedical Research Imaging Center and Member, Committee on Saving Women's Lives: Strategies for Improving Breast Cancer Detection and Diagnosis

MODERATOR AND RAPPORTEUR: The first speaker on this afternoon's panel is Stephen Taplin, a senior scientist in the applied research program at the National Cancer Institute, who will tell us about the organization of breast cancer screening services.

STEPHEN TAPLIN, M.D., Senior Scientist, Applied Research Program, NCI: Today we want to talk about how we can improve breast cancer screening by organizing care. I underline that screening is a process that leads to outcomes, not a test. Figure 3.1 presents the steps in the screening process. There are at least four different steps in the screening process: risk assessment, looking at who we are trying to reach; detection, finding an abnormality, where today's focus may be; diagnosis, evaluating the abnormality to find the cancer; and treatment. The transitions between these steps need to be organized as well. Focusing only on improving the steps, not on how women get from one step to another, will not result in improved breast cancer screening.

FIGURE 3.1. The screening process.

FIGURE 3.1

The screening process.

The whole process leads to at least two outcomes that can be examined - the long-term outcome of mortality and some short-term outcomes, like reductions in late stage disease.

To dissect this process, we can start by looking at recruitment. What are the ways to get people from the population at risk into screening? The next step is the detection process, and for this step we need to evaluate sensitivity and specificity, the quality and validity of the test itself. The third part is follow-up. In a study we just completed, we tried to simplify this process and isolate the problems (Taplin et al., 2004a). We decided to identify the source of all the late stage cases within an organized system. Were they people who were not being recruited, were not being detected, or had a breakdown in the follow-up of their care after a positive screen? Box 3.1 illustrates the sources of advanced cancers in populations with health insurance coverage where 70 to 80 percent of the women reported they had been screened. We found that 52 percent of the advanced cancers were recruitment failures. This teaches us that organized screening must include organized recruitment.

Box Icon

BOX 3.1

Sources of Late Stage Cancers. Failures in the process are associated with poor outcomes: 1,347 late stage cancers from 10 integrated health plans.

The limitations of mammography are not trivial, however. Forty percent of these women had a mammogram within the prior three years. So reduction of these detection failures by improving the quality of technology is important, although not the whole story. Failure of follow-up, where I thought the action would be, was, in fact, the cause of only about 8 percent of advanced breast cancers in this population.

We need to think about how to change the system. The Institute Of Medicine (IOM) report recommends organized screening as a way to make screening happen in our populations. This builds on several previous IOM reports, beginning with Crossing the Quality Chasm, which stressed the need for systematic change to improve quality (Institute of Medicine, 2001). Organizing screening is a dramatic system change.

Then, measurement is critically important to clinicians making changes. Showing people they are making progress, that they are reaching entire populations, is a critical feedback loop. It tells people they are getting results from their actions in a way they otherwise might not appreciate.

European models of care have been mentioned. They demonstrate that organized care does have a definition. Box 3.2 lists that it is about an explicit policy, specific age categories, a method to do it, and an interval for screening. It is a policy with a very explicit approach and a defined target population, a population at risk as specified in Figure 3.1. It is also about somebody being responsible, a leadership team. These things do not just happen. A quality assurance structure and measures to create feedback are also essential.

Box Icon

BOX 3.2

Definitions of European Organized Screening. European Models of Organized Care (IARC, 2002): An explicit policy, with specified age categories, method and interval for screening.

So organized screening is happening in Europe. Are these programs having an impact? Of course they are. They have shown a clear trend towards reduce tions in mortality in the populations as illustrated from 1974 through 1990 in Table 3.1.

TABLE 3.1. Results of Organized Screening Programs.

TABLE 3.1

Results of Organized Screening Programs.

Can it be done in the United States? I just completed 20 years working at Group Health Cooperative, where we put this kind of program in place, and we are also beginning a pilot project in 20 Bureau of Primary Health Care clinics. Group Health is somewhat unique. However, there are a number of plans around the country that have an organized insurance structure, and, in any event, I think our plan's success was due more to leadership and paying attention and commitment to a direction than to the structure of our medical system.

We organized five mammography screening facilities within our system which serves 400,000 people in the Northwest with more than 70,000 women age 40 and above. About 35,000 women were screened each year in that population. We created a multidisciplinary team and a team for leadership. There were different providers involved in each of the steps. When we started to influence, and get feedback on, what was happening in our population, it was not a big problem to get these people together to ask what we were doing, what we needed to do.

We had a group that included surgeons, oncologists, nurses, radiologists, administraters, and primary care physicians, all working together to organize this care. We had groups in each region delivering the care. There was clinical leadership at a facility which involved the radiologist and primary care physicians as well as the nursing staff. And we had an information system which mailed reminders. The critical part is identifying the population and communicating with the women. We organized outreach to look across the whole process and explore how to improve it.

One of the first things we learned was that follow-up was not coordinated, and there were people falling through the cracks. So we created a system in which there were nurses in the radiology center who were responsible for the follow-up of all positives. All positive mammograms were in a database, and the nurses took responsibility for communicating with the primary care physician and the surgeon.

We carried out a number of studies funded by the National Cancer Institute (NCI) to look at recruitment. A reminder postcard was found to be effective in improving recruitment (Taplin et al., 1994). We did a risk survey. Collecting risk factor information from a survey and informing women about their personalized risk increased the likelihood that they would come in for mammography to 66.7 percent compared to 42.9 percent among controls, who received generalized risk information (Curry et al., 1993). We found that a simple call, in which the woman was asked to come in and was scheduled, was as effective as a call addressing all the care issues. The reminding phone call itself was sufficient (Taplin et al., 2000).

Then we turned to detection. We measured sensitivity, specificity, recall rates, and positive predictive values, and we provided yearly reports back to the radiologists including all the false negatives, which is more than is required by the Mammography Quality Standards Act (MQSA). Radiologists then went through their own quality assessment program. We looked at a method of improving clinical image quality and reported that interval cancers were more likely to occur in mammograms of poor quality (Taplin et al., 2002), and we are currently studying computer assisted detection. Then we conducted a teaching session with our technologists to try to improve clinical image quality.

We evaluated follow-up and treatment. As I said earlier, our nurses assumed responsibility for communicating both with the primary care physicians and the surgeons to ensure follow up. If it was not occurring, they contacted the women. Group Health has also been looking at how treatment is organized. We already know that within our group the odds of breast conserving therapy are about 300 percent higher than in the surrounding community.

Over a 17-year period, we systematically changed the entire system for individuals, by surveying and by recruiting; for the physicians, by giving them feedback about patient participation and results of screening; for the organization as a whole, by a steering committee and a multidisciplinary team; and by creating an information system.

Did it have an impact? Absolutely. Our screening rates (ever had a mammogram) among women age 50 and above increased from under 50 percent to over 80 percent between 1986 and 1990, and we had similar increases in rates for women between 40 and 50 years of age and for mammograms within the last two years (increased from about 26 to 51 percent in women age 50 and above). We also reduced the numbers of women with late stage disease as shown in Figure 3.2. Our report of these data provides evidence that enrollment in an organized screening program is associated with increased likelihood of mammography and reduced odds of late-stage breast cancer compared to community controls (Taplin et al., 2004).

FIGURE 3.2. Group health rates of women found with late stage disease are lower as a result of early detection.

FIGURE 3.2

Group health rates of women found with late stage disease are lower as a result of early detection.

We can't describe the total use of resources in the surrounding community, because we don't have individual level data, but my suspicion is that our program consumed fewer resources, that is, was more efficient. So organized care has advantages within this health plan in promoting efficient use of screening, affecting screening and screening interval rates, and in reducing late stage disease.

Having made the case in one place, what is the next step? Is it possible to organize screening in other settings? We turned to the cancer collaborative, which is a consortium of agencies, the Bureau of Primary Health Care, the NCI, the CDC, and the Institute for Health Care Improvement which have joined in a 2-year effort to change screening in a comparable way within the Bureau of Primary Health Care.

The Bureau's clinic sites are intentionally spread around the country. Of the 800 clinics, there are 20 sites participating, so it is a small proportion, but it is the pilot proportion. The 800 clinics as a whole serve more than a million women over age 45, so there is a chance to have an impact on a large population of people. The collaborative's program makes the policy explicit, sets targets for improvement, and then asks about measurement. We create a leadership and implementation team within the primary care group. That team is the physician, the nurse, the PA, the medical records person, and the receptionist. We organize the recruitment, the follow-up, and the referral for treatment. Then we encourage regular changes within the clinic in order to achieve these, and we create a data system to identify how many people receive care. We emphasize systematic reorganization and practice teams. We meet monthly with these teams. We have three sessions within the year in which we discuss progress and assess what they are doing. We then meet on a monthly basis as they put the new plan into place. The bottom line is they are being asked to look at what they are doing, change it, and measure what the results are.

So in conclusion, we can systematically change the screening process. It does not require rewriting legislation; it requires the will and the leadership to do it; it takes time, and it takes data. We need also to address barriers to collaboration and whether we can create an environment in which quality improvement is encouraged and reinforced. Those are important questions for us and for our society.

REBECCA SMITH-BINDMAN, M.D., Associate Professor, Radiology, Epidemiology, and Biostatistics, Obstetrics, Gynecology, and Reproductive Medicine, University of California, San Francisco: There has been an enormous amount published over the last few decades about who is getting screening mammography. There have been several predictors of screening noted—age, race, ethnicity, having a usual source of care, rural residence, as well as financial barriers. I think there is a general belief now, however, that the differences by these predictors have declined as a result of numerous mammography outreach efforts.

Our knowledge of the current status of mammography is based largely on two very widely cited surveys, the CDC's Behavioral Risk Factor Surveillance System and National Health Interview Survey, both of which assess mammography use annually. The surveys have found that the historical discrepancies in mammography use that have been seen by race and ethnicity have declined significantly. According to these self-report surveys, most eligible women are now getting mammography; 70 to 80 percent of women report that they have had a mammogram in the last two years, and there is no reported difference by race and ethnicity. In fact, some minority groups report higher rates of screening than white women. So many people, including policy makers, have concluded that compliance with recommendations for screening mammography is no longer a problem.

We have not discussed differences in cancer statistics a lot today, although Dr. Esserman touched on this in her presentation. However, I think it is understood that there are substantial differences in breast cancer outcomes and breast cancer detection rates by race and ethnicity. In general, non-white women have more advanced disease at diagnosis. There have been improvements in breast cancer mortality over the last decade, but SEER statistics show that the improvements have been largely limited to white women. Mortality curves have been essentially flat for other racial and ethnic groups.

It seemed to me that one might question the value of mammography if mammography use is now the same (and high) among different racial and ethnic groups, but differences persist in breast cancer mortality and tumor stage at diagnosis. Clearly, if mammography were working, we would expect breast cancer mortality rates to decline coincident with improvements in screening mammography rates.

One possible explanation for the failure of mortality rates to decline among racial minorities along with their higher use of mammography is that estimates of their use of mammography based on self-reports may be inaccurate. I think there is growing concern that this might be the case, and that women, particularly minority women, may overestimate their use of mammography.

So, I have been interested in investigating whether there are persistent differences in the use of mammography. We have just completed a study that evaluates screening mammography use among a large number of racially and ethnically diverse women diagnosed with cancer. This study examines recorded mammography use from medical records, and, therefore, we believe these data are more accurate than self-report data. We used data from the NCI funded Breast Cancer Surveillance Consortium which comprises mammography registries in seven states and is probably the largest data set available to assess actual mammography in the United States.

In this data set, we learned about mammography use based on medical records, radiologist reports, and a survey that each patient completed every time she had a mammogram (such as a patient self-reported breast mass at the time of a mammogram). These data allowed us to assess mammography use in a more detailed way compared with the self-report surveys that are relatively crude in terms of their assessment of mammography—that don't differentiate between a screening mammogram, or a diagnostic mammogram, or whether there was a mass at the time of mammography. Clearly if a woman has a mass at the time of a mammogram, it should not be considered a screening test.

Cancer outcomes are complete in this data set, since over 95 percent of cancers are ascertained based on linkage to cancer registries. The data describe approximately 900,000 women who are racially and ethnically diverse. In these women ages 40 to 85, there were approximately 26,000 breast cancers diagnosed.

Table 3.2 displays the characteristics of tumors in these women by race and ethnicity. These numbers should be, and are, very similar to recent reports using SEER data. They display the adjusted odds of advanced stage cancer by race and ethnicity using white women (set at one) as the reference. Essentially, African American, Hispanic, and Native American women are at increased risk of having an advanced cancer at diagnosis, and such tumors are less likely to be curable. These are the results you would expect, looking at current population tumor registry data.

TABLE 3.2. Breast Cancer Characteristics by Race and Ethnicity with White Women as the Reference (adjusted odds ratios).

TABLE 3.2

Breast Cancer Characteristics by Race and Ethnicity with White Women as the Reference (adjusted odds ratios).

We next looked at the mammography use among these women in the five years prior to breast cancer diagnosis. We categorized women into five groups based on their screening frequency. The most screened women had a mammogram a year before cancer diagnosis. The least screened woman had not had a screening mammogram for at least five years prior to cancer diagnosis. We considered women to have inadequate mammography if they had either never had a mammogram, had not had a mammogram for at least three and a half years, or had their first mammogram after age 55, or only coincident with the diagnosis of cancer.

There were substantial differences in mammography use by race and ethnicity. Compared to white women, all minority women were less likely to be screened regularly and more likely to be screened infrequently. In comparison to whites, the odds ratios varied from 1.4 to 1.8, and these ratios suggest minority women were around 40 to 80 percent more likely to not be screened.

We then looked at tumor characteristics after adjusting for mammography. We asked whether women who were similarly screened would have similar types of cancer, or are minority women who are similarly screened still at increased risk for advanced cancers. The latter would suggest a biological explanation, the former discrepancies in mammography use.

Once we stratified by mammography, the differences in tumor size, stage, lymph node involvement, and symptoms by race and ethnicity were reduced or eliminated. Thus no matter what a woman's race or ethnicity, similarly screened women had similar types of tumors. Interestingly differences in tumor grade persisted even after adjusting for mammography.

Figure 3.3 shows the percent of women with large tumors in each racial and ethnic group by use of mammography. From left to right, the groups go from more to least use of mammography. The percentage of women with large tumors increases as mammography decreases, and large tumors were found in about 80 percent of weomen who were never screened.

FIGURE 3.3. Increasing tumor size with increasing interval since mammography.

FIGURE 3.3

Increasing tumor size with increasing interval since mammography.

However, once plotted by mammography, there are few differences by race or ethnicity at each screening level. Similar results were found for advanced stage, lymph node involvement, and symptoms. Tumor grade did not change much from the most recent to the most distant mammography groups, and African Americans persistently had higher grade tumors as noted earlier by Dr. Esserman, apparently for biological reasons. It is clear that mammography use is associated with size of tumors, but differences by race and ethnicity, represented by the four different lines of the figure, are no longer significant.

In summary, there are persistent and dramatic differences in cancer characteristics which are reduced or eliminated when data are adjusted for mammography use. Therefore, mammography appears to be in large part causal for the differences in tumor characteristics by race and ethnicity. Since mammography clearly contributes to racial and ethnic differences in mortality and to reduction of mortality in general, it is vital to increase the regular use of screening. Having had a mammogram once does not protect against advanced stage cancer, regular screening is required. We consider a 3-year interval to be a minimum requirement.

I now turn to mammography use among elderly women. Medicare billing records are a great source of data to assess population-based use of mammography in Medicare eligible elderly women. Clearly mammography rates in this population have increased over time. However, the overall rates are substantially lower than suggested by self-report surveys. In contrast to self-report data, data from Medicare billing records suggest substantial and persistent racial and ethnic disparities.

Figure 3.4 illustrates biennial mammography use over time in women aged 65 to 69 in whom the evidence is clearest that mammography is beneficial. Rates of mammography have increased, but even as recently as 2001, those rates are only 50 percent, far lower than the 70 to 80 percent from the self-report surveys. There is obviously a significant gap between white and African American women which is growing rather than shrinking.

FIGURE 3.4. Racial and ethnic differences in mammography in Medicare beneficiaries.

FIGURE 3.4

Racial and ethnic differences in mammography in Medicare beneficiaries.

In summary, there are persistent disparities in who is getting mammography. Repeat mammography is very important, and it is less frequent than widely believed. Dr. Taplin showed how to increase regular mammography use, but that may be more easily said than done with respect to racial and ethnic minorities and underserved women. Clearly the current efforts to recruit these women need improvement.

We have heard a lot about the accuracy of mammography today. It is not a perfect test. Not all cancers are found, and not all normal women have a normal test result. But it is the best test we have, and it is a pretty good test when done well.

Dramatic differences have been reported between the U.S. and other countries in the performance of screening mammography, but it is unclear whether these represent true differences in how mammograms are interpreted, or whether they relate to patient characteristics, such as age, the mix of screening and diagnostic exams, or the screening intervals, that is, is it the performance of mammography or the case mix? The data seem to show that performance has a lot to do with it.

I recently spent a year in Britain which gave me the opportunity to compare screening mammography there to that in the U.S. To examine that, I pooled data from the two countries (Smith-Bindman et al., 2003). In the case of the U.K., I used data from the National Health Service Breast Screening Program on 3.94 million women. This is a single organized screening program which differs in many ways from the United States, but basically provides mammograms very similar to those in the United States, by radiologists whose training is very similar, and with very similar technology. In the United States, I included data from two sources, 978,591 women from the Breast Cancer Surveillance Consortium and 613,388 women from the CDC's National Breast and Cervical Cancer Early Detection Program. Since performance in mammography varies by age, all the analyses I will discuss are either age-adjusted or age-stratified. Performance in mammography also varies by whether women have undergone previous mammography, so all those results are stratified by first or subsequent exams, which is made easy because in both the U.S. and the U.K. separate data are kept for the first and subsequent exams.

We looked at several measures of performance. First, what percentage of mammograms resulted in a recommendation to do a further examination, a recall for either additional non-invasive workup, a diagnostic mammogram, an ultra sound, a clinical breast exam, or a recommendation for a biopsy (open orpercutaneous), cytology or histology? I looked in particular at the rate of open surgical biopsies per 100 mammograms and likewise the rates of biopsies that did or did not discover cancer per 100 mammograms.

Balanced with the recall rate is the resulting cancer detection rate, how many cancers are you finding by calling back women. You might be willing to accept a very high recall rate if you are finding a lot of cancers, including invasive cancer and ductal carcinoma in situ (DCIS). This rate was defined as the number of cancers detected per thousand mammograms. The denominator for cancers is 1,000, whereas the denominator for recall is 100, because cancers are much rarer than recalls. I counted a cancer if it was diagnosed within one year of screening. I did not look at the false negative rate, or sensitivity, because the method of ascertainment for the three programs is very different. Furthermore, cancer detection rates have been found to very closely parallel sensitivity.

Screening is more frequent in the U.S. than in the U.K. Therefore, it is important to look at recalls and cancers detected over an interval of screening rather than at just a single screening examination. You want to know basically what would happen to a woman in any of these programs if she participated in the program over some length of time, say 10 or 20 years.

Women in the U.K. will have around three exams over a 10-year period. We examined total cancer detections and recalls assuming one first exam and several subsequent exams, and then added up the total number of recalls and cancer diagnoses. We did the same thing in the U.S., except there were more exams to add up. The cancer rate (Table 3.3) is slightly different in the different programs because of the different ages of the women screened. But we found approximately the numbers of cancers you would expect, 5 cancers per 1,000 screening examinations. In terms of the recall rate, we asked what percent of women are recalled for additional evaluation after a screening exam. The numbers are similar betweenthe two U.S. data sources. Approximately 13 or 14 percent of women are recalled for additional evaluation. In the U.K. the number is approximately half that (shown in Table 3.4). The recall rate is lower for subsequent exams, approximately 50 percent lower, but the trend is the same. The recall rates are twice as high in the U.S. as in the U.K.

TABLE 3.3. Cancer Rates in Three Data Sets from the U.S. and the U.K.

TABLE 3.3

Cancer Rates in Three Data Sets from the U.S. and the U.K.

TABLE 3.4. Comparative Recall Rates of Women After Mammography in the U.S. and the U.K.

TABLE 3.4

Comparative Recall Rates of Women After Mammography in the U.S. and the U.K.

In terms of what kinds of additional tests result from the recall exams, it turns out that most of the difference is in the non-invasive workup. In the U.S., we recall about 10 to 12 percent of women in the 50 to 54 year age group for a non-invasive further evaluation (such as ultrasound), and in the U.K it is half that at 5 percent, as shown in Table 3.5. In terms of pathologic evaluation (that is, biopsy), the numbers are much closer together.

TABLE 3.5. Percent Recall Exams in the U.S. and the U.K.

TABLE 3.5

Percent Recall Exams in the U.S. and the U.K.

The rates of cancer detection are very similar in the two countries. In 50- to 54-year-old women, approximately 6 cancers per 1,000 were detected by screening mammography. The cancer rate increases with age, from 6 to 12, but remains similar across the different programs. Thus, the same numbers of cancers are found despite much higher recall rates in the U.S. If we examine the data over a 10-year period for women in the 50 year age range, about 17.5 percent would be recalled in the U.K. and in the U.S., where there is a greater frequency of screening, between 40 and 50 percent. These numbers are high, but they are exactly the same as have been reported by others using different data sets. Similar results are found for women in their sixties—substantially higher recall rates in the U.S. by two- to three-fold. The numbers of cancers detected in all three data sets, however, are similar based on estimating screening over 20 years.

So in summary, the U.S. programs are really very similar, but the United States is very different from the U.K. Recall rates are twice as high in the U.S. Negative open surgical biopsy rates are two to three times as frequent in the U.S. as well. Cancer detection rates however are similar, and there is no difference in the detection of large cancers.

I can speculate that the differences between the two countries may reflect the higher rate of litigation here which is focused on delayed breast cancer diagnosis. This may lead U.S. physicians to recall patients, even when they see a finding that has a low likelihood of cancer. Additionally, in the U.K. a much smaller number of radiologists focus on screening mammography. On average their mammographers read ten times as many mammograms as their U.S. counterparts.

British radiologists know there is limited manpower. They know they can't recommend that 10 or 20 percent of women come back for diagnostic exams. There is not the capacity to handle this number, so they consciously limit recalls to the number of diagnostic mammograms they can handle. But, in fact, this is helping their program as they are finding the same numbers and types of cancers without all of the additional evaluation. Also, centralized reading and double reading are the standard, almost 100 percent, and they use this system to limit recalls.

Lastly, and I think most importantly, the U.K. has nationally set quality standards that are intensively monitored through a QA network. They have very targeted CME programs that teach radiologists to reach the standards. There are agreed-upon targets about what is desirable, that is, benchmarking. We don't have that here. We don't have a set of standards that label a recall rate of 20 percent unacceptable. On the other hand, a recall rate of two percent is not acceptable unless you find a certain number of cancers.

So they have targets, and because they have set them and because they have a coordinated effort to reach very specific recall and cancer detection rates, they are better able to reach their targets. Programs and individual physicians are subject to annual peer review. Under-performing programs and physicians are reviewed. Physicians take an exam that includes practice tests; it is voluntary, but over 90 percent of physicians take it once a year. As a result the performance of outlier radiologists has improved dramatically. There is no similar program in the U.S., but I think it might be incorporated in the programs of many health care organizations, or perhaps most easily under MQSA. I think the U.K. experience teaches us that we should focus on standardizing the interpretive components of mammography and setting performance parameters as has been done for technical performance in MQSA (summarized in Box 3.3).

Box Icon

BOX 3.3

Summary Recommendations. The accuracy of screening mammography can and should be im proved. The U.K. provides an example of a success model that relies on setting clear goals and continuous quality improvement. Access to regular screening mammorgraphy, (more...)

CHARLES FINDER, M.D., Associate Director, Division of Mammography Quality and Radiation Programs, Center for Devices and Radiological Health, Food and Drug Administration: In this presentation, I will describe the historical basis for the passage of the MQSA, briefly review the current MQSA program, and outline objective indicators of program performance.

In 1985, a nationwide evaluation of X-ray trends (the NEXT study), found that there was wide variation in image quality and radiation dose among mammography facilities. In 1987, the American College of Radiology (ACR) established a voluntary program of accreditation for mammography. By July 31, 1992, 2,684 (37 percent) of the 7,246 facilities that applied had failed, which meant that only 4,662 (or about 42 percent) of the approximately 11,000 total facilities then in service were fully accredited, and there had been no on-site evaluation of these facilities. Also by 1992, only 10 states had adopted any form of legislation referable to the quality of mammography. Michigan had the most comprehensive program, which had begun in 1989. This program had equipment and personnel requirements, carried out some annual inspections, and found that 34 percent of its units failed a quality test.

These findings supported enactment of MQSA, which was signed into law on October 27, 1992, and stipulated that all mammography facilities were to be certified by October 1, 1994. The Food and Drug Administration (FDA) was tasked with developing and implementing regulations for MQSA. Interim regulations became effective on October 1, 1994. These regulations and accompanying procedures set quality standards, standards for accreditation and certification, and dealt with inspections. They closely conformed to those of the ACR. The biggest change was the initiation of annual on-site inspections.

The final regulations were implemented after a long process that dealt with notice and comment, and they went into effect on April 28, 1999. They expanded and clarified many of the interim regulations' requirements. For example, the interim regulations required that equipment be specifically designed for mammography; in the final regulations, the FDA listed the specific requirements that the equipment had to meet.

Currently, the ACR is the only national accrediting body, and Iowa, Arkansas, and Texas are state accrediting bodies under MQSA. California withdrew as a state accrediting agency May 4, 2004, and since then accreditation in that state has been taken over by the ACR. The major function that an accrediting body performs is the review of clinical and phantom images from each mammography facility at least once every three years. Additional reviews are performed when there is a suspected public health risk. These are more intensive evaluations of a facility to determine whether or not there is a public health risk; finding such risk, FDA would go in and ask the facility to notify those patients who were at risk.

In addition to accreditation, there is also certification. Currently, the Food and Drug Administration is the only national certifier, and two states, Iowa and Illinois, are state certifiers. Issuance of MQSA certificates, which are required to lawfully provide services, is the major function of certification bodies. Certification also involves annual inspections of facilities to ensure compliance with regulatory standards. In those cases where a risk to human health has been determined, the certification agency will require the facility to notify all referring physicians and their patients that there is a problem which may require review of mammograms. This has happened several times over the course of the program, and occasionally involves as many as 10,000 patients at a time. The certifying agency provides enforcement through compliance activities, and if necessary, can impose sanctions and court actions, although this is rare.

The final regulations also set up quality standards. These cover personnel qualifications in three different categories: the interpreting physicians; the radiologic technologists; and the medical physicists. We also have standards for the reports that are sent to referring physicians and patients. All patients should now receive a lay summary of their results. There are also requirements for record retention, a medical outcomes audit, which I know many people are interested in, quality control testing, and standards for equipment and quality assurance. There is a requirement for an annual physics survey and for evaluation of equipment before it is used on patients. There are also requirements that a consumer complaint mechanism be established at all facilities and that all facilities have infection control procedures.

All interpreting physicians, radiologic technologists, and medical physicists that provide mammography services must meet specific initial and continuing training, education, and experience requirements. Specifically, the interpreting physician must have a valid state license, be either board certified in diagnostic radiology or have at least three months of formal training in mammography, have 60 category one continuing medical education (CME) credits in mammography at least 15 hours of which were obtained in the 3 years prior to qualifying as an interpreting physician, and then have interpreted mammography examinations from 240 patients in the preceding 6-month period under the direct supervision of a qualifying interpreting physician. All interpreting physicians must have 15 CME credits within a 36-month period and must interpret 960 mammography examinations in a 24-month period.

Reporting standards require that all reports must contain an overall assessment of findings. There are requirements for communicating the results to the referring physicians and patients, and there are also requirements that the films and medical reports be retained for as much as 10 years. Reports to referring physicians must have one of the six assessment categories: negative; benign; probably benign; suspicious; highly suggestive of malignancy; or incomplete, needs additional imaging evaluation.

For the medical outcomes audit, we have a very general, some might say superficial, requirement, but it is amazing, how few facilities were even implementing this level of evaluation. All mammography facilities must have a system to follow-up all positive (suspicious or highly suggestive of malignancy) mammograms, and an audit physician must be assigned responsibility to ensure that the data are collected and analyzed on a regular facility and individual physician basis with correction of any problems identified.

The FDA requires many equipment quality assurance tests, daily, weekly, quarterly, or semiannually. There is also a requirement that a medical physicist perform a series of annual tests. These cover the equipment's basic requirements, including evaluation of the automatic exposure control, dose (which generates a lot of patient interest and concern, although problems are few), phantom image quality, and radiation output, among others. Finally, there are required tests for other mammographic modalities, meaning full field digital mammography, which has its own list of specific tests designed for that equipment.

Table 3.6 displays the data on numbers of facilities at the start of four recent fiscal years. In 2000 we had almost 10,000; as of October 1, 2003 we were down to a little over 9,100, but note that the average number of mammography units per facility has increased from 1.2 to 1.5, so actually the availability of mammography units appears to be increasing slightly.

TABLE 3.6. Mammography Facility Numbers Have Been Declining.

TABLE 3.6

Mammography Facility Numbers Have Been Declining.

The FDA's annual inspection at each facility reviews personnel qualifications, the medical reports and lay summaries, and the outcomes audit, primarily to ensure that they have been done. We don't capture the data, but we do make sure that the facility is meeting our requirements. We check to make sure that the equipment is performing. We check dose and phantom images and processing and darkroom fog. We also check to see that the medical physicists have performed the tests as required and that there is a consumer compliance mechanism. When the inspector finds that the facility is not meeting all requirements, the facility is given an inspection observation. We have broken that down into three levels: level three is the most minor deviation, generally satisfactory; level two is facility performance that is acceptable with a deviation that may affect quality; and level one is a more significant problem, deviations that may seriously compromise quality.

The FDA began inspections under the final regulations in July 1995, and we are currently doing about 8,500 inspections each year. When we started the inspection program in 1995, only 30 percent of facilities were violation free and 2.7 percent were level one. Our data for this year through the end of April show 68.2 percent of facilities violation free and 1.9 percent at level 1. By and large, this represents a steady improvement over the years. However, there have been some bumps in the road such as when we put in new regulatory requirements like continuing personnel requirements in the late 1900s and 2000. It is noteworthy, that when the mandatory accreditation program began, and clinical images from all facilities were being reviewed, about three quarters of facilities were passing on first attempt. Now, the current percentage is about 99 percent, so there has been an objective improvement there also

A phantom is one of the ways that we evaluate image quality, not actual clinical image quality, but as a surrogate for that. The purpose of the phantom is to simulate some of the structures that we will find in a breast. The typical phantom contains 16 objects. The more objects you can see on the image, the better the image quality. Figure 3.5 displays what has happened to dose and image quality over the past two or three decades. Historically, doses were fairly high, but they have declined significantly with only a slight increase recently because breast imagers have determined that more exposed (darker) films improve image quality. Clearly, at the time doses had declined, image quality, as measured by a phantom, improved dramatically.

FIGURE 3.5. Image quality has improved and radiation dose decreased.

FIGURE 3.5

Image quality has improved and radiation dose decreased.

Those interested in more information about our program, can find it on our website at http://www.fda.gov/cdrh/mammography, and if anybody ever has any specific facility type questions, we also have a facility hotline, 1-800-838-7715 that facilities or patients can call.

JAMES BORGSTEDE, M.D., FACR, Chairman, Board of Chancellors, American College of Radiology, Clinical Professor of Radiology, University of Colorado Health Science Center: I am a practicing radiologist in Colorado Springs who personally interprets more than 3,600 mammograms and performs more than 100 image guided breast biopsies each year. I will talk about quality and access from the perspective of the ACR and from the practitioner perspective. I commend the IOM for this report, and I stress that quality is a concern of both the IOM and the ACR; we have very few differences of opinion on how to achieve it.

Today, I will focus on four subjects: work force, liability, economics and reimbursement, and then the College's efforts. Let's talk about work force first. The U.S. General Accounting Office (GAO) recently reported on mammography capacity, and that report, which dealt with data from 1998 to 2000, gives me both some optimism for the present and some concern for the future (GAO, 2002). GAO concluded that capacity at the time of the report was adequate to deliver mammography services. Facilities had decreased during the study period by 5 percent, numbers of machines had increased by 11 percent, technologists had increased by 21 percent, and mammography had increased by 15 percent. ACR finds that these trends are continuing. In 2002-2003 the College sent a survey to 16,147 of our members, 9,048 (56 percent) responded; of those 4,924 were doing some mammography (54 percent) and 654 members (7 percent) reported being specialists in mammography.

On the other hand, population projections predict that the numbers of women at eligible ages for mammography will be growing by 1.25 million each year through 2020. Hence the reason for concern; are we going to be able to provide all of those women with mammography services in the future?

Furthermore, improving quality while simultaneously increasing access for growing numbers of women may be difficult. Others today have cited an analysis (Beam et al., 2003b) that concluded that to improve interpretation by eliminating poorly performing mammographers would likely result in an inadequate workforce. There is the potential for an inverse relationship between quality and access. As we increase quality, we have the potential to decrease access, and vice versa. I think we all need to be concerned about that for the future.

Another study that was also mentioned by others reported a 20-minute survey of radiology residents who had completed their breast imaging rotation (Bassett et al., 2003). Large majorities of the residents said they were more concerned about missing a potentially important finding on mammography than on abdominal CT, which is not an easy examination, and they endured more stress as a result. This illustrates some points that I want to talk about in a moment in the context of liability and reimbursement. People were very concerned about this issue. Stress is an important factor in interpreting mammograms.

Let's turn to non-physician prescreening, which is taken up in the IOM report. We need to ask why one would want to use prescreeners—presumably to increase accuracy and access. I believe computer assisted detection would be a more appropriate solution. Increased access should result only if there is a decrease in physician time per case so that more mammograms could be read per unit time. But I am concerned that this would lead to lesser quality.

We can examine this from another perspective, too. Use of prescreeners will decrease reimbursement because insurers will not pay the physician work rate for non-physician work. The Centers for Medicare and Medicaid Services (CMS) pays 85 percent of the M.D. rate in the Medicare fee schedule for services of physician's assistants or nurse practitioners. If a facility employed these prescreeners and held the time the same, but divided between non-physician and the physician so the latter could see more cases, total reimbursement per case would be reduced since the non-physician fraction would at a 15 percent discount. The situation would be even less economically attractive in the case of radiologic technologists whose time is worth 43 cents per minute according to the U.S. Bureau of Labor Statistics.

But non-physician prescreening raises additional questions. Will quality change? In my opinion, mammography differs from cervical cancer pap smear screening. This is not a binary procedure, where you can say it is normal or it is something else. There is a wide range of normal, and it is very difficult to distinguish normal from abnormal. Will radiologists agree to supervise in this environment? And will professional liability insurance carriers be willing to insure with non-physician prescreeners doing part of the work? Will it save radiologist time? Will this approach potentially create a shortage of breast imaging technologists by diverting them from performing to interpreting mammograms? Will the number of diagnostic examinations change? Will these individuals, in effect, take screening cases that would have been called negative and move them into the diagnostic category resulting in radiologists working up more false positives? And most importantly, what will happen to women? What will happen to our patients? It is my opinion that prescreening will not improve access. It will take the best technologists out of the work flow, and it will not increase the productivity of radiologists.

Turning now to liability, the Physicians Insurance Association of America (PIAA) analyzed 450 paid malpractice claims involving breast cancer in 2002 (PIAA, 2002). In all of medicine, missed diagnosis of breast cancer was the number one condition for which patients filed a medical malpractice claim. Radiologists were the most frequent defendants, and this was the second most expensive condition in terms of indemnity, exceeded only by problems with deliveries and injured babies. Of those patients, 88 percent had at least one mammogram, and 80 percent of those mammograms were interpreted negative or equivocal. That does not necessarily mean that those mammograms were misinterpreted. That implies that the radiologists were at risk in those cases.

Lawsuits, even unsuccessful ones, are the reasons that radiologists are reluctant to interpret mammograms. Dr. Smith-Bindman's Figure 3.10 compared recall rates for first mammograms in the U.K. and the U.S. It would have been interesting to have an additional figure comparing numbers of attorneys. In my opinion, the threat of malpractice is one explanation for increased recalls for evaluation and more biopsies for benign disease.

Malpractice insurance premiums also appear to be higher for radiologists who interpret mammograms. Four companies quoted premiums to a Connecticut radiology practice ranging from no difference to 14, 17, and 29.5 percent lower if no mammography was done (Kaye, 2004). Similar data from Virginia were reported to me when I visited there. These premiums clearly discourage breast imaging.

I turn now to reimbursement, which is another factor that plays a role in discouraging radiologists from interpreting mammograms. CMS arrives at reimbursement by valuing the current procedural terminology, or CPT, codes that physicians use for billing in relative value units. These units are calculated according to resource costs needed to provide the services, and they are multiplied by a conversion factor to arrive at the actual dollars that are paid. Screening mammography as a service is broken down into subunits, as illustrated in Box 3.4, physician work (55 percent), practice expense (42 percent), and professional liability insurance (3 percent). Physician work which does not include support staff, is broken down further into time and intensity, and intensity is assessed by technical skill and physical effort, mental effort, and judgment, and the stress associated with patient risk.

Box Icon

BOX 3.4

Factors Used to Calculate Reimbursement for Physician Services in the Resource Based Relative Value Scale. Physician work (does not include support staff)—55 percent Time

In my opinion, there are three factors that are particularly germane to mammography and present a particular problem in valuing the service. Time is very important and, if reduced by the use of nonphysician prescreeners could result in decreased reimbursement. As for intensity and mental effort and judgment, we should recall the radiology resident survey which reported that effort and judgment in mammography exceeded that require for interpretation of abdominal CT scans. The same survey also emphasized the stress involved in this kind of work (Bassett et al., 2003).

Practice expense involves reimbursement for technologist work, the cost of equipment, and the like, and the final item is professional liability insurance. This latter, in my opinion, is more generously reimbursed for facilities than for physicians, although I believe the physicians incur more of the risk.

Table 3.7 provides an example of how Medicare reimbursement relates to cost in Colorado. The all inclusive payment for screening mammography in my practice, including physician work, practice expense, and professional liability insurance, is $83.58. My costs include $14.78 for compliance with MQSA and either $124.54 for hospital or $86.60 for office costs, all according to an ACR survey of 37 radiology practices in the spring of 2001. Clearly, this is not an attractive economic proposition.

TABLE 3.7. Costs Versus Payment in Screening Mammography in Colorado.

TABLE 3.7

Costs Versus Payment in Screening Mammography in Colorado.

I would like to conclude by describing some of the College's efforts with organized radiology to improve mammography. Our efforts have been continuous for more than 25 years, working with government, the FDA, industry, and other organizations. Our efforts with the residency review committee of the American Council of Graduate Medical Education have resulted in an increase in the number of residency positions. The number of mammography units has increased, also. We will maintain quality, and we will improve access. This is a commitment from the College.

Among English speaking countries, England, Australia, Canada, New Zealand, and the U.S., there are comparable rates of mammography screening, but the 5-year survival is best in the United States, and only Australia has a slightly better mortality rate. So I think there is some cause for optimism here. My concern is for access in the future.

What should we be doing? In my opinion we need to enhance the current work force. We should use physician extenders, not for prescreening, but for hanging mammograms, contacting patients, and logistics work, and we certainly do that in our practice. Computer assisted detection is the way to go for that second pair of eyes, and we need to continue to work to improve its quality. We need to further advance the use and lower the cost of digital mammography. We also need to work on transmission of data and provide governmental incentives for manufacturers and communication system providers to enhance electronic transmission. That would simplify the transmission of full-field digital mammograms and encourage the use of centers of excellence.

There has been an increase in the number of radiology residency positions by 300 due to our efforts. We have to have relief from litigation, perhaps some sort of no-fault system as was suggested earlier. And we need appropriate reimbursement. Mammography cannot be the loss leader. It has to stand on its own. We also need to promote an environment of enthusiasm; enthusiasm by those of us performing mammography stimulates interest by residents. I believe that mammography offers tremendous research opportunities if one is interested in epidemiology, statistics, or new technologies such as MRI, telesynthesis, and the use of ultrasound.

DR. PISANO: Now is the time to open the discussion. I have some questions, but are there people from the audience who have questions as well?

DR. RICHARD WAGNER, Wisconsin Radiology Specialists: I have been practicing mammography for 25 years and have experienced unpleasant turf issues with surgeons. I think we need MQSA standards broadened to include stereotactic as well as open surgical biopsies. It is a quality issue that has to be resolved.

DR. BORGSTEDE: The College has accreditation programs in those areas. That would be something you could certainly promote.

DR. FINDER: Ultrasound biopsy is not covered by MQSA; we have no authority over ultrasound, but stereotactic biopsy certainly has been brought up, and we have been looking at that issue. I'm not sure that anything we could come up with would necessarily have affected your situation. The major factor for the FDA is if there is a problem, and is there something we can do to alleviate that problem. Many approaches to assuring quality of stereotactic biopsy have not yet been explored, such as the use of the audit; that sort of thing would probably have to be invoked in order to address the situation

DR. SMITH-BINDMAN: We do two to three times as many open surgical biopsies in the United States as they do in the U.K. Your example seems to involve the surgeons encouraging it. Perhaps we should have targets, or benchmarks, of what is desirable.

DR. BORGSTEDE: Are the surgeons doing cores?

DR. SMITH-BINDMAN: They are doing both. If they are doing exactly the same thing as radiologists are doing, that is one issue. But if they are proposing more open surgical biopsies, that could call for MQSA, or maybe another organization, to propose performance benchmarks.

DR. BORGSTEDE: We also have to be careful about adding more MQSA requirements as the way to solve turf battles. I personally would disagree with that. We are going to kill mammography programs with love if we keep adding on more and more requirements. They need to be appropriate, but they need to benefit the patient as the first priority. I would hope that I could prove that I should do the examinations because I can do them with quality. But anybody who can do them with quality should be able to do them.

DR. PISANO: Dr. Taplin, you talked a lot about how to improve the delivery system from risk assessment all the way to care after the patient was diagnosed. Similar things have been done at the University of North Carolina, and I think at other places as well, for example, the University of California, San Francisco, and Sloan-Kettering. We are salaried employees, not fee-for-service physicians, and I think that makes it easier to implement some care improvements. How do we motivate practitioners who are not in similar model systems to practice integrated health care? What are the financial incentives and disincentives? How do we get such systems implemented across more practices in the U.S.?

DR. TAPLIN: Certainly at my institution of the last 20 years everybody was on a salary and this affects motivations. Instead of battling for procedures, surgeons and radiologists were content to see their colleagues doing more work. So there is no doubt that the financial incentives motivate the people in a health care system.

However, I was very encouraged to hear this morning that CMS is thinking about alternative ways of reimbursement. I think those explorations and perhaps demonstration projects could be constructive. How do you fix the structure in which delivery occurs? CMS openness to beginning to think about alternative structures for reimbursement may help us.

In our demonstration, the critical part of improving quality was that the surgeons, radiologists, and others were talking together. It turns out it is pretty radical to have all those people, primary care physicians, nurses, radiologists, surgeons, all at the table at the same time, and meaningfully defining the kind of care they want to organize. I think that we need to think about more ways of reimbursing that kind of organization.

I should say also that our quality reviews and reporting occurred in the context of the quality improvement structure. It was important that we were reporting results to a committee which was responsible for the quality of care within the entire organization. That meant that the reviews, the information, and the reports were their business only and could not be discovered, including the reports of all the women who were given a negative interpretation and had a cancer within 1 year.

DR. BORGSTEDE: Speaking as a past president of a state medical board, you want to do that in a system with peer review protection, so that it is not discoverable.

DR. TAPLIN: I don't know what happens outside of our structure, whether there is also a quality improvement structure that can be set up for people in indemnity plans.

DR. D'ORSI: We know that breast cancer survival is worse in the U.K. than in the U.S.. How do you explain this in view of your data on their screening programs.

DR. SMITH-BINDMAN: Are you saying that, given the higher breast cancer mortality rates in the U.K., those data are inconsistent with our results suggesting mammography is done very well there?

DR. D'ORSI: Yes

DR. SMITH-BINDMAN: I think looking at cancer mortality rates for a country can be quite complicated. It is difficult to compare mortality rates between the United States and the U.K. In the U.K. they usually look at survival rates. A recent article on cancer survival rates in Europe including England, Wales, and Scotland reported that the U.K. survival rates were below those of most other European countries for most cancers whether screenable or not (Coleman et al., 2003). It is not clear why they are doing poorly, but it is a huge issue for them, and they are studying it. You would hate to pick out one cancer, for example, breast cancer, and conclude that they are doing worse than us, and therefore they cannot be doing better in mammography. I think they are doing a great job with their mammography program, and for the breast cancers they diagnose, they are doing very well in terms of finding the same proportion of small cancers as found in the U.S. So, I cannot confidently address why their survival statistics are below average.

Mortality is harder to compare. It is a more important comparison, and really has not been done. Mortality data are a little more objective. Survival data are influenced by the over diagnosis of early disease which will make the data look better even in the absence of real improvement. Thus, I just think that simple comparisons may be misleading.

DR. PISANO: I'm sorry we don't have more time to talk. We will be moving now to the wrap-up session in the other room, and there will be more time to interact over there.

Second Group Discussion Developing and Delivering New DetectionTechnologies

, M.B.Ch.B., M.P.H.

Assistant Professor, Harvard Business School and Member, Committee on Saving Women's Lives: Strategies for Improving Breast Cancer Detection and Diagnosis

MODERATOR AND RAPPORTEUR: We are going to be focusing on the development and delivery of innovative and new detection technologies. I wanted to highlight two elements of the thinking of the committee that were clearly influential in some of the report's recommendations largely because I think we are going to be talking a lot about technology assessment in its various forms this afternoon.

The first is the expectation that many of the new technologies with which we are dealing and will probably be discussing, based on this morning's conversation, are more likely to be complements than substitutes. That is, they will add to the clinical armamentarium available to physicians in screening and detection of breast cancer, rather than represent wholesale replacement of any one technology.

That has some important implications. It means that with the development and introduction of more new and innovative technologies, physicians are going to have a much wider range of choices, choices that are potentially applicable to and limited to perhaps ever smaller and smaller subsets of the patient population. From a clinical practice point of view, that will substantially increase the complexity of the task ahead for clinicians. Some of those technologies will be used singly, some of them will be used in combination with other technologies in circumstances perhaps where the sequence and organization of technology use will be important

The second observation is that in some cases, technology adoption is highly context dependent. Some technologies are much easier to slip into the context of routine practice than others. There are medical technologies that tend to wreak substantial change on the organizations that are adopting them. CT scanning was perhaps one of the more famous examples of that.

We know that different organizations around the country are better or worse at undertaking the organizational and clinical process and work routine redesign that is needed in order to successfully adopt and make maximum use of a new technology. So we might expect more regional variation in the effectiveness of the use of new technologies.

Taken together, I think these two observations gave the committee a sense that various sorts of new data will need to be available to clinicians adopting new screening and diagnostic technologies. These data are likely to be both quantitative and qualitative, the latter being a class of data we are a little bit less used to using in medical practice. So there will be data not only about how well the technology performs in the absolute, which is the kind of data that goes to the FDA, but also about how the technology performs in comparison to other technologies or how technologies perform when used in concert with other technologies. The third class of data that I think will be needed is information on the appropriate organizational model or organizational design that best makes use of a new technology. These will be data on what we in the report call the deployment of technology, what kind of organizational capabilities, organizational structures, or clinical management processes will be required in order to make best use of a new screening or detection technology. The purpose of such data is to inform several decisions for clinicians and the organizations in which they work. The first is where to use a new technology and who to use it for, but also how to use a new technology and moreover, what supports need to be put in place to be able to make the best use of that new technology, what resources, skills, processes.

So the perception of need for these data has shaped a lot of the committee's thinking. At least in my view of the report, that concern has been reflected in a number of the recommendations that we are going to be discussing this afternoon.

DANIEL SULLIVAN, M.D., Associate Director, Cancer Imaging Program, Division of Cancer Treatment and Diagnosis, NCI: My first general comment is about NIH funding policy and priority setting. The decisions at NIH are complex. To some extent the leaders of the Institutes and the Director of NIH have significant input, as do their Executive Committees with help from the staff. But those are affected by and take into consideration recommendations by external advisors and various external committees, statutory or otherwise. Since NIH is in the Executive Branch of government, there is also, of course, significant input from various individuals who allocate executive funds, and clearly, legislative authorizers and appropriaters can have a major influence.

With regard specifically to technology assessment at NIH, this morning Dr. Tunis mentioned that AHRQ has the mandate for that, but not much money. The fact that Congress created AHRQ and then followed up with relatively modest funding has sent an implicit signal to the NIH that, firstly, it is not the NIH mission to do technology assessment, and secondly, that it should not be one of NIH's high priorities. The Institutes have many potential priorities for their funds. Technology assessment does not get high on the list.

So I think in terms of moving the report's recommendations forward, one of the most helpful things that could be done, in addition to making reports like this available to NIH leaders and external advisory groups, is to get a positive signal from Congress.

With that as background, I would like to briefly discuss three messages I took from the report. First, how could federal agencies help facilitate useful multidisciplinary collaborations? This is a subject of increasing interest for a variety of groups and committees. The Biomedical Engineering Consortium (BECON) held a symposium on this subject a year ago. The report is on the website http://www.becon.nih.gov/becon_symposia.htm. Key recommendations to NIH from that report are summarized in Box 3.5. These changes are viewed as incentives, or removals of barriers, to team science. There are also recommendations in that report to academic institutions and to science publications. There are various committees and activities at NIH currently addressing all the recommendations. Similar comments have come from related reports, one of which was the IOM report, Large-Scale Biomedical Science: Exploring Strategies for Future Research that was released June 19, 2003 (Institute of Medicine, 2003).

Box Icon

BOX 3.5

Changes in NIH Policy Would Remove Barriers to Team Science. Allow more than one Principal Investigator on individual grants. Allow multiple performance sites to receive appropriate indirect cost recovery.

In particular, there is a subcommittee on Research in Business Models (RBM) to the Committee on Science in the Office of Science and Technology Policy at the White House which is intended to harmonize the research business approaches across federal agencies. One of the issues on their agenda is the acknowledgement of co-PIs across all federal agencies. At NIH, the BECON group is taking the lead in proposing a definition of co-PIs that RBM might propose be used by all federal agencies. There is an NIH co-PI implementation committee, which I co-chair, that will have a specific plan very soon. We hope that NIH will have the capacity to appoint true co-PIs on grants in a year.

My second subject involves better coordination between NIH and other federal agencies on technology assessment. Some of this has already been going on, and, I think, even increasing a little bit. You probably know about the ongoing trial comparing full-field digital mammography to conventional film-screen mammography (DMIST). That protocol was developed a few years ago, and planning included coordination among NIH staff, industry, FDA, and CMS, so all views could be heard and incorporated into the overall design. The trial is being carried out by the American College of Radiology Imaging Network (ACRIN), which is the cooperative group that we fund at NCI to review clinical trials in digital mammography. They are doing several related breast cancer screening studies, such as MRI and ultrasound (the latter co-funded by the Avon Foundation), and one for colon cancer screening by CT. In these smaller breast cancer studies, there actually was not very much input from other agencies. So I think there is an opportunity to do better.

A good recent example is the plan for the CT colonography study. There have been several reports with conflicting results in the last year of the potential of virtual colonoscopy to equal or approach the sensitivity and specificity of optical colonoscopy. We have been aware of those studies and recent results, in particular the work of Pickhardt and colleagues (Pickhardt et al., 2003). Prior to that report, we had organized a meeting which we held on December 9, just a few days after the publication appeared. We included extramural researchers in gastroenterology, epidemiology, and biostatistics, NCI, CMS, and FDA staff. In fact, several staff members from CMS attended, indicating their considerable interest in the potential economic impact of virtual colonoscopy.

At that meeting, a number of very specific recommendations were made by the participants, and I want to highlight one that was of particular salience to CMS. They specifically requested that the trial be structured to allow evaluation of inter-site variability. The trial team went back and incorporated all the ideas, but on that particular issue, they debated whether there should be the same number of subjects at each site, for example, 150 or 200 subjects at each of 15 sites totaling 3,000 subjects, or whether it should be powered with the same number of polyps, the same number of suspicious polyps, or the same number of cancers. A scientific argument could be made for any of those. The final decision was to power it for the same number of cancers. This trial is now designed to accrue until each site finds 12 colon cancers. Therefore, the absolute final number of subjects is not predetermined, but should be approximately 2,500.

I think this experience provided a good example for the future of how agencies can come together and pool their respective interests to design a trial. Dr. Tunis this morning mentioned that there is a new NCI/CMS agreement. I suspect that this will help to assure that this kind of activity will occur more often.

My third subject is promotion of research over the entire spectrum of technology assessment, especially investigating how technology gets used after dissemination as opposed to the current emphasis on early feasibility and efficacy trials. This is a difficult problem, because it involves not just setting the priorities and providing money but also some tough scientific issues. There is not a generally accepted paradigm for technology assessment as there is for drug trials. Discovery, development, maturation, and dissemination stages comprise one schema.

Stage four dissemination studies are very difficult to do if you want to get a truly representative sample of all the people that are using the technology, particularly for an imaging technology, where the radiologist's interpretation is an integral part of the use of that technology. There is no database or registry that could give you information on all the radiologists that interpret chest X-rays, or all the conventional radiologists who do some conventional procedure. But mammography is the exception because all radiologists who interpret mammograms have to be certified as meeting quality standards by the FDA. When I was practicing mammography, I saw lots of films coming from other sites that I thought were poorly interpreted. Although we focus much on the quality of the image acquisition and the quality of the film, the best quality film is of no value to the patient unless it is interpreted appropriately. We generally think of this in terms of a so-called linear process of perception and cognition, first seeing it and then deciding what it is.

In the early 1990s, Beam, a biostatistician at Duke University at that time, and I decided to send a large set of mammograms to a truly random sample of all radiologists in the country for interpretation in their usual settings (Beam et al., 1996). We found that there was a wide range of sensitivity and specificity. This variability is now generally well accepted, and that needs to be understood in terms of technology assessment.

This morning, Dr. Tunis listed the factors that CMS takes into consideration, and he specifically listed sensitivity and specificity. That is not the most appropriate method for determining the value of a technology. There is a very nice section in the IOM report about that issue and a very clear statement that says the receiver operating characteristic (ROC) curve is a better method for analysis.

For the aggregate performance of a technology, it is appropriate to determine the ROC curve for the technology. The ROC method takes into account the variability due to subjective interpretations of the image. Some of the comments this morning referred to the tradeoff between sensitivity and specificity. That may be true if you are talking about a technology in aggregate, because you will move up or down the ROC curve. It need not be the case for a single radiologist, however. I think there were some comments this morning that suggested some confusion about that: that if a radiologist increased his or her sensitivity, there would inevitably be loss of specificity. An individual can improve both sensitivity and specificity by moving to a different curve, because an individual does not necessarily have to move on the same curve.

In the report there is a section noting that signal detection theory can be applied to imaging technology. This is a true statement, but it can be misleading, because the example that is given in the report talks about finding an airplane with radar. The issue in imaging is that, although it is a signal to noise problem, the noise is not random noise but a highly structured coherent noise. So the problem would be more like looking for an airplane on a background of many other airplanes, or, the other analogy of signal to noise that people often use, looking for a polar bear in a snowstorm. But in imaging it would be like looking for a polar bear on a background of a lot of other things that look like a polar bear.

So the task is very much like the child's game, Where's Waldo. I use this as an example to illustrate that I think we need to understand the interpretive process and its implications for development of computer assisted detection (CAD) a lot better. I think it is an enormously important issue for improving mammography today.

The task in a typical picture is to find Waldo. For those of you who are a couple of years away from being five years old and may have forgotten, Waldo is identified by having a red and white striped hat, a red and white striped jersey, blue trousers, round face, glasses, and brown hair. One of the things that one might wonder is, if there are some eyes looking out of a house, is that Waldo. This would be equivalent in mammography studies to seeing something that is completely obscured, and there is no possible way to make any statement about that. You would have to open up the house in order to get more information, which in mammographic equivalence would mean using ultrasound, MRI, and so forth.

There may be an individual who has a round face, brown hair, a hat that appears to be red, a red and white jersey partially obscured and we don't see the blue trousers at all. Our brain says, that meets enough of the criteria that it is probably Waldo. But even a 5-year-old would look at that and immediately say that it can't be Waldo because it's a girl. So how does our brain come to that decision? What I am suggesting is that this is not a clearly linear process of perception and cognition. There are still a lot of other things going on, so it is not so straightforward. I don't think we understand this process very well.

Some kids can do this task much more quickly than others; some probably have an innate ability, and some can learn to do it. If you think about the analogy here and how it could inform training radiologists, you probably wouldn't do it by giving them a lecture. You would do this in a very interactive way, because it requires developing skill.

I believe we do not train radiologists very effectively to develop a skill and use appropriate immediate feedback. I also think we give too much credit to the ability of a computer to sort this out. It is not surprising to me that there are recent reports suggesting that CAD in practice is not performing the way it did in the early trials (Zheng et al., 2004).

One of the things that we are doing to get at the issue of what agencies and industry can do collaboratively is to develop a very large database, a large imaging archive. We are doing a demonstration project with the spiral CT lung screening study to develop a database of images which will be available to help develop CAD. We think that this could be a model for public-private partnerships to develop multiple such databases for this kind of work.

So to summarize my third point, I think that research on the interpretive process is essential to the notion of evaluating technology. Radiologist training and feedback needs to be much more interactive than it is now. Studies of CAD implementation are necessary, that is, how does CAD really work when it is in the hands of radiologists, as opposed to how it works in the laboratory or in the hands of experts.

DR. NORTON: Over the years in drug development, we learned how to structure clinical trials, phases 1, 2, and 3, so that their results could change practice, even though we might ask many other kinds of scientific questions, targeting, relative efficacy, about a drug. In imaging, we seem not to have defined and universally agreed on practice changing, definitive trials, the results of which would change what people do.

DR. SULLIVAN: In a drug trial, it doesn't matter who gives the drug or how competent you are, the drug does or does not have an effect, and the trial can be designed to show that end point. In imaging trials, it very much depends on the skill and the abilities of the imager. Very large trials like our DMIST trial, will attempt to get at that by providing data that will allow us to compare the ROC curves of the two different technologies. The technology that has the higher ROC curve is the one that will be credible, that we will want to use. Of course, there is always the element of the level of comfort of practicing radiologists with a technology and what level of evidence they will finally accept to persuade them that they need to make a change.

DR. NORTON: I have been in oncology drug development trials long enough to know that when different doctors get dramatically different results, it could reflect better handling of drug toxicity; they got closer to the proper drug dose level. In any therapeutic protocol, there are clear instructions on preparing the drug, calculating the dose, handling toxicity, and the like. Those things are all subject to audit, and, as we have shown over the years, repeated auditing makes doctors better because of adherence to the protocol which not only gives more reliable trial results but better therapeutic outcome. That is what I hear is lacking here, the development of criteria for getting assessed, quantifying and measuring the human element. Our audits, our criteria, detect the doctors that say they can't do it right; they cannot follow the instructions, and they are not allowed to put patients on the protocol. In the absence of such criteria, you do get the variability that can makes it very hard to influence practice, because some doctors believe that their better personal skills will prove effective even when evidence from the trial is not there.

DR. SULLIVAN: I agree that the culture of drug development has matured to deal with those issues. The culture of diagnostic development is much less mature. However, I think improved criteria and maturity have been built into the four studies that I showed you, and it will be interesting to see how that plays out. We have not previously built in the ability to examine inter-site variability, for example.

DR. BOHMER: Related to what we have been discussing, how do radiologists tend to self-select into trials? Is there a population of radiologists who wish to participate in trials that is observably different from radiologists who might be using the outcome of the trial at some future date? Do we have a mechanism for screening the entry of radiologists into the trials?

DR. SULLIVAN: Again, this is relatively new. ACRIN has only been doing trials for about 4 years, and before that there were only a couple of isolated multi-site trials, so there is relatively little experience. I don't think anybody has looked at that issue.

DR. ESSERMAN: But even in the breast imaging trials, those participating are not general radiologists, by and large, right?

DR. SULLIVAN: That's right.

DR. ESSERMAN: Only about a third of breast cancer screening is done by breast imagers, so knowing that the trialists are mostly breast imagers, you can immediately say that they are very different. They are the dedicated and highly skilled, as it is in drug trials, the top of the group.

DR. SULLIVAN: Yes, it is going to be self-selected people who are interested and motivated to learn new interventions in a well-structured way.

DR. ESSERMAN: And you cannot disseminate those results to the general population which reinforces your point that you cannot stop with the trial. It is the same thing we found in the drug trials.

DR. BOHMER: Is it feasible to even think about deliberately recruiting different types of radiologists into future trials to get an early sense of the gap between efficacy and effectiveness prior to dissemination?

DR. SULLIVAN: We might do that through the community cancer center program at NCI. It is probably a matter of developing human resources at a limited number of sites. There is a tendency now to choose sites where there are motivated people who are willing to go through a big trial.

DR. BOHMER: And involving a diversity of radiologists might require much more involvement on behalf of the PIs, so there is a resource issue there too.

DR. SULLIVAN: In medical oncology, it is official policy that testing cancer drugs requires a medical oncologist, someone who is trained and knows how to do it. That is another difference for imaging, because specialization is not necessarily required, and that can affect results. I think finding who is qualified to use these machines should be part of the process.

DR. ESSERMAN: The first thing you want to discover is whether the technology works when it is implemented. If it doesn't work in that setting, forget it. But that is not where you stop. How you then test it or implement it generally might be in the context of registration trials where the focus is very specifically on tracking implementation and dissemination of skill sets. Maybe our concept should be that if you want to use it and be paid for it you must be part of the registration trial.

DR. KIRBY VOSBURGH, Ph.D., Associate Director, Center for Integration of Medicine and Innovative Technologies, Partners Health Care: I am a physicist by training, who spent 28 years in industrial research and development. Over the years, I worked on several breast imaging approaches, CT, MRI, and ultrasound-tagged optics. Our goal was to try to replace conventional X-ray mammography. Generally this did not pan out; it is a hard business. Today, I am representing and addressing the scientists and engineers that we might charge to go back to their labs and get us some more effective technology to detect and characterize this difficult disease.

Since we have a significant amount of money to put into research on breast cancer screening, where do we get the best return on our investment? A logical extension of the results of our report suggests that we will probably not get it by trying to develop straight-up replacements for X-ray mammography. Several presentations today have reminded us that what we have now works well. In the IOM report, Mammography and Beyond, the difficulty of replacing screen-film mammography was strongly stated. A replacement has to be specific, sensitive, have the right ROC curves, be stable and inexpensive, run forever, and not be breakable, even by a Ph.D. It has to be really bulletproof. Our committee consensus has affirmed that the primary emphasis should rather be on getting every practitioner up to the best current levels using both technology (such as computer-aided diagnosis or tomosynthesis) and practice changes.

Every potential mammography replacement technology we could identify faces major challenges in competing with today's best current practices. So, a question is how many of these new ideas could we develop, recognizing, based on experience with digital mammography which required comparatively modest changes in clinical practice, that it will be at least a decade before we see them in widespread use. Many of us are frustrated by encounters with the inventor who has a great idea for a better way to detect breast cancer, tries it out on a sample of 12 women, and gets “very strong” results. The inventor then cannot understand why the idea is not immediately adopted. We should, of course, support ideas at a proof-of-concept stage, but recognize this is the beginning of a much more sophisticated process. You have to move quickly to a more complete evaluation. How will this reflect the clinical course of the disease and, ultimately, patient cohort survival at a reasonable cost to society?

In breast cancer screening, it is hard to obtain gold standards of the best possible performance, so large sample sizes are needed. Since the gold standard can be used as a basis to evaluate more than one new system, it is good to test new techniques in a multi-modality context, rather than looking at each technique in isolation. Early evaluations need to have adequate statistics and be designed in such a way that allows a decision on whether or not to scale up, but the bar for starting large-scale clinical trials should be very high. Since large-scale clinical trials will require that the technology be fixed over a long period of time, the changes in our understanding of the development and treatment of breast cancer, which are likely to continue at a rapid pace over that period, may not be accommodated. Lacking a major breakthrough that gives very high sensitivity and specificity on initial tests, validating a replacement technology for mammography will be so time-consuming that it is likely that the clinical land-scape will change significantly and that the benefits of the trial well be diminished as a result.

An obvious way to improve conventional mammography is to increase image contrast. That gets us to the potential for physical or molecular markers to augment the detectability of disease in either a screening or diagnostic context. Contrast-enhancing agents have not been used in screening because they are expensive, and they may cause allergic reactions. Of course, at some point there may be an inexpensive and safe marker developed that lights up cancers and markedly improves sensitivity and specificity, so we should keep our eyes open and encourage biologists and imagers to talk to each other, to maintain a partnership.

A comparison of the development of virtual colonoscopy to our attempts to apply technology to X-ray diagnosis may be illuminating. In colon cancer, we have a disease process which is extremely will characterized and for which there is a very well-accepted treatment. If a polyp is bigger than, for example, five millimeters, it is excised, and that leads to a better outcome. There is not a lot of variation in how gastroenterologists evaluate and screen for polyps, so the “gold standard” is quite solid. However, the current screening procedure, optical colonoscopy, is expensive, time consuming, and not particularly beloved by patients, leading to poor compliance. It is, therefore, a good target for a high-tech approach. And, when virtual colonoscopy was proposed, a very strong consensus emerged, with all involved parties trying to develop a better system. That was an example of how progress can be made through scientist-engineer-research clinician partnerships. Unfortunately, some of the positive factors which have made virtual colonoscopy such a strong contender for clinical use are not present in mammography.

Prescreening and the consequent stratification of risk may open “niche” opportunities for novel imaging approaches that may not be suitable for broad application. We heard this in some of the presentations this morning. But prescreening has an important attribute that has not been mentioned. It may also account for the potential for effective treatment. An example of this would be the recent observation that tamoxifen chemotherapy for breast cancer may be more effective for patients with certain genetic characteristics. If you know that a patient has the potential for an effective response to chemotherapy, you may want to change your screening strategy to differentiate such women, perhaps by screening them more often. To the extent this type of correlation becomes more evident, the rationale for investments in screening and the practical targets for disease detection will be moving targets. Overall, these factors imply that care is progressing from “one size fits all” stratified only by age to individually tailored management of women at risk.

The same factors that apply to screening apply to diagnostic imaging, but you do have more resources; you can use contrast agents; you can take more time. And you have the opportunity to integrate information and display it for the physician more effectively. However, in other medical applications, it has been found that it is not always a good idea to provide a packaged solution to physicians. It may be better to provide a richer set of information directly, and let the caregiver do the integration mentally. The optimal approach may be best established, as in many other cases, by iterative studies, with the clinicians and technologists working collegially. In this connection, high quality long-term archives of images with serum and tissue samples will be of great value.

The bottom line is that physical scientists and engineers should try to work as much as possible in concert with clinical care providers and the biologists who are studying the disease. They should recognize the power of biomedical science and informatics to improve the diagnosis and treatment of breast cancer. In this way the physical aspects of disease detection will be optimally designed and deployed to save more lives.

CAROLE FLAMM, M.D., M.P.H., Technology Evaluation Center (TEC), Blue Cross and Blue Shield Association: I will be presenting the Blue Cross Blue Shield system perspective on technology assessment. I am a radiologist, and I have been doing technology assessment for the past seven years. Dr. Baugh is going to present after me, speaking on coverage and reimbursement from the health plan perspective.

You have already heard from CMS how they look at the evidence. As outlined in Figure 3.6, the TEC process is a systematic review of the body of evidence in the literature, not performance of the primary studies. There is a formal set of TEC criteria, listed in Box 3.7. I saw references to these types of criteria in the report, and CMS has similar criteria.

FIGURE 3.6. Technology assessment at Blue Cross and Blue Shield.

FIGURE 3.6

Technology assessment at Blue Cross and Blue Shield.

Box Icon

BOX 3.7

Criteria for Technology Assessment. The technology must have final approvale from the appropriate government regulatory bodies. The scientific evidence must permit conclusions concerning the effect of the technology on health outcomes.

One of the arts of doing a technology assessment is framing the key questions, framing how you are going to look at it from a decision maker's point of view. We apply the same set of hierarchical criteria to therapeutic and diagnostic technologies, and I will discuss some of the differences in the way that plays out. The first one, FDA approval, is a necessary but not sufficient piece of information regarding effectiveness. Most technologies run into trouble on the second criterion, the sufficiency or quality of evidence. We just don't have the right kind of studies.

Then we ask if the technology improves health outcomes. Is it as beneficial as alternatives? Here, where you are talking about a technology that is going to replace another technology, it has to be as beneficial. If it is going to be used in addition to other technologies, there has to be an additional benefit, obviously.

The fifth criterion is about effectiveness versus efficacy, in other words, does the technology work in every day practice. A similar hierarchical model (illustrated in Box 3.8) of looking at the contribution of diagnostic imaging to patient management is useful in thinking about the different kinds of studies published (Fryback and Thornbury, 1991). Many studies look at technical quality and diagnostic accuracy, the sensitivity and specificity. I have been encouraged to see more recently studies asking how a technology changes the diagnosis, changes management, and changes outcomes. An effect on outcomes is the ultimate test we are looking for. We do not factor in cost or cost-effectiveness. The pure technology assessment is really a clinical evaluation of the evidence.

Box Icon

BOX 3.8

A Hierarchical Model of Efficacy. Level 1: Techical efficacy Level 2: Diagnostic accuracy efficacy

Ideally, we would like to see direct evidence, randomized controlled trials comparing outcomes with and without the test. You have heard something about the barriers to that, but it is the standard of evidence in most therapeutic technology assessments, such as drug trials. In the screening setting, we do see some randomized controlled trials and that is great, but in the diagnostic imaging literature in general, that is not the reality. Indirect evidence is the reality that links a chain of evidence: the performance of the diagnostic test; its effect on patient management; and what it does to health outcomes. What are the criteria for a positive test? Does it permit the avoidance of other tests, or invasive procedures? Does it detect a treatable condition earlier?

It is vital when using this kind of indirect framework of evidence to consider separately different patient indications. MRI of the breast differs depending of the indication, the kind of patient, the situation, and what it is being compared to. Is it a replacement for mammography, specifically for screening high risk women, or as an adjunct decision aid for biopsy in women with positive mammograms? These are different questions and require very different diagnostic performance of the test. So, the clinical context is critical.

Also, in terms of the effect on patient management, when a non-invasive test is replacing an invasive test or procedure, that represents an obvious advantage. That is an easier technology assessment question than thinking about the ultimate effect on mortality.

I'll just move on to a couple of examples. I mentioned earlier MRI of the breast in the screening setting. We have looked at this from the technology assessment point of view (http://www.bcbs.com/tec/Vol18/18_15.pdf). For women at high genetic risk, there have been studies comparing MRI and screen-film mammography (Kriege et al., 2004). In the specific population with higher than average breast density, conventional mammography is not as sensitive. Specificity is a little bit more of a tossup. But if the screening test is positive, there will be a biopsy or further workup; if the test is negative, screening will continue. The trade-off is the benefit to the true positives of earlier detection against the harms of false positives, unnecessary biopsies, and delayed diagnosis. About 6 percent of the women in a high-risk population have breast cancer. About four additional cancers will be detected for every seven additional unnecessary biopsies. That is the sensitivity tradeoff, the risk-benefit equation that is part of technology assessment.

In the high risk population, where the prevalence is high, there will be a relatively high number of false positives. Dr. Sullivan mentioned earlier that sensitivity and specificity are not the important parameters, and I agree. When comparing one test against an alternative possible replacement test, you examine the ROC curves. But particularly if you are looking for the add-on value to current management of a test, you need to understand how frequently the test is going to be called positive. What is your operating point on the ROC curve if you are going to use decision analysis modeling to figure how a positive compared to a negative test affects management, affects outcomes? So there is a little bit of that bind when we are stuck using some summary estimates of sensitivity and specificity.

The next example is positron emission tomography or PET (http://www.bcbs.com/tec/Vol18/18_14.pdf). In a woman with a positive mammogram or clinical exam who is told she needs a biopsy, can some unnecessary biopsies be avoided by using PET. A negative PET scan will spare the woman a biopsy, a positive scan will lead to biopsy. So the balance on health outcome is between the harm of delayed diagnosis versus the benefit of avoiding an unnecessary biopsy. The specificity and sensitivity are not bad, but in the populations that generated these results, there was actually a 50 percent prevalence of cancer. In such a population, there is actually a 12 percent risk of a false negative scan. So that is the way the technology assessment equation plays out in that case.

I thought these illustrations would shed some light on technology assessment in our hands. For anyone who is interested in learning more about specific technology assessments and the kind of things we do, we have a website which you can visit at http://www.bcbs.com/tec.

ERIC BAUGH, M.D., Senior Vice President, Medical Affairs, Care First Blue Cross and Blue Shield: I will discuss how we go from the technology assessment that Dr. Flamm described to coverage decisions. At Care First Blue Cross and Blue Shield we serve approximately two and a half million members in Maryland, Washington, D.C., and Delaware. We formed our own coverage policy using a variety of informational resources. A technology assessment from our Technology Evaluation Center is only one of them. The evidence for our medical policy is also reviewed by a committee of community physicians, academic experts, and plan staff. Our community is sophisticated. We have people at Johns Hopkins, the University of Maryland, George Washington, and Georgetown Hospital Center that will participate at some level in our coverage decisions.

Care First, like all Blue Cross and Blue Shield plans is an independent company and determines its own policies. Figure 3.8 illustrates our medical policy which is the foundation for everything we do. Medical policy is a proven plan or course of action or guiding principles affecting community standards of diagnosis and treatment. As you can see technology assessment, FDA approval, pharmacy and therapeutics committees, community standards of care, the medical literature, all go into helping formulate medical policy. This then determines quality of medical care, which is defined as the right care at the right time in the right setting at the right cost.

FIGURE 3.8. Care First Medical Policy.

FIGURE 3.8

Care First Medical Policy.

When we reach the stage of building a set of benefits, contractual services provided to implement medical policy that people can buy at a reasonable cost, cost enters into the decision on coverage. Then of course, we have utilization management. All of these filter through our claims adjudication policy as to whether or not we are going to pay for something and how much.

Medical policy development must fit contractual definitions and employ an objective standard of review and process for considering and reaching decisions.

We use the TEC criteria described in Figure 3.9 for determining if a new technology provides net health benefits at least as great as the best available alternative by objective evidence in peer-reviewed literature. We also use a Hayes report (http://www.hayesinc.com/) that tracks new and emerging health technologies and gives us impact utilization and cost data.

FIGURE 3.9. Coverage criteria.

FIGURE 3.9

Coverage criteria.

I refer to new technologies for breast cancer detection evaluated to date that provide no clinical benefit when compared to mammography or biopsy, or small benefit for a limited subset of the population when added to mammography as adjunctive. They do not substitute for existing technologies, but may add to the benefit of existing technologies for certain patients. For a new technology of this type, Care First will develop a medical or coverage policy that clearly defines for which patients and indications the technology is available. An example is MRI to investigate a woman with a positive lymph node and negative mammogram. For coverage, we must be able to verify adherence to the policy definition.

There are a number of mechanisms to implement a policy of this type. The first mechanism is prior authorization. We could require that MRI of the breast be prior authorized, and specify the documentation required before approval can be granted. This information will be reviewed by a reviewer. If the reviewer feels that the criteria have not been met, the case will be referred to a physician reviewer. Only physicians have the authority to deny coverage.

But prior authorization programs are burdensome and unpopular. We limit the number of services that require authorizing and prefer to use other mechanisms to implement medical policies for very specific indications.

Service claims editing is a second mechanism that could be employed. Certain CPT or ICD-9 codes may be selected for review, and the claims will be separated out after service. Certain medical information will be requested from physicians to document that the indications in the policy are met. This gets back to the whole concept of evidence-based medicine and the use of protocols—how was this supposed to be used and was the doctor applying those protocols appropriately? This is burdensome to the clinician, member, and plan, and plans need to consider the time and cost of this as with other coverage restrictions.

A retrospective review after payment is a third approach to implement medical policy of this type. This gets painful if we decide that the criteria have not been met, and we ask for our money back. We look at claims experience to gauge the appropriateness of this mechanism, and apply it typically when the volume of claims is small and with limited indications. The review information will typically be used to educate participating physicians about the policy and about guidelines and protocols.

New technologies are frequently more expensive than existing technologies. PET and MRI imaging are clearly more complex than mammography. Cost is not considered in the technology assessment, but it may be a factor in formulating the coverage and payment policies, which set the coding and payment rules, the frequency limits, and payment level. Health plans need to establish payment policies when there are no existing rates, for example, for new technologies or new applications of existing technologies. The payment level may be set based on the cost of the device and the operating costs. Payors may attempt to establish a rate based on price or cost of a comparable technology, or payors may attempt to reimburse certain new technologies or drugs at the same rate as existing technologies that provide comparable clinical benefit for the condition in question. These approaches try to link price with value. They are not very popular with the manufacturers or providers of these services and could have the effect of retarding the dissemination of the technology in question.

I have tried to identify some of the selection criteria for those things that we use to establish coverage policy. These are listed in Figure 3.9. As the report for this meeting documents, FDA approval does not assure that a technology provides clinical benefit or utility. In fact, only ten percent of the new devices and tests that make it to the market have undergone trials to establish safety and effectiveness because they are cleared by the FDA through the 510-K process. It is also known that many payors will not cover a new technology that cannot demonstrate an improvement in health outcomes at least as great as or better than the available alternatives.

One way to promote development of data on clinical utility after FDA approval would be for payors to provide coverage during clinical trials. Contingent on completion, such trials could establish utility for coverage. Payment only within these trials will assure their completion. In Maryland, health plans are mandated to cover patient care costs of clinical trials involving serious or life-threatening conditions. In effect, this mandate requires contingency coverage. However, these trials are not limited to those establishing clinical value.

It is important, therefore, for the health plans, other payors, and employers to interact with the research community to communicate the importance of trials that measure the impact on net health outcomes. If the trials demonstrate that the technology in question does not improve net health outcomes or is no better than conventional treatment with higher side effects, coverage will not be continued once the trial is concluded. This was our experience with autologous bone marrow transplantation for metastatic or advanced breast cancer. Coverage contingent on conduct of a trial (that is, not limited entirely to the trials) dissipates the impetus to field trials, and once provided, runs the risk of alienating members and clinicians when it must be withdrawn downstream.

DIANA PETITTI, M.D., M.P.H., Director, Research and Evaluation, Kaiser Permanente, Southern California: I am also going to speak from the decision-making payor point of view, but with the added perspective of a population and an organized system. In the United States we don't have an overall health care system, we have a non-system. I am fortunate to work within a system which defines its responsibility in terms of providing health care coverage and maximizing health insurance investment for the improvement of the health of its population.

The population served by our system comprises the 3.1 million members of the Southern California health plan—a population that is larger in size than that of New Zealand, many other countries, and 35 states.

For breast cancer, the population health perspective means that our investment of premium dollars must improve early detection for the whole population of members. Within this population health context, I am going to specifically talk about our technology assessment of CAD and our decision not to adopt and deploy this technology.

The goal for our population is to improve detection of breast cancer. We must weigh what we might spend on CAD against alternative investments of our resources. Increasing the number of women in our population who are eligible to be screened and have the test is the first competing alternative investment. Even in our system with the ability to deploy resources and outreach to our members, we have a rate of breast cancer screening in the 50- to 72-year age group of only 80 percent according to our reports in the Health Plan Employer Data and Information Set (HEDIS). We consider this screening rate unacceptable given benchmarks from our other health plans of upwards of 93 percent; we believe we could attain such rates if we deployed our resources appropriately.

So in thinking through the CAD decision, it was against a backdrop of an overall screening rate of 80 percent, and a recognition that our first priority in this system would have to be to improve this rate.

However, that was not the only consideration. Within a set of possible new technologies or ways of improving performance of existing technologies, there are a number of competing approaches. Even among screened women, we have incomplete sensitivity and imperfect specificity and high false positive (suspicious and not cancer) rates, which is typical of the technology. How can we change our system and technology to improve performance of the test for the women being screened. The first way is to change the way we organize our existing services. In the IOM report, there is a case study which describes how this was accomplished in the Colorado Kaiser plan (see also Adcock, 2004). Replicating the Colorado model in our 11 facilities in Southern California is our main focus.

Now we can look carefully at CAD in the context of other alternatives. To begin our assessment, we supplemented the Blue Cross and Blue Shield evidence assessment. On reviewing this evidence, we concluded first of all that use of CAD was really not better than an experienced radiologist in terms of sensitivity. We felt that we had a pressing need to get experienced radiologists, or train a few radiologists so that they had high levels of experience, as had been achieved by the organizational changes in Colorado.

Secondly, we concluded that the evidence about the effect of CAD on callback rates in populations similar to ours and in similar broad screening efforts was poor. Evaluations of CAD had mostly been done in highly-specialized centers against specially constructed test sets. Such evaluations did not give us a very good idea of what our callback rates would be. This is important information for us because of the possible burden imposed on our system already stressed by existing service demands of about 80,000 women age-eligible for mammography and performance of about 290,000 screening mammographies each year.

And finally, at the time that we were considering CAD, we were in the process of rebuilding a number of our hospitals due to the seismic safety standards in the state of California. We were cognizant of the fact that imaging technology is moving in the direction of digital imaging, and that we would likely need to replace any CAD devices that we invested in somewhere between two and five years later. All in all, this did not seem to us a good use of resources compared to investments to increase our screening rates and better organize our services.

This kind of assessment and decision-making exemplifies why it is a privilege to work within a system. I believe it also probably represents the kind of thinking that is typical of some of the countries that have been discussed today as models. In such settings, there is a recognition that resources available to spend on any health service are fixed, and that the responsibility of decisionmakers is to maximize the deployment of those resources for the benefit of the populations needing that health care service. In his presentation this morning, Dr. Tunis also suggested that we all think comprehensively in these terms about the broad clinical, economic, and social value of new or added technologies.

DR. NORTON: We heard two different views on the role of payors in doing research in imaging technology development. There is existing evidence in existing trials, or you try to get information out of ongoing trials.

DR. PETITTI: We would participate in the trial if we had that opportunity, and if it could be done for the same price as the existing service, or someone else was going to help foot the bill. I was encouraged by what Dr. Tunis said this morning. The amount of money going into direct head to head comparison, or even the trials for imaging, has been incredibly limited because the big payor is CMS, and we have a limited ability to mount them on our own.

DR. NORTON: But on the other hand, going back to autologous bone marrow transplantation for breast cancer, the Blues were paying for transplants for ten years. It took us that long to find out it did not work, and the Blues could stop paying for it. Had they supported trials, we would have gotten out in two years.

DR. BAUGH: It was not that we wanted to go in and pay for this intervention. We were mandated to pay by the courts of the United States. We got the cart before the horse.

DR. NORTON: But you wouldn't pay for participation in clinical trials.

DR. BAUGH: We would have paid for clinical trials had that been an option, but at the time, it was mandated in the courts that we pay without benefit of a clinical trial.

DR. NORTON: We could go on about that, but the point is, in terms of costs, there may be more cost-effective technology out there and we may continue to pay for technology that isn't as good.

DR. PETITTI: I think we are agreeing that someone should pay. The question is, what pocket does it come out of. So it is not so much a matter of Blue Cross, or Kaiser, or CMS paying; it is that society bears the burden of the inefficiencies that are created by using ineffective or unproven technologies or even multliple, layered technologies as seems to be happening in imaging. We need to find out how we can pay to get the evidence to sort this out.

DR. BOHMER: Or making the decision to use technologies in the absence of an evaluation of the system-wide impact of those decisions and the system-wide resources that those decisions imply. To some extent, most payors are obliged to make one by one decisions in a way that Kaiser, because it combines payor and provider, can avoid and is better off for it.

DR. BAUGH: I agree. I think they are better off. I think the kind of evidence we are talking about needs to be gathered and paid for. People will look to find the money. We have some of the responsibility in the state of Maryland at least. When it first happened, we looked at it with mixed emotions, but I think at this point we are ready to step up to our portion.

DR. NORTON: The argument for reimbursement of the patient care costs in clinical trials is that it is cheaper in the long run for everybody, and better for patients.

DR. BAUGH: But I think it is not up to a single payor. This is something that has to be across the board and shared.

DR. PENHOET: The committee heard some strange ideas about very large trials during the course of our work. There have to be some controls if we are going to expect other people to pay for studies. And are we talking primarily about what Dr. Tunis this morning called practical clinical trials, trials focused on evidence that will help improve practice?

PARTICIPANT: And would you centralize decision-making so you have a public/private collective that could evaluate at the proof of concept point? Dr. Vosburgh, you were talking about technologies that could be out there, that could replace something that exists today, how the evidence could be collected and they could be brought to the marketplace by working with a large enough collective.

DR. PETITTI: At least from our point of view, we looked to the NIH, and maybe they become the clearinghouse, given these public-private partnerships. They have enormous credibility in the kinds of trials and the decision making process.

For example, the ALLHAT trial example from our report was an NIH trial where $20 million came from the pharmaceutical companies to pay for it. The fact that it came to us as an NIH trial with all the oversight and the integrity that implies made us willing to participate even though, I can tell you, we lost a ton of money. I have documented to the NHLBI how much money we lost in participating in that trial in the short run.

DR. BOHMER: It is an investment.

DR. PENHOET: I think the NIH review mechanisms are pretty good at sorting out the bad ideas. The issue left hanging in the air—is AHRQ a hindrance to further progress in this field or a help. It is possible that before we look at this again, we might think about refolding it back to the NIH. It is very hard given the current situation to see how it is going to work otherwise. The existing agency today, NIH, is clearly in the best position.

DR. PETITTI: I am thinking of the CT colonography trial. We would be willing to be in that trial, but it will be competitive. There will be more competent sites interested in participating than can be accommodated. That often happens with the really good NIH trials. Why would we want to be part of the trial? First of all, we get the information early. We are able to tell our members that we are looking at it. We have the satisfaction of making a public contribution, but we would probably lose money on that trial, too.

DR. VOSBURGH: Importantly, I think the NIH is taking a broader view of funding that has marketplace implications rather than scientific implications by broadening out the review process and the participation of different communities. That is essential here. It is not just a matter of clinical efficacy, but of the business case and of the potential to market it. So I think headway is being made there, but it is something that will bear attention as you move forward.

DR. PENHOET: Dr. Hanash, you never did give us your prediction of the date your proteomic marker will come to market.

DR. HANASH: Personally, I think there are multiple strategies which have merit, so which one would pan out remains to be determined. Some investment is needed for early discovery and validation to determine which markers are the winners. I think it would be very premature, at this point, to predict which particular one. In the end, I think there is going to be a continuum whereby we could start with something cheaper than a mammogram and apply progressively more expensive testing to subpopulations to confirm a diagnosis

DR. NORTON: What is the best way to collect serum proteins for proteomic analysis?

DR. HANASH: To some extent it depends on what type of marker you are after, but in terms of representing what is circulating in the blood, it is clear that plasma is best. When you subject blood to clotting, you burst a lot of cells and you activate many different subcellular systems. So what you are seeing in serum may not represent what is normally in the circulation. Plasma is a cleaner preparation because avoiding the clotting process eliminates a lot of the proteins from burst cells that you see in serum.

DR. NORTON: I am thinking about the implications of having the most informative samples. I am talking about my experiences in trials. Historically, we did not collect tissue from the tumor over the many years we were doing trials, so that, as we developed therapies that worked, we did not have samples that might identify the responsive subset of patients. Now we have the molecular technologies for classification by gene expression and gene copying and various other things that we can measure. It seems obvious to me that at some point we are also going to have protein patterns which may be informative. If we do not collect specimens prospectively during imaging or other trials, 5 years from now we will not have the opportunity to look back and identify the various subsets of patients in which those technologies were effective. We will have missed a real golden opportunity, don't you think?

DR. HANASH: Absolutely. There is still a complete disconnect between clinical trials and molecular approaches to cancer biology. We must somehow deal with that. The cooperative trial groups do not seem very adept at designing molecular components into their studies. We have been looking for support for that without much success. At the moment, it seems that the trials are aimed only at finding out if the drug does or does not work.

DR. NORTON: We all know it is critically important. We can't get agreement on who is supposed to pay for it. Therefore, it is a question of doing something that you can afford.

DR. HANASH: We still need to figure out how to synergize molecular approaches to tumor profiling or serum profiling with cooperative trials. The NCI is very interested in having another workshop like this one to deal with that specific issue. Many challenges remain.

DR. PENHOET: I think it is worth pointing out that it is almost inevitable that in this case, as in many others, the screening test will evolve from the diagnostic test. Most of the money is going into paying for therapeutic trials and not for screening trials. Now we are finding genetic markers that predict therapies, so I think your point is well taken that money invested in clinical trial diagnostics is not money wasted in terms of eventual screening techniques.

DR. VOSBURGH: I had somewhat the same thought as Dr. Norton, but from a different perspective. This came up in early discussions of the committee and may be in our report somewhere. There is a significant role and perhaps some advocacy for the education of patients to support the acquisition of blood or tissue samples so that we can build these longitudinal databases and then go back and validate new technologies as they are developed. This is something that people can do now for the long-term advancement of detection of disease. There is a call for action here that we probably haven't emphasized as much because there are so many other good things in the report.

DR. PENHOET: It is possible that a gene chip, if you have a candidate number of genes—you would still need a few hundred—could be very inexpensive to run—or proteomics—if you only have a dozen markers or so.

DR. HANASH: We should be careful not to embellish this. That creates disappointment later. This is really a very slow painful process of an incremental nature. There is not going to be a revolution overnight; you wake up and mammography has been replaced by a 100 percent sensitive and specific test. It is incremental and very tedious.

Copyright © 2005, National Academy of Sciences.
Bookshelf ID: NBK83874

Views

  • PubReader
  • Print View
  • Cite this Page
  • PDF version of this title (3.1M)

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...