Executive Summary

Publication Details

Introduction

Breast cancer is the most commonly diagnosed cancer in women. This tumor is the second leading cause of cancer-related deaths in women in the United States, with approximately 178,000 new cases and 40,000 deaths expected among U.S. women in 2007. Treatment for breast cancer usually involves surgery to remove the tumor and involved lymph nodes. Frequently, surgery is followed by radiation therapy (in case of breast conservation or in women with large tumors or many involved lymph nodes), endocrine therapy (for essentially all women with tumors that express the estrogen receptor (ER-positive)), and/or chemotherapy (for women having a high risk for a poor outcome such as those with large tumors, involved lymph nodes, advanced disease, or inflammatory breast cancer). More than three-quarters of patients are expected to survive with this multi-modality approach.

Gene expression profiling has been proposed as an approach to address this issue in clinical settings, and three breast cancer gene expression assays are now available in the U.S. The Oncotype DX™ Breast Cancer Assay, the MammaPrint® Test, and the Breast Cancer Profiling test (BCP or H/I ratio). MammaPrint is based on the use of microarray technology, while the other two assays are based on the reverse transcriptase polymerase chain reaction (RT-PCR). All of these tests combine the measurements of gene expression levels within the tumor to produce a number associated with the risk of distant disease recurrence. These tests aim to improve on risk stratification schemes based on clinical and pathologic factors currently used in clinical practice. As therapeutic decisions are based on risk estimates, tests that improve such estimates have the potential to affect clinical outcome in breast cancer patients by either avoiding unnecessary chemotherapy and its attendant morbidity or by employing it where it might not otherwise have been used, thereby reducing recurrence risk.

The literature was searched for evidence about the use of gene expression profiling in breast cancer. Our analytical framework for reporting the results distinguishes between the assays, as they are offered to patients, and the underlying signatures, which comprise the genes whose expression is measured. This measurement of expression can be done in a number of ways that may not be identical to the procedures used for the marketed test, producing an unknown number of different predictions. We also distinguish between developmental and validation studies.

Methods

Working with the Agency for Healthcare Research and Quality (AHRQ), the Centers for Disease Prevention and Control (CDC), the Evaluation of Genomic Applications in Practice and Prevention (EGAPP) working group, and members of a technical expert panel, we formulated four key questions, and addressed them on the basis of the evidence available about the specific assays and the underlying gene expression signatures. The original set of key questions was refined to focus primarily on two gene expression profiling tests: Oncotype DX (Genomic Health, Inc.) and MammaPrint (Agendia). During the course of the evaluation, a third gene expression profiling test came to our attention, the H/I ratio test based on the two-gene signature (AviaraDX/Quest Diagnostics, Inc.), and was thus investigated. We searched and retrieved studies in MEDLINE®, EMBASE, and the Cochrane databases (1990-2006). We supplemented this search with recent publications that appeared after the time period initially considered in the systematic search, and about the two-gene test (H/I ratio). We also searched for relevant documents on the Food and Drug Administration's web site, and solicited additional documentation from the companies offering the tests. The systematic searches yielded a total of 12983 citations. Specific inclusion and exclusion criteria were developed and pairs of readers reviewed each title; the same procedure was used to review selected abstracts. We identified 63 studies for full text review. We developed tables to summarize each article. Initial data were abstracted by investigators and entered directly into evidence tables. Quality and consistency of the abstracted data was then evaluated by a second reviewer, and a senior investigator examined all reviews to identify potential problems with data abstraction. These were discussed at meetings of group members. A system of random data checks was applied to ensure data abstraction accuracy.

Results

Literature on Key Questions

Key Question 1. What is the direct evidence that gene expression profiling tests in women diagnosed with breast cancer (or any specific subset of this population) lead to improvement in outcomes?

Direct evidence was defined as a study where the primary intervention is the use of a prognostic test (with therapeutic decisionmaking directed by the result) and the outcomes are patient morbidity, mortality and/or quality of life. No direct evidence was found in the published data on improvement of patients' outcomes due to such testing in women diagnosed with breast cancer, nor were there any randomized studies using the tests' predictions to manage patients. However, as described under Key Questions 3 and 4, some of the tests' supporting evidence was derived from past randomized controlled trials (RCTs) with prospectively gathered patient samples, giving them strong evidential value. Two ongoing RCTs, TAILORx and MINDACT (using Oncotype DX, and MammaPrint respectively), will provide further evidence allowing almost direct inference about the impact on patient outcomes.

Key Question 2. What are the sources of and contributions to analytic validity in these two gene expression-based prognostic estimators for women diagnosed with breast cancer?

In the field of gene expression there are no “gold standards” outside the technologies used in the tests under study, i.e., microarrays and RT-PCR. Consequently, a definitive evaluation of the analytic validity of expression-based tests is difficult. Evidence about operational characteristics was partial and limited to a few publications. A 2007 paper by Cronin and colleagues, on the analytic validity of Oncotype DX was the most detailed study for any of these tests so far, showing good performance for a number of analytic components of the assay. Data about the sources and contributions to variability of the tests and about their reproducibility was generally limited to analyses of few samples, and thus a complete evaluation of the impact of such variability on risk assessment was not available. Partial evidence about analytic validity was provided in the percentage of subjects whose samples were successfully analyzed with these tests, and those numbers were fairly good. Continuous monitoring of laboratory procedures and careful evaluation of the quality of the submitted specimens are major factors affecting test reliability.

Key Question 3. What is the clinical validity of these tests in women diagnosed with breast cancer?

a.

How well does this testing predict recurrence rates for breast cancer compared to standard prognostic approaches? Specifically, how much do these tests add to currently known factors or combination indices that predict the probability of breast cancer recurrence, (e.g., tumor type or stage, age, ER, and human epidermal growth factor receptor 2 (HER-2) status)?

b.

Are there any other factors, which may not be components of standard predictors of recurrence (e.g., race/ethnicity or adjuvant therapy), that affect the clinical validity of these tests, and thereby generalizability of results to different populations?

Clinical validity is defined as the degree to which a test accurately predicts the risk of an outcome (i.e., calibration), as well as its ability to separate patients with different outcomes into separate risk classes (discrimination). Clinical validity was documented to some degree for all three gene expression signatures. Oncotype DX was validated on a homogenous population of lymph node negative, ER positive patients all treated with tamoxifen, derived from an arm of an RCT, the National Surgical Adjuvant Breast and Bowel Project (NSABP-14). MammaPrint, on the other hand, was validated on samples from a clinical series with a wide range of clinical and treatment characteristics, and sometimes it was the signature and not the MammaPrint test itself that was validated. Data that made clear the incremental value of the test over standardized risk predictors using classical clinical factors, in the form of risk reclassification tables, was limited to Oncotype DX in one population, and for one of those predictors (Adjuvant! Online for MammaPrint). The evidence behind the two-gene test is quite heterogeneous, in that the specific manner in which the index was calculated differed in each, and only one examines the index that is to be used as part of the BCP (or H/I ratio) test in a study that was still using statistical methods to find optimal cut points, i.e., a training study. So the Oncotype DX test, which has been validated in exactly the form given to patients on clinically homogeneous samples with clear treatment implications, is regarded as the index with the strongest claim to clinical validity. It is not yet as clear to which populations MammaPrint best applies, and how much incremental value it would have within those clinically homogeneous populations above various standard predictors. Since the number of validation studies for any of the tests is still relatively small, more remains to be learned about stability between different populations of the relationship between expression-based score and the absolute observed risk. Essentially nothing is known about how specific characteristics of these populations might affect test performance.

While the H/I ratio test shows some promise, it must be regarded as still being in a developmental phase; it cannot yet be considered fully validated. It was not clear whether samples were processed by Quest Diagnostics, which hold the current license. There are a number of intriguing biological insights and plausible mechanisms to support the rationale for the test, but its consistent value in well-defined clinical settings has not yet been firmly established.

Key Question 4. What is the clinical utility of these tests?

a.

To what degree do the results of these tests predict the response to chemotherapy, and what factors affect the generalizability of that prediction?

b.

What are the effects of using these two tests and the subsequent management options on the following outcomes: testing or treatment related psychological harms, testing or treatment related physical harms, disease recurrence, mortality, utilization of adjuvant therapy, and medical costs.

c.

What is known about the utilization of gene expression profiling in women diagnosed with breast cancer in the United States?

d.

What projections have been made in published analyses about the cost-effectiveness of using gene expression profiling in women diagnosed with breast cancer?

Few studies addressed the clinical utility of Oncotype DX recurrence score (RS) in predicting the benefits of adjuvant chemotherapy, although the probability of recurrence represents an upper bound on the degree of absolute benefit. One fairly strong retrospective study produced preliminary evidence that the RS has predictive power in assessing the benefit of chemotherapy usage in ER-positive, lymph node negative breast cancer patients. This study was embedded within a large, well conducted RCT (National Surgical Adjuvant Breast and Bowel Project (NSABP B-20)). Some patients from the tamoxifen-only arm of the trial were in the training data sets for the Oncotype DX assay development, and this could potentially translate into a somewhat enhanced estimate of the discriminatory effect of Oncotype DX, although it is unlikely to eliminate entirely the effect seen here. Other studies produced preliminary evidence that the RS from the Oncotype DX assay has predictive power in assessing the likelihood of pathologic complete response after pre-operative chemotherapy with various drugs and regimens, although very limited sets of patients have been used. One study produced preliminary evidence that the RS cannot predict pathologic complete response after primary chemotherapy in advanced breast cancer patients.

One study produced preliminary evidence that the knowledge of the RS from the Oncotype DX assay can have an impact on the clinical management of patients diagnosed with ER positive, lymph node negative, and early breast cancer. However, it did not report specifically what the patients (or doctors) were told or understood about their absolute risk of recurrence, and therefore was minimally informative as to the actual risk thresholds used by women and their treating physicians, or whether absolute risks even entered into the decision.

There were no studies that addressed the clinical utility of the MammaPrint or H/I ratio tests.

Three published studies have addressed economic outcomes associated with use of the breast cancer gene expression tests. One study reported that using the 21-gene RT-PCR assay to reclassify patients who were defined by 2005 National Comprehensive Cancer Network (NCCN) criteria as low risk (to intermediate or high risk) would lead to an average gain in survival per reclassified patient of 1.86 years. The associated cost-utility of using recurrence score testing for this cohort was $31,452 per quality-adjusted life-year (QALY) gained. The analysis also reported that using the 21-gene RT-PCA assay to reclassify patients who were defined by 2005 NCCN criteria as high risk (to low risk) was cost saving. In a hypothetical population of 100 patients with characteristics similar to those of the NSABP B-14 participants, more than 90 percent of whom were NCCN-defined as high risk, using the 21-gene RT-PCR assay was expected to improve quality-adjusted survival by a mean of 8.6 years and reduce overall costs by about $203,000. However, the EPC team had only moderate confidence in the results of this analysis because the study was sponsored in part by the manufacturer of the 21-gene RT-PCR assay and the authors did not provide sufficient information about methodological and structural uncertainties as well as other potential sources of bias such as the derivation of the utility estimates. Furthermore, the 2007 NCCN guideline indicates that the use of chemotherapy in these patients is now considered optional, further diminishing the usefulness of these projections.

The second study reported that use of the 21-gene RT-PCR assay was associated with a gain of 0.97 QALYs and a cost-utility ratio of $4432 per QALY compared with use of tamoxifen alone, and a gain of 1.71 QALYs with net cost savings when compared with the chemotherapy and tamoxifen combination. However, the EPC team had little confidence in the results of this analysis, which was supported in part by the manufacturer, because the study did not meet many of the standards that the team used for appraising the quality of the analysis.

The third study compared the cost-effectiveness of the Netherlands Cancer Institute gene expression profiling (GEP) assay (MammaPrint) to the U.S. National Institutes of Health (NIH) guidelines for identification of early breast cancer patients who would benefit from adjuvant chemotherapy. The GEP assay was projected to yield a poorer quality-adjusted survival than the NIH guidelines (9.68 vs. 10.08 QALYs) and lower total costs ($29,754 vs. $32,636). To improve quality-adjusted survival, the GEP assay would need to have a sensitivity of at least 95 percent for detecting high risk patients while also having a specificity of at least 51 percent. The EPC team had confidence in the results of this analysis because it met most of the standards for appraising the quality of an economic analysis.

Based on the appraisal of these three studies, the overall body of evidence on economic outcomes was inconclusive.

Limitations of the Report

The report included only English publications and was restricted to three gene expression tests.

Limitations of the Literature and Implications for Future Research

There are several issues that concern all of these tests.

1.

While all of the tests exhibit a fair bit of risk discrimination (i.e., separating patients into different risk groups), the calibration of the estimates (i.e., how close the predicted risk is to the observed risk) in varying settings is still not as well established. Of greatest interest is the observed risk in the lowest risk groups, since the absolute level of this risk is critical for informed decisionmaking, and patients may forego chemotherapy on the basis of this information.

2.

The manner in which the tests are best used-in combination with other prediction scores, as continuous scores, or as categorical predictors-has not been established. In addition, the current cut-points for designation of Low and High risks (with or without an intermediate category) are not clearly derived from decision-analytic criteria.

3.

The incremental value of these tests is best assessed from cross-classification tables that show how many subjects are placed in different risk categories (corresponding to different clinical decisions) by the addition of the information from the test in comparison or in addition to standard predictors. Such tables have been developed for Oncotype DX, but for only one set of risk thresholds, and some of the conventional guidelines used for those comparisons have since been updated.

4.

In practice, pre-analytic issues related to sample preparation, transport and processing could cause the tests to perform differently in practice than in investigational contexts; continued monitoring of test procedures and performance will be important as they are used more widely.

5.

The relevance of validation studies in past tamoxifen-treated populations for current populations treated with aromatase inhibitors needs further research.

6.

Studies examining the use of the tests should provide women and physicians with quantitative risk information and report how this alters clinical decisionmaking. The manner in which this risk information is presented should also be studied.

Oncotype DX

1.

The role of the RS in guiding treatment of HER-2 positive patients is unclear, as most of these patients were classified in the high RS group in the initial trials.

2.

While awaiting the TAILORx results, the findings of the Paik 2006 study predicting treatment benefit need independent confirmation.

MammaPrint

1.

The prognostic value of the 70-gene signature has been assessed in different populations facing different therapeutic choices. In the analysis by van de Vijver and colleagues, 130 of the 295 patients received adjuvant therapy in a non-randomized fashion. Patients in the original development cohort were not treated, and Buyse validated the marketed assay in untreated patients. It is not yet clear which are the optimal patient populations for the use of this test, exactly what its performance is in those populations, and how many of its predictions would result in different therapeutic decisions. Larger independent validation studies in therapeutically homogeneous groups would be very valuable.

2.

There is no evidence for the degree to which this test predicts the benefit of adjuvant chemotherapy.

Breast Cancer Profile (H/I ratio) Test

1.

The BCP test is not yet as well validated as either of the other tests, with most of the supporting studies examining slightly different ways of either performing (e.g., different reference standards) or calculating the index. More work needs to be done documenting the risk discrimination and risk calibration of the marketed test in clinically homogeneous populations, as well as its incremental value.

2.

There is no evidence for the degree to which this test predicts the benefit of adjuvant chemotherapy.

In addition to the conclusions above, a series of other observations were made on the basis of what was learned in this investigation.

Assay Validation

In general, it is clear that validation studies need to deal with populations for whom the decision-making implications of various risk groupings are clear. For all tests except Oncotype DX, both validation and development studies have been on mixed populations, without sufficient sample sizes to stratify into large enough homogeneous groups to guide clinical decisionmaking. In addition, validation samples are often re-used by other investigators; the pool of such samples in the public domain needs to be greatly expanded.

Potential for Scale Problems

One problem that may be faced in the future is that of the consequences of an increase in demand for these tests. Whether the degree of accuracy seen in investigational settings can be maintained with increasing demands should be monitored by scientific or regulatory bodies.

Genetic Variability and Gene Expression

It is unknown whether gene expression profiles are more or less likely than more traditional biomarkers to be generalizable beyond the populations in which they were initially developed. Gene expression may reflect fundamental biological tumor features, and thus be relatively stable across ethnic groups. This speaks to the importance of validating these tests in populations with varying genetic background. Of particular interest will be the variation of the observed absolute risk in those populations, and its correlates.

The Need for Databases, Reproducibility, and Standards

Consideration should be given to the development of databases with complete data on each patient tested with these and future tests (absent identifiers). The data should include all the analyses performed, laboratory logs, the raw and processed data, and all the information about procedures and analyses that have been performed to produce a risk estimate from a tumor sample.

Where is the Field Going?

We can expect many new tests, as well as new uses for the assays that already exist. More genes might be added to the signatures, and in the particular case of MammaPrint this will be possible without changing the experimental procedures, since the array contains more genes than the ones that are incorporated in the 70-gene signature. In this regard, we might also expect other modifications: subsets of the current signatures might be proposed as alternatives to current clinical risk factors, or be proposed in different populations or for different purposes. For Oncotype DX, a natural evolution could be related to its use as an alternative to immunohistochemistry and/or pathology to evaluate tumor Grade, S-phase index, ER, progesterone receptor, and HER2 expression, since such genes are part of the set included in the assay. Reporting of individual gene expression results may also prove useful.

“Comparative Effectiveness” Studies

As these tests mature and proliferate, an important question will be how they compare to each other, and whether there is value in their combination. In the therapeutic domain, this has been called “comparative effectiveness” research. Such research has traditionally been difficult to fund by government or by industry, because it may not hold out as much therapeutic promise as new discoveries, and because industry understandably is not anxious to fund head-to-head comparisons with competitive products. This same dynamic could easily take hold in the risk prediction arena, with a proliferation of licensed prediction indices without any clear notion of what new ones are contributing over previous tests. In this perspective, development of future expression-based predictors should account for direct contrasts with “established” methods.

Conclusion

The introduction of these gene-expression tests has ushered in a new era in which many conventional clinical markers and predictors may be seen merely as surrogates for more fundamental genetic and physiologic processes. The multidimensional nature of these predictors demands both large numbers of clinically homogeneous patients to be used in the validation process, and exceptional rigor and discipline in the validation process, all with an eye toward how the test will be used in a clinical decisionmaking context. Every study provides an opportunity to tweak a genetic signature, but we must find the right balance between speed of innovation and development of scientifically and clinically reliable tools. Going forward, it will be important to harness, if possible, as much genetic and clinical information on patients who undergo these tests to facilitate achieving each goal without unduly sacrificing the other.