NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Marchionni L, Wilson RF, Marinopoulos SS, et al. Impact of Gene Expression Profiling Tests on Breast Cancer Outcomes. Rockville (MD): Agency for Healthcare Research and Quality (US); 2008 Jan. (Evidence Reports/Technology Assessments, No. 160.)

Cover of Impact of Gene Expression Profiling Tests on Breast Cancer Outcomes

Impact of Gene Expression Profiling Tests on Breast Cancer Outcomes.

Show details

3Results

Key question 1. What is the direct evidence that gene expression profiling tests in women diagnosed with breast cancer, or any specific subset of this population, lead to improvement in outcomes?

In a study defined as providing direct evidence of improvement in outcomes, the use of the test in decisionmaking is compared to not using the test, with health outcomes as an endpoint, generally in the form of an RCT. There is currently no direct evidence that the investigated gene expression profiling tests lead to improvement in outcomes in any subset of women diagnosed with breast cancer. Two ongoing RCTs aim to provide almost direct evidence for Oncotype DX™, and for MammaPrint®. These studies are described at the end of this chapter.

Key question 2. What are the sources of and contributions to analytic validity in these gene expression-based prognostic estimators for women diagnosed with breast cancer?

Analytical validity is usually assessed by determining how much observed measurements differ from expected values derived from a standard reference method. In the measurement of gene expression, however, universal standard reference RNAs and universally accepted, definitive methods of analysis are not available. Consequently, a definitive evaluation of the analytic validity of such type of test is difficult. It is more appropriate to focus instead on test variability. In clinical use, gene expression-based prognostic tests involve multiple steps with individual components that are difficult to separate. Ultimately, reproducibility of patient classification into clinically relevant risk groupings is what matters. From this perspective, the most important sources of variability are tumor sampling and handling, specimen preparation, and biologic variation within and between different samples of the same tumor. The analytic validity of expression-based tests can therefore be assessed by asking the following questions:

1.

How reproducible is the test when applied repeatedly to the same patient, either by examining the same specimen, or a different specimen?

2.

How reproducible is the test over time?

3.

What are the factors that most affect the overall performance of the test?

Few existing studies directly address analytical issues involved with the assays, and additional information could only be collected from clinical studies. Overall, this evidence was heterogeneous, spanning technical aspects, reproducibility, the number of successfully performed assays, or the comparison of RNA and protein levels of individual genes. Table 1 describes the three assays; Oncotype DX, MammaPrint, and the H/I ratio (HOXB13 and IL17RB).

Oncotype DX™

Evidence about the analytic validity of Oncotype DX is available from two technical studies, Cronin et al, 2004,44 and Cronin et al. 2007,45) and from several clinical reports. Information about the overall success rate of the assay was documented in 9 studies (Chang, 2007,55 Cobliegh, 2005,47 Esteva, 2005,48 Gianni, 2005,49 Habel, 2006,50 Mina, 2006,51 Oratz, in press,52 Paik, 2004,28 and Paik, 200653). This success rate ranged from 78.9 percent to 98.9 percent, and only some of the studies provided detailed descriptions of the reasons for assay failure. Reported failures were mainly ascribed to an insufficient number of cancer cells in the specimens, to poor RNA quality, and in a few cases, to failure of the RT-PCR technique. A synopsis of this evidence is provided in Table 2.

Table 2. Successful assays, Oncotype DX™.

Table 2

Successful assays, Oncotype DX™.

Data on assay variability and reproducibility were available from 3 studies (Cronin, 2007,45 Habel, 2006,50 and Paik, 200428). These studies assessed the variability of repeated gene expression measurements using RNA from either the same or different FFPE blocks at repeated time points, and across different instruments and operators. Data reported in the study concerned the variability of individual genes in the assay as well as the RS reproducibility. Variability evidence was reported for 66 FFPE blocks from 22 distinct patients, and from repeated measurements of two aliquots of a pooled reference RNA. Overall, the standard deviation (SD) for the recurrence score was below 3 RS units, although the authors did not discuss the impact on risk stratification. This evidence is reported in Table 3.

Table 3. Variability and reproducibility, Oncotype DX™.

Table 3

Variability and reproducibility, Oncotype DX™.

Two studies addressed technical and operational aspects of analytic validity (Cronin, 2004,44 and Cronin, 200745). The first study presented data about the development the assay procedures, comparing gene expression measurements between frozen tumor specimens and FFPE blocks. The optimization of the RT-PCR primers (see Glossary, Appendix B1), and the normalization strategy were discussed. The second study addressed relevant analytic components of the assay, such as detection and quantification limits (limit of detection (LOD) and limit of quality (LOQ) respectively), amplification efficiency, linearity, dynamic range, accuracy, precision, and assay reproducibility. The available evidence is reported in Table 4.

Table 4. Analytic validity, Oncotype DX™.

Table 4

Analytic validity, Oncotype DX™.

Finally, eight studies (Chang, 2007,55 Cobliegh, 2005,47 Esteva, 2005,48 Gianni, 2005,49 Cronin, 2004,44 Habel, 2006,50 Mina, 2006,51 and Paik, 200428) compared gene expression measurements of specific individual genes (ER, progesterone receptor (PR), HER-2) to measurements of the corresponding proteins produced by those genes as obtained by other techniques, in particular immunohistochemistry (IHC). Such studies used various cycle thresholds (CTs) to define positivity for the genes (see Table 5). Overall agreement between RT-PCR and IHC proved generally good for ER (k statistics ranging from 0.80 to 1). In one study (Habel, 200650), agreement was low (0.49), although RT-PCR measurements were comparable to data available in the clinical records. In general, agreement for PR and HER-2 was moderate or poor. Such evidence is reported here for completeness (see Table 5), although it does not contain any relevant information about the assay as a whole.

Table 5. RT-PCR vs IHC comparison assays, Oncotype DX™.

Table 5

RT-PCR vs IHC comparison assays, Oncotype DX™.

Individual studies are briefly described below.

Cronin et al., 2004.44 In this study, the authors discussed the primer (see Glossary, Appendix B) design optimization and expression level normalization necessary to obtain reliable RT-PCR measurements from archival FFPE samples, with the goal of establishing the reliability of their results with partially degraded RNA samples. The authors compared gene expression levels in 62 matched FFPE and frozen tissue specimens prepared from the same breast tumor. They showed that the relative expression profiles obtained from the two analyses were similar (correlation = 0.91, P value < 0.0001), although the magnitude of the measurements differed. They successfully corrected the differences using normalization based on the expression of five reference genes. Convincing evidence supporting the use of the implemented protocols in assessing gene expression levels from archival (i.e., formalin-fixed, paraffin-embedded) tumor specimens was shown.

The authors also analyzed several genes that were reported to show similar patterns in the literature20 for co-expression,54 and confirmed these correlations. Specifically, the expression of cytokeratin 5 and cytokeratin 17 (r = 0.85), LPL and RBP4 (r = 0.84), HER-2 and GRB7 (r = 0.71), ER1 and GATA3 (r = 0.6) were highly correlated.

Additionally, the authors compared RT-PCR measurements of ER, PR, and HER-2 (all components of Oncotype DX) expression to IHC analysis of corresponding protein levels, and to fluorescent in situ hybridization (FISH) analysis for HER-2 for a subset of 17 samples. The concordance among the different assays detecting the protein products of the genes and the relative RNA levels as measured by RT-PCR was high (94 percent, 84 percent, and 100 percent, respectively, see Table 5).

In summary, this study provided a foundation for the use of the Oncotype DX assay in archival tissue, although it did not contain data about the development of the RS (Appendix I, Evidence Tables 1, 2 and 3).

Paik et al., 2004.28 In this clinical study, the authors reported data on the variability of the RS, and the overall success rate of the assay. The authors evaluated the reproducibility of the Oncotype DX assay within and between FFPE blocks from the same patient. The Oncotype DX assay was carried out on 5 serial sections from 6 different blocks from 2 distinct patients. Seventy nine blocks out of 754 were not analyzed due to insufficient tumor content, but RT-PCR was successful in 668 of the remaining 675 (98.9 percent) tissue blocks.

For the 16 genes considered in the RS, the SD of expression ranged from 0.07 to 0.21 expression units across serial sections from the same block. The within-block SD of the combined RS proved to be 0.72 RS units (with 95 percent CI: 0.55–1.04), while the within-patient SD, which included both among-block and within-block variation, proved to be 2.2 RS units. The impact of this variation on the risk stratification provided by the RS was not discussed in the paper. The difference between the low- and high-risk groups is 14 RS units, far larger than the standard deviations reported. Although ER, PR and HER-2 were also assessed by other techniques, the agreement of the measurement obtained by the different technologies was not reported.

In summary, this paper reported evidence about the fraction of tissue blocks that can be successfully typed by the Oncotype DX assay, as well as limited data about the reproducibility of the RS between different sections and FFPE blocks from the same patient. The impact of such variability on the risk stratification was not examined (Table 3, Appendix I, Evidence Tables 1, 2 and 3).

Esteva et al., 2005.48 In this study, the authors evaluated the correlation of RS, both as a whole and broken into its components, with known standard prognostic markers in FFPE tumor specimens. Specifically, the relationship between RT-PCR and IHC for ER, PR, and HER-2 was examined. The concordance for PR status was poor (k of 0.48), high for ER (k = 0.81), and proved moderate for HER-2 (k = 0.60).

A logistic model using IHC HER-2 measurement as a quantal response indicated a significant (P < 0.0001) degree of correlation between IHC and RT-PCR. Sensitivity and specificity for HER-2 were also measured, using different RT-PCR cutoff points and positivity, and are reported in Table 5.

In summary, this paper reported evidence about the percentage of successfully-analyzed samples (67.7 percent, 149/220) in a large population from a single institution (M.D. Anderson Cancer Center) (Table 5, Appendix I, Evidence Tables 1, 2 and 3).

Cobleigh et al., 2005.47 This study reports on the development of the 21-gene Recurrence Score assay (Oncotype DX), Duplicated gene expression measures were obtained by RT-PCR in archival FFPE tumor tissue blocks. An initial set of 192 genes (187 cancer-related and 5 controls) were analyzed and 16 additional candidate genes were added at a later time. Ninety-one point six percent (78/85) of samples were successfully analyzed

IHC-measured protein levels and RT-PCR mRNA levels for ER, PR, HER-2, and Ki-67/MIB-1 (a proliferation marker of cancer cells) were compared. The concordance was high for both ER (k = 0.83) and HER-2 (k = 0.67), somewhat lower for PR (k = 0.40), and poor for Ki-67 (k = 0.22). (Table 5, Appendix I, Evidence Tables 1, 2 and 3).

Gianni et al., 2005.49 The authors of this paper evaluated the correlation of IHC-measured protein levels with RT-PCR mRNA measurements of ER and PR expression in tumors. The concordance was high for ER (k = 0.84; 95 percent CI, 0.71 to 0.96) and moderate for PR (k = 0.71; 95 percent CI, 0.56 to 0.86). This paper also reports preliminary evidence about the use of the Oncotype DX assay in fixed core biopsies from breast cancer patients. The percentage of successfully analyzed samples was 93.6 percent (89/95) (Table 5, Appendix I, Evidence Tables 1, 2 and 3).

Mina et al., 200651 In this study, the authors evaluated the usefulness of FFPE core biopsies from a completed phase II trial in identifying genes that correlated with a response to primary chemotherapy. Out of the 70 patients enrolled in the study, 67 gave their consent, and specimens from 57 patients were available to perform gene expression analysis by RT-PCR. Out of these 57 patients, gene expression levels could be accurately measured in 45 patients. Failures were due either to low RNA yield (9 patients) or low tumor content in the biopsies (3 patients).

In this study the authors compared the expression levels of ER mRNA obtained by RT-PCR to ER protein expression as measured by IHC. Using a pre-defined cutoff of 6.5 CT, 64 percent of the 45 tumors were ER positive, while 36 percent were considered ER negative. ER expression by IHC correlated well with ER mRNA expression by RT-PCR (see Table 5), and only four of the 45 samples did not show agreement. The authors concluded that gene expression analysis on core biopsy samples was feasible. Data for PR, HER-2 and Ki-67 were not reported.

In summary, this paper reported preliminary evidence about the expression of some of the Oncotype DX assay genes in fixed core biopsies from breast cancer patients. The percentage of successfully analyzed samples was about 79 percent (45/57), raising concerns about the real feasibility in clinical settings (Table 5, Appendix I, Evidence Tables 1, 2 and 3).

Habel et al., 2006.50 This study contains several results that are relevant for the overall analytic validity of the Oncotype DX assay. The authors cited two unpublished studies with data concerning the reproducibility of the RS. These studies analyzed, respectively, 60 blocks from a total of 20 distinct patients, and 49 core biopsies or resections from advanced breast cancer patients. In the first study the RS SD between different blocks from the same patient was 3.0 RS units, and less than 2.5 for 16 out of 20 patients. Similar results were claimed for the second study, although the actual data were not shown.

Finally, the authors compared the agreement of ER status, as obtained by RT-PCR, to the ER status reported in the medical records. A positive or negative classification was based on a CT cutoff point of 6.5. The RT-PCR failure rate was about 1 percent for specimens available after pathological review, and 7.9 percent of the samples were not assessable due to low tumor contents. In this study population, the concordance between RT-PCR and the medical chart information was only moderate (k = 0.49, 95 percent CI 0.41–0.56). In the multivariate models used in the following statistical analyses, the RT-PCR based ER status was used.

In summary, this paper reported a high percentage of successfully analyzed samples in a large population from a single institution and the reproducibility of the RS between different blocks from the same patient. The impact of such variability on the risk stratification was not addressed (Table 5, Appendix I, Evidence Tables 1, 2 and 3).

Paik et al., 2006.53 In this clinical study the authors reported several results that can be used as indirect evidence for the overall analytic validity of the Oncotype DX assay. Particularly relevant, FFPE blocks with sufficient tumor content were available from 670 of the 2,299 eligible patients in the NSABP N-20 trial, and the RT-PCR assay was successful on 651 of the 670 patients (97.2 percent). (Appendix I, Evidence Tables 1, 2 and 3).

Cronin et al., 2007.45 This study is the most extensive analysis to date of the analytic components of the Oncotype DX assay. Detection and quantification limits of the RT-PCR reactions, amplification efficiency, linearity, dynamic range, accuracy, precision, and assay reproducibility were investigated in serial dilution experiments, using a common RNA obtained by pooling 15 distinct RNA samples.

Detection and quantification limits proved to be well within the instrument's pre-specified CT unit limits for all the genes. Amplification efficiencies (100 percent efficiency means that the RT-PCR reaction products achieved perfect doubling) for the 16 cancer-related genes ranged from 75 percent to 112 percent, with an average of 96 percent, while the mean efficiency proved to be 88 percent for the reference genes, with a range from 75 percent to 101 percent.

Accuracy and precision studies were conducted at the target RNA concentration of 2 ng per assay well, which is what is used in the Oncotype DX assay. The mean percent bias from each gene target was -0.3 percent (ranging from -10 percent to 6 percent) for cancer-related genes, and 0.7 percent for reference genes (-1.5 percent to 3.3 percent), indicating 99 percent mean quantitative correctness at this assay condition. The CV averaged 5.7 percent for the cancer-related genes and 3.2 percent for reference genes. The implications of such variability for RS were not discussed.

Finally, individual gene and RS reproducibility were measured by performing repeated analyses across multiple days, operators, RT-PCR plates, RT-PCR instruments, and liquid-handling robots. Two operators obtained replicate CT measurements on two aliquots of a single RNA sample over the course of five days with three real time PCR instruments (7900HT instruments) and two liquid-handling robots. The study design allowed the estimation of all main effects, including operator, RT-PCR instrument, and liquid-handling robot. Total SD in CT measurements varied from 0.06 to 0.15 CT units across the 21 genes, and the upper bounds on 2-sided 95 percent confidence intervals for the CV were all within 10 percent. The authors reported that a maximum SD of 0.15 at a CT of 30 translates into a CV of 0.5 percent, allowing a 15 percent change in gene expression to be distinguished. The day-to-day SD for all 21 genes ranged from 0 to 0.055, the between-plate SD ranged from 0 to 0.09, while the within-plate SD ranged from 0.057 to 0.147. The standard deviation for the overall RS (total and within-plate) was 0.8 RS unit. The largest differences between operators, as well as between liquid handling robots and 7900HT instruments, were 0.5 CT units for each of the 21 Oncotype DX genes, while SD and CV for the RS were not reported.

In summary, this study presented extensively detailed results about several relevant analytic components of the assay (Table 4, Appendix I, Evidence Tables 1, 2 and 3).

Chang et al., 2007.55 This clinical study reported several results that can be used as indirect evidence for the overall analytic validity of the Oncotype DX assay. Ninety-seven FFPE blocks from core biopsies were analyzed by the standard assay protocols, and the percentage of successfully analyzed samples was 82.4 percent.

In summary, this paper provides preliminary evidence about the use of the Oncotype DX assay in fixed core biopsies from breast cancer patients (Table 2, Appendix I, Evidence Tables 1, 2 and 3).

Oratz et al., in press.56 This clinical study evaluated the impact of the Oncotype DX assay on clinical management, and also provided indirect evidence for the assay's overall analytic validity. Seventy-four FFPE blocks were analyzed by the standard assay protocols, and the percentage of successfully analyzed samples was 97.3 percent. No explicit eligibility criteria were used. The samples were included based on the request for analysis from the patient's clinician.

In summary this paper contains evidence about the use of the Oncotype DX assay on FFPE blocks from breast cancer patients (Table 2, Appendix I, Evidence Tables 1, 2 and 3).

MammaPrint®

Analytic validity and variability evidence for MammaPrint was available from two technical studies ( Ach, 2007,57 and Glas, 200658) and information on the overall success rate of the assay was documented in just one study, Buyse, 200659(80.9 percent).

Data about variability and reproducibility were obtained in these studies using repeated gene expression measurements over time, within and across individual microarrays, across different laboratories, protocols instruments, and operators (see Tables 6, 7, 8). No comparisons were made between expression measurements of individual genes and their corresponding protein level by IHC.

Table 6. Successful assays, MammaPrint®.

Table 6

Successful assays, MammaPrint®.

Table 7. Reproducibility, MammaPrint®.

Table 7

Reproducibility, MammaPrint®.

Table 8. Analytic validity, MammaPrint®.

Table 8

Analytic validity, MammaPrint®.

The following is a brief description of each study.

Glas et al., 2006.58 In this study the authors reported a summary of the results obtained during the development of the commercially marketed version of the 70-gene prognostic signature,21,25 the expression array-based test known as MammaPrint. The authors evaluated and compared both technical aspects and the clinical validity of the assay using the originally published data (see Key Question 3).

MammaPrint uses a microarray accounting for 1,900 features (individual microarray locations where the probes are positioned), containing each of the 70 genes in the signature spotted in triplicates. In this paper the authors re-analyzed the data from the original series21,25 using the new array, a dye-swap hybridization design, a different reference RNA and a different approach to computing gene expression levels. Triplicate measurements were obtained for each gene of the 70-gene signature and summarized by an error-weighted average, rather than the approach proposed by Hughes et al., 2000,60 which was used in the original studies.

The results obtained with the new signature were comparable to the original results. Briefly, MammaPrint proved reproducible on the original development series21 (Pearson's correlation coefficient = 0.92 P value < 0.0001), and in a subset of the van de Vijver cohort25 (Pearson's correlation coefficient of 145/151 lymph node-negative patients = 0.88, P value < 0.0001). The replication of the experiment within patients and along time suggested high reproducibility as well. In particular, the Pearson's correlation coefficient on 49 patients analyzed twice was 0.995, and no significant variability within individuals was found by an analysis of variance (ANOVA) for the 70-gene signature P value = 0.96).

Risk classification by MammaPrint is obtained by measuring the cosine correlation of individual patients' gene expression profiles to the mean gene expression profile obtained in the van't Veer21 series. The variability of such correlation was measured by repeated analysis of 3 patients over time and showed very small SDs (0,028, 0,028 and 0.027 respectively).

In summary, this study reported detailed data about the development of the MammaPrint assays as it is offered in clinical settings, as well as data about the reproducibility of the assay within a single laboratory (Tables 7 and 8, Appendix I, Evidence Tables 6, 7 and 8).

Buyse et al., 2006.59 In this clinical study the authors reported several results that can be used as indirect preliminary evidence for the overall analytic validity of the MammaPrint® assay. Fresh frozen blocks from primary breast cancer patients collected in 5 distinct institutions were shipped for analysis to Agendia, and the percentage of successfully analyzed samples was 80.9 percent (326/403 patients) (Appendix I, Evidence Tables 6, 7 and 8).

Ach et al., 2007.57 The inter-laboratory reproducibility of the MammaPrint assay was assessed in this paper. Results for the same set of four patients were obtained at three different sites and compared in order to assess the variation resulting from several important phases of analysis, including RNA amplification and labeling, hybridization and wash, and slide scanning. The same input RNA was used for all experiments.

In the first phase of the analysis, two laboratories, one in Amsterdam and one in California, amplified and labeled the RNA samples, then exchanged aliquots of the templates. Hybridization and slide scanning were performed at both locations and the scanned slides were then exchanged for re-analysis by the other laboratory. The same lot of labeling kits and microarrays were used at both sites. Technical replication variability was assessed by analyzing two separate slides in two different days. This experimental design allowed examination of both intra- and inter-laboratory variation.

The Pearson correlation coefficient across all technical replicates for all tumors analyzed proved to be above 0.983, indicating that the signals from replicate hybridizations correlated extremely well for genes expressed at all the measured intensity levels.

The reproducibility of laboratory scanning procedures was evaluated by scanning each of the 16 microarray slides at both sites. Signals for green fluorescent dye proved extremely reproducible, irrespective to the site of first hybridization and scan (Pearson correlation coefficient > 0.995, slope = 0.97), while signals for the red dye correlated less well and were always lower on the rescanned slide. The correlation of the 70-gene expression profile to the previously developed59 mean signature58 was computed for each dye-swapped pair of arrays and ANOVA was used to evaluate the variability by hybridization site, labeling site, and hybridization day. No significant differences were found between hybridization sites, or hybridization days (regardless of site), but two tumors showed a statistically significant difference (P value <0.05) between labeling sites. Variability due to the RNA labeling site was further confirmed for expression measurements of individual genes of the 70-gene expression profile, as well as on the 182 most highly expressed genes.

In the second phase of the study, the assay performance was evaluated by a third laboratory in Paris, France, using a different batch of arrays, reagents, and labeling kits, on the same four tumor RNAs, several months after the initial comparison. The 70-gene signature correlation values for each of the four tumors were compared by ANOVA analysis, and significant differences were found for two of the tumors, when stratified by labeling site (P values of 0.0004 and 0.01 respectively), whereas one tumor proved to be significantly different (P value, 0.016) by hybridization site. The authors predicted, but did not provide supporting data, that if variations in the washing protocols were introduced between laboratories, significant discrepancies in the 70-gene signature results would emerge. They concluded that while some sources of variation have measurable influence on individual microarray measurements, the overall impact on the 70-gene signature is low.

In summary, this study thoroughly investigated factors that could affect the reproducibility of the 70-gene signature within and across different laboratories. RNA labeling proved to be the largest contributor to inter-laboratory variation, but the authors did not address the impact of such factors on the classification of individual patients into different risk groups. The data (although from only four distinct patients) implies that results from MammaPrint testing cannot be compared across laboratories and that the test must be centralized (Tables 7 and 8, Appendix I, Evidence Tables 6, 7 and 8).

H/I Ratio

None of the studies reviewed here explicitly referred to the marketed H/I ratio (BCP assay). However, one publication described the analytic procedures involved with such test, Ma, 2006.61 The rest of the available analytic validity and variability evidence was specific to the way in which the two-gene ratio profile was computed in each clinical study, and did not contain direct information about the marketed test.

Three studies (Goetz 2006,62 Jerevall 2007,63 Ma 200661) reported the overall success rate of the analyses, one report, Jerevall 2007,63 assessed the reproducibility between two different institutions, one assessed the correlation between RT-PCR and microarray based gene expression measurements for the two genes (HOXB13 and IL17RB), and one, Ma 2004,64 study compared ER status by RT-PCR and IHC (see Tables 9, 10, and 11). No comparisons were made between expression measurements of HOXB13 and IL17RB transcripts and the corresponding proteins by IHC. For completeness, a brief description of individual studies follows.

Table 9. Successful assays, two-gene signature and H/I ratio assays.

Table 9

Successful assays, two-gene signature and H/I ratio assays.

Table 10. Reproducibility, two-gene signature and H/I ratio assay.

Table 10

Reproducibility, two-gene signature and H/I ratio assay.

Table 11. RT-PCR vs IHC comparison assays, two-gene signature and H/I ratio assays.

Table 11

RT-PCR vs IHC comparison assays, two-gene signature and H/I ratio assays.

Ma et al., 2004.64 In this study the authors developed the HOXB13/IL17BR two-gene ratio signature. They identified differentially expressed genes associated with breast cancer recurrence in patients who were treated with tamoxifen, using gene expression arrays on whole mount as well as on laser micro-dissected (LMC) specimens. From a total of 5,475 genes selected because of their high variability across tumors, three differentially expressed genes proved to be common between the two analyses (macro-dissected specimens vs. LCM). These genes were HOXB13 (identified twice as AI700363 and BC007092), the 17B receptor IL17BR (AF208111), and EST AI240933.

HOXB13 was found to be over-expressed in tamoxifen recurrence cases, whereas IL17BR and AI240933 were over-expressed in tamoxifen non-recurrence cases. The authors confirmed relative gene expression by RT-PCR microarray analysis on 59 out of the 60 original patients. The Pearson correlation coefficient between array and RT-PCR results was 0.83 for HOXB13, and r = 0.93 for IL17BR. The RT-PCR-derived HOXB13/IL17BR ratios also highly correlated with its microarray-derived counterpart (0.83). The authors also evaluated by RT-PCR 20 additional ER-positive early-stage primary breast tumors from women treated with adjuvant tamoxifen monotherapy between 1991 and 2000. These were used as a validation set (see Key Question 3).

In summary, this study provides a foundation for the use of the H/I ratio signature in LMC FFPE specimens (Table 10, Appendix I, Evidence Tables 10, 11 and 12).

Ma et al., 2006.61 The authors developed the two-gene index concept in this study, based on the two-gene ratio they originally published in Ma et al, 2004.64 New RT-PCR primers/probes for HOXB13 and IL17BR were used, and four reference genes were introduced for normalization. Total RNA was isolated from two 7-micrometer thick tissue sections for each sample, reverse transcribed into cDNA using a pool of gene-specific primers, and quantitated by TaqMan RT-PCR in duplicate in a 384-well plate. For each sample, CT values for the four reference genes were averaged and the relative expression level of each target gene was expressed as the difference from mean reference CT after Z-transformation. This resulting value is no longer a simple ratio, and is thus referred to as the two-gene index.

RNA for this study was prepared from cancer cells isolated by LCM from FFPE tissue microarray sections (see Glossary, Appendix B) of originally frozen tumor specimens. From 870 patients, 98.0 percent of samples were successfully processed (Table 9).

In this study the authors evaluated the concordance between ER and PR protein levels assessed by IHC and the corresponding gene expression measured by RT-PCR. Since the distributions were found to be bimodal for both genes, the midpoints between the two populations were used as cutoff points (2.5 CT for ER and 5.9 for PR). Both the ER (91 percent concordance; kappa = 0.83; P value = .0001), and PR (85 percent concordance; kappa = 0.70; P value = .0001) status proved to be highly concordant. According to the authors, this confirms the significant correlations between mRNA and protein levels for ER and PR and provided validation of their gene expression analysis.

In summary, this clinical study, in which the HOXB13-to-IL17BR index was developed, represents the foundation for using the two-gene ratio signature in tissue microarray FFPE specimens analysis (Table 11, Appendix I, Evidence Tables 10, 11 and 12).

Goetz et al., 2006.62 In this clinical study, FFPE tumors samples from 206 of 211 primary breast cancer patients were successfully processed by laser micro-dissection (LMC) prior to total RNA preparation. This study provides generic evidence about the analytic validity of the two-gene signature in primary breast cancer patients, as computed from LMC processed FFPE blocks (Table 9, Appendix I, Evidence Tables 10, 11 and 12).

Jerevall et al., 2007.63 This paper quantified expression of HOXB13 and IL17BR (normalized to beta-actin) by RT-PCR in fresh frozen specimens from two distinct institutions in Sweden. RT-PCR reactions at the two institutions were performed using the same sets of primers and fluorescent probes, and two distinct instruments. Ninety-six percent of the 373 samples were successfully analyzed.

In summary, good reproducibility of the measurement between institutions was documented for each individual gene and the ratio (Pearson's correlation coefficient = 0.99, P value < 0.001) (Table 10, Appendix I, Evidence Tables 10, 11 and 12).

Key Question 3. What is the clinical validity of gene expression profiling tests in women diagnosed with breast cancer?

A synopsis of the clinical validity evidence presented in the following section is reported in Table 12.

Table 12. Clinical validity, Oncotype DX™.

Table 12

Clinical validity, Oncotype DX™.

Oncotype DX

Paik et al., 2004.28 This study was the first to validate the prognostic validity of Oncotype DX in a population independent from that used to develop the test. The population consisted of a sample of 668 (out of 2617) lymph node-negative, ER positive breast cancer patients from the tamoxifen-treated arm of the National Surgical Adjuvant Breast and Bowel Project (NSABP) Trial B-14. This 668-patient subset had enough analyzable tissue in paraffin blocks to be evaluated using the Oncotype DX assay, and was reported to be similar in baseline characteristics to the overall sample. A more complete sample was impossible because of sample unavailability or processing problems. In this study, the overall 10-year distant recurrence rate was 15 percent and the RS was significantly correlated with disease-free survival and overall survival (P<0.001 for both). The authors reported that RS alone was a better predictor of the distant recurrence risk than traditionally used predictors. In a multivariate model including age, tumor size grade, ER, PR, and HER, the RS Hazard Ratio was 2.81 (95 percent CI, 1.70–4.64, P<0.001, per 50 unit increase). Forty-four patients out of the 109 with small tumors (diameter less than 1 cm), were classified using Oncotype DX into the intermediate or high risk groups (Table 12, Appendix I, Evidence Tables 1, 2 and 4).

Esteva et al., 2005.48 In this study the Oncotype DX assay was evaluated in a population of 149 patients treated at the MD Anderson Cancer Center between 1978 and 1995. These patients had been diagnosed with node-negative breast cancer and did not receive tamoxifen or chemotherapy, and had a median 18 year followup. The number of recurrences was not reported, and this study failed to find correlation between RS and distant breast cancer recurrence. ER, PR, and HER-2 showed no prognostic value, and well-differentiated tumors were correlated with worse survival than higher grade tumors, the reverse of expected. The population was unusual in that it received no treatment, and was different from the one used by Paik et al.28 (Table 12, Appendix I, Evidence Tables 1, 2, and 4).

Cobleigh et al., 2005.47 This report is the only study among the three used to develop the 21-gene Recurrence Score assay (Oncotype DX) to be published in a peer-reviewed journal. Seventy-eight breast cancer patients with more than 10 positive nodes from Rush University Cancer Center were studied, and 55 had recurred. Two hundred and fifty-five candidate genes were amplified with RT-PCR from FFPE tumor tissue obtained as long as 24 years ago. Twenty-two genes were significantly correlated with distant recurrence-free survival (DRFS) (unadjusted P value < 0.05). An RS was developed using these genes which very strongly predicted disease-free survival, but as this was training and not validation data, it has minimal evidential value in assessing Oncotype DX predictive properties (Table 12, Appendix I, Evidence Tables 1, 2 and 4).

Habel et al., 2006.50 The Oncotype DX assay was used to assess the risk of breast cancer-specific mortality among women in a large case-control study population derived from fourteen Northern California Kaiser community hospitals with ER positive, node-negative breast cancer.

There were a total of 4,964 eligible patients, 220 had died and 570 were living controls. All were younger than 75 years old, diagnosed between 1985 and 1994, and had not been treated with adjuvant chemotherapy. For ER positive tamoxifen-treated patients, RS risk groups (as defined by pre-specified thresholds chosen by the test developers) showed similar 10-year risks of death from breast cancer (3 percent, 12 percent, and 27 percent respectively for low, intermediate, and high risk, groups) as Paik28 reported for the NSABP B-14 patients. Multivariate analysis showed that RS and tumor size were significant and independent risk predictors of breast cancer death in both ER positive, tamoxifen-treated (hazard ratio per 50 units = 7.6, P<0.001) and untreated patients (RS hazard ratio per 50 units = 4.1, P<0.001). Tamoxifen-treated patients were shown to have a higher risk of death, and tumor grade proved to be a significant, independent predictor as well. The RS score showed some prognostic value in ER negative patients, although this group was too small to perform a reliable analysis.

ER status was missing from the medical record for a substantial proportion of patients in this study, and therefore ER status based on gene expression was used in the analysis. Cases and controls were matched with respect to tamoxifen treatment, so it was not possible to assess whether the RS was able to identify patients who are likely to respond to tamoxifen therapy. The performance of the Oncotype DX assay RS was not compared to standard risk stratification methods (e.g., St. Gallen, NIH criteria, or Adjuvant! Online) (Table 12, Appendix I, Evidence Tables 1, 2, and 4).

Paik et al., 2004,65Bryant 2005,66and Hornberger et al., 2005.67 These posters showed the cross-classified risk predictions of the Oncotype DX assays compared to the risk stratifications using the 2004 NCCN and 2003 St. Gallen criteria, with the observed 10 year risks of relapse in the cross-classified strata. NCCN guidelines have since been modified, and the St. Gallen criteria did not accounted for HER-2. Patients came from the Paik65 NSABP-14 validation cohort, N=668. Using the 2004 NCCN guidelines, the study indicated that of the 92 percent who were in the high-risk NCCN category, about half were reclassified as low-risk by RS, with a 10-year relapse risk of 7 percent (95 percent CI, 4–11 percent), which is similar to the risk observed in the low risk RS group, without the NCCN information65. Finally, against the Adjuvant Online criteria, roughly 40 percent of those assessed to be at high risk (22 percent relapsed) were reclassified as having an 8 percent risk if they had a low RS score. These data, demonstrate that optimal predictions may come from a combination of expression predictors and standardized indices, although the latter contribute less than the RS to the risk estimate (Tables 13, 14, and 15).

Table 13. Risk classification of Oncotype DX™ against the St. Gallen criteria.

Table 13

Risk classification of Oncotype DX™ against the St. Gallen criteria.

Table 14. Risk classification of Oncotype DX™ against the 2004 NCCN guidelines.

Table 14

Risk classification of Oncotype DX™ against the 2004 NCCN guidelines.

Table 15. Risk classification of Oncotype DX™ against the Adjuvant! guidelines.

Table 15

Risk classification of Oncotype DX™ against the Adjuvant! guidelines.

MammaPrint

A synopsis of the clinical validity evidence presented in the following section is reported in Table 16. In the following section we will be distinguishing between MammaPrint, the marketed assay, and the gene expression profile which is the 70-gene signature originally published by van't Veer et al., in 2002.21

Table 16. Clinical Validity, MammaPrint® and 70-gene signature.

Table 16

Clinical Validity, MammaPrint® and 70-gene signature.

van't Veer et al., 2002.21 This study reported the development data for the 70-gene panel that is the basis for the MammaPrint test. A gene expression array containing 25,000 features was used to select genes associated with metastases-free survival at 5 years from surgery in 78 node negative patients, including 34 patients who recurred at 5 years and 44 who had not. Using the development of metastasis within 5 years as the first relapse event, 65 out of the 78 patients were correctly classified into good and poor prognosis groups by the 70-gene signature. Among the 13 misclassified patients, 5 patients with poor prognosis were in the good prognosis group, while 8 patients with good prognosis were classified in the poor prognosis group. Seventeen of 19 were correctly classified in the validation set.

The odds ratio (OR) to develop metastases within 5 years was 28, (95 percent CI, 7–107), while after leave-one-out cross-validation it was 15 (95 percent CI, 4–56). Using univariate analysis, the 70-gene signature performed better than tumor grade, size, patient age (less than 40years), ER status, and angioinvasion. Using multivariate analysis, the 70-gene signature was an independent predictor of metastases within 5 years, OR = 18 (95 percent CI, 3.3–94) (Tables 16 and 17, Appendix IEvidence Tables 6, 7 and 9).

Table 17. MammaPrint® compared with traditional composite risk markers.

Table 17

MammaPrint® compared with traditional composite risk markers.

van de Vijver et al., 2002.25 This was the first major validation of the 70-gene signature as reported in van't Veer 2002 using the same protocol and approach. Banked tumor specimens from the Netherlands Cancer Institute were used from a consecutive series of 295 women with breast cancer, with a mix of lymph node positivity, ER status, chemotherapy, and tamoxifen treatment. Time to metastases, as well as overall survival (OS) were used as primary end points in survival models, and 61 patients in this cohort had been in van't Veer's21 original 78 patient training set.

Patients were young (less than 52 years) with small tumors (less than 5 cm). The 70-gene signature was shown to be associated with grade, size and ER positivity, with almost all of ER positive patients falling into the good prognosis category. Those with “good prognosis” 70-gene expression signatures had dramatically better 5-year (95 percent vs. 61 percent) and 10-year (85 percent vs. 51 percent) DRFS and OS (95 percent vs. 55 percent at 10 years) than the “poor prognosis” group. Multivariate analysis showed that the prognosis group, tumor size, and adjuvant chemotherapy were the strongest predictors of distant metastases. The “poor prognosis” signature had the largest hazard ratio = 4.6 (95 percent CI, 2.3–9.2). Analyses excluding the 61 previously-included patients produced similar results. Fourteen of the 115 “good signature” patients experienced a recurrence by 10 years, demonstrating that the “good prognosis” group may not be at low enough long-term risk to justify forgoing chemotherapy when the 70-gene signature is used alone.

The authors did not compare a regression-based predictor using only conventional variables with one including the 70 gene panel. However the authors demonstrated the prognostic value of the 70 gene index using survival curves stratified by the NIH and St. Gallen criteria, which showed substantial separation between 70-gene prognostic groups that were either low or high risk by those conventional indices. These stratified survival curves also showed that optimal prediction was achieved when the gene index and conventional predictors were combined (Table 16, Appendix I, Evidence Tables 6, 7, and 9).

Buyse et al., 2006.59 This study compared the MammaPrint assay with the conventional combination risk predictors Adjuvant Online, Nottingham Prognostic Index, and St. Gallen. Patients were drawn from five distinct European institutions, in the context of an independent, multicenter validation study performed by the TRANS-BIG consortium. Gene expression in frozen tumor specimens from node negative patients younger than 60 years old who did not receive systemic adjuvant chemotherapy, and were diagnosed between 1980 and 1998 was characterized using the MammaPrint® assay. Final results were obtained for 302 out of 402 eligible patients. The median followup was 13.6 years, and the overall rate of distant metastasis was 25 percent.

The three primary end points of the study were time to distant metastases (TTM), DFS, and OS. The hazard ratios of the MammaPrint assay for TTM and OS were statistically significant after adjustment for St. Gallen, NPI and Adjuvant! On-line, but were generally far below (in the 1.5–2.5 range) that seen in the original validation cohort.25,58 The partial explanation offered by the authors was that this study had a longer median followup time than the one used by the van de Vijver25 cohort. Additionally, the authors introduced an interesting analysis showing the marked (3–6 fold) lowering of the hazard ratio for various endpoints when patients were artificially censored at increasing times, up to 10 years. Also, none of the ER positive patients reported in this study received hormonal therapy as did some of the original van de Vijver25 cohort.

Specificity and sensitivity of the MammaPrint assay and the Adjuvant! algorithm were compared for distant metastases within 5 years and for death within 10 years. Similar sensitivities were found, but a higher specificity was demonstrated for MammaPrint. The areas under the Receiver operating characteristic (ROC) curves were comparable between MammaPrint and Adjuvant! (0.68 vs. 0.66 for distant metastases at 5 years). The use of alternative thresholds for the Adjuvant! Online results did not change the overall results, and Adjuvant! hazard ratios were greater than unity but not statistically significant when adjusted for the gene signature. Finally, there was no statistical heterogeneity in any outcomes between centers, suggesting that this prediction model has transportability across populations with possibly different genotypic patterns.

This study is particularly important in that it provided the first evidence for the degree of clinical validity of the MammaPrint assay distinct from the 70-gene signature. It provided insight into the impact of differing lengths of followup in validation cohorts, and concluded that the prognostic contribution was sizable. However, this study's predictions were made in the context of no adjuvant hormonal or chemotherapy treatment, thus its applicability to women over 60 years old and treated with tamoxifen is unknown68 (Table 16, Appendix IEvidence Tables 6, 7, and 9).

Glas et al., 2006.58 This study used the same patients as in the van't Veer,21 and van de Vijver25 studies and compared the currently offered MammaPrint assay results to the results of the previous studies. RNA was available for all the 78 patients in the van't Veer series, but only 145 lymph node negative patients were available for reanalysis from the van de Vijver series. A different reference RNA was used, as well as a different quantification method, however odds ratios and hazard ratios were very similar. A total of 15 patients were incorrectly classified into discrepant risk categories. The results of the 70-gene signature used in the original cohorts therefore apply equally to the MammaPrint assay based on that signature (Table 16, Appendix IEvidence Tables 6, 7 and 9).

H/I Ratio

A synopsis of the clinical validity evidence presented in the following section is reported in Table 18.

Table 18. Clinical Validity, two-gene signature and H/I ratio assays.

Table 18

Clinical Validity, two-gene signature and H/I ratio assays.

Ma et al., 2004.64 This study reported the development of the two-gene ratio predictor. The authors generated gene expression profiles with gene chips from whole and laser-capture microdissected (LCM) frozen tumor specimens from 60 ER positive, node positive or negative breast cancer patients all treated with adjuvant tamoxifen monotherapy. Twenty-eight of the cohort (46 percent) experienced a distant recurrence within 4 years and 54 percent had no recurrence by 10 years. Twenty-two thousand genes were screened in the whole tissue sections and in LMC samples for their ability to predict DFS. Only three genes were highly predictive of DFS in both tissue sets, with over-expression of HOXB13 predicting recurrence and over-expression of IL17BR predicting non-recurrence. These expression values were combined in the form of a ratio, which outperformed both existing biomarkers and either gene alone. The univariate OR (interquartile) was 10.2 (95 percent CI, 2.9–36), multivariate OR was 7.3 (95 percent CI, 2.1–26.3) with adjustment for tumor size, PR and ERBB2 (none statistically significant) in a logistic regression. Area under the receiver-operating-characteristic curve (AUCs) for the ratio were reported in the 0.8 range.

Next, the above analysis was repeated using just the two-gene ratio calculated by RT-PCR on 59 fresh-frozen samples from the training set along with 20 additional FFPE specimens to independently validate the ratio. Sixteen of these 20 were accurately predicted. The RT-PCR-measured expression was reported to have similar predictive power to that measured via gene arrays. No comparison with the full array of clinical predictors (e.g. tumor grade) or with standard combination predictors (e.g., Adjuvant!) was performed (Table 18, Appendix I, Evidence Tables 10, 11, and 13).

Reid et al., 2005.69 In this paper the authors attempted to validate the two-gene ratio on an independent cohort of 58 patients with ER positive breast cancer. These patients had been treated with tamoxifen monotherapy, had larger tumors, a higher frequency of lymph node metastases (78 percent vs. 47 percent), and a higher HER-2 positivity (21 percent vs. 5 percent) than those in the Ma et al., 2004 study. Eighteen patients had distant recurrences within a median time of 31 months, and 40 had no recurrence after a median of 93 months (range 70–125). The expression of the genes HOXB13 and IL17BR was measured by RT-PCR and the association between their expression and outcome was assessed by use of univariate logistic regression, AUC, a two-sample t test, and a Mann-Whitney test. None of these analyses revealed any statistical relationship with outcome.

The authors then took the original data of Ma et al.64 and applied standard supervised methods to this and to another independent data set with 99 similar patients.70 They tried to estimate the classification accuracy obtainable by using two or more genes in a microarray-based predictive model, using linear discriminant analysis and extensive cross-validation. The authors failed to validate the two-gene ratio and found high error rates with two-gene predictors.

Overall, findings from this paper argued against the prognostic utility of the two-gene ratio in ER positive breast cancer patients treated with tamoxifen. However, it must be noted that a different part of the transcripts were assayed in the two studies and that differences could be due to the documented differences in the populations used, which were neither clinically nor therapeutically homogeneous, with small validation sets71 (Table 18, Appendix I, Evidence Tables 10, 11 and 13).

Goetz et al., 2006.62 To investigate the prognostic performance of the two-gene ratio, this study analyzed FFPE samples from 206 ER-positive patients treated in the tamoxifen-only arm of a Phase III randomized trial of tamoxifen alone versus tamoxifen plus fluoximesterone conducted through the NCCTG (North Central Cancer Treatment Group).64 RT-PCR expression values for each gene were normalized using a standard curve (Appendix D) obtained by analyzing the human universal total RNA (Stratagen, La Jolla, CA), rather than the standard reference gene method, although the authors stated that control genes were not necessary to assess the expression ratio. The following end points were considered: RFS (time from randomization to any event of recurrence, contralateral breast cancer or death), DFS (time from randomization to any event of recurrence, or contralateral breast cancer, or other cancer, or death), and OS (time from randomization to death).

Cutoffs points that best predicted RFS, DFS and OS were identified: the optimal cut-off for the entire cohort was -1.85, corresponding to the 58th percentile, whereas the 59th percentile (-1.34) was used for the node-negative group (n = 130), and the 90th percentile (4.4) best discriminated in the node positive group (n = 86).

The ratio showed modest outcome prediction value in the entire cohort, with cross-validated hazard ratios near 1.5 and P values around 0.05, with the predictive value being restricted to the node-negative subset of patients (hazard ratios 1.7 to 2, P values = 0.04–0.06). In the node-positive group the ratio had no relationship to relapse or survival. The authors concluded that a high 2-gene expression ratio is associated with increased relapse and death in patients with node-negative, ER positive breast cancer treated with tamoxifen.

Overall this study provided some support of the two-gene ratio signature's prognostic value in ER positive, lymph node negative patients, but both the magnitude of that effect and the statistical support were modest, and the relevant cutoffs used for discrimination between high and low risk were optimized for each endpoint and patient subgroup. Hence, this is closer to a training than validation exercise (Table 18, Appendix I, Evidence Tables 10, 11 and 13).

Ma et al., 2006.61 This study examined a consecutive series of patients from Baylor University diagnosed between 1973 and 1993 with stage I or II breast cancer. The patients did not have distant spread, and non-relapsed cases had a median followup of 6.8 years. The authors reported data on the clinical validity of the two-gene expression index (HOXB13:IL17BR), which is the base of the H/I assay. A different normalization strategy (Table 1) from Ma et al., 200464 was applied to obtain this index. FFPE samples only yielded 852 analyzable cases out of 1,002 patients.

This population had 72 percent node negative, 73 percent ER positive, and 16 percent HER-2 positive patients, with an overall recurrence rate of 31 percent. A higher HOXB13:IL17BR index was associated with a higher risk of relapse (hazard ratio=1.5, P<0.001). In a stratified analysis, univariate Cox regression indicated that the HOXB13:IL17BR index was only significant in node-negative patients (hazard ratio = 1.6, P<0.001 vs. hazard ratio=1.2, P=0.1,) and further subsetting indicated that the interaction with node status was statistically significant for the HOXB13:IL17BR index (P= 0.02) only in ER positive patients. The HOXB13:IL17BR index correlated significantly with predictors of poor prognosis (i.e., HER-2 amplification, S-phase fraction, and number of positive lymph nodes) and correlated inversely with ER and PR expression.

The authors identified the optimal cut-off point for the index by analyzing a training set of ER-positive untreated patients (n=205), in order to obtain the smallest P value from a log-rank test in Kaplan-Meier survival analysis. The selected threshold (of about 1.0) was validated in a separate test set of untreated patients (n=103), and was also applied in the analysis of the tamoxifen-treated group of patients (n=122). Kaplan-Meier curves and univariate Cox regression analysis indicated that this cut point stratified patients into significantly different risk groups. Results from the Kaplan-Meier plots suggested that the prognostic power of the two-gene index was independent of tamoxifen therapy. The hazard ratio obtained in multivariate Cox proportional hazards regression, incorporating age, tumor size, S-phase fraction, PR status, and tamoxifen therapy, confirmed the prognostic role of the HOXB13:IL17BR index (hazard ratio=3.9, 95 percent CI = 1.5 to 10.3, P value = 0.007), in ER positive, node negative, patients irrespective of tamoxifen treatment. The index was also demonstrated to be a continuous predictor of DFS in untreated patients. The authors concluded that the two-gene index was a significant predictor of clinical outcome in ER positive, node-negative, patients regardless of tamoxifen therapy.

This study validated the two-gene ratio gene expression profile, developed the two-gene index, and provided preliminary evidence for its prognostic value. Classification probabilities were not presented, and its incremental value over conventional predictors was not reported, although some components of such predictors were included in the multivariate analyses (Table 18, Appendix I, Evidence Tables 10, 11 and 13).

Jansen et al., 2007.72 This clinical study evaluated the ability of the HOXB13-to-IL17BR expression ratio to predict DFS in breast cancer patients treated with tamoxifen. The HOXB13 and IL17BR expression levels were measured by RT-PCR in 1,252 primary breast tumor patients and normalized with respect to 3 housekeeping genes73. The study population was a mix of ER-positive (73 percent), lymph node-positive (52 percent), tamoxifen-treated (14 percent), and chemotherapy-treated (17 percent) patients, with additional patients treated with tamoxifen or chemotherapy after relapse (55 percent). Patients with ER-positive tumors with node negative primary breast cancer (N = 468) were followed for DFS. Patients with recurrent breast cancer treated with first-line tamoxifen monotherapy (N = 193) were followed for progression free survival (PFS). This study used different populations, protocols, normalization strategy, and ratio thresholds than Ma et al. 2006.61

The study evaluated the relation between the HOXB13-to-IL17BR ratio and tumor aggressiveness in lymph node negative, ER positive patients who did not receive adjuvant systemic chemotherapy (N=468). Of these patients, 46 percent had a relapse during the followup period. The HOXB13-to-IL17BR ratio, as a univariate continuous variable, was significantly associated with a poor DFS (hazard ratio=1.6, P=0.02) and a poor OS (P<0.001, data not reported). When traditional factors were added to the model, the HOXB13-to-IL17BR ratio continued to contribute significantly to DFS and OS prediction, either as a continuous variable or after dichotomization according to published pre-specified thresholds61 (Table 18).

The same analysis was performed on ER-positive, lymph node-positive tumors from untreated patients, who were mainly enrolled in the early 1980's (n=151). Univariate analysis of the continuous HOXB13-to-IL17BR ratio was associated with a poor DFS and a poor OS. In the multivariate model for this population, the index was significantly associated with OS (P value = 0.001), but less strongly with DFS (P value = 0.065). The dichotomized index was not related to DFS (data not shown).

Finally, the authors evaluated the prognostic performance of the HOXB13-to-IL17BR ratio in 193 ER-positive primary breast tumors in relapsed patients treated with first-line tamoxifen monotherapy. Both univariate and multivariate analyses revealed that the ratio, continuous and dichotomized, was strongly associated with PFS (Table 18).

This study is by far the largest done so far concerning the potential value of the 2-gene ratio. It provided evidence of the clinical validity of the HOXB13-to-IL17BR ratio in ER positive, node negative patients who did not receive systemic adjuvant therapy, and also in ER positive relapsing patients whose relapse was treated with tamoxifen. However, the study was calculated and dichotomized in a somewhat different manner than in Ma et al., 2006.61 Additionally, comparisons were not provided with conventional combination risk indices, nor were classification probabilities provided for the models with and without the ratio. Therefore, incremental predictive values could not be accurately assessed. Although qualitative conclusions are not affected, there are some differences between the quantitative results reported in the text and tables (Table 18, Appendix I, Evidence Tables 10, 11 and 13).

Jerevall et al., 2007.63 In this paper the authors investigated whether the two-gene ratio can predict the benefit of 2 versus 5 years of tamoxifen treatment in postmenopausal breast cancer patients, and also predict the ratio's prognostic value in systematically untreated pre-menopausal patients. Expression of HOXB13 and IL17BR were quantified by RT-PCR in tumors from 264 randomized postmenopausal patients and 93 systemically untreated premenopausal patients. The two study populations were collected as part of a collaborative study between two centers in Sweden, and 72 percent of the randomized patients were lymph node positive and 74 percent ER positive. To stratify the patients into risk groups the authors dichotomized the ratio using the median, a procedure and dichotomization differing from the approach used by Ma 2006.61 The results from the prediction of prolonged treatment benefit are reported under Key Question 4, Clinical Utility.

The ratio proved to be significantly correlated to tumor size, ER, PR, HER-2, Nottingham histologic grade (NHG), ploidy, and S-phase. ER, HER-2, S-phase and NHG correlations were mostly due to IL17BR, while PR and ploidy correlations showed contribution from both genes. The authors concluded that a lower expression of IL17BR, but not HOXB13, was correlated to several factors related to poor prognosis, and thus IL17BR might be an independent prognostic factor in breast cancer, and that HOXB13 may be correlated with tamoxifen resistance. However, the ratio had no prognostic value in ER negative postmenopausal patients and they were excluded from subsequent analyses.

In summary, this study produced additional developmental evidence of the prognostic value of the HOXB13-to-IL17BR ratio, and of the two individual genes, in ER positive breast cancer patients who received systemic adjuvant therapy. However, neither the patient profile nor the mode of calculation of the ratio were identical to previous studies, and the results differed from previous reports, as the ratio predicted for worse outcome in lymph node positive patients (Table 18, Appendix I, Evidence Tables 10, 11 and 13).

Key Question 4. What is the clinical utility of these tests?

The clinical utility of a test tells us whether the test helps discriminate between those who will have more or less benefit from a therapeutic intervention. This can only be assessed in the context of randomized clinical trials, where benefit can be measured in terms of an improvement of clinical outcomes such as overall survival, disease-free survival, chemotherapy toxicity, or quality of life.

The prognostic estimates provided in the previous section, however—have a relationship to clinical utility—providing an upper limit on the degree of clinical benefit that can be provided by chemotherapy for a given endpoint. For example, if the 10-year cancer recurrence rate without adjuvant chemotherapy is estimated to be 5 percent, the maximum absolute benefit to be derived from chemotherapy cannot exceed 5 percent. Furthermore, knowledge that chemotherapy generally only prevents a minority of recurrences tells us that the absolute benefit in terms of recurrence in that situation will be likely less than 2 percent. So while prognostic estimates are not direct estimates of benefit per se, they provide enough information that could be used to crudely estimate benefit and be sometimes relevant for patient decision-making.

Oncotype DX

Currently a prospective randomized clinical trial, TAILORx, is underway with the goal of assessing the value of adjuvant chemotherapy among patients with mid-range RS results. However, one other published study does address the potential value of the RS in predicting chemotherapy benefit.

A synopsis of the clinical utility evidence presented in the following section is reported in Table 19.

Table 19. Clinical Utility, Oncotype DX™.

Table 19

Clinical Utility, Oncotype DX™.

Paik et al., 2006.53 The authors used the Oncotype DX assay to investigate whether the RS was a predictor of the benefit from chemotherapy in ER-positive, lymph node negative, breast cancer patients. This study used 651 patients from the NSABP B-20 randomized trial and compared a group treated with both tamoxifen and chemotherapy with a group of patients who were randomized to tamoxifen only. Gene expression analysis was found to be correlated with chemotherapy benefit, defined in terms of 10-year distant recurrence-free survival (DRFS).

Kaplan-Meier analysis on all patients showed a significant benefit from the use of chemotherapy (P value = 0.02), however when the data was stratified by RS risk groups, only the high RS risk group of patients benefited from using chemotherapy (P value = 0.001).

When the authors used multivariate Cox proportional hazard analysis, findings about the benefit from chemotherapy use were unclear due to large confidence intervals in the low and intermediate RS risk groups (low RS risk group, RR=1.31; 95 percent CI: 0.46–3.78; intermediate RS risk group, RR = 0.61; 95 percent CI, 0.24 to 1.59). Patients classified in the high RS risk group, however, showed a significant benefit from the use of chemotherapy (RR=0.26; 95 percent CI: 0.13–0.53).

The authors also looked for interaction between each variable and chemotherapy treatment using separate likelihood ratio tests. The RS was the only significant interaction (P=0.038), with only slight statistical weakening when age, tumor size, tumor grade and site were added to the model individually (P values from 0.035 to 0.068). When RS was fit as a continuous score, there was not a clear threshold that predicted no benefit for chemotherapy.53

Overall, this study produced preliminary, high-quality evidence that the RS from the Oncotype DX assay has clinical utility, i.e. predictive power in assessing the benefit of chemotherapy usage in ER-positive, lymph node negative breast cancer patients. The embedding of this study within a large, well conducted RCT was a strength. However, some patients from the tamoxifen-only arm of the NSABP B-20 trial were in the training data sets for the Oncotype DX assay. While the algorithm was trained for the outcome of recurrence and not chemotherapy benefit, optimization of recurrence prediction in one arm of this study could translate into a somewhat enhanced estimate of chemotherapy benefit, although it is unlikely to account for the large effect seen here. Finally, while the models could not sustain the inclusion of all possible clinical variables, they could have included a composite score, either standard risk predictors, or one tailored for the data set (Table 19, Appendix I, Evidence Tables 1, 2 and 4).

Correlation between RS and chemotherapy response

Gianni et al., 2005.49 This study focused on the complete pathological response (pCR) to preoperative chemotherapy in node negative and positive patients, looking at the correlation between pCR and RS. Two independent cohorts of patients were used, the cohort from the Italian National Cancer Institute of Milan, Italy, and the M.D. Anderson Cancer Center cohort from the M.D. Anderson Cancer Center of Houston, U.S. (Appendix I, Evidence Table 2), and were evaluated by two different technologies (RT-PCR and the Affymetrix hgu133a array). The study also identified additional genes that are associated with pCR and allowed the development of a new gene panel associated with pCR, as well as the evaluation of the association of Oncotype DX RS with pCR.

Results of the Oncotype DX assay in the Milan cohort. Three hundred and eighty-four genes were analyzed by RT-PCR in the Milan cohort of patients, including the 21 genes assessed by the Oncotype DX assay. Data showed good discrimination of pCR by RS. Probit regression-based models with and without the incorporation of the RS resulted in a P value of 0.005 in a global likelihood ratio test.

Preliminary evidence that the RS from the Oncotype DX assay has predictive power in assessing the likelihood of pCR after pre-operative chemotherapy was obtained in this study. (Table 19, Appendix I, Evidence Tables 1, 2 and 4).

Mina et al., 2006.51 In this study paraffin-embedded pre-treatment core biopsies from a completed phase II trial of 70 patients with newly diagnosed stage II or III breast cancer who were treated with sequential doxorubicin and docetaxel were used to identify genes that correlate with response to pCR. Gene expression was investigated by RT-PCR in 45 patients, using the same procedures of the Oncotype DX assay. A total of 192 genes (187 candidate genes and 5 reference genes) were tested, including those used to compute the Oncotype DX Recurrence Score.

Individual genes, as well as groups of biologically related genes, were found to be associated with pCR, however no correlation between Oncotype DX RS and pCR was found (P = 0.67). A total of 22 individual genes had an uncorrected P value of less than 0.05 in a likelihood ratio test derived from logistic regression models; however 13 genes would be expected to correlate with pCR at the P value level of 0.05 level by chance alone.

This study provides preliminary evidence that the RS from the Oncotype DX assay cannot predict pCR after primary chemotherapy in advanced breast cancer patients (with variable ER and HER-2 status, lymph node involvement, tumor size, and tumor grade) (Table 19, Appendix I, Evidence Tables 1, 2 and 4).

Chang et al., 2007.55 This study is currently in press for Breast Cancer Research Treatment. The authors investigated if expression of the 21 genes of the Oncotype DX assay and other candidate genes in locally advanced breast cancer tumors could be used to predict response to docetaxel treatment. The 97 women in this study were diagnosed and were enrolled into three phase II studies with the neoadjuvant docetaxel at Baylor College of Medicine, Houston, U.S. Clinical response was assessed by Response Evaluation Criteria in Solid Tumors (RECIST) criteria: clinical complete response (CR) was defined as complete disappearance of the tumor, while partial response (PR) was defined as at least 30 percent decrease in unidimensional size. An increase of more than 25 percent was defined as clinical progressive disease (PD). Any response that did not meet the definition of CR, PR, or PD was defined as stable disease (SD). All patients received primary surgery and standard adjuvant therapy. Core biopsies from 97 patients were obtained before treatment and RNA levels of expression for the selected genes were studied by RT-PCR, following the specified protocols for the Oncotype DX assay.

Of the selected 97 patients, 81 (84 percent) had sufficient invasive cancer, 80 (82 percent) had sufficient RNA to perform the RT-PCR based assay, and 72 (74 percent) had known clinical response data. The mean age was 48.5 years, while the median tumor size was 6 cm. A clinical CR was observed in 12 patients (16.7 percent) a partial response in 41 (56.9 percent), a stable disease in 17 (23.6 percent), while progressive disease was present in 2 patients (2.8 percent). Pathologically, pCR was observed in 2 patients (3.2 percent), ‘incomplete’ responses were observed in 61 patients (96.8 percent), and pathologic response was unknown for 9 patients.

The authors found that a CR was more likely associated with a high RS (P = 0.008). When the RS was used as continuous variable, a 50 unit increase in the RS was associated with a five-fold increase in the odds of achieving clinical CR (95 percent CI 1.3, 6.0). Moreover, the logistic model for the RS indicated that a 14-unit increase in the RS (the difference between low and high risk groups, as defined by the standard thresholds) was associated with a complete clinical response odds of 1.7 (95 percent CI 1.15, 2.60). The authors concluded that a high risk patient is at least 1.7 times more likely to achieve a clinical CR with neoadjuvant chemotherapy compared to a low risk patient. Finally, the accuracy of the Oncotype DX RS in predicting the response to adjuvant chemotherapy with docetaxel throughout the range of RS values was judged to be at least moderate, with AUC of 0.73.

Overall, this study provided preliminary evidence that the RS from the Oncotype DX assay has predictive value in assessing the likelihood of a clinical CR to primary chemotherapy with docetaxel. However the small cohort patients points to the need for further confirmation (Table 19, Appendix I, Evidence Tables 1, 2 and 4).

Oncotype influence on decisionmaking

Oratz et al., 2007 (in press).56 This study investigated whether the Oncotype DX RS had influenced both clinicians' treatment recommendations and the actual treatment administered in patients with ER positive, lymph node negative, early (stage I or II) breast cancer. A retrospective analysis was performed on 74 patients from a community-based oncology practice for whom RS was determined. Treatment recommendations prior to RS knowledge were compared with treatment recommendations after RS knowledge, and to the treatment eventually administered.

Knowledge of RS changed the clinicians' treatment recommendations in 21 percent of patients, and the actual administered treatment in 25 percent of the patients. In particular, the decision to add chemotherapy to the hormonal therapy was generally associated with the high-risk group, whereas the decision to change from chemotherapy to hormonal therapy was associated, in general, with low RS.

While this study produced preliminary evidence that knowledge of the RS from the Oncotype DX assay can have an impact on the clinical management of patients diagnosed with ER positive, lymph node negative, early breast cancer, it did not report specifically what the patients (or doctors) were told or understood about their risk of recurrence. Because it is unknown whether absolute risks were a factor in decision-making, the study is minimally informative as to the actual risk thresholds used by women and their treating physicians (Appendix I, Evidence Tables 1, 2 and 4).

Economic studies

Hornberger et al., 2005.67 The objectives of this study were twofold. First, the authors sought to estimate the incremental benefits, costs, and cost-effectiveness of using Oncotype DX to better assign risk of distant recurrence-free survival associated with early stage breast cancer. Secondly, the authors wanted to assess the factors that most influence potential benefits and efficient use of the 21-gene RT-PCR recurrence score. The outcomes of interest to the study included overall survival, relevant costs of breast cancer care, and distant recurrence-free survival.

Cost-utility analyses used a Markov model to forecast overall survival, quality of life, costs, and cost-effectiveness. Two scenarios were considered, based on NCCN classification of patients with lymph node negative, ER positive, early stage breast cancer who were expected to receive 5 years of hormonal therapy into a low risk (T1a N0-1mi) group that did not receive chemotherapy versus a high risk (T1b with unfavorable features or T1c) group that did receive chemotherapy. Patients were then reclassified using the RS. Annual risks of recurrence and survival were obtained from published meta-analyses of clinical trials, and the study model included costs of the assay and drugs, including chemotherapy (Table 20, Appendix I, Evidence Tables 1, 2, and 5).

Table 20. Comparison of economic studies.

Table 20

Comparison of economic studies.

Summary of study findings. The analysis reported that using the 21-gene RT-PCR assay to reclassify patients who were defined by NCCN criteria as low risk (to intermediate or high risk) would lead to an average gain in overall survival per reclassified patient of 1.86 years. Total cost estimates increased by about $25,000. This amount included $12,190 to identify intermediate- or high-risk patients and at least $15,000 for chemotherapy, and was offset by savings of $2,344 because of the lower risk of recurrence. The cost-utility of RS testing for this cohort was $31,452 per quality-adjusted life-year (QALY) gained.

The authors also reported that reclassifying patients defined as high risk (by 2005 NCCN criteria) to low risk (using the 21-gene RT-PCR assay) was cost saving. The added cost of testing ($7,073) to identify 1 reclassified patient was offset by an estimated $15,000 in savings for eliminating the need for chemotherapy.

Using the 21-gene RT-PCR assay was expected to improve quality-adjusted survival by a mean of 8.6 QALYs and reduce overall costs by about $203,000 in a hypothetical population of 100 patients with characteristics similar to those of the NSABP B-14) participants, more than 90 percent of whom were NCCN-defined as high risk. The estimated cost-effectiveness was most influenced by the propensity to administer chemotherapy based on the RS, and by the very small proportion of patients at low risk as defined by 2005 NCCN guidelines. The 2007 NCCN guideline indicates that the use of chemotherapy in these patients is now considered optional, thereby diminishing the utility of this model.

Critical appraisal of the analysis. The EPC team appraised the analysis using published guidelines for good practice in decision-analytic modeling in health technology assessment, Philips 2004.41 The appraisal took into consideration the domains of structure, data, and consistency (Table 20, Appendix I, Evidence Table 5).

Structure and Data. The authors provided a clear description of many aspects of the structure of the analysis, including the decision problem, objectives of the evaluation, perspective of the analysis, rationale for the model structure, and structural assumptions. However, the model inputs were not entirely consistent with the stated perspective of the analysis. For instance, the model did not include all costs that are relevant from a societal perspective such as decreased productivity and days lost from work. Also the authors did not address the limitations in how utility estimates were derived. This is an important limitation because utility estimates can vary a lot depending on the methods that are used to derive the estimates. The authors also did not justify extrapolating beyond the 10-year followup period for which recurrence data is available. Finally, the authors did not report much information about their assessment of methodological and structural uncertainties. Without such information it is difficult to determine how their projections might differ if different assumptions were made in the decision model.

The authors correctly pointed out that the 2005 version of the NCCN breast cancer guideline recommends chemotherapy for all node-negative tumors greater than 1 cm (T1a).74 Since 84 percent of the patients included in the Paik study28 had tumors larger than 1 cm (T1c), it is unsurprising that a very large proportion of patients overall would be spared chemotherapy (gene expression profiling data expected to identify approximately half of these patients to have a low RS). However, by 2007 the NCCN panel had refined its criteria for recommending chemotherapy6, now considered optional (adjuvant hormonal therapy ± chemotherapy) for those with ER-positive HER-2-negative disease and tumors greater than 1cm (T1c). Therefore, it is reasonable to speculate that approximately half of these patients might opt for no chemotherapy. This is a similar proportion of patients that would be found to have a low RS, although these two groups of patients may not necessarily be the same.

Consistency. Appendix I, Evidence Table 5 notes that the authors did not report information about the internal and external consistency of their analysis. The analysis would be more convincing if it gave more information on whether the mathematical logic of the model had been tested (internal consistency) or if results from other models were available for comparison (external consistency). Nevertheless, the results of the model make intuitive sense and seem to be consistent with published data on the performance characteristics of the 21-gene RT-PCR recurrence score.

Summary of critical appraisal. Overall, the EPC team concluded that this economic analysis met most of the standards set by the rigorous guidelines of Phillips et al., 200441. It is not clear whether the limitations noted above biased the results for or against the 21-gene RT-PCR assay, but extension of the timeframe beyond 10 years could overstate the benefits of using the assay. Given that this study was sponsored in part by the manufacturer of the 21-gene RT-PCR assay (Genomic Health, Inc., Redwood City, California), the EPC team would have had more confidence in the results if the authors had provided more information about methodological and structural uncertainties as well as other potential sources of bias such as the derivation of the utility estimates. The generalizability of these results to patients in 2007 is also limited, as the 2005 NCCN guidelines have since been updated. Thus, the team has only moderate confidence that the results of the economic analysis provide reasonable estimates of the potential cost-effectiveness of using the 21-gene RT-PCR assay to guide treatment of early stage breast cancer

Lyman et al., 2007.75 The main objective of the second study7 was to estimate the cost-effectiveness of 21-gene RT-PCR assay-guided treatment of patients with ER positive, lymph node-negative, early-stage breast cancer with either tamoxifen alone or the combination of chemotherapy and tamoxifen.

This analysis incorporated data that validated the prognostic accuracy for distant RFS using a 21-gene RT-PCR assay in 668 lymph node-negative, ER positive women with early-stage breast cancer receiving tamoxifen on NSABP B-14. The analysis also incorporated data that validated the predictive accuracy for treatment efficacy in 651 patients randomized in NSABP B-20, and 645 patients in NSABP B-14.

The study design involved cost-utility analyses using a “clinical decision model” designed to compare clinical, economic, and quality of life outcomes for three adjuvant treatment strategies: 1) tamoxifen alone, 2) chemotherapy followed by tamoxifen, or 3) therapy based on the results of the 21-gene RT-PCR assay. Using the RS from the 21-gene RT-PCR assay, patients were classified as high risk (RS ≥ 31), intermediate risk (RS 18–30), or low risk (RS < 18) for distant recurrence at 10 years. The third strategy assumed that low-risk patients would receive tamoxifen, and intermediate or high-risk patients would receive chemotherapy and tamoxifen. Clinical outcomes were estimated in terms of life expectancy or life-years saved as derived from NSABP B-20 and B-14 data. Economic outcomes included selected costs of cancer care, including the costs of chemotherapy, surveillance without recurrence, use of the 21-gene RT-PCR assay, and treatment of recurrence. Quality of life outcomes were estimated based on the utility associated with use of chemotherapy. The treatment strategies were compared in terms of the additional cost of one strategy over another (marginal cost), the additional clinical benefit (marginal efficacy), and the additional quality-adjusted clinical benefit (marginal utility) (Table 20, Appendix I, Evidence Tables 1, 2 and 5).

Summary of study findings. The lowest expected mean cost per life-year saved was associated with treatment with tamoxifen alone ($11,890), whereas the greatest expected mean cost was associated with treatment with both chemotherapy and tamoxifen ($18,418). The expected cost of each strategy increased as the assumed cost of treating distant recurrence increased. Above a cost of $100,759 for treating recurrence, therapy guided by the RS provided a net cost savings compared with other strategies and was always cost-saving compared with the chemotherapy and tamoxifen strategy. The tamoxifen strategy was associated with the lowest costs for all reasonable followup cost assumptions among those without recurrence. Therapy guided by the RS was favored over chemotherapy and tamoxifen for total chemotherapy costs exceeding $5,822. The use of therapy guided by the RS was more costly for low-cost chemotherapy regimens not requiring additional supportive care, whereas a net cost savings between $500 and $10,000 was estimated with RS guided therapy for other commonly used and higher-cost adjuvant chemotherapy regimens.

Compared to tamoxifen alone, the expected incremental cost associated with RS-guided therapy was $4,272. The expected incremental cost associated with chemotherapy and tamoxifen was $6,527. The incremental cost-effectiveness ratio compared with tamoxifen alone favored the use of RS-guided therapy ($1,944 per life-year saved) over the use of chemotherapy and tamoxifen ($3,385 per life-year saved). When the analysis considered increases in healthy life expectancy, the incremental life-years saved increased for the RS-guided therapy compared with tamoxifen alone, and the corresponding marginal cost-effectiveness decreased.

Expected QALYs favored RS-guided therapy over chemotherapy and tamoxifen for all health utility values, with increasing incremental QALYs as the impact of chemotherapy on measured utility increased. Recurrence-score-guided therapy had greater expected QALYs compared with tamoxifen alone, until the utility associated with chemotherapy fell below 0.80. At a utility of 0.90 for adjuvant chemotherapy, RS-guided therapy was associated with a gain of 0.97 QALYs, a cost-utility ratio of $4,432 per QALY compared with tamoxifen alone, and a gain of 1.71 QALYs with net cost savings when compared with the chemotherapy and tamoxifen combination.

Critical appraisal of the analysis. The EPC team appraised the analysis using published guidelines for good practice in decision-analytic modeling in health technology assessment Phillips 200441, taking into consideration the domains of structure, data, and consistency (Table 20, Appendix I, Evidence Table 5).

Structure. Although the authors provided a clear description of the decision problem, they did not state the perspective of the model. Moreover, the authors did not provide enough information about the structure of the model to allow an evaluation of the appropriateness of the model type or of the causal relationships described by the model. The authors also did not justify extrapolating beyond the 10-year period for which recurrence data is available.

Data. The authors provided some explanation and justification of the data used in the analysis, citing previous work for some of the details. However, the authors did not include all relevant costs. They included the costs of adjuvant chemotherapy, surveillance, use of the Oncotype DX assay, and treatment of recurrence, but they did not include other treatment-related direct costs (e.g., costs of administration, associated testing, and transportation) or indirect costs (e.g., decreased productivity). Although indirect costs may be implicitly included in utility values assigned to relevant health states, the authors did not provide enough information to determine whether that was done. The analysis would have been stronger if it had estimated cost-effectiveness with and without inclusion of indirect costs and other treatment-related costs. The authors did not mention any health-state utilities other than the utility with chemotherapy, and did not give sufficient detail about how they estimated the utility with chemotherapy. In addition, the authors did not report on the quality of the data. A single study was used as the source of estimates for the relative effects of the treatment strategies. The authors also did not report sufficient information about the sensitivity analysis and alternative assumptions. Finally, the authors did not report much information about their assessment of methodological and structural uncertainties.

Consistency. The authors did not report information about the internal and external consistency of their analysis, but the results of the model make intuitive sense. Generally, the results seem to be consistent with the cited data on the performance characteristics of the 21-gene RT-PCR RS.

Summary of critical appraisal. Overall, the EPC team concluded that this economic analysis did not meet many of the standards set by the rigorous guidelines of Phillips et al., 200441. These limitations are particularly serious because the authors received research support from the manufacturer of the 21-gene RT-PCR assay. Consequently, the EPC team has little confidence in the results of this analysis.

Summary of available studies. Based on the evidence from the stronger of the two available studies, the EPC team concluded that the 21-gene assay, when used to guide treatment for patients previously classified as low risk by NCCN-defined criteria, may be cost-effective compared to standard treatment approaches in women with lymph node-negative, ER positive early-stage breast cancer. Similarly, the EPC team concluded that the 21-gene assay, when used to guide treatment for patients previously classified as high risk by NCCN criteria, may be cost-saving compared to standard treatment. The overall body of evidence on economic outcomes is weak because of the limitations of the two available studies.

MammaPrint

No published studies evaluated the ability of the 70-gene signature for the main MammaPrint assay to predict chemotherapy benefit.

Economic studies

Oestericher et al., 2005.76 The main objective of this study was to compare the cost-effectiveness of the Netherlands Cancer Institute gene expression profiling (GEP) assay to the NIH guidelines for the identification of early stage breast cancer patients who would benefit from adjuvant chemotherapy based on risk of distal recurrence. Although the references cited for the performance characteristics of the GEP assay indicate that the investigators were using data on MammaPrint, the article does not clearly state that they were analyzing MammaPrint.

The study design involved a cost-utility analysis. Using a Markov model, the investigators estimated the incremental cost and QALYs associated with use of the GEP assay as compared to use of the NIH guidelines in a hypothetical cohort of premenopausal women averaging 44 years of age newly diagnosed with stage I/II breast cancer. The performance characteristics of the tests were based on data from the Netherlands Cancer Institute cohort.25 In the Markov model, the investigators assumed that the results of the GEP assay would be used to classify patients as having a “good prognosis” or a “poor prognosis” based on a test cutoff derived from the first validation study of the GEP assay.21 They also assumed that the NIH guidelines would be used to classify patients as having a “good prognosis” or a “poor prognosis,” that women with a “poor prognosis” would receive adjuvant chemotherapy, and that women with a “good prognosis” would not receive chemotherapy. The model considered the following clinical events: distant recurrence of breast cancer, mortality due to distant recurrence, and mortality from other causes. The economic outcomes included the cost of the GEP assay, the cost of adjuvant chemotherapy, and the cost of managing distant recurrence of breast cancer. Quality of life outcomes were estimated in terms of QALYs, with utility estimates for specific health states derived from previous publications. The two strategies were compared in terms of the number of cases of distant recurrence prevented, costs, and QALYs. (Table 20, Appendix I, Evidence Table 5).

Summary of study findings. The NIH guidelines identified 96 percent of the cohort as high risk whereas the GEP identifies 61 percent of patients as high risk with sensitivities of 98 percent for the NIH guidelines and 84 percent for GEP. Specificities were 51 percent for GEP and 5 percent for the NIH guidelines. Since there is a 35 percent risk reduction in distant recurrence from use of chemotherapy, using NIH guidelines to identify high-risk women and treat with chemotherapy prevented 34 percent of distant recurrences compared to 29 percent for GEP. After including the negative impact on life expectancy and quality of life from chemotherapy and distant recurrence, the NIH guidelines and GEP yielded 10.08 and 9.86 QALYs respectively. Total costs were $32,636 for the NIH guidelines and $29,754 for GEP.

Although the GEP assay was projected to identify 35 percent fewer women for chemotherapy than NIH guidelines, quality of life benefits in the women who did not need chemotherapy were outweighed by the decrease in life expectancy in the women who needed chemotherapy but did not receive it because of GEP's lower sensitivity.

The authors concluded that, in order to improve quality of life by allowing women to safely avoid chemotherapy while not missing women whose survival is compromised by avoiding therapy, GEP's sensitivity would have to increase to at least 95 percent while maintaining a specificity of 51 percent. The GEP assay did not attain a sensitivity of 95 percent regardless of the test cutoff used in the analysis.

Critical appraisal of the analysis. The EPC team appraised the analysis using published guidelines for good practice in decision-analytic modeling in health technology assessment,41 taking into consideration the domains of structure, data, and consistency (Table 20, Appendix I, Evidence Table 5).

Structure. As indicated in Table 20, the authors provided a clear description of most aspects of the structure of the analysis, including the decision problem, objectives of the evaluation, perspective of the analysis, rationale for the model structure, and structural assumptions. The model inputs were consistent with the stated perspective of the analysis. The authors did not justify using a timeframe beyond the 6.7-year period for which recurrence data is available.

Data. The article was very strong in providing explanation and justification of the data used in the analysis. Limitations were that the authors did not justify extrapolation of data beyond 6.7 years of followup and that they only compared their model to the NIH guideline. In addition, although the authors listed a number of references for their use of utilities, they did not provide any explanation of how they derived specific utility estimates from these references. They also did not provide any explanation of the methods or scaling techniques that were used to derive the utility estimates. Thus, we can not determine whether the utilities were based on the standard gamble techniques, which is the gold standard, or on other scaling techniques. This is important because the standard gamble techniques generally yields utility values that are higher than the values derived using other techniques. The estimates used in this study seem low compared to the values assigned to most serious health conditions.77,78 Also, these references for the utility estimates are significantly more dated than some of the references used to obtain cost data.

Consistency. The authors discussed the internal and external consistency of their analysis, and the results of the model make intuitive sense.

Summary of critical appraisal. Overall, the EPC team concluded that this economic analysis met most of the rigorous standards set by Phillips et al., 2004.41 The EPC team therefore has confidence in the results of this analysis. Although we had some uncertainty about the utilities used in the analysis, the EPC team believes that this limitation is unlikely to have changed the overall conclusion of the authors, which is based on the lack of sensitivity of the GEP assay.

H/I Ratio

Jerevall et al., 2007.63 This paper investigated whether the two-gene ratio can predict the benefit of 2 years versus 5 years of tamoxifen treatment in postmenopausal breast cancer patients, and also predict the prognostic value in systematically untreated premenopausal patients. Expression of HOXB13 and IL17BR were quantified by RT-PCR in tumors from 264 randomized postmenopausal patients and 93 systemically untreated premenopausal patients. The two study populations were collected as part of a collaborative study between two centers in Sweden, and 72 percent of the randomized patients were lymph node positive and 74 percent ER positive. To stratify the patients into risk groups the authors dichotomized the ratio using the median. Thus the normalization procedure and dichotomization differed from the approach used by Ma.61 The prognostic results from this study are reported under Key Question 3 (clinical validity).

Kaplan-Meier analysis of data from postmenopausal ER-positive patients demonstrated that a low HOXB13-to-IL17BR ratio was associated with a benefit to receiving 5 vs. 2 years of tamoxifen treatment (univariate P= 0.021; in KM analysis). There was no benefit (P=0.9) in patients who had a high ratio, which mainly appeared due to the low expression of HOXB13 genes (P= 0.010, in Kaplan-Meier analysis). The predictive significance of both the two-gene ratio and the HOXB13 gene alone was maintained using a Cox proportional hazard modeling, adjusting for tumor size, PR status, and lymph node status.

The authors concluded that the ratio, or even HOXB13 alone, could predict the benefit of prolonged endocrine therapy, and that a lower expression of IL17BR, given its correlation to poor prognosis, could be an independent prognostic factor.

Neither the patient profile nor the mode of calculation of the ratio were identical to previous studies (Table 21, Appendix I, Evidence Tables 10, 11 and 13). However this study produced additional developmental evidence about the prognostic utility of the HOXB13-to-IL17BR ratio, and of the two individual genes, in ER positive breast cancer patients who received systemic adjuvant therapy.

Table 21. Clinical Utility, two-gene signature and H/I ratio.

Table 21

Clinical Utility, two-gene signature and H/I ratio.

Ongoing Studies

TAILORx (Trial Assigning IndividuaLized Options for Treatment (Rx))

The primary objective of TAILORx is to compare the DFS of women with previously-resected axillary-node-negative breast cancer who have an Oncotype DX RS of between 11 and 25 when treated with both adjuvant chemotherapy and hormonal therapy versus hormonal therapy alone. It should be noted that this range is lower on both ends than the standard “Intermediate” RS range, viz. 18–30. This represents a more conservative approach to the use of the RS than is suggested by current categories, in that subjects who agree to forego chemotherapy in this trial will be at lower risk than those in the current “low risk” RS group. The secondary objective is to determine if adjuvant hormonal therapy alone is sufficient treatment (i.e., 10-year distant DFS of at least 95 percent) for patients with an RS of less than or equal to 10.

This study will not provide direct evidence for the value of Oncotype DX, as all patients in the trial will receive the test. The trial results will indicate whether adjuvant chemotherapy is of value within the trial's intermediate RS range, and will serve as further validation of the absolute risk of recurrence in subjects with scores above and below the range. This will provide better estimates of the degree of benefit from utilization of the test, but will not directly examine what therapeutic choices would have been made and clinical outcomes incurred if only standard risk prediction tools were used. However, since standard risk prediction indices will be calculable, that information may be inferred. First results from this trial are expected in approximately 2013.

MINDACT (Microarray for Node-Negative Disease may Avoid Chemotherapy)

MINDACT is a multi-center, prospective, phase III randomized study comparing use of the MammaPrint assay with a common clinical-pathological prognostic tool, Adjuvant! Online, to select patients for adjuvant chemotherapy in node-negative breast cancer. Patients at low risk by both MammaPrint and standard clinical-pathological criteria will not receive chemotherapy, patients at high risk by both criteria receive chemotherapy, and patients with discordant criteria will be randomized to use either MammaPrint only or standard criteria to decide treatment (i.e., randomized to receive adjuvant chemotherapy or not). This will directly test whether the choice of chemotherapy guided by MammaPrint provides benefit over that guided by the Adjuvant! criteria.

Other Relevant Studies

Fan et al., 2006.79 No key questions relevant to the evaluation of gene expression-based prognostic estimators was directly addressed in this study, but the agreement between gene-expression tests and other predictors was evaluated, as well as their individual performance on a common dataset. In particular, the 70-gene signature, the gene panel used in Oncotype DX, the 2-gene ratio, and other gene expression signatures were considered. This investigation was carried out on the 295 samples from stage I–II breast cancer patients, which had been used to develop the 70-gene test21. The Oncotype DX RS and the 2-gene ratio were estimated from microarray gene expression data (i.e., not RT-PCR), and thus were not obtained according to the protocols and methods used in the corresponding marketed assays. These are therefore described as “derived” scores below.

All tests except the 2-gene ratio (hazard ratio of about 1) were highly significant predictors of OS and DFS. The agreement between MammaPrint and derived RS was 81 percent (239/295). However the intermediate and high risk groups, as defined by the RS gene panel, were considered as one group in this paper and compared to the poor prognosis group of patients, as defined by the MammaPrint signature. ER status, tumor grade, tumor size, and lymph node involvement also proved to be significant univariate predictors. The coefficients of clinical predictors were allowed to vary between models in this analysis. All the analyses were repeated for the ER positive (N=225) subset with qualitatively similar results. Good, but not perfect correlation between predictions was found. This was surprising since classification was obtained using different gene sets. The degree of prediction over and above “standard” clinical stratifiers was not clear in the paper and the reclassification of samples was not done.

This study is of interest since it compared 5 different classifiers. However, it should not be regarded as a validation of either the Oncotype DX or the H/I ratio assays, since actual tests were not used on these patients and the RS and the two-gene index estimates were obtained from microarray data. In addition, since this was the same dataset used in the development of the 70-gene signature, it would be expected to perform better than the RS, for which this was a true test set.

Espinosa et al., 2005.80 In this paper the authors developed an RT-PCR based version of the 70-gene expression signature21,25 RT-PCR was used to measure, in breast cancer biopsy specimens, the expression of the 70-gene signature, as well as four additional genes (HER-2, EGFR, PLAT, and MUC-1) related to prognosis. The study population was 96 patients diagnosed between 1991 and 1997 for whom samples and followup were available and who were seen in a single Madrid hospital. Half of the patients were lymph node positive, 75 percent ER positive, and 25 relapses were observed after a median of 70 months of followup. Eighty percent of ER positive patients received tamoxifen, and 74 percent of patients overall received adjuvant chemotherapy.

The objective of the authors was to reproduce the results obtained with the 70-gene profile through an alternative technology. However, for technical reasons only 60 of the 70 genes could be investigated. For this reason, the study cannot be considered a validation of the 70-gene signature. According to the results obtained by RT-PCR, Kaplan-Meier estimates for RFS and OS in the good and poor profiles patients' groups were as follows:

  • RFS for Good vs. Poor prognosis profile 70 months after surgery: 85 percent vs. 62 percent.
  • OS for Good vs. Poor prognosis profile 70 months after surgery: 97 percent vs. 72 percent.

Univariate and multivariate Cox proportional regression analyses were performed to compute a hazard ratio for the risk groups for both endpoints. Only the lymph node status (hazard ratio, 1.2; 95 percent CI, 1.09 to 1.36) and the gene profile (hazard ratio, 6.3; 95 percent CI, 1.28 to 31.07) proved to be independent prognostic variables for OS. Only the number of positive lymph nodes (≤ 3 versus >3) (hazard ratio, 1.13; 95 percent CI, 1.05 to 1.25) and again the gene profile (hazard ratio, 2.74, 95 percent CI, 1.13 to 6.61) were independent prognostic variables for RFS.

In subgroup analyses, the signature did not predict significantly in lymph node negative patients (many of whom received adjuvant chemotherapy), or in women >52 years of age.

The profile predicted both local and distant relapses in the general population of women with breast cancer. In the poor-prognosis group, most patients survived less than 2 years after relapse, regardless of the site of first relapse. In contrast, patients in the good prognosis group usually had low-risk relapses and survived longer than 2 years after relapse.

This study cannot be considered an independent validation of the MammaPrint assay, since only 60 out of 70 genes were considered, the genes were assessed by a different technology (RT-PCR rather than microarray), and the population was far more heavily treated with adjuvant chemotherapy than previously-tested populations. It therefore did not test a population in whom these results would have a clear implication for therapeutic decisions.

Studies Excluded Upon Complete Review

Eden et al., 2004.81 This paper was excluded because it did not provide new information on the assays investigated. The gene expression markers identified by van't Veer and colleagues21 were compared to both conventional markers and newly constructed indices to predict distant metastases. However, analysis was conducted in the same van't Veer cohort patients, and therefore was not a new validation of the 70-gene signature.

Weigelt et al., 2005.82 This paper was excluded because it does not include prognostic information for the investigated assays, although it does provide some useful biologic insights. These authors showed that distant metastases display both the same molecular breast cancer subtype and 70-gene prognosis signature as their primary tumors. These results suggest that the capacity to metastasize is an inherent feature of most breast cancers, implying that poor-prognosis breast carcinomas, as classified by the intrinsic gene set or the 70-gene profile, represent distinct disease entities. These findings support the hypothesis that molecular subtypes might originate from different cell types within the breast, therefore reflecting different biological entities and maintained throughout the multistep metastatic process. Indeed the metastatic nature of poor-prognosis breast carcinomas, which are depicted by the 70-gene profile or the luminal B, HER-2 positive, or basal-like molecular subtype, is an inherent feature of breast cancers that remain stable with time and across distinct tumor outgrowth locations within the same individual.

Nuyten et al., 2006.83 This paper was excluded because the authors used a subset of the van de Vijver25 data set and looked at local recurrence.

This group searched for gene expression signatures that predict the risk of local recurrence after breast-conserving therapy (BCT) in a series of 161 early-stage breast cancer patients who were a subset of the original van de Vijver25 cohort. The 70-gene signature, originally designed to predict metastasis, failed to predict local recurrence after BCT.

In this paper other gene signatures were evaluated. The supervised wound-response signature22,84 is the only gene expression profile that could predict a local recurrence after BCT, while both the 70-gene and the primary hypoxia signatures85 failed to predict metastases.

Naderi et al., 2007.86 This study was excluded because it was not related to the assays investigated for this review. The authors developed a Cox-ranked 70-gene signature, which is a ‘new’ signature, and it is not related to the MammaPrint test.

Sun et al., 2007.87 This paper was also excluded because it is not related to the assays investigated for this review. The author developed a new predictor (with only 3 genes from the 70-gene profile) for recurrence based on the van't Veer data set and used the 70-gene signature for comparison: the new signature performed better than 70-gene signature.

Appendixes cited in this report are provided electronically at: http://www​.ahrq.gov/clinic/tp/brcgenetp​.htm

Footnotes

1

Appendixes cited in this report are provided electronically at: http://www​.ahrq.gov/clinic/tp/brcgenetp​.htm