NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Bast RC Jr, Kufe DW, Pollock RE, et al., editors. Holland-Frei Cancer Medicine. 5th edition. Hamilton (ON): BC Decker; 2000.

The modern era of therapeutics in cancer is dominated by clinical data arising from cancer clinical trials. This reliance on clinical trial methodology to generate scientific data on the value of therapies not only has been adopted by the oncology community but is true for those working with all chronic diseases. In the United States, efforts to find therapies for acquired immunodeficiency syndrome (AIDS) are principally relying on clinical trials. Applications for drug approval to the U.S. Food and Drug Administration (FDA) can only be made on the basis of scientific evidence generated by clinical trials. The development and widespread acceptance of clinical trials is one of the major conceptual advances in experimental therapeutics made during the latter half of the 20th century.

A clinical trial is defined as an experiment on humans being carried out in order to evaluate one or more potentially beneficial therapies. The clinical investigator is assumed to have control of both the therapies being evaluated and the patient population to which these therapies are administered.

The basic ideas that are associated with clinical trials have been discussed for at least 150 years. An important intellectual landmark is the treatise *Essays in Clinical Instruction*, written by the French physician P.C.A. Louis in 1834.^{1} He advocated the use of the “numerical method” to study the benefits of therapy. His view was that only with “counting” is it possible to learn about the scientific basis of medicine; however, “counting is not easy. It is necessary to account for the different circumstances of age, sex, temperament, physical condition, natural history of the disease, and errors in giving therapy.” Louis wrote: “The only reproach which can be made to the Numerical Method is that it offers real difficulties in its execution . . . this method requires much more labor and time than the most distinguished members of our profession can dedicate to it.” Louis’ comments are as appropriate today as when he wrote them.

## Types of Clinical Trials

Ordinarily clinical trials are characterized by three phases and are referred to as phase I, II, or III trials. The characterization of these trials has arisen from drug trials, but the language has been used for radiation therapy and surgical trials as well.

A phase I trial refers to a new treatment (usually a drug) that is to be tried on humans for the first time. The aim is to find an acceptable dose and schedule with respect to toxicity. Use of the term *acceptable* is particularly important. Therapies for life-threatening illnesses generally will allow for greater risks of serious side effects than those targeted at less serious illnesses. In cancer, patients who are refractory to therapies that are believed to be beneficial usually are the patients who are entered in phase I trials. As a result, evaluation of side effects in this population with very advanced disease may not necessarily be the same for patients who ultimately receive the therapy for an evaluation of its benefit.

Phase II cancer trials are initiated after the completion of phase I trials. The goal is to determine if the therapy has any beneficial effect. The patient population in phase II trials sometimes is composed of newly diagnosed patients with advanced cancer. Entering such patients may be justified in non–small cell lung cancer trials, but it may not be appropriate for cancer sites for which therapies with proven benefit do exist. As a result, most patients entering phase II trials are those who no longer benefit from therapies that are believed to be beneficial. The dilemma of phase II trials is that the trial may not be a satisfactory test of an experimental therapy if the patient population used has failed or been found to be unresponsive to therapies with proven benefit. Another criticism is that some trials are designed to investigate a single dose and schedule, while others test combinations of drugs. The particular dose–schedule combination of a drug may be far from optimal. Scientific considerations dictate that tests of drugs in phase II trials should include a spectrum of doses and schedules that still have acceptable toxicity. In some circumstances it may be appropriate to combine phase I and II trials into a single phase I-II trial.

Phase III studies always are comparative trials; one or more experimental therapies are compared with the best standard therapy or competitive therapies. They tend to have many more patients than phase II trials, and they often require patients from many cooperating hospitals.

### Randomized Versus Nonrandomized Clinical Trials

The fundamental scientific principle underlying the comparison of patient groups receiving different therapies is that these groups must be alike in all important aspects and differ only in the treatment that each receives. Otherwise, differences between groups may not be caused by the treatments under study but may be attributed to the particular characteristics of the group. In clinical experimentation, patients may vary widely in their ability to respond to therapy. Furthermore, therapies cannot be reproduced exactly from occasion to occasion, in contrast to the physical sciences, in which the treatments applied to experimental units are exactly reproducible and the experimental units homogeneous. Variability in clinical experimentation arises from the heterogeneity of the patient populations and the lack of exact reproducibility of the treatment, whereas in the physical sciences, variability often is a secondary factor and arises from slight changes in the ambient environment and the variability of the measuring instrument.

The use of randomization refers to the process used to generate comparable patient groups. The term *randomization* refers to allocating the treatments to patients using a chance mechanism; it is equivalent to tossing a coin to assign therapies when only two treatments are under investigation. Classic randomized clinical trials require that neither the physician nor patient knows in advance the treatment to be given before entering a trial. Randomization makes the treatment groups “alike on the average” with respect to all factors that are likely to affect the principal end points of a trial. Randomization ensures that each patient has the same opportunity of being assigned to any of the therapies in the trial. In actual practice, a randomization schedule is generated by a computer or from a table of random numbers.^{2}

Randomized clinical trials (RCTs) are regarded by many investigators as the “ideal” scientific standard for comparing therapies. Randomization creates balanced patient subgroups with the same average baseline characteristics. This “balance” not only applies to known but to unknown prognostic factors as well, and randomization eliminates both physician and patient selection biases. The former refers to the physician creating a bias by only putting a special class of patients in one of the treatment arms (e.g., assigning patients in the poorest physical condition to the least toxic treatment). The patient selection bias refers to a comparable bias but is induced by the patient.

Another implicit advantage of an RCT is that the experimental therapy is compared with a concurrent control group. Hence, every group in the trial will have the same criteria for diagnosing and staging of disease, patient management, supportive care, and the same data quality and methods of evaluation.

Despite widespread acceptance of the scientific merits of randomization, many physicians are reluctant to participate in RCTs.^{3} The principal reason for nonparticipation is that physicians feel that the patient–physician relationship is compromised if the physician must explain to the patient that the treatment for their cancer would be chosen by a “coin toss” or “computer.” The United States Code of Federal Regulations governing human experimentation has been interpreted to imply that a physician must tell the patient about the use of randomization. Thus, as a result of the nonparticipation by subpopulations with disease, results of an RCT may not necessarily apply to the entire patient population. Caution must be exercised when extrapolating the inference from a clinical trial to the entire population with disease.

An interesting example of biases that arise in physician and patient selection is illustrated in the trial reported by Antman and colleagues.^{4} An RCT was carried out jointly by the Dana Farber Cancer Institute and the Massachusetts General Hospital for the treatment of sarcoma (intermediate/high grade). The trial compared Adriamycin against observation (i.e., no active treatment). Over a period of time, there were 84 eligible patients seen between both institutions, of whom only 36 were entered in the RCT. Among the 48 patients who did not go into the trial, patient or physician refusal each accounted for 50%. Of these 48 patients, 29 did not receive any active treatment and for all practical purposes received the same treatment as the control treatment of the RCT. Thus, the control arm and a portion of the patient population (nonrandomized) can be compared. The 20-month disease-free survival for the control patients in the RCT was 64% compared with 16% for the nonrandomized patients receiving no treatment; the 30-month survival was 68% for the RCT controls, compared with 29% for the nonrandomized controls. Even after adjustment for differences in prognostic factors, the differences still persisted. This example illustrates the need for concurrent control groups.

Because of the unpopularity of RCTs with many physicians, many nonrandomized trials aim to make conclusions about the value of an experimental treatment. Generally, data on an experimental treatment is generated prospectively and compared with a historic “control” group of patients. Of course, if the value of the treatment is overwhelmingly beneficial, no comparison may be necessary. For example, if patients with pancreatic cancer are living long periods of time (e.g., 5 years) without evidence of disease, no formal comparison is necessary because we know that the prognosis of this disease is uniformly dismal. Unfortunately, the available therapies for cancer are not likely to result in dramatic benefits. Consequently, the benefit of an active therapy is likely to be of moderate magnitude, requiring care in its evaluation. Moderate benefits are, however, of real clinical importance (e.g., increasing the cure rate of breast cancer by 10 to 20%, will result in saving thousands of lives).

The use of historical controls for evaluating the benefits of an experimental therapy is fraught with many problems. There are ample opportunities for serious biases to distort the conclusions, and even when known biases are considered, there may be other unknown biases that can distort the conclusions of a clinical trial. Nevertheless, nonrandomized trials may be important as part of the overall scientific process in evaluating an experimental treatment. They have a role in pilot and exploratory studies as well as in phase II trials. Consider a phase II study to identify an active drug or drug combination. Good scientific strategy would dictate that the study be carried out in the most desirable conditions possible to identify an active therapy (e.g., selection of patient population). Comparison of the magnitude of the effect with other available therapies, however, is best done with an RCT.

The reporting of a nonrandomized trial requires special care, especially when claims are made about efficacy. Reporting should address the potential biases that could affect the conclusions, such as the six discussed below that arise in all nonrandomized trials which employ a comparison with a historical control group:

- 1.
Physician selection bias: Selection of patients for the experimental treatment may be biased; this is not true in RCTs.

- 2.
Patient selection bias: Patients self-select themselves for the experimental treatment. There is no self-selection in the historical control group, and this leads to potential biases in comparing outcome with a historical control group. RCTs have patient self-selection only for those who enter the trial. However, because the assignment to treatment is randomized, the self-selection does not bias comparisons between therapies within the same trial.

- 3.
Diagnosis and staging: Methods of diagnosis and staging must be the same for the experimental therapy and the historical groups. If methods have improved during recent years, this may not be reflected in the historical control group. For example, a significant number of newly diagnosed cases of breast cancer are found by mammography. This precludes using a historical group for comparisons of adjuvant treatment unless one accounts for the method of diagnosis. This problem does not exist with a concurrent control group.

- 4.
Patient management and supportive care: This must be the same for both groups.

- 5.
Evaluation methods: This factor reflects on the quality of data. If the historical control group has significant missing or unknown data for important variables, then unbiased comparisons may be impossible.

- 6.
Prognostic factors: The key prognostic factors must be the same for both groups. Statistical adjustments often can be used to make the groups comparable when the prognostic factors are known. It is not possible to adjust when data are missing or if there are unknown prognostic factors.

Some investigators have suggested that complex staging is a strong argument against the use of historical controls. For example, at one time, metastatic colon cancer was considered to be incurable. Early stages had variable prognoses with a high death rate. When adjuvant therapy became available through clinical trials, staging was pursued much more vigorously. As a result, more lymph nodes are now examined, multiple step sections made, and liver biopsies carried out. Consequently, a higher proportion of more recent patients may be in a better prognostic state than controls, even though they may both be classified in the same stage. Because they have better prognoses, their survival will be greater than that of the historical controls.

Use of consecutive patients for generating data on the new treatment and use of “matched’’ controls are two common methods for investigating new treatments without resorting to randomization. Both have drawbacks. Entering consecutive patients in a study eliminates the opportunity for physician bias that is associated with the selection of patients. If patient consent is necessary, however, a patient selection bias will still be present. If the mix of patients has not changed over time, the prognostic variables associated with each group may be comparable; however, issues such as patient bias, diagnosis and staging, patient management and supportive care, and different methods of evaluation still must be considered. The consecutive patient experimental design is targeted mainly at eliminating bias arising from physician selection of patients.

Employment of matched controls is another method often used to compare a new treatment with a historical control. This involves forming a group of one or more control patients for each patient receiving the new treatment. Patients are selected from the historical group so that they are comparable with the new treatment group on a patient-by-patient basis for known prognostic variables. This method is limited in that only a few key variables can be matched on any practical basis. For example, in matching, one would perhaps aim to have patients who are comparable with respect to anatomic staging, pathology, performance status, and extra disease characteristics, such as demographic factors and prior treatment history.

Statistical modeling is a generalization of matching that enables one to adjust for several factors simultaneously. However, this method also has limitations because the statistical adjustment for bias introduces additional variability in the analysis from the “uncertainty” of the adjustment. Such adjustments can be made only for known prognostic factors. Patient and physician self-selection cannot be factored into the adjustments; neither can questions about different criteria for diagnosis and staging, different methods of patient support and management, and different methods of evaluation.

In summary, the nonrandomized methods for the evaluation of new therapies are useful for exploratory and pilot studies. They are not to be relied on for generating credible conclusions, unless the issues of potential biases are carefully discussed or the therapy outcome is so dramatic that it cannot be credited to the aggregate effect of the potential biases.

Another type of nonrandomized study that will likely gain more popularity in the future is the study attempting to correlate disease markers with survival. The simplest example is to relate tumor response for measurable disease with eventual survival. The logic is that patients enter a trial with a positive disease marker (e.g., tumor), receive treatment, and the tumor becomes smaller or disappears. Survival comparisons are made between patient groups with a positive marker versus a negative marker (i.e., tumor has significant reduction in volume or disappears). A straightforward comparison as described is invalid,^{5} the reason being that the longer a patient lives, the greater is the opportunity to observe a change in disease marker status. This means that even if a change in marker status is not related to increased survival, a direct comparison of survival data on negative versus positive disease markers will show a positive relationship. This relationship is spurious; however, statistical techniques of analysis have been developed that can overcome this problem.^{6}

### Multi-center Versus Single-center Trials

The available experimental therapies for cancer are likely to have only moderate benefit. Nevertheless, a moderate benefit can be important. For example, the number of cases of breast cancer diagnosed in a year is estimated to be more than 190,000 in the United States. Approximately 50% of these new cases have positive axillary nodal involvement. If the cure rate were increased by 10%, then at least 9,000 more women would be cured every year.

The end points for nearly all definitive phase III studies are survival or the disease-free period. Furthermore, the clinical course of the disease is complicated and highly variable. As a consequence of both the need for making an inference on survival and the variability of data, it is necessary to have a relatively long follow-up time with a large number of patients. Because few hospitals have enough patients to meet this need, it is necessary to carry out these phase III trials using many cooperating hospitals. Consequently, nearly all phase III trials are multi-center trials in which patients are pooled into a common study. Carrying out a multi-center trial results in increased administrative difficulties and quality assurance problems; it is one of the most difficult and complicated experiments in science.

Multi-center studies are not only used in phase III clinical trials but in phase II trials as well. The use of multi-centers enables patient-accrual goals to be achieved much more rapidly. Table 24.1 is a summary of current practice regarding multi-center studies and their use in phase I, II, and III trials.

## Planning Clinical Trials

### Overall Considerations

The overall planning for a clinical trial depends critically on whether the trial is an “exploratory” or a “management” trial. A management trial seeks to determine whether a therapy is beneficial under conditions as close to clinical circumstances as possible. (Sometimes the term *demonstration trial* is used to describe a management trial.) A management trial should be carried out by a large number of hospitals; when the trial is collecting data from many representative hospitals, it will be possible to determine if a therapy is beneficial. An exploratory trial seeks to determine whether a therapy is efficacious under ideal or restricted circumstances, which may not necessarily correspond to practical clinical situations.

Objectives in clinical trials may vary greatly. Possible objectives are to (1) find the best overall treatment, (2) find the best treatment by prognostic subgroup, (3) determine the relationship between the natural history of the disease and the treatment, (4) identify an active treatment, and (5) evaluate the effects of augmenting a beneficial therapy.

The choice of the eligible patient population in a trial is crucial to reaching accrual goals in a reasonable time. It is necessary to decide whether to have narrow eligibility requirements so that patients are relatively homogeneous with regard to baseline prognostic variables or to have broad eligibility requirements that will accelerate accrual. The pros and cons about the choice of the population depend on whether it is a management or exploratory trial. If the trial is exploratory, then having a relatively homogeneous patient population will result in less variability in the end points of the trial and make the trial more sensitive at showing real differences among treatments. Alternatively, if the trial is a management one, there is some advantage in having broad eligibility criteria because one will be able to explore how therapy benefit varies among subgroups of patients; using post-hoc stratification and statistical modeling in the analysis will reduce the statistical fluctuations generated from having heterogeneous patient groups. An operational problem with defining narrow eligibility criteria is that accrual may take a long time.

Another basic decision in the choice of population is to determine if the patient population should be newly diagnosed patients or should include those who have been shown to be refractory to beneficial therapies. A newly diagnosed patient represents the most promising “patient material” for study. On the other hand, it is necessary to consider the ethics of withholding therapies of proven benefit in favor of an experimental treatment of unknown benefit. If one chooses to use a patient population that is refractory to beneficial therapies, then it may not be possible to suitably evaluate the experimental therapy. This decision on choice of patient population must balance the ethics of denying a patient a beneficial (but still noncurative) therapy with the opportunity of a patient receiving an experimental therapy that has the potential of significantly better benefit.

Another consideration in planning a study is to determine the treatment plan if a patient has failed or does not appear to benefit from the treatment. Does one not continue the treatment unchanged, or should a new therapy plan be prescribed? If the end point is survival, then introducing a new therapy may complicate interpretation of the survival data. If the protocol does not specify what to do after failure, however, the attending physicians may introduce a large number of new therapies, which will complicate the interpretation of the survival data even further.

It is generally accepted that phase III trials should be randomized. However, the question remains, should phase II trials be randomized? Because the object of a phase II trial is to determine if there is any activity against the disease rather than to make comparisons with other therapies, most are nonrandomized. One reason for evaluating several therapies in the same phase II trial, however, is to evaluate them simultaneously with the same clinical trial process and for the same patient population. Another strategy is to include a treatment with proven benefit but which has not yet been used on the phase II patient population as one of the therapies. If the proven therapy cannot demonstrate benefit, then the phase II trial may not have a suitable patient population to permit the evaluation of experimental therapies. Thus, if several therapies are to be evaluated in the same phase II trial, randomization could be used for the treatment assignment.

What end points should be chosen to evaluate the therapies? There is widespread agreement that adjuvant trials should use survival as an end point; however, patients with recurrent disease will receive additional or alternative therapies. This will certainly be true for placebo or no-treatment control patients. Those on therapy also will receive alternative therapies on recurrence. Thus, survival data may not be clear-cut. Other end points could be the disease-free period and time to “progression” (progression must be carefully defined). Phase II trials often use tumor response or other disease markers. In practice, there will be a number of end points in any trial.

During the past few years, use of surrogate markers as the major end point of phase III studies has increased. Some disease markers have a high correlation with survival. The idea is that conclusions about treatment benefit can be made in a shorter time frame. For example, many phase III trials in AIDS use the virus load as the major end point. In cancer, complete responses may have a high correlation with survival. One problem with using a gross measurement like complete response in cancer versus a survival end point is that in many cancers, the anticipated frequency of complete responses is too low to provide meaningful comparisons.

Quality-of-life issues are being widely recognized as important end points in cancer (see Chapter 20). A patient cured of leukemia by bone marrow transplantation but who has chronic graft-versus-host disease (GVHD) has a severely compromised quality of life. The paper by Goldhirsch and colleagues^{7} discusses new methods for objectively evaluating quality of life and represents an important advance. They call their method Q-TWiST (Quality-adjusted Time Without Symptoms and Toxicity of Treatment).

### Statistical Tests and Probabilities of Reaching Incorrect Conclusions

Almost all clinical trials are analyzed using statistical procedures that are based on the frequency theory of probability. This simply means that if, for example, an outcome in an experiment has a 10% probability of happening, then in an infinite number of repetitions of that experiment, one would observe the outcome 10% of the time. It will be assumed in the following discussion that probability statements refer to the relative frequency notion of probability.

Consider a trial in which two treatments are being evaluated. After the trial is completed and an analysis made, the two main conclusions are that (1) the treatments are equivalent, or (2) the treatments differ. These conclusions are referred to as the *null* and *alternate* hypotheses, respectively. The statistical procedures (called tests) chosen for analysis enable the specification of the error probabilities: α: probability of concluding treatments are different when they are actually the same;*β*: probability of concluding treatments are the same when they are actually different.

The two probabilities often are referred to as type I and II errors, respectively. These error probabilities are the false-positive and false-negative rates, respectively (i.e., calling a result positive when it should be negative, and calling a result negative when it should be positive).

The practical application of these tests is that using observed data from the trial, one calculates a probability or significance level. This is the probability of observing the same or more extreme differences among therapies than those observed if the two treatments actually are equivalent. The reasoning behind the probability calculations is that if the investigator is willing to accept the differences observed in the trial as being scientific evidence in favor of a difference between the two treatments, then the scientific evidence would be stronger if the differences were larger than those observed. This probability is called a tail area, significance level, or simply the *p* value. If the *p* value is less than .05, then the practice is to accept the hypothesis that the treatments differ. Thus, if the treatments truly are equivalent, the probability would be less than 5% that the observed differences could have arisen from statistical fluctuations in the data.

The power or sensitivity of a test is equal to 1-β and denotes the probability that the clinical trial will be able to detect differences among the treatments, when, in fact, the differences are real. The power is fundamental in planning all clinical trials. It depends on the significance level chosen, the number of patients in the trial, and the magnitude of the difference between the treatments. Large numbers of patients and large differences between treatments increase the power. For studies in which survival is an end point, longer follow-up time also increases the power. When a trial concludes with “no difference between the treatments” or the result is not “statistically significant,” it may be because (1) there is no difference between the treatments or (2) the power of the trial was so low that the trial could not detect a difference.

We illustrate here the concept of power and its relation to sample size in a phase I trial. Suppose that a phase I trial is being conducted to determine the probability of life-threatening toxicity from an experimental drug. Suppose that if no life-threatening toxicity is observed, the drug will be declared to be “free of life-threatening toxicity.” Table 24.2 shows the relationship between true toxicity rate, number of patients, and the power (i.e., probability of observing at least one life-threatening toxic event). For example, if only 5 patients are in the trial and the true toxicity rate is 10%, there is only a 41% probability of observing one or more toxic reactions. As the sample size increases, this probability goes to unity. Similarly, if the true toxic rate is high, it is easy to show that the power is increased (e.g., if the toxic rate is 40% and only 5 patients are in the trial, the power is 0.92). This same table also can be used to determine the probability of observing one or more responses or solutions to any problem in which events are being observed.

As another example of the use of power in a clinical trial, consider a trial in which the proportion of recurrences is the principal end point of interest. Suppose that one desires to calculate the sample size necessary to detect a difference between two groups when one has a recurrence rate of 60% and another may have a possible recurrence rate within the range 30 to 55%. Table 24.3 shows the number of patients per treatment for different levels of power comparing the two recurrence rates. For example, one needs more than 2,000 patients for each treatment group to have a power of 0.9 to detect a difference between 55% versus 60%. However, the sample size is reduced to 130 patients per group to test 40% versus 60% for the same power (0.9). It is clear that increasing the number of patients results in an increase in power. Also, as the differences between groups become large, a high level of power can be attained for the same sample size.

In practice, one does not know the recurrence rates to compare. Therefore, a range of comparisons is chosen on the basis of past data that are of clinical interest, and a sample size is then chosen that will give high power. Generally, a power less than 0.8 for clinically important differences is unacceptable in planning a trial.

### Data Collection and Forms Design

There is a great deal of misunderstanding over the amount of data required to evaluate a clinical trial properly. Ordinarily, clinical trials performed by the pharmaceutical industry with the intent of submitting the trial to the FDA for approval collect very large amounts of data. A great deal of these data may be unnecessary, and it is not uncommon to have data forms exceeding 100 pages for each patient. One reason motivating the large collection of data items is the “antagonistic” relationship between the FDA and industry. Essentially, industry-sponsored data collection plans have the theme “to leave no stone unturned.” Another motivating factor is that many trials are planned to show that two treatments are equivalent (“me-too trials”). Usually, a new drug is compared with a competitor’s drug that already has received FDA approval. Large amounts of data are collected in these trials to answer unanticipated questions that might be raised by the FDA.

Disregarding the special problems between the FDA and industry, data chosen for collection in a clinical trial must supply information to determine (1) eligibility of the patient, (2) whether the protocol was followed, and (3) objective measures of the study end points. In general, the more the data collected, the greater is the opportunity for data degradation. A nonrandomized study attempting to make treatment comparisons ordinarily would require a greater amount of data to check for biases than a randomized study would. A randomized study, by definition, has comparable patient groups that differ only in the treatment assigned to those groups.

In general, a cancer clinical trial is likely to have six different types of forms for collecting data as well as special forms. The data forms and types of information collected are outlined in Table 24.4. Data forms should be designed such that they are self-coding. As much as possible, boxes should only need to be checked to supply information. Care should be taken to prevent the person filling in the form from making interpretive decisions (e.g., calling a toxicity life-threatening). Space should be provided to allow the physician to note or comment on special features of the patient that were not collected by the data form or that require further comment.

### Experimental Design

This section briefly discusses experimental designs for randomized studies. Generally, phase III cancer clinical trials have two to four treatment groups. The reason for keeping the number of treatments small is that the patient consent process requires the physician to discuss all treatment options with the patient. Having more than four treatment programs likely will confuse the patient. The logistics of having a large number of treatments in a multi-center trial also may be overwhelming.

Clinical trials ordinarily are designed using stratified randomization.^{2} After a patient is found to be eligible for a trial, he or she may be classified under two or more subgroups or strata that are defined by available data at the time of registering a patient into the trial. The subgroup should be defined so that patients in the same stratum have a more common prognosis than those from different strata. For example, a study for advanced breast cancer may have eight strata defined by (1) performance status (0, 1 versus 2, 3), (2) number of recurrent sites (two or less versus more than two), and (3) disease-free interval (1 year versus more than 1 year). All possible combinations of these three factors result in eight distinct strata. In each stratum, the treatments are assigned at random. Stratification tends to balance the treatment assignments so that treatment groups are equally balanced among the different strata. This is especially important in the early stages of a study, in which only a small number of patients have been registered. If unexpected events are observed (e.g., unusual toxicities) within a treatment group, one would be able to analyze whether the events arose from an aggregate of prognostic factors in one treatment group that was not present in other groups. As the number of patients becomes large, the patient groups tend to be comparable on the average, and the need for stratification diminishes or even disappears. Relatively small clinical trials should always be stratified, however.

There are obvious practical limits to the number of strata that can be used. If there is a large number of strata relative to the number of patients, then some of the strata will not have any patients. Having empty strata does not cause a loss of efficiency. There will be a loss of efficiency, however, if the number of patients in a single stratum is less than the number of treatments. For example, if there are two treatments, then all strata containing only one patient will not contribute to the analysis unless additional modeling assumptions are made. In practice, the maximum number of patient/disease variables in a trial is approximately 12 to 15 for trials involving several hundred patients. A rough “rule of thumb” is that the

Table 24.5 shows a variety of experimental designs that are useful in phase III trials. It is assumed that the only patients who are randomized are those who have given consent. As the designs become more complicated, the statistical methods for analysis also become more complex.

Another design that should be used more often is the factorial experiment. Suppose that the class of therapies can be characterized by two factors, which will be designated as A and B (e.g., A might refer to drug A and B to drug B). Suppose that drugs A and B are given at two doses, A_{0}, A_{1 } for drug A and B_{0 } and B_{1 } for drug B. Then, there will be four drug combinations, given by A_{0}B_{0}, A_{0}B_{1}, A_{1}B_{0}, and A_{1}B_{1}. If the clinical trial is carried out with these four groups, then one can determine (1) if a low dose is different from a higher dose of A or B, and (2) if there is an interaction (i.e., synergy) between A and B. In general, the factors may refer to different modalities of treatment or to dose or schedule. The advantage of this factorial design is that it enables two questions to be answered for the same patient. Factorial experiments need not be restricted to two factors, however, and the number of conditions for each factor need not be two. Even so, because of the need to keep the number of treatment groups small, the factorial design with two factors each under two different conditions appears to be the most practical in clinical trials.

In some instances, it may be possible to investigate simultaneously three factors, each at two conditions, by having only four treatments. Consider the case where an investigation is planned to explore three drugs (denoted by A, B, and C) each at two different doses. All possible combinations result in eight distinct treatments; however, by choosing a special set of four treatments, it is possible to investigate the contribution to outcome for each drug by changing the dose. There are two sets of four treatments that can be chosen for this purpose. If the dose levels are designated by (A_{0}, A_{1}), (B_{0}, B_{1}), and (C_{0}, C_{1}), then any treatment combination will be made up of three letters where the subscript 0 or 1 denotes the dose. The two sets of four treatment combinations, each of which is suitable for a trial, are

Note that the two sets together comprise the eight possible treatment combinations. This experimental design is called a Latin square or, equivalently, a “^{1}/_{2 } replicate of a 2^{3} factorial design.”

Some care must be exercised in making these comparisons. If it is necessary to change the doses of A and B when C is added, then the comparison of A_{1}B_{1}C_{1 } with A_{1}B_{1}C_{0} may not properly reflect the change in outcome with the addition of C. In other words, the two drugs A and B have different doses when combined with C compared with the two-drug combination A and B without C.

### Role of Compliance

One of the key problems in interpreting the results of a clinical trial is the effect of compliance on the conclusions. If the conclusions of a trial result in no difference between the therapies under investigation, it may have been caused by a lack of compliance. The effect of noncompliance is to lower the sensitivity of a trial at finding differences as well as to create possible biases.

As an example of the potential for bias, consider a randomized clinical trial comparing two treatment programs for head and neck cancer. One therapy is radiation followed by surgery; the other is surgery followed by radiation. Patients in whom the disease has disappeared after radiation may refuse surgery, whereas patients doing poorly after surgery may refuse radiation. For the first treatment program, the better-prognosis patients do not comply, while in the other, the poorer-prognosis patients do not comply.

To understand more fully the role of bias and loss of efficiency, we use a simple mathematical model. Consider a trial comparing two treatments A and B. Let the proportion of noncompliers be *P*_{a} and *P*_{b} for the two treatments, respectively. Also let the outcome for each treatment be m_{a} and m_{b} for compliers and m_{a}' and m_{b}' for noncompliers. (The outcome could be the proportion of responders, the median survival, or whatever is appropriate.) Then, the average outcome for each treatment group will consist of a mixture of outcomes of compliers and noncompliers. Define *M*_{a} and *M*_{b} as the aggregate outcome for each group. We can write *M*_{a} and *M*_{b} as

Note that the effect of noncompliance is to dilute the outcomes for each group. The comparison (*M*_{a}- *M*_{b}), which compares treatment A with treatment B, is

Now consider a case in which there is no difference between treatments for compliers (*m*_{a}= m_{b } = *m*) and noncompliers (m_{a}'= m_{b}' =* m*’). Then, the value of (*M*_{a} – *M*_{b}) is

which will result in a bias if the noncompliance rates *P*_{a}, *P*_{b} are different and if *m* ≠ *m*^{±'}. As a result, the analysis could show a difference between treatments when, in truth, there is none.

As another example, suppose that treatment A is an observation group having complete compliance (*P*_{a} = 0). Suppose that treatment B is an experimental therapy and noncompliance is simply not taking the medication. Then, the noncompliers on the intervention arm are likely to have the same outcome as the compliers on the observation treatment arm (
=* m*_{a}). Hence, substituting *P*_{a} = 0 and
=* m*_{a} in equation 1 gives

Thus, the effect of noncompliance is to make the treatment difference smaller. (The multiplier [1 – *P*_{b}] is always less than 1 unless *P*_{b} = 0, in which case it is unity.) The net effect of noncompliance for patients assigned to B is to lower the statistical efficiency. The statistical efficiency is (1 – *P*_{b})^{2}. Table 24.6 is instructive about statistical efficiencies for various values of *P*_{b}. The statistical efficiency means that if, for example, the proportion not complying is 10%, it is equivalent to using effectively only 81% of the accrual. In other words, 100 patients having a 10% noncompliance rate is equivalent to having 81 patients who completely comply.

Compliance issues are fundamentally important in cancer prevention trials. Consider a clinical trial in which A is an observation arm and B an intervention aimed at preventing cancer. Contemplated interventions may be to reduce smoking, reduce intake of dietary fats, or a similar activity. However, it is quite possible that individuals in the control group will not comply because they eliminated smoking or changed their diet. In this case, m_{a}' = *m*_{b} and m_{b}'=* m*_{a}. (Those in the control group who adopt the intervention are noncompliers but will have the same expected outcome as compliers in the intervention group; similarly, those in the intervention group who do not comply will have the same outcome as the compliers in the control groups.) Substituting m_{a}' *= m*_{b} and mb' =* m*_{a} in equation 1 results in

Note that it is necessary for the sum of the two noncompliance rates to be less than unity (*P*_{a} + *P*_{b} < 1), otherwise the multiplier will be negative. Also, even though (*m*_{a} – *m*_{b}) may be positive, *M*_{a} – *M*_{b} will be negative. Thus, trials with very large noncompliance rates will be worthless. The statistical efficiency in this case is (1 – *P*_{a} – *P*_{b})^{2}. Table 24.7 shows how the statistical efficiency changes with the average compliance rate *P* = (*P*_{a} + *P*_{b})/2. Thus, a trial with a 10% noncompliance rate is only 64% efficient and is losing approximately one-third of its effective number of patients.

As an example of the reality of noncompliance, one can use the experience of the Multiple Risk Factor Intervention Trial, often referred to as the MRFIT Trial.^{8} The intervention consisted of an educational program to change the lifestyle habits thought to be risk factors for coronary heart disease. Smoking was the most important factor, and it is interesting to note that 30% of the control group gave up smoking compared with 46% of the intervention group. Thus, *P*_{a} = 0.30 and *P*_{b} = 0.54. Hence, the statistical efficiency of this trial is (1 – 0.30 – 0.54)^{2} = 0.0256. The study enrolled 12,866 men. Therefore, with a statistical efficiency of 2.56%, this trial was equivalent to having (0.0256) × (12,866) = 320 men who complied 100%.

### Interim Analyses, Multiple Looks at Data, and Early Stopping

Nearly all ongoing clinical trials are monitored at periodic intervals both in the accrual and follow-up phases. A common time period for such monitoring is every 6 months. At these times, an interim analysis is performed to review the toxicity and end-point data. If the toxicity is unexpectedly high, the trial is likely to be modified. Also, if one treatment appears to be significantly superior or inferior to the other under investigation, ethical considerations would dictate that the trial be terminated or modified. Results of these interim analyses may be reviewed by a Data Safety Monitoring Committee. This Committee is usually composed of individuals not directly involved in the trial and are charged with reviewing interim analyses. In a multi-center trial, detailed interim analyses ordinarily are not made available to trial participants unless the outcome information is blinded with respect to treatment identification. The reason for masking treatment identification with respect to end-point data is to avoid accrual to a study being influenced by statistical fluctuations in the end-point data. Toxicity information usually is identified, but even this can influence the subsequent conduct of a trial.

The decision to terminate a trial early because of the apparent inferiority or superiority of one or more therapies is a difficult problem. The difficulty arises because the false-positive rate (i.e., the probability of concluding a therapy is beneficial, when it is not) increases as the number of interim analyses increases. For example, if a clinical trial is planned to have a false-positive rate of 5% at every interim analysis, then that rate would be changed to 14% if there were five interim analyses. The false-positive rate is changed because the more occasions the data are reviewed, the greater the opportunity that a large statistical fluctuation may be mistaken for a real effect. Table 24.8 shows how the false-positive rate changes with a varying number of multiple looks at the data.

In recent years, new statistical techniques have been developed to aid decision making on the early stopping of clinical trials. The idea behind these methods is that one specifies not only the overall false-positive and false-negative rates, as in conventional clinical trials, but also the number of “looks” at data and the maximal sample size of the trial. This results in objective rules for early stopping. These methods are called “sequential methods,” and they are modifications of concepts from the era of World War II, when sequential methods were developed for the acceptance sampling of manufactured products. Essentially, these sequential methods are derived so that at each interim analysis, the trial may be stopped if the significance level of the statistical tests comparing the treatments is very low. For example, an early stopping rule for a trial with five interim analyses may stop the trial if the first analysis was significant at the *p* = .00001 level; stopping at the second analysis would be done if the results were significant at the *p* = .001 level, with subsequent stopping rules if the significance levels were *p* = .008, *p* = .023, and *p* = .041 for the third, fourth, and fifth interim analyses, respectively. This set of rules preserves an overall 5% false-positive rate for the trial. In essence, the trial results would have to be very dramatic to result in early stopping of the trial.

In practice, these rules should only serve as a guide to aid investigators. It is especially important in using these rules that the data are current, recently reviewed, and that prognostic subgroups are comparable between the various treatments being compared.

If sequential method trials continue to the last interim analysis, it generally results in an approximately 5 to 20% increase in sample size, compared with a study design with a preassigned, fixed number of patients. The potential gain in using sequential methods is to have the option of terminating the trial in the accrual phase of the study. This option may be realized in trials in which the end point is the proportion of responders, but it is not likely to happen when the end point is survival. An experiment carried out by Rosner and Tsiatis^{9} is particularly instructive. They reviewed 72 completed studies from the Eastern Cooperative Oncology Group (ECOG) in which survival or some other time metric was an end point. Various sequential experimental plans were superimposed to determine what would have happened if the studies had been originally designed to have early stopping. They found that among the 72 studies, 66 (92%) would have terminated earlier using the best sequential plan. (They simulated four different sequential plans.) Among these, 26 (36%) would have been terminated in the accrual phase. It is particularly important to note that all conclusions made from the sequential analysis simulation agreed with those made by the clinical investigators using the full data set. This study shows that the use of sequential methods in clinical trials can result in a positive gain. A fuller discussion of sequential methods in the context of cancer trials can be found in Geller.^{10}

### Strategy of Experimentation

There are a very large number of cancer clinical trials being carried out throughout the world, and a positive outcome is likely to affect clinical practice. One question that arises is, if a trial concludes that a therapy is beneficial, what is the probability that the therapy is truly beneficial? The answer is important in deciding about when a practicing physician should adopt a new therapy.

To discuss this problem, it is necessary to understand how conclusions from a trial are made. All analyses of clinical trials use statistical methods that are based on concepts of the probabilities of making incorrect decisions. The previous section discussed statistical tests and the role of false-positive and false-negative rates (i.e., the false positive rate refers to the probability of concluding positive benefit when there is no benefit, the false-negative rate is the probability of concluding there is no benefit when a treatment is beneficial). In addition to these concepts, we need another, which is the “prior probability of success.” This depends on the level of clinical innovation and basic science that motivates the trial. Prior probability of success is subjective and cannot be measured objectively; however, it increases with knowledge of successful exploratory or pilot studies. Phase III studies should only be initiated on the basis of successful exploratory and phase II studies; alternatively, if a trial tests a drug combination in which each individual drug is without benefit, the prior probability of success will be low. What values of the prior probability of success should one adapt for cancer trials? Because the concept is subjective, it is difficult to be precise, but prior probabilities in the range of 5 to 15% seem to be reasonable for most cancer trials. Define α, β, and θ to be

α- false positive rate,

β- false negative rate, and

θ- prior probability of success.

Let us adopt the values α = .05, β = .7, and θ = .10. The value of *α*= .05 is commonly chosen as a false-positive rate in most studies; a false-negative rate of *β*= .7 arises if one has 50 patients in each of two groups and is attempting to determine if there is a 50% difference in the median survivals of the two groups. Figure 24.1 illustrates this process for 1,000 trials. Because the false-negative rate is *β*= .7, the true-positive rate is 1 – β = 0.30. With θ = .10, one can expect 100 true-positive trials; however, only 30% of these will be reported as positive. In addition, among the true-negative trials, 5% or 45 trials will be reported as false positives. Thus, there will be a total of 75 reported positive trials from among the 1,000 trials. The true-positive trials are indistinguishable from the false-positive trials. Thus, the proportion of true positives among the reported 75 positives is 30/75 = .40. Hence, with these parameters, we would expect (on the average) 4 among every 10 reported positive therapies to be true positives.

If the trials had very large patient numbers, the false-negative rate β would be close to zero. If β= 0, then all 100 true-positive trials will be reported to be positive, and the proportion of true positives would be 100/145 = .69 (i.e., approximately 7 of 10 reported positive trials are true positives). Thus, we have shown that with a prior probability of success of .10, the probability that a treatment reported to be beneficial is in truth beneficial may range from .40 to .69.

If *p*(+) denotes the probability that a reported positive trial is a true positive, then we can write

which shows that *p*(+) depends on θ, α, and β. If θ is close to unity, then *p*(+) is close to unity.

Table 24.9 is a summary of *p*(+) for various values of θ and β with α = .05. Note that as θ goes toward unity, *p*(+) will approach unity. These considerations indicate that in assessing the conclusions of a trial reported to be positive, it is necessary to review the prior scientific evidence that led up to the trial. Also, one should avoid initiating phase III trials that are not preceded by positive pilot and phase II studies.

## Reporting of Clinical Trials

The practicing oncologist must rely heavily on the published literature to make decisions about therapy. Unfortunately, there are too many cancer sites and the current views on the systemic treatment of disease may appear to be moving too quickly for most oncologists to have personal experience with the “latest treatments.” This section outlines the guidelines for assessing the quality of reporting for a clinical trial. These guidelines should be useful both to readers of the literature and authors of clinical trial manuscripts.^{11}

### General Guidelines

#### Population Under Study

There should be clear statements describing the population under study. Major subgroups of patients who are excluded should be mentioned (e.g., “patients over age 65 years were not eligible for the study”).

#### Therapy

Reporting of the protocol therapy (especially chemotherapy) should be outlined in sufficient detail so that the therapy can be duplicated by another physician. Not only the contents of the written protocol but the therapy actually received by patients must be stated. This is especially important for chemotherapy, for which full doses as written in a protocol often may not have been given to patients. Summary measures, such as average dose per course, proportion of patients receiving incomplete courses, proportion of patients receiving full doses, and average number of courses, should be provided as well, and their effect on outcome analyzed. If the written protocol provided for a de-escalation or escalation of dose(s) as a function of toxicity, those details should be given. In addition, information should be given on the extent to which changes in dose followed protocol criteria.

#### Study Design

The study design should be outlined. A schema, which is a pictorial display of the study design, is helpful to the reader. If the study is randomized, it is not sufficient simply to state that it was a randomized study; a statement should indicate how the randomization was carried out (e.g., central randomization, closed envelope, or other methods). The actual randomization scheme should be described. Occasionally, a randomization schedule or procedure may be changed during the course of the study. If so, details should be given regarding the reasons for the change. If there is institutional balancing or other kinds of stratification, this should be stated as well.

#### Patient Accounting

There should be a detailed accounting of all patients registered for the study, and registration should be carefully defined. How is a patient officially registered? Are all patients officially registered before the first day of treatment or after treatment has begun? It is disappointing to learn that in many single-institution, nonrandomized studies, registration may take place months after the first day of treatment. This leaves open the possibility that not all patients on a protocol are registered. In a randomized study, patients are registered from the moment of randomization. Nonrandomized studies should have similarly precise rules for registration.

The number of patients who are classified as “canceled” or “evaluable” by treatment received should be given. A canceled patient is defined as a registered patient who withdrew from the study before the first day of treatment. An unevaluable patient may be one who has incomplete information. Some studies classify an unevaluable patient as one who has major deviations from the protocol. If the reasons for patients being classified as canceled or unevaluable relate to the treatment assignment, then it is mandatory that all patients be included in the treatment comparisons. Otherwise, the selective inclusion of patients may result in wrong conclusions being drawn from the study.

#### Follow-Up

The follow-up period for patients should be given separately for each treatment. Statistics should be included on the average follow-up time, the number followed up for each time period (1 year, 2 years), and maximum and minimum follow-up times. The number of patients lost to follow-up and the reasons should be reported for each treatment. If a relatively large number of patients is lost to follow-up (i.e., 10%), then statements about long-term effects may not be correct.

#### Data Quality

There should be a discussion of the quality control methods used for the data. Was there “Second-Party Review?” A Second-Party Review is defined as a patient data review by individuals other than the investigator who generated the patient record. This could be carried out by the Study Chair or a special committee. If there was central data management, it should be mentioned. The review should be centered on answering three major questions for each patient: (1) was the patient eligible? (2) was the protocol followed? and (3) was there objective documentation of the major end points? There should be statements about the quality control of radiotherapy and surgery if these treatment modalities were involved in the study, and similar remarks hold for pathology quality control.

#### End Points and Censored Data

Trials in which the end point for evaluating therapy is a time metric, such as overall survival or disease-free survival, often may have patients with incomplete data. This happens if patients are still alive or in the disease-free state at the time of analysis. Such observations are called censored observations. Several situations arise in defining censored observations that could seriously skew the results. We mention only two here which are widely prevalent and could lead to incorrect conclusions.

The first occurs when a patient dies from a cause other than cancer (e.g., cardiovascular disease and suicide). Appreciable numbers of patients dying from competing causes of death could seriously alter the conclusions of the study if these patients were treated as censored observations. The cancer may have been an important contributing factor in the death. The other reporting problem arises when a patient is taken off the protocol treatment because of lack of response or progression of disease and receives some other therapy that may be more beneficial. If the survival time is classified as censored (still alive) at the time the patient ceased to be on protocol therapy, then the statistical analysis will be biased (being purged of an imminent death). This bias will make a poor therapy appear to be better. It is unfortunate that such practices are widespread. For this reason, the report of a clinical trial should indicate the reasons for classifying patients as censored when the classification arises, other than the usual situation where not enough follow-up time has elapsed to have a complete observation.

#### Statistical Analysis

The report on therapeutic benefit should be presented so that there is no ambiguity if a treatment difference refers to the entire patient population or special subgroups of patients. It is necessary that the analysis consider all known major prognostic factors that can affect the outcome. Otherwise, there may be disappointment when the therapy is applied in practice. The comparison of response proportions and disease-free and overall survival curves must be made using objective statistical procedures. If a complicated statistical model is used, it should be described in the paper itself or in an appendix. The description of the statistical methods must be adequate for another statistician to reproduce the analysis, if the source data were available.

The outcome of statistical tests depends both on the existence of a true difference and the number of patients in the study. If the number of patients is small, then the study will have low sensitivity (i.e., power) to detect small or moderate treatment differences. Failure to find statistical significance may result from small numbers, rather than lack of benefit. For this reason, every paper reporting a null effect should have a discussion of statistical power and how it can influence the conclusions of the paper.

The analysis also should contain a discussion relating to ending patient entry to the study. For example, was the trial (or part of the trial) stopped because of an unusual outcome associated with a treatment (i.e., a very good or poor result)? Was patient accrual terminated after a predetermined number of patients entered the trial? Was an early stopping rule used? All of these affect the reader’s interpretation of the study conclusions.

### Statistical Techniques

The most common end points in cancer clinical trials are “success” (defined in the context of the trial), response (complete or partial), toxicity (lethal, life-threatening, severe, moderate, or mild), overall and disease-free survival, and duration of response. These end points fall into two general classes, which often are called categorical data (success, response, toxicity) and time metric or survival data (overall survival, disease-free survival, and duration of response). Categorical data are characterized by having outcomes that belong in a category and can be counted (e.g., number of successes or failures) or other events. The *survival data* (this is the term most often used to describe time metric data, even though the data may not actually refer to survival) are characterized by two events (beginning and end); the time between these two events is the time measurement.

### Categorical Data

Suppose a trial evaluating objective tumor response observed 20 complete or partial responses in 100 patients. The reported response rate is 20%. The statistical model for this study envisions a true or theoretical response rate that could only be calculated if the experiment enrolled the entire population of patients with the particular disease characteristics. Theoretically, this number would be very large. The clinical trial enrolling 100 patients is a sample from this population. The 20% is only an estimate of the true proportion, as it is based on a sample of patients. How close is the reported value to the true value? To judge how close the reported or sample value is to the true value, one uses a statistical technique called a confidence interval. The formula for the confidence interval is

where *n* is the sample size and ˆp = (number of successes)/(sample size). The caret (^) often is used to remind one that the proportion is based on a sample of observations; the true value would be designated by *p*. More correctly, the formula given by equation 2 is an approximate 95% confidence limit. The confidence interval for our example is calculated to be 0.20 ± 0.08. The operational interpretation of the confidence interval is that the true value of response is within the interval (12%, 28%). The reason it is referred to as a 95% confidence interval, is that on average, 95% of such confidence intervals will be correct (i.e., the true value of response will be within the interval). It is possible to raise the “confidence” to 99% or even higher at the expense of widening the interval, but in practice, most scientists use 95%.

Another common statistical problem arising in the analysis of a clinical trial is comparing two proportions. The comparison can be made by calculating a confidence interval between two proportions or carrying out a statistical test of significance. To illustrate the problem, suppose that outcome is measured by success and failure and the proportion of successes for two treatments, designated as A and B, are ˆp_{a} = 50/90 = .56 and ˆp_{b} = 40/100 = .40, respectively. The formula for calculating an approximate 95% confidence interval for the (true) difference (*p*_{a} - *p*_{b}) is

where *n*_{a }*and n*_{ b} are the respective sample sizes. Carrying out the calculations results in 0.16 ± 0.14. The interpretation is that the true value of the difference can be as low as 0.02 or as high as 0.30. The interval (0.02, 0.30) is referred to as a 95% confidence interval for the difference between two proportions. The formula in equation 3 is only an approximation for the 95% confidence interval, but it is accurate enough for sample sizes above 20. The interpretation of the 95% confidence intervals, is that on average, 95 of every 100 intervals so calculated will have the true difference within the interval. In this particular example, we conclude there is a real difference between the success proportions because a difference of 0 is not a possible value of the true difference.

Another common way to compare proportions is to carry out a statistical test of significance. Usually the data are put in the form of a 2 × 2 table, as shown in Table 24.10.

The statistical test for a 2 × 2 table calculates the probability of obtaining a result that was observed, as well as more extreme outcomes if there actually is no difference between the treatments. The calculation is based on the following reasoning: if the outcome is regarded as scientific evidence in favor of a treatment difference, then outcomes having a greater difference would constitute even stronger evidence of a real difference between treatments. Essentially, the probabilities of all the more extreme tables are calculated where the totals in the margins are kept constant. The probabilities are then summed to form a *p* value. For example, a more extreme table is depicted in Table 24.11, and those data would be even stronger evidence in favor of a difference.

If the *p* value is small, then the probability of the observed table, or more extreme tables, arising by chance (i.e., no difference between treatments) is unlikely. Hence, we would conclude that the premise on which the calculation is made (i.e., no treatment difference) is incorrect and the treatments differ. Usually, a *p* value of less than .05 is declared to be “significant,” resulting in a conclusion that treatments differ.

The statistical test is based on a hypothesis, called the null hypothesis, in which the true values are equal. This is usually designated as H_{0}: *p*_{a }= *p*_{b}. The alternative hypothesis is that the true proportions are different (e.g., usually specified by H_{1}: *p*_{a }≠ *p*_{b}). This alternative hypothesis is called a two-sided alternative, because it refers either to *p*_{a} < *p*_{b}, or *p*_{a} > *p*_{b}. Occasionally, the alternative hypothesis would be a one-sided hypothesis (e.g., H_{1}: *p*_{a} > *p*_{b}). As a working rule, one should routinely use two-sided alternative hypotheses. A one-sided hypothesis is used when one treatment can never have a less beneficial effect than the other treatment. In some instances, investigators have reasoned that in comparing a potentially beneficial treatment against a placebo or observation group, a one-sided test would be suitable (i.e., therapy will be no different from having no treatment or will be better). This excludes the possibility that the active treatment may adversely affect the patient; there have been instances when one-sided tests have been used to evaluate the outcome of clinical trials in which, on further follow-up, the active treatment was found to be detrimental.

The statistical test for calculating the test of significance is often referred to as Fisher’s exact test after R. A. Fisher, the statistician who derived it. The numeric procedure for the test is a complex one; however, it is available in almost all computer software programs for statistical analyses. The statistical test for computing Fisher’s exact test comparing the two proportions

resulted in a *p* value of .0415.

Because any value less than .05 is significant, we would conclude that the proportions differ. An approximate test for comparing the two proportions can be carried out by calculating the chi-square test for comparing two proportions. That formula is

and then comparing the calculated values from a table of the chi-square distribution. Large values of *X*^{2} reflect evidence of a treatment difference. Table 24.12 is a short table of the chi-square distribution. Note that if *X*^{2} > 3.8, the *p* value is certainly less than .05. Using the same data from which Fisher’s exact test was carried out results in a value of *x*^{2} = 4.97; this gives a *p* value of .03. If the sample sizes are of moderate size (i.e., at least 20 for each group), the chi-square test will give an answer that is quite close to that of Fisher’s exact test.

The difference between the confidence interval approach and the test of significance is that the significance test does not indicate the magnitude of the difference between two proportions. The significance test refers to the probability of observing the given difference, or larger differences, between the two observed proportions if the two theoretical proportions actually are the same. In other words, it calculates the probability of these differences arising from chance fluctuations.

### Survival Data

A characteristic of survival data is that at the time of analysis, some patients may still be alive. These observations are called censored observations and represent incomplete data; however, they do contain important information by providing a lower bound on survival. Both complete and censored observations must be included in any analysis of survival-type data. Censored data arise from a variety of different circumstances. The two chief reasons for observing a censored observation are (1) the period of follow-up is short, and (2) the patient may have been lost to follow-up. The first reason for censoring is referred to as “noninformative censoring” because apart from providing a lower bound on survival, the fact that the patient is censored conveys no further information about the treatment. Patients may be lost to follow-up because they have moved, leaving no trace, or may have died without the investigator being aware of it. In some cases, loss of contact may have arisen because the treatment was unsuccessful or too toxic. Alternatively, the loss to follow-up may be unrelated to the patient’s progress. The latter represents noninformative censoring; however, the other reasons contain information about the treatment and may be informative. Because information on the reasons for patients being lost to follow-up is not generally available, clinical trials with a significant number of such patients could be seriously biased. A rule of thumb is that if more than 10% of the patients are lost to follow-up, then care must be taken in the interpretation of data. One way to assess the importance of these patients is to carry out the analysis in two separate ways: (1) regarding all the lost patients as censored, and (2) assuming the observations on the lost patients are complete and represent the survival time. If the general conclusions of both analyses are the same, then these patients do not constitute a source of bias.

A theoretical survival distribution exists for any defined population of patients, and it may be altered by treatment. The theoretical survival distribution is the probability distribution of the different survival times if a (conceptually) infinite number of patients has received the same therapy. Figure 24.2 plots a theoretical survival function. (It is a plot of a probability versus time.) Denoting the survival function by S(*t*), it represents the proportion of patients who will have a longer survival than time *t*. For example, if *t*_{m} is the time for which the survival time is exceeded by half the patients, then S(*t*_{m}) = 0.5. The quantity *t*_{m} is called the median survival, and the median survival in Figure 24.2 is *t*_{m} = 2 years. In general, one can define the survival time *t*_{p} such that S(*t*_{p}) = *p*. The survival time *t*_{p} represents that point on the theoretical survival curve such that a proportion *p* of patients will have longer survival time. For example, if*p* = .25, then 25% of the patients are expected to have longer survival times than *t*_{.25.} The value *t*_{p} is called the *p*th percentile or upper *p*th percentile.

The theoretical survival distribution is never really known. Instead, as in any real-life situation, we have a limited amount of data, which can be used to estimate the theoretical survival distribution. The estimate of the theoretical survival distribution can be considered as being a summary or condensation of the data.

There are two principal ways of estimating the survival curve from actual data. These are called the life-table or actuarial method and the Kaplan-Meier or maximum likelihood method. The life-table method generally is used with a large number of observations, whereas the maximum-likelihood method is used with a small number of observations. There are many different computer programs that automatically calculate these estimates. We illustrate the calculations for the life table method here, as this is the more common method. Calculations for data on *n* = 118 patients with advanced adenocarcinoma of the lung are outlined in Table 24.13. The starting point for the calculations is to select a time interval to summarize the survival times. In Table 24.13, this interval is 1 month. A summary of the data is given in columns 2 to 4; the calculations in this table are self-explanatory. Figure 24.3 plots the survival function; note that it is plotted as a step function. The last column of Table 24.13 refers to the survival probability. For example, the probability is .897 of surviving 1 month and .809 of surviving 2 months. The number at risk within any interval is the number of patients who are “candidates” for dying within that interval. For example, for the first interval, 118 patients were alive at the beginning, but 3 were censored within the interval. The number at risk is calculated by assuming that three censored patients are equivalent to half the number who would be available for a potential death. Hence, the number at risk is calculated as 118 – 3/2 = 116.5. The larger the number of patients at risk, the greater is the reliability of the survival probability. As a result, survival probabilities in the “tails” of the distribution do not have the same reliability as those in the beginning of the distribution.

### Test of Significance for Comparing Two Survival Distributions

There are several ways to perform a statistical test of significance for comparing two survival distributions. The most widely used test is the “log rank test.” The calculation of the test is relatively complex; however, it is widely available on computer systems. The key assumption in using the log rank test is that if the two survival distributions are denoted by S_{1}(*t*) and S_{2}(*t*), then the ratio of their logarithms is always a constant, that is, log S_{1}(*t*)/log S_{2}(*t*) =e^{β} (constant independent of time). The log rank procedure tests the null hypothesis that β = 0. If the assumption is that the ratio of the logarithms of the survival functions is not constant, then the log rank test would be inappropriate to use for comparing two survival distributions. This assumption sometimes is referred to as the “proportional hazard” assumption. (See the section on Statistical Models for a discussion of hazard functions.) One situation in which this assumption does not hold is when the two survival distributions are observed to cross or intersect; this corresponds to the case in which one therapy appears to be better during early follow-up but, as time progresses, a higher proportion of patients on the other therapy live for longer periods of time.

## Statistical Models

Evaluation of any therapy in a clinical trial should consider all factors that influence outcome. In addition to a potential for the therapy under investigation to influence outcome, features associated with the natural history of the disease also influence outcome. For example, it is well known that the probability of observing a response for advanced lung cancer depends both on performance status and weight loss. Another example is that the survival of women participating in adjuvant breast cancer trials is affected by menopausal status, nodal involvement, tumor size, and estrogen receptor (ER) status. Incorporation of these baseline variables into the statistical analysis ordinarily results in a more precise analysis.

The way in which these covariates are incorporated into a statistical analysis is through statistical models. General statistical models have been developed for both categorical and survival data; the models commonly used for these two kinds of end points are referred to as logistic and proportional hazards models, respectively. Very often, the proportional hazards models are called “Cox models” in honor of D. R. Cox, who first proposed them. The computations for using these models are extensive and ordinarily would be impossible to carry out on a hand calculator; however, calculations for both types of models are widely available in many statistical software analysis packages. Therefore, it is only appropriate to give the basic concepts here.

The basic idea of models are presented as if one had two therapies that are being compared and a single covariate. The generalization of these ideas to many covariates is straightforward.

### Logistic Models

Suppose there are two therapies (labeled A and B) having theoretical response probabilities *p*_{a} and *p*_{b}, which are unknown. The odds of response for each therapy is defined as the ratio of the probability of a response to the probability of no response, that is, *p*_{a}/*q*_{a} and *p*_{b}/*q*_{b} where *q*_{a} = 1 – *p*_{q} and *q*_{b} = 1 – *q*_{b}. The model for comparing two therapies would be to write the logarithm of the odds ratio as

The quantities α and β are unknown parameters in the model. If β = 0, then the two treatments are the same. This formulation of the modeling is equivalent to writing

The quantities *p*_{a} and *p*_{b} are expressed in terms of parameters (α,*β*), respectively. The functional form is the logistic function; hence, the name logistic models. Often, the logarithms of the odds ratios log(*p*_{a}/*q*_{a}) and log(*p*_{b}/*q*_{b}) are called logits.

Now suppose that the response is affected by gender. This is incorporated into the model by writing the logit for response as

where the parameters (α, β, γ) are unknown. The above model assumes that the effect of gender is additive and is independent of treatment. A more complex model would be to allow the effect of treatment to depend on gender. This can be done by defining a new parameter δ and writing the model for males receiving treatment B as log(p_{b}/q_{b}) = α + β + γ + δ. The quantity δ is called an *interaction term*. The parameters for the logit model with interaction can be summarized as in Table 24.14.

Statistical methods exist for finding the numerical values of the parameters and for making tests of significance. The numerical calculations are readily carried out on computers. The model is very flexible and allows an organized way for interpreting the data. Table 24.15 summarizes the various possibilities for drawing conclusions from this data set. The extension to many more covariates can be done.

In using a statistical model to analyze data, there are many opportunities to draw incorrect conclusions because the model was incorrect. For example, if the interaction term (δ) was omitted in the model and the effect of treatment depends on gender, then conclusion from the analysis may be wrong. Unfortunately, details of the goodness of fit of mathematical models are often omitted from scientific papers, and so it is difficult for the most experienced reader to verify the adequacy of a model.

### Proportional Hazard Models for Survival (Cox Survival Models)

Modeling of survival data to take account of other factors which influence survival is carried out by modeling the “hazard function” or “failure rate.” To illustrate ideas consider the survival (years) of ten patients: .2+, .5+, .5, 1.2+, 1.2, 1.8, 2.0, 2.1, 3.5, 5.0+ where a + denotes a censored observation. Out of 10 observations, there are 6 deaths. Hence, the proportion of deaths is *p* = 6/10 = .6. However the proportion of deaths depends on the length of the follow-up time. The longer the follow-up time, the greater is the proportion of deaths. A more useful summary of the data is provided by the ratio of the proportion of deaths to the average follow-up time. This results in an expression which has units of “deaths per unit follow-up time” or equivalently “proportion of failures per average follow-up time.” This quantity is called the “failure rate” (FR) and is calculated by

Sometimes, it is convenient to report the deaths as per 100 patients or 1,000 patients. In our example, the failure rate could be reported as 1 death per 3 patients per year, or 33 deaths per 100 patients per year, or 333 deaths per 1,000 patients per year.

Alternatively, the time units may be changed to be deaths per month, in which case the death rate would be 2.75 deaths per 100 patients per month. The FR=0.333 represents an average failure rate over the entire set of data. One could have calculated separate failure rates for the first year, second year, and so on, and these calculations are shown in Table 24.16.

Clearly, the failure rate keeps dropping with time, and the FR=0.33 is an average failure rate. With a very large number of patients, the failure rate can be calculated for smaller and smaller time intervals (e.g., monthly, weekly, daily). One can envision an interval of time which gets progressively smaller as the interval shrinks to a point. With each of these smaller and smaller intervals, a failure rate can be calculated, provided the number of patients is very large (theoretically infinite). This limiting process defines the “instantaneous failure rate” or the “hazard function” which is directly related to the survival function. If one knows the hazard function, then the survival function is completely defined and viceversa. Letting *S(t)* define the survival function and *h(t)* the hazard function, the relationship between the two is given by

Suppose a clinical trial is comparing two treatments, denoted by A and B. The proportional hazards model for making the comparison is to specify that h_{B}(t)=e^{β}h_{A}(t) where h_{A}(t), h_{B}(t) are the hazard functions for the two treatments and β is an unknown constant. Clearly, if β = 0, the two treatments have the same hazard function and consequently the same survival functions. A more formal way of writing this model is

Thus, the hazard function of treatment B is proportional to the hazard function of treatment A. This model leads to the log rank test which has been discussed earlier. Note that

Therefore, log S_{B}(t)/log S_{A}(t) = e^{β}, which is the assumption for the log rank test discussed earlier.

Now suppose that survival not only depends on the treatment but also on gender. The hazard function can then be modeled in a similar way as the logistic function. Explicitly, we can model the hazard function for each treatment by

The quantities β, γ, δ can be estimated from the data. Statistical tests are available for making inferences on these parameters. Note that this is a parallel model to the logistic regression model. The only difference is that *h*_{0}*(t)*, which is the baseline hazard rate, replaces α in the logistic model. The inferences from this proportional hazard model regarding the parameters are the same as in the logistic model (see Table 24.15).

Many applications of this type of model fail to verify if the proportional hazards assumption is correct. Furthermore, it is rare that an interaction term is included in the model. Investigators should be wary of the presentation of a statistical analysis which does not address the issues of the inclusion of interaction terms to determine if the treatment effect depends on one or more prognostic or other baseline factors as well as the correctness of the proportional hazard assumption. In order to further illustrate how modeling for proportional hazard models is carried out, a detailed example is presented.

The ECOG performed a randomized clinical trial on recurrent head and neck cancer with three treatment groups: (1) low-dose (40 mg/m^{2}) methotrexate (M), (2) high-dose (240 mg/m^{2}) methotrexate plus leucovorin rescue (ML), and (3) high-dose (240 mg/m^{2}) methotrexate plus leucovorin rescue plus cyclophosphamide (500 mg/m^{2}) plus cytosine arabinoside (300 mg/m^{2}) (MLCC). It is known that survival depends on performance status, time since first symptoms, disease site, and weight loss. The trial registered 237 patients, and Table 24.17 summarizes the medians.

It is clear that this trial is complex and a candidate for statistical modeling. The particular variables chosen for modeling and their associated levels are summarized in Table 24.18. This table represents a condensation of the data because performance status is measured on a five-point scale (0, 1, 2, 3, 4). An ambulatory patient is someone with a performance status of 2 or less. Similarly, the time from first symptoms has been condensed to three levels. Even with this condensation, however, the number of possible combinations is 3 × 2 × 3 × 4 × 4 = 288. The entire trial only registered 237 patients, so there are more experimental combinations than patients.

In setting up the statistical model, each variable will generate parameters in the model. Ordinarily, the number of parameters is one less than the number of levels. For example, treatment will have two parameters because it has three groups. This can be modeled by using the parameters β_{1} and β_{2}, that is, h(M) = h_{0}(*t*), h(ML) = h_{0}(t)*e ^{β1}*, h(MLCC) = h

_{0}(t)

*e*. Because h(ML)/h(M) =

^{β2}*e*

^{β2}, positive values of β

_{1}imply that treatment ML has a higher failure rate than treatment M. A similar interpretation holds for h(MLCC)/h(M) =

*e*

^{β2}. The comparisons of ML to MLCC results in h(MLCC)/h(ML) =

*e*

^{β2}

^{-}

^{β1}. Hence, if β

_{2}> β

_{1}, then MLCC has a higher failure rate than ML. The same scheme is set up for the other variables.

Table 24.19 summarizes the elements of this model. Note that weight loss is only one parameter (δ), even though there are four levels. This variable has been modeled as a continuous variable, where a new variable *x* is introduced that takes on values (0, 2, 3, 4) corresponding to increasing levels of weight loss. The estimates β_{1} and*β*_{2} are significantly greater than zero. Hence, M (low-dose methotrexate) has better survival (smaller hazard function) than the other two treatments. To compare the hazard ratios of h(ML) to h(MLCC), we have h(ML)/h(MLCC) = *e*^{β2}^{-}^{β1}. Making the comparison, we have

Because zero is a possible value of (^{β2-β1}), we would conclude that the two high-dose methotrexate arms have the same survival. Reviewing other parameters, we note that performance status, time since first symptoms, weight loss, and disease sites are all significant.

This model is an example of an additive model. There are no interaction terms with treatment. It concludes that low-dose methotrexate is superior, but it does not explore, for example, how this superiority relates to ambulatory status (i.e., does the superiority hold in the same way for both ambulatory and nonambulatory patients?). Similar remarks can be made about the other prognostic factors. Another potential criticism is that the way in which the weight loss is modeled requires further documentation.

## Meta-Analysis

The last decade has seen increasing use of the statistical technique termed *meta-analysis*, which refers to the use of formal statistical techniques to sum up a collection of separate studies attempting to investigate the same hypothesis. Its purpose is the same as the scientific review of independent studies aimed at studying the same hypothesis. The difference between meta-analysis and an ordinary scientific review is that a scientific review tends to be somewhat personal, reflecting the views of the reviewer. Meta-analysis, however, attempts to synthesize data in a quantitative way, and the end product is a numeric estimate of a quantity that usually reflects the advantage of a treatment or a method.

The impetus for carrying out meta-analyses of cancer clinical trials is that the trials may be too small to find small but important therapeutic effects. For example, there are approximately 40,000 deaths from breast cancer every year in the United States. If the cure rate was increased by 10%, then one would expect 4,000 fewer deaths. A clinical trial to compare a treatment with a very low cure rate, in the neighborhood of 10 to 20%, with one that caused a 10% increase would require approximately 10,000 to 25,000 patients in a clinical trial. No breast cancer trials have ever been designed with this large a sample size.

Meta-analytic methods have been used in a wide variety of fields. They have been applied to observational as well as randomized studies. The initial applications were made in education research, but these ideas now are being applied to a large number of scientific fields. Although performing a meta-analysis is relatively straightforward, carrying out a good analysis is difficult.

The principal difficulties in applying meta-analysis to clinical trials are (1) the therapies may be different, (2) the patient populations may be different, (3) the follow-up times may vary, and (4) the quality of the studies may vary. To illustrate these ideas, suppose that a meta-analysis is to be performed to determine if adjuvant chemotherapy prolongs survival for patients with breast cancer. In fact, a meta-analysis of this kind has been carried out by the Early Breast Cancer Trialists’ Collaborative Group.^{12} Which trials should be included in this analysis? Both randomized and nonrandomized trials exist including nonrandomized studies would introduce all the well-known biases that are associated with them; hence, the meta-analysis should be restricted to randomized studies only. Among the randomized trials, some exist comparing a chemotherapy versus placebo or observation group, whereas others may be comparing chemotherapy plus postoperative radiation versus postoperative radiation. Should the latter trials be included in a meta-analysis? Hypothetically, if the radiation therapy is of no benefit, then such trials should be included. Alternatively, if the radiation therapy does improve survival, then it will ameliorate the effect of chemotherapy to improve survival. Among the randomized chemotherapy trials, there are a variety of treatment regimens. Some have used tamoxifen, cyclophosphamide, combination therapy [cyclophosphamide, methotrexate, 5-fluorouracil (CMF)], CMF with prednisone, or melphalan. The schedules have ranged from short intensive courses to long courses of therapy. Doses also may have differed among the studies. Some studies have been made on node-negative patients only, and some have been on node-positive patients. Still others have included both. The eligibility requirements for the trials may differ in other substantial ways as well.

Should the meta-analysis be restricted to published studies only, or should it include both published and unpublished studies? Published studies tend to be positive, whereas unpublished studies tend to be negative. The data quality of unpublished studies may not be the same as that of published studies. How does one find unpublished studies? Finally, the methods used to carry out the meta-analysis give more weight to studies with larger sample sizes and do not give any weight to the quality of a study. Nevertheless, proponents of meta-analysis believe that despite these problems, meta-analysis is worthwhile.

To discuss the basic issues, the meta-analysis performed by the Early Breast Cancer Trialists’ Collaborative Group of cytotoxic therapy for patients with early breast cancer will be reviewed.^{12} The meta-analysis includes almost all randomized clinical trials (both published and unpublished) that were made available for analysis. The only exclusions were trials in Japan and the former Soviet Union. The number of trials in the analysis totaled 35, which were divided into four major subgroups: (1) trials of CMF or CMF and prednisone (CMFP); (2) trials of CMF and extra cytotoxic agents; (3) trials of combination therapy that include some C, M, or F; and (4) trials of single agents. The analysis was divided into two sets, corresponding to women younger than 50 years of age at entry to the trial and those 50 years or older. We only consider here the analysis for the younger women.

The essential summary of the analysis has been put in graphic form by the authors in Figure 24.4. The columns are self-explanatory, except for the last two. The authors have calculated the difference between the observed number of deaths (O) and the expected number of deaths (E) for the treatment, assuming there is no difference between treatment and control. This difference is written as O – E, and the results are given for each trial. A negative value reflects that the treatment group had fewer deaths than expected. The graphic portion of the figure plots the ratio of treatment to control mortality rates with a 99% confidence interval for each trial. A value of less than unity indicates that mortality is less for the treatment group than the control group. The diamond symbol is centered on the average ratio of mortality rates; its length represents a 95% confidence interval. The figure contains the average mortality ratios for each of the four subgroups of trials as well as an overall ratio, which appears at the bottom. The overall conclusions are that (1) trials including CMF (group a) have a significant reduction in the annual mortality rate (37 ± 9%); (2) none of the other three clinical trial groups indicates a significant reduction in mortality; and (3) all four groups combined together share a 22 ± 6% reduction in overall mortality. This last conclusion mainly reflects the inclusion of CMF trials in the overall average.

If we examine the CMF (group A) trials, 3 among the 11 trials listed had fewer than five patients per treatment group and did not warrant inclusion. Of the remaining 8 trials, all had a no-treatment control group except Glasgow, which had a control group receiving radiotherapy. Both the ECOG (6177) and the Ludwig III trials added prednisone to the CMF, with the Ludwig trial also adding tamoxifen. With the exception of the Leiden and UK/Asia trials, all had a 12-month course of therapy, with the former having a 24-month course of therapy. Thus, the trials were not all comparable, but they were reasonably close with respect to therapy. It is questionable whether the Glasgow study should be included because it does not have a no-treatment control group. In any event, excluding trials with small numbers, all of the remaining 8 trials produced a negative O - E, which indicates an excess of deaths in the control group. The chance of this happening if treatment is not beneficial is the same as tossing a fair coin eight times and observing all heads or all tails. This probability is *p* = .016 and is unlikely to have happened by chance. Hence, one could have readily concluded that the aggregate of trials having CMF as their therapy reduces mortality. Among the 8 mature trials, 5 are individually significant at the .05 level (i.e., INT Milan 7205, Glasgow, Leiden, Guy’s/Manchester II, INT Milan 8004). Thus, it is no surprise that the meta-analysis reached a similar conclusion. The use of a 99% confidence interval for the individual trials obscures those trials significant at the conventional 5% level. It is not clear why this was done.

The overall value of O – E essentially gives more weight to trials with larger numbers of patients. There is no attempt to weight or judge the quality of these studies. However, one clue to quality is that there should be equal numbers of patients in each treatment group for every trial, except for chance fluctuations. Note that the total number of patients in the treatment and control groups are 635 and 554, respectively, in the group a analysis. The probability that such a split could arise by chance is *p* = .02; in other words, the split is not random and probably reflects differential quality among these trials. The major contributors to this imbalance are INT Milan 7205, Glasgow, and UK/Asia. These trials represent 40% of the total number of patients in group A.

There has been a virtual “explosion” in the application of ideas from meta-analyses to cancer clinical trials. Some examples of these meta-analyses are tamoxifen and early stage breast cancer, uveal melanoma, advanced uterine cervix, small-cell lung cancer and prophylactic cranial irradiation, postoperative radiotherapy and non–small cell lung cancer, chronic lymphocytic leukemia, colorectal cancer, prophylactic node dissection in breast cancer, astrocytoma, hormone replacement therapy for ovarian carcinoma and colon cancer, and preoperative radiation and invasive bladder cancer.

It is not at all certain that many poor trials, considered as a constellation in a meta-analysis, will shed more light than a few high-quality trials that reach similar conclusions. The strength of meta-analysis is numbers, but the weakness is failure to consider the inherent quality of the research design and execution by better investigators.

## Falsification of Data

The veracity of cancer clinical trials was seriously questioned when it was discovered in 1994 that false patient data were submitted by at least two institutions participating in the breast cancer clinical trials carried out by the National Surgical Adjuvant Breast and Bowel Project (NSABP).^{13} The NSABP is a multi-center clinical trials group mainly performing trials in breast cancer. The public announcement of the data falsification not only became a political issue but raised questions about the way that cancer clinical trials are being carried out. Among the issues raised were (1) why the falsification had not been discovered earlier, and (2) the effect of the false data on the conclusions drawn from these clinical trials. This section discusses these two issues.

Quality control of data from cancer clinical trials usually is carried out by a combination of data managers, automated computer data checks, and record reviews by study chairmen and possibly other senior physicians. This program is supplemented by periodic audits at each clinical site. These involve either a full audit, an audit of all significant events, or drawing a random sample of hospital records for patients entered in trials. The audit consists of comparing the hospital record data with that submitted to the data coordinating center for the trial.

One of the NSABP clinical investigators had been submitting falsified data over a period of 15 years. A complete review audit of this investigator resulted in the discovery of 99 patients with discrepancies from among 1,511 patients. All but one of these discrepancies involved eligibility rather than toxicity or relapse. The falsified data included changing the dates of surgeries performed before patients enrolled in studies, altering dates of biopsies, changing or fabricating ER values, altering dates of chemotherapy, and lack of appropriate informed consent. The nature of these falsifications could only be discovered at an audit of the records in the hospital. Actually, the NSABP audit process did discover that the investigator had submitted some false data; however, questions were raised by the United States Congress and the National Cancer Institute officials as to why the data discrepancies had not been uncovered earlier by the NSABP.

It should be noted that an audit on a random sample of patient records can only assess the quality control system that is in place at the hospital. It cannot certify that all submitted data are correct. One point of view is that data fraud is at the end of a road marked by a careless and sloppy data collection system. Demands have been made to increase the frequency and scope of data audits in order to have a better chance of detecting data falsification. However, one must be realistic in proposing such a program. Our experience is that data falsification is relatively rare among clinical investigators.

It is unfortunate that this much-publicized case of data fabrication only led to punitive actions rather than an investigation of why it occurred and how to reduce the incentives for such behavior. The motivation for submitting false data may arise from a variety of reasons. Among these are: (1) the eligibility criteria may appear to be arbitrary (e.g., patient must be entered in the study within 28 days of surgery); (2) laboratory tests may be required that are nonroutine, expensive, too frequent, or inconvenient to the patient and are to be used in an ancillary study but are not necessary for patient care, or (3) investigators receive funds on a per-patient basis (which may result in additional income if the physician is in private practice).

The other issue to be addressed is the effect of data fabrication on the principal conclusions of a study. Clearly, if the amount of falsified data is relatively small compared with patient accrual, it will unlikely alter previous conclusions. It also is necessary to account for the nature of the data fabrication. All but one of the known discrepancies in the NSABP consisted of altered eligibility criteria. Furthermore, all these trials were randomized. As a result, these ineligible patients were randomly assigned to the treatment groups; so each of the treatments under study had the same opportunity to be tested on these ineligible patients. Consequently, the comparisons among treatments will still be unbiased, even though the inclusion of these patients may require a modified interpretation of the conclusions of the study because of the altered patient population. The randomization process ensures that unknown factors (e.g., data ineligibility) affect all treatment groups the same way (on average).

In any event, discovery of falsified data submitted by an investigator is a serious matter. The prevailing view is that all data from that investigator (as well as his or her hospital) should be expunged from the database of the clinical trial. This will result in reduced power to detect real differences among therapies in a trial. Table 24.20 shows the loss of power relative to the percentage of data expunged from the database when a trial is designed to have 80% power at a 5% level of significance. For example, if 10% of the patients are dropped, it will result in a 5% loss of power. Hence, the power changes from 0.80 to 0.76. A rule of thumb is that there is a loss of 0.5% power for every 1% removal of patients when less than 20% of the patients are dropped.

Our general conclusion is that if the original trial showed a statistical difference between treatments, then a removal of fewer than 10% of patients from the original analysis is unlikely to change the conclusions with a reanalysis. Also, submission of fraudulent eligibility data does not affect the unbiasedness of a randomized trial.

## References

- 1.
- Louis PCA. Essays in clinical instruction. London, U.K.: P. Martin; 1834.
- 2.
- Zelen M. The randomization and stratification of patients to clinical trials. J Chronic Dis. 1974;27:365–375. [PubMed: 4612056]
- 3.
- Taylor K M, Margolese R G, Saskolne C L. Physicians’ reasons for not entering eligible patients in a randomized clinical trial of surgery for breast cancer. N Engl J Med. 1984;310:1363–1367. [PubMed: 6717508]
- 4.
- Antman K, Amato D, Wood W. et al. Selection bias in clinical trials. J Clin Oncol. 1985;3:1142–1147. [PubMed: 4020412]
- 5.
- Anderson J R, Cain K C, Gelber R D. Analysis of survival by tumor response. J Clin Oncol. 1983;1:710–719. [PubMed: 6668489]
- 6.
- Lefkopoulou M, Zelen M. Intermediate clinical events, surrogate markers and survival. Lifetime Data Analysis. 1995;1:73–85. [PubMed: 9385094]
- 7.
- Goldhirsch A, Gelber R D, Simes R J. et al. Costs and benefits of adjuvant therapy in breast cancer: a quality adjusted survival analysis. J Clin Oncol. 1989;7:36–44. [PubMed: 2642538]
- 8.
- Neaton J D, Brose S, Fishman E L. et al. The Multiple Risk Factor Interventions Trial (MRFIT). VII. A comparison of risk factor changes between two study groups. Prev Med. 1981;10:519–543. [PubMed: 7027241]
- 9.
- Rosner G L, Tsiatis A A. The impact that group sequential tests would have made on ECOG clinical trials. Stat Med. 1989;8:505–516. [PubMed: 2657957]
- 10.
- Geller N L. Planned interim analysis and its role in cancer clinical trials. J Clin Oncol. 1987;5:1485–1490. [PubMed: 3625263]
- 11.
- Zelen M. Guidelines for publishing papers on cancer clinical trials: responsibilities of editors and authors. J Clin Oncol. 1983;1:164–169. [PubMed: 6668497]
- 12.
- Early Breast Cancer Trialists’ Collaborative Group. Effects of adjuvant tamoxifen and of cytotoxic therapy on mortality in early breast cancer. An overview of 61 randomized trials among 28,896 women. N Engl J Med. 1981;319:1687–1692. [PubMed: 3205265]
- 13.
- Nowak R. Problems in clinical trials go far beyond misconduct. Science. 1994;264:1538–1541. [PubMed: 8202708]

## Additional Reading

- [The following books contain good overall discussions of clinical trials.]
- Buyse ME, Staquet MJ, and Sylvester RJ. Cancer clinical trials: methods and practise. New York, NY: Oxford University Press; 1984.
- Chow SC, Liu JP. Design and analysis of clinical trials: concepts and methodologies. New York, NY: Wiley; 1998.
- Friedman LM, Furberg CD, DeMets. Fundamentals of clinical trials. New York, NY: Wiley, 3rd Ed; 1998.
- Meinert CL. Clinical trials: design, conduct and analyses. New York, NY: Oxford University Press; 1986.
- Piantadosi, S. Clinical trials: a methodologic perspective. New York, NY: Wiley; 1987.
- Pocock, S.J. Clinical trials: a practical approach. New York, NY: Wiley; 1983.
- Spilker B. Guide to clinical trials. New York, NY: Raven Press; 1991.
- The planning of sequential trials requires complex calculations and specialized software that is not widely available. A software package, suitable for personal computers, that is easy to use is called EAST and is available from Cytel Inc., Cambridge, MA.

- Theory and Practice of Clinical Trials - Holland-Frei Cancer MedicineTheory and Practice of Clinical Trials - Holland-Frei Cancer Medicine

Your browsing activity is empty.

Activity recording is turned off.

See more...