NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Institute of Medicine (US) Committee on the Use of Complementary and Alternative Medicine by the American Public. Complementary and Alternative Medicine in the United States. Washington (DC): National Academies Press (US); 2005.

Cover of Complementary and Alternative Medicine in the United States

Complementary and Alternative Medicine in the United States.

Show details

3Contemporary Approaches to Evidence of Treatment Effectiveness: A Context for CAM Research

Evidence of treatment effectiveness from clinical research has become integral to effective clinical care. This chapter provides a context for the committee's recommendations about research on complementary and alternative medicine (CAM) that appear in Chapters 4 and 5. A brief account of the development of present approaches to evidence-driven clinical and public policy (which includes practice guidelines and coverage policy) is presented. This is followed by a description of the basic ideas of clinical research design, including a taxonomy of study design and a taxonomy of outcome measurements. An account of some features of contemporary data analysis follows. The chapter concludes with an overview of the applicability of contemporary clinical research methods to some CAM therapies.


As noted in Chapter 1, over the past twenty years practitioners of conventional medicine have made a marked shift from a reliance on experience (directly observed or as recorded by others in medical journals) to a reliance on more rigorous research to evaluate the effectiveness of treatments. For example, the concept of formal evaluation of therapies through randomized controlled trials is certainly not new (Kaptchuk and Kerr, 2004) but has regularly been applied in Western medicine only since World War II (Byar, 1980; Cochrane, 1972).

Some notable exceptions to reliance on experience exist, however. In the middle of the nineteenth century, Florence Nightingale pioneered the application of epidemiological and statistical methods to the study of hospital deaths, and her discoveries of a plausible causal relationship between processes of care and outcomes led to challenges to thoughts about mechanisms of disease prevalent at the time, changes in clinical practice, and improvements in mortality rates. In the early twentieth century, Ernest Amory Codman, a Boston surgeon, argued strongly for the formal study of surgical outcomes in an effort to understand which surgeons, hospitals, and surgical procedures produced good versus bad outcomes (Neuhauser, 2002). This effort did not take root and grow—in fact, it provoked significant hostility among Codman's colleagues—but it raised the question of the need for formal analysis of treatment outcomes that was picked up again more than 50 years later.

The need for formal evidence of effectiveness for common medical and surgical interventions was recognized much more broadly beginning in the 1970s. Passage of Medicare and Medicaid legislation, together with apparent advances in medical and surgical care, contributed to a surge in health care spending. Policy makers and payers asked increasingly pointed questions about the “value” of health care, questions that required more fundamental questions about the effectiveness of interventions to be answered.

Even more disquieting questions emerged from a body of work that described striking variations in the rates of common surgical procedures such as surgery for benign prostate disease, hysterectomy, and tonsillectomy, among seemingly similar geographic regions. This work began in the late 1960s in northern New England, where isolated hospital market areas could easily be defined (Wennberg and Gittelsohn, 1982).

International differences in the rates of medical procedures were observed; and when the variations within countries were adjusted for the overall rate of variation among countries, a consistent pattern was detected; a high degree of variation was a marker for a high degree of discretion, and a high degree of discretion was often explained by professional uncertainty about effectiveness (McPherson et al., 1982). At the time, few of the procedures in question had been subjected to randomized controlled clinical trials or other credible clinical studies. Subsequently, variations in the rates of medical admission, physician visits, and diagnostic tests that could not be explained by clinical variables were also found. Taken together, these findings raised new questions about the science base of clinical practice. If decisions were based on science, how could it be that treatment depended more on where one lived than what was wrong or what one cared about? Policy makers wondered if high rates meant overuse and economic waste and if low rates meant underuse and deprivation. “Which rate was right?” became the pressing policy question and the answer required a new investment in clinical research to better define the outcomes of common interventions for common conditions. Thus, the practice variation phenomenon provided the motivation and the rationale for the development of “outcomes research.”

The goal of outcomes research was to determine what was known and what was not known about common interventions, thereby setting research agendas for common conditions. Existing evidence was systematically reviewed by using techniques for combining data from different studies previously described in the social sciences. Claims data linked to Social Security Administration mortality and other administrative data were used to glean outcomes information to fill the gaps in knowledge that existed at the time. Patient surveys were conducted to capture patients' subjective responses to treatments and outcomes. Variations in these responses highlighted the importance of patients' preferences as a source of warranted variation in clinical decisions. Decision models (a model is a representation of reality) were constructed to test the relative sensitivities of decisions to key probabilities of good or bad outcomes and to patient preferences. Decision support tools were developed; and trials were designed to help patients and doctors choose among treatment options, including randomization in a trial when a well-informed patient was at equipoise, that is, finding each treatment equally acceptable.

Other investigators used consensus methods to develop appropriateness criteria that were then applied to the medical records of patients who had undergone procedures with high rates of variation by geographic region. It was common for procedures for 30 percent or more of patients' indications to be deemed inappropriate. The proportions of procedures deemed inappropriate was essentially the same in high- and low-volume settings, so the low-volume providers were not simply doing a better job of selecting only appropriate cases.

This work was extended with a focus on guideline development. Professional organizations such as the American College of Physicians instituted rigorous guideline development processes, increasing recognition of the severe limitations of the evidence base for the treatment of common conditions.

Evidence of Effectiveness for Prescription Drugs

The limited evidence base for surgery and other procedures for the common conditions targeted by outcomes research contrasted sharply with the richer body of evidence for medical therapies. This difference can best be understood in the context of the regulation of medications that began in 1906 with the Pure Food and Drug Act, which made misrepresentation of ingredients illegal and which recognized the standardized drug formulae registered in a national formulary or pharmacopoeia. The 1906 act was silent on drug safety and efficacy.

The Federal Food, Drug, and Cosmetic Act of 1938 extended safeguards by introducing the distinction between prescription and over-the-counter drugs and requiring pharmaceutical manufacturers to prove drug safety information before the drug could be released for use. This was a direct response to the elixir sulfanilamide tragedy, in which 107 people, most of them children, died when a new sulfa preparation was distributed without testing of the preparation for safety.

In 1951, the Durham-Humphrey Amendment (U.S. Statutes at Large, 1951) made it clear that the classification of drugs as prescription was up to the Food and Drug Administration (FDA), not manufacturers. In its initial form, the amendment authorized FDA to test drugs for efficacy as well as safety. However, the efficacy requirement was eventually removed before passage of the amendment. The next significant change in legislation followed the thalidomide disaster in Europe, which was narrowly averted in the United States and which prompted passage of the Kefauver-Harris Amendment (U.S. Statutes at Large, 1962) in 1962. All clinical testing procedures had to be approved by FDA and had to demonstrate efficacy as well as safety. The pharmaceutical industry resisted the efficacy requirement, especially the retroactive evaluation of drugs. However, in 1970, in the court case Upjohn Co. v. Finch (422 F.2d 944, 955 [6th Cir. 1970]), the Court of Appeals ruled that commercial success alone did not constitute substantial evidence for efficacy in the case of the Upjohn drug Panalba. Evidence of efficacy as well as of safety had become an enforceable standard for prescription drugs. Over the ensuing decades, the pharmaceutical industry and clinical research organizations (CROs1) rapidly built the capacity to conduct those clinical trials necessary to meet the standards set by FDA.

More Recent Developments: Evidence-Based Medicine

As the need for evidence became more evident and funding for clinical research became more available, many academic settings emphasized development and use of methods for gathering clinical evidence. The multidisciplinary collaborations that formed with federally funded “patient outcomes research teams” matured as methodological expertise in clinical epidemiology, decision theory, and other domains of quantitative and qualitative research was honed. The “clinimetrics” work of Alvan Feinstein at Yale attracted more attention. David Sackett and colleagues in Canada and later England defined clinical epidemiology as the “basic science for clinicians.” Courses were devised to teach students and physicians how to critically appraise the medical literature. Fellowship and clinical scholar programs turned out clinical scientists with the skills necessary to generate and accurately interpret evidence. The Canadian Task Force on the Periodic Physical Examination and later the U.S. Preventive Services Task Force developed ratings for “levels of evidence,” which are described in detail later in this chapter. The Cochrane Collaboration was formed in Oxford, England, to systematically examine evidence for the full breadth of medical practice and quickly attracted an extensive international following. Journals with the titles Evidence-Based Medicine and Evidence-Based Health Care appeared. By the mid-1990s, the concept of formal, scientific evidence of treatment effectiveness had arrived, at least in some circles.

The goal of evidence-based medicine is to ensure that, to the extent possible, individual clinical decisions and broader health policy decisions about tests and treatments be based on the published results of rigorous studies of efficacy and effectiveness. Because not all treatments have been subjected to formal study and because some treatments cannot be studied without investment in massive clinical trials, it will not be possible to base all treatment decisions on published evidence. Nevertheless, the concept of evidence-based medicine asks that decisions be based on published scientific evidence when it is available and that investments be made to gather evidence in as many areas of medical care as possible.

In conventional medicine, there is now a general acceptance of the need to carefully study the effectiveness of tests and treatments, even those that have already become frequently used. Just in the past 3 years, prominent studies have challenged the effectiveness of bone marrow transplantation and high-dose chemotherapy for breast cancer (Farquhar et al., 2003), arthroscopic surgery for osteoarthritis of the knee (Moseley et al., 2002), and the use of estrogen replacement therapy during menopause (Rossouw et al., 2002).

There are clearly some treatments for which evidence of effectiveness is immediate and compelling. It may be unnecessary or even unethical to conduct formal effectiveness trials in a variety of situations, for example, when a treatment results in a combination of a clear reversal or elimination of a disease process, has a short latency of noticeable effect, is nearly universally effective in all patients treated, and eliminates clinical symptoms. The use of penicillin in the mid-1940s, surgery for appendicitis, and resection of localized cancers all stand as examples of this sort of undisputed effectiveness. Even in these examples, there may be value in conducting long-term surveillance studies to detect rare or late complications or side effects, and it may be appropriate to conduct formal cost-effectiveness or cost-benefit studies.

At the other end of the spectrum are those interventions that have modest effects, if any at all. It is these interventions that require studies with rigorous design and of rigorous execution to determine whether an effect does indeed exist and to estimate its size.

The next section examines a variety of research methods available for use in conducting clinical effectiveness research.


A Taxonomy of Clinical Research Methods

Many factors can influence the outcome of treatment. These include the treatment itself, characteristics of the patient (such as age, gender, and comorbid conditions), other treatments, access to care, adherence to treatment plans, socioeconomic status and education, and the skill of the practitioner. In treatment effectiveness research, the goal is to evaluate the contribution of one of these factors, treatment, to determine whether treatment makes a difference. Doing so can be difficult if other factors are at play, as they often are. The goal of study designs is usually to make it possible to assess the contribution of the treatment after the other influences on outcome are taken into account.

In a study comparing two clinical interventions, the goal is to be sure that any difference observed is due to the differences in the two interventions rather than some other factor. The “some other factor” is a “confounder,” because it confounds one's efforts to draw the conclusion that differences between the interventions are responsible for the differences in outcomes.

Random Assignment to Treatment or Control

The best way to be sure that one can draw a strong conclusion from a difference in outcomes is to assign subjects randomly to receive one intervention or the other. If the randomization is successful and the number of patients is large enough, the two study populations will be essentially identical except for the different interventions. If one conducts the study so that, except for the intervention, the study populations are also identical at the end of the study, the researchers can make a very strong inference that the cause of the differences in outcomes is the difference in the interventions. Randomization is powerful because it ensures that the two populations are similar in every respect except for the intervention to which the researchers randomly assigned the patients. This claim means that if the study groups are large enough and the randomization was successful, the frequencies of all known factors (e.g., age, gender, and comorbid conditions) are similar in both groups; in addition, and perhaps more importantly, the frequency of any unknown or unmeasured factor will be the same in both groups. Randomized trials stand at the top of the hierarchy of evidence because they make it possible to infer a cause-and-effect relationship between an intervention and an outcome. Because the study groups are identical except for the intervention, any effect on outcomes must be due to the intervention.

Observational Studies

Other methods for studying the effects of two interventions rely on data derived from the observation of care. In contrast to a randomized trial, no one intervenes in an observational study. Instead, researchers use information about the patients to try to make inferences about the relationship of clinical factors (including treatment) to clinical outcomes. Sometimes, researchers collect the information systematically (a prospective study); other times, the data represent patient care as it happened in the past (retrospective study). In either case, the crucial distinguishing feature of an observational study is that receipt of the intervention depends on clinical circumstances and preferences rather than deliberate assignment, as in a randomized clinical trial. These circumstances that influence choice of treatment may also influence the outcomes. Differences in outcomes may therefore be due to the intervention or other circumstances, or a combination of both.

Observational studies have many advantages. They are much less costly than randomized trials, they can have huge study populations, and the results are more likely to represent practice (Benson and Hartz, 2000). However, the circumstances that influence the choice of treatment often confound the interpretation of differences in outcomes. The frequencies of potential confounders may differ between those who receive the intervention and those who do not. To evaluate differences in outcomes independently of the influence of possible confounders, researchers perform multivariable regression techniques on the data. The variables used in the statistical model are either the several candidate predictor variables (the treatment itself and other potential confounders, such as demographic characteristics, clinical characteristics of the patient, comorbid conditions, and socioeconomic factors) or the dependent variable (the outcome that the model is trying to predict). These techniques effectively adjust the frequencies of the confounders measured so that they occur at the same rate in both the treatment and the no-treatment groups. If the treatment is still a statistically significant predictor of an outcome, researchers can infer an association between the outcome and the treatment. However, they cannot infer that the treatment causes the outcome because the statistical techniques can only adjust for differences in the confounders that the researchers measured. Unmeasured confounders are thus the bane of researchers who conduct observational studies.

Therefore, the possibility that a confounding variable is responsible for observed differences means that one must express conclusions in terms of association rather than causation. Even then, researchers must be cautious in their conclusions because it is possible that the apparent association between two variables is actually the result of a third variable (the confounder) that is affecting the two variables at the same time so that they change in concert. In observational studies, the researchers must guard against concluding that the change in one variable is the consequence of the change in the other variable (cause and effect).

Types of Observational Studies Observational studies come in several forms: cohort studies, case-control studies, case series, and cross-sectional and longitudinal studies. Each of these is described below.

Cohort Studies. A cohort study (in the context of treatment effectiveness research) is the formal collection and analysis of data on treatments and outcomes for a defined set of patients with similar clinical characteristics. For example, a researcher might study pain and disability levels as outcomes in a cohort of patients older than 70 years of age who received lumbar fusion surgery for severe sciatica. The distinguishing feature of cohort studies is that researchers gather data on treatment and possible confounders at one point in time and measure outcomes at a later point in time. Cohort studies are a relatively powerful form of study design because researchers can often statistically adjust the final outcomes (e.g., levels of pain) for differences in the outcome variable at the beginning of the study (pain levels before surgery) and because they can measure the outcome variable at many points in time (e.g., from monthly pain reports for up to 2 years after surgery).

The assembly of a cohort is the first step. It may take place in the present as a deliberate, planned activity in which the researchers gather data on the present state of the participants (prospective cohort), or it may rely upon data gathered in the past (retrospective cohort). In either case, the investigators use specific inclusion and exclusion criteria to define a group of people with many similarities. Even though members of the cohort are similar in terms of the inclusion criteria (in the example cited above, all patients will be older than age 70 years, all will have had fusion surgery, and all will have had severe sciatica before surgery), they will inevitably differ in many other predictors of outcome. For example, some members of the spine surgery cohort may be 70 years old and others may be 85 years old. Some may be overweight and others may be thin. Some will be engaged in regularly physical activity, and others will be sedentary. Some will have a spouse or caregiver available to help with work at home; others will be on their own. All of these factors, and countless others, may have influences on treatment outcomes. Researchers try to identify and record as many of these factors as possible, but inevitably some potentially important factors are not measured.

Outcome measurement is the second major step. The researchers measure outcomes at a future time relative to the date of cohort assembly. With a prospective cohort, measurement of outcomes occurs in the future at specific time points relative to the date of treatment. With a retrospective cohort, the outcomes may have occurred in the past relative to the date of treatment and may have to be abstracted from existing data systems or may still occur in the future if members of the cohort are still alive and available for follow-up.

Case-Control Studies. The study population in a case-control study in the domain of treatment effectiveness consists of the cases (those with the target outcome, such as complete pain relief) and the controls (those without the target outcome, for example, those with continued pain). Case-control studies are especially well-suited to studies of rare events, because cases (those experiencing the event) are oversampled relative to the controls. Case-control studies are typically retrospective, in that the researchers assemble the study population after the measurable outcome events have occurred. If adequate numbers of patients are available, researchers choose the controls by matching each control patient (or several control patients) to one case patient for variables such as age, sex, and date of entry into the population from which the researchers identify cases and controls. The next step is to measure rates of exposure to a treatment (e.g., a surgical procedure) for the cases and the controls. The ratio of the rates of exposure to the intervention for those who experience the outcome (cases) and those who do not experience it (controls) is mathematically equivalent to the ratio of the outcomes in those exposed to the intervention to the rate in those not exposed to it. Thus, the outcome of a case-control study is a rate ratio or an odds ratio of the target condition frequency in exposed patients versus that in unexposed patients. Researchers may perform regression analysis techniques to adjust the cases and controls for differences in potential confounders. The Achilles heel of a case-control study is confounders, and the researchers' greatest challenge is assembling the control group to avoid confounders. One way to accomplish this task is to choose cases and controls from a cohort that the researchers assembled using the same inclusion and exclusion criteria (a so-called nested case-control study).

Case Series. A case series is simply a serial collection of patients with some defining characteristic. A typical case series is a group of patients who have a rare diagnosis or who have undergone a new surgical procedure. In the context of treatment effectiveness research, a case series would be a consecutive set of patients who received a particular treatment. Case series do not have controls, so that it is very difficult to make any inferences about whether an intervention (a treatment or surgical procedure) had any effect. An exception would have been the first group of patients with pneumonia who received penicillin and experienced rates of survival that were unprecedented in the era before penicillin. In surgical research, it is reasonably common to publish results of case series studies and to compare the outcomes to those for other published case series for patients with the same underlying condition. These “historical controls” provide a basis for comparison of outcomes for the new treatment, but it is even more difficult to draw inferences in a case series study than in either a prospective or a retrospective cohort study because the patients in the comparison group were treated at a different place and at a different time, so there are confounders related to place and time, in addition to the confounders in the cohort study related to the patients' clinical and personal characteristics.

Cross-Sectional and Longitudinal Studies. Cross-sectional studies measure the relationship between variables at a single point in time. Cross-sectional studies are a relatively weak study design for the testing of hypotheses about treatment-outcome relationships because they rely upon a single measurement of each variable. A survey is a typical cross-sectional study design. Longitudinal studies measure the relationship between variables at two or more points in time. In effectiveness studies, longitudinal studies would typically involve the measurement of outcomes at several points after treatment. They are a relatively powerful method for the testing of hypotheses because repeating a measurement many times (or even once) for an individual reduces statistical variation and narrows confidence intervals.

Clinical Outcomes: A Taxonomy

Treatment outcomes can be objective or subjective. Objective outcomes are visible or measurable to people other than the patient, and subjective outcomes can be felt or reported only by the patient. One of the major contributions of the outcomes management movement in the late 1980s and early 1990s was to raise the status of subjective measures as valid scientific endpoints in clinical trials and other forms of research studies. Advances in the technology of subjective measurement made that change possible, so that it is now common to find a mix of subjective and objective endpoints in many clinical trials.

Subjective Outcomes

Subjective outcomes include those symptoms and other aspects of a patient's experience that are not directly observable by others, but that represent the goals of treatment. Pain, sensations of nausea or dizziness, functional status, ability to perform activities of daily living, and experience of moods or emotional states are examples of subjective outcomes for which well-developed and widely used measures exist (Bowling, 1997; Frank-Stromborg and Olsen, 1997; McDowell and Newell, 1996). Because there is no direct way to validate a patient's report of pain level or mood state, the development of valid measures requires careful attention to issues of reliability (i.e., whether measures taken at two adjacent points in time yield the same result or whether two closely related versions of the same scale yield the same result) and convergent validity (i.e., whether the results of two presumably related, but different, measures actually yield similar results). Because patients' responses to single questions or item formats may be affected by idiosyncrasies of wording and interpretation, it is common for measures of subjective outcomes to be based on multi-item scales with different wordings and response formats. Patient reports may also be sensitive to context or contrast effects (for example, a relatively modest “absolute” level of pain may feel uncomfortable if it is new but may feel very minor if it has been preceded by a long period of excruciating, severe pain).

The subjective domains for which well-established measures exist cover many of the endpoints of CAM treatments. Existing measures can be and have been used in studies of the effectiveness of treatments involving CAM. Some subjective domains are more unique to specific CAM modalities (e.g., feelings of “centeredness” or “wholeness”) and some additional measurement work may be required for these modalities; but in principle, virtually any subjective experience can be captured either as present or absent or as present as a matter of degree.

Because subjective experiences cannot be independently validated, and because they can be significantly affected by context, contrast, and expectation effects, it is particularly important to try to build in features of the study design that minimize these kinds of biases. “Blinding” the patient to the specific treatment that he or she has received, for example, is a way to minimize the effects of expectations on reports of subjective outcomes. Careful selection of patients who are all similar in terms of the level of pain or disability at the time of treatment is a way to minimize contrast effects. Having the outcome assessment done by a person other than the treating clinician is a way to minimize the biasing effects from the desire of a patient to please the clinician.

Objective Outcomes

Objective outcomes include those things that can be felt, seen, heard, or measured in some way by someone other than the patient. Status as alive or dead, weight, blood pressure, tumor size, white blood cell counts, and levels of blood sugar all represent examples of objective measures used regularly in treatment effectiveness studies. In conventional medicine, the vast range of tests possible in the domains of laboratory tests, radiologic and nuclear imaging, physical examination, electrical recording (electrocardiograms, electroencephalograms, and electromyograms) are all potential means of obtaining objective outcome measures for specific treatments.

Even though the data for objective outcomes can often be stored and made available for repeat testing by other observers (e.g., two independent radiologists reading the same X-ray image), there is still some potential for bias, particularly if the person providing a treatment or studying its effects is the person performing a test or interpreting the results of a test. Many objective tests still involve some judgment or interpretation on the part of the person performing or interpreting the test, so it is important to design effectiveness studies in ways that minimize any biasing effects of those judgments or interpretations. The most common way to minimize bias is to make sure that the person reading an X-ray image, taking and recording a blood pressure, or interpreting an electroencephalogramm is not someone with a vested interest in the results of a study and to try, whenever possible, to blind the person doing the test to any knowledge of the study treatment that the patient has received. Doing these two things minimizes the opportunity for any systematic bias either for or against a particular treatment being studied.

To the extent that treatments involving CAM are designed to influence objective endpoints like survival, blood pressure, tumor size, or the alignment and spacing of vertebrae, standard measures of these endpoints, with appropriate blinding and other controls for bias, should be suitable for treatment effectiveness research in CAM.


In the last 10 to 15 years, a consensus has emerged about the types of scientific evidence required to establish the efficacy or the effectiveness of a treatment. Doctors and patients adopt health interventions, such as a CAM therapy, because someone will pay the cost (such as an insurance company or Medicare) and because the benefits of the intervention sufficiently outweigh its harms. These benefits and harms are the outcomes of the use of the intervention. The outcomes movement or the evidence-based medicine movement in health care is simply the application of the principle that societal and individual acceptance of a health care intervention should depend on the balance of its benefits, harms, and costs. The measurement of those benefits and harms, the balance of harms and benefits, and how that balance compares with the balance of harms and benefits of an established treatment are at the heart of most clinical research. This section discusses some core principles in comparisons and evaluations of health care interventions.

Many possible solutions or interventions exist for every human complaint and every affliction. Some interventions gain general acceptance and become standard treatment. Some of these have strong scientific evidence to support their use. That is, investigators have tested formal hypotheses about the interventions according to established principles and have found that they are clearly superior. Others have little or no scientific evidence to support their use but have become accepted as effective because of their long-term use. For practicing clinicians, clinical research often addresses the question asked by many patients, “This treatment has worked for me in the past, so why should I switch to another, less established treatment?” The way to answer this question is a head-to-head comparison of the two treatments. In designing a study to this end and analyzing the results, a number of considerations can be important. This section describes some of these considerations, with particular emphasis on how they apply to the problem of predicting the response to an intervention.

Predicting the Response to an Intervention

The data set from a randomized study of two treatments will contain many different variables. Among them is a measure of outcome (e.g., blood pressure at the end of the study); this variable would be the dependent variable in a multivariable model (the goal of the model would be to predict the end-of-study blood pressure). Another variable would be treatment assignment (Drug A or Drug B); treatment assignment would be a predictor variable (or an independent variable). Other predictor variables might include age, sex, ethnicity, pretreatment blood pressure, and dietary salt intake. One form of a multivariable model would include all of the predictor variables and might show that several predictor variables were significant predictors of the end-of-study blood pressure. One of them would be Drug A and another might be salt intake. In a multivariable regression model, these two would be independent predictors of outcome. This result would not prove that salt intake was a mediator of the response to Drug A, in the sense that Drug A had a greater effect in the presence of a low-salt diet versus high-salt diet. However, the model could be set up in a different way, with so-called interaction terms reflecting the extent to which the effects of Drug A and salt intake vary as the levels of the other variables vary. If the regression coefficient for this interaction term was significantly different from 0, a researcher could say that the interaction between Drug A and salt intake was a predictor of end-of-study blood pressure. In other words, the effect of Drug A on blood pressure was not the same for all levels of salt intake, and the effect of salt intake on blood pressure differed as a function of Drug A's presence or absence. The search for predictors of response would involve a search for significant interaction terms between the intervention and the predictor variables.

Stability of Predictive Models

A key element of a predictive model is its stability (its ability to give the same result if the study was repeated with a new sample of patients). One of the ways to predict the stability of a model is to count the number of outcome events. If the number of outcome events is small relative to the number of predictor variables in the model (a ratio of at least 10 outcome events per predictor variable is the minimum), the model is likely to be unstable. Thus, a model for the prediction of mortality after a surgical procedure would likely be unstable if the study recorded 15 deaths and the predictive model tested 8 variables. For this model to be stable, at least 80 deaths should have occurred. The reason that a model's stability depends on the ratio of the number of outcome events to the number of predictor variables is that small samples are likely to differ from the parent population to a greater degree than large samples would (the law of small numbers). One needs substantial numbers of people who experience the outcome, because statistical models for the identification of which potential predictors are good predictors rely upon differences in the frequencies of the potential predictors of outcome in those who experience the outcome and those who do not.

Distinguishing Between Intermediate and Distal Outcomes

It is important for researchers to decide whether to be satisfied with measuring intermediate outcomes (outcomes that predict death or long-term disability but that are not death or long-term disability themselves) as a result of an intervention or whether to measure the effects of the intervention on the death and disability rates of those diseases directly (distal outcomes). The treatment of hypertension is an informative example. One can compare two antihypertensive drugs by their effects on blood pressure; or one can compare their effects by determining the rates of strokes, congestive heart failure, or myocardial infarction between the two groups of patients taking the two drugs. Blood pressure is an intermediate outcome, of interest mostly because it predicts outcomes that frighten patients, such as stroke rates. It is much easier and much less costly to measure blood pressure than it is to measure stroke rates, which are typically small and which require many patient-years of monitoring for accurate measurement. Once several studies show, however, that lower blood pressure predicts fewer strokes, the decision to pay for a new antihypertensive drug might well rest on its effect on blood pressure, with the implicit assumption that the lower blood pressures brought about by treatment with the new drug relative to those achieved with standard therapy imply that the patients will experience fewer strokes. Therefore, although distal outcomes mean more to patients, it may not be necessary to measure them to determine whether a new drug is equal to or better than the standard therapy. The usefulness of intermediate outcomes in clinical research rests on the bedrock of a scientifically proven connection between the intermediate outcome and the distal outcome. It involves the assumption that the effect of a drug on blood pressure is the only determinant of stroke rates. A drug that affected blood pressure and blood clotting might have a greater effect on stroke rates than its effect on blood pressure would predict.

Standard Outcome Instruments Versus Customized Instruments

The rationale for standardizing measures of outcome rests upon the value placed on being able to compare the effects of different interventions. Compartmentalization of health care policy is shortsighted in the long run. For example, Medicare could be concerned about whether to pay for left ventricular assist devices for damaged hearts. If resources are limited, Medicare should consider alternative uses of the money expended on left ventricular devices. If decision makers ask, “What must we give up in order to pay for X?” they need a method that they can use to compare the interventions for one disease with the interventions for another disease.

The growth of standardized outcome measurements is a response to the need to compare the effects of treating diseases that have different outcomes. How does one compare a treatment for diabetes with a treatment for back pain? One uses some measure that both diseases affect, such as ability to function in daily life (the SF-36 questionnaire or activities of daily living). Instruments such as the SF-36 have been used in so many studies that scores serve as a way to compare one study population with another. Many disease-specific instruments are also available.


Cost-effectiveness analysis is an important tool for evaluating health care interventions. According to Garber et al. (1997), cost-effectiveness is “a method designed to assess the comparative impact of expenditures on different health outcomes.” The importance of cost-effectiveness is its salience to setting policies in a setting in which resources are constrained and people must make choices that involve trading off costs against effectiveness.

The salient word in describing cost-effectiveness analysis is “comparative.” The cost-effectiveness of one intervention is always in relation to the cost-effectiveness of something else, even if the alternative is doing nothing. The cost-effectiveness of Intervention A compared with Intervention B is


where effectiveness is measured in clinical units and where QALY represents a quality-adjusted life year. A QALY is a year living in a specific state of health relative to the most desirable health state, usually perfect health. The relationship between a desirable health state and a less desirable one is a number, called a utility, which expresses the ratio of the patient's preference for a specific health state compared with that for perfect health. Thus, if the patient says a year with heart failure is equivalent to 80 percent of a year spent in perfect health, the utility for heart failure is 0.8. Thus, a year spent by one person living with heart failure counts as “0.8 QALY” and a year spent in perfect health counts as “1.0 QALY.” In large populations of people, one can calculate the total number of QALYs that the population experiences over a period of time by adding the QALY values accumulated by each individual in the population over the time period being considered.

When one expresses effectiveness in QALYs, it is possible to compare several different interventions directly (e.g., the treatment of hypertension and screening for lung cancer). This advantage is very important for health care planning, in which those who provide or pay for health care assemble a package of health care interventions, because one can choose interventions on the basis of cost-effectiveness and obtain a package of services whose components have values—expressed as the health benefits obtained for the money expended—that are consistent with one another.

The analyst could express cost-effectiveness differently (e.g., the cost per patient death postponed or the cost per case of lung cancer detected). These methods are easier, but their use means that the power to compare interventions for two entirely different health conditions is given up. Estimation of the effect of an intervention on life expectancy, especially when life expectancy is adjusted for quality of life, requires a great deal more work, but most cost-effectiveness analyses calculate the cost per QALY.

In a typical cost-effectiveness analysis, the analyst uses a mathematical model to represent the alternative actions (e.g., a treatment, a clinical test, or observation) and their health consequences. The analyst represents the uncertainty of future events (e.g., death or survival after surgery) as a probability ranging from 0 to 1, and the outcome of various sets of events as a health state (e.g., death or survival with heart failure). The value of a given health state at a point in time is expressed on a 0-1 scale as a “utility,” and the cumulative values of health states over a fixed period of time are expressed as quality-adjusted life years (QALY) (time spent in a given health state times the value placed on life in the health state).

Measurement of Preferences

The QALY is a key element of cost-effectiveness analysis because it expresses length of life in a common unit: healthy years. By using a common unit, one can compare the desirabilities of two different health states. To compare two outcome states quantitatively, one must characterize each one by a number that reflects the desirability of the outcome state. This number is the utility of the health state, usually expressed on a scale from 0 to 1. Several methods that can be used to elicit a person's utility for a health state exist. In many cost-effectiveness analyses, the researchers adopt a utility obtained from large population-based surveys (Torrance, 1986, 1987). The first method, the standard reference gamble method, is the most theoretically sound method but is also the most difficult for a patient to do. The second method, the time trade-off method, in which the patient is asked how many years in a particular health state are equivalent to the patient's life expectancy in perfect health is easier. With this method a patient is asked, “How many of your 20 years of expected life in your current health state would you give up to have perfect health for the rest of your life?” The third method is the easiest but is the least sound theoretically: “Point to a place on a scale from 0 to 10 that characterizes your feeling about a health state.”

Do Outcomes Differ? Statistical Analysis

Study populations are necessarily samples of the universe of people who are eligible for a study. If the study population is small, the outcomes are more likely to differ by chance from the outcomes that would occur in the universe from which the sample was drawn. Because the results from any one sample can be atypical, scientists use the concept of the 95 percent confidence interval, which is the range of outcomes that would occur in 95 of 100 samples from the universe. One can calculate the 95 percent confidence interval for the difference between two outcomes. If the 95 percent confidence interval for the difference includes 0 (no difference), the results of the study that gave rise to the difference are consistent with no difference. Statistical tests estimate the probability that a difference in outcomes is consistent with chance, usually expressed as the “p value.” Outcome studies increasingly report the confidence interval of the absolute difference in outcomes related to two interventions. This method provides a graphic measure of the uncertainty in a conclusion. If the confidence interval of the difference includes 0 or a value that is very close to 0, the difference is not statistically significant or is of borderline statistical significance, respectively. If the lower limit of the confidence interval of the difference is far from 0, one can be sure that the difference itself is unlikely to be the product of a chance variation in the samples drawn from the same universe.

Confidence intervals enter into the interpretation of predictive models designed to identify clinical predictors of a response to a treatment, such as an element of a package of CAM interventions for a clinical problem. The coefficient of an interaction term has a confidence interval. If it includes 0, the interaction term is not a statistically significant predictor of the dependent variable (e.g., Drug A, salt intake, and end-of-study blood pressure, as in the earlier example).

Measurement Error

Measurement error adds uncertainty. The inclusion of a measurement error widens the 95 percent confidence interval. Failure to take into account measurement error will lead one to overestimate precision and draw incorrect conclusions about differences in outcomes.

Effectiveness Versus Efficacy Studies

Efficacy Studies

Efficacy studies mean, by common agreement, that the comparison of two technologies has taken place under strictly controlled conditions designed to show a difference if a difference is truly present. Typically, an efficacy study will exclude patients who are likely to die of diseases other than the target disease for the technology under study to maximize the information value of each death in characterizing the two technologies. The study population of an efficacy study is typically narrowly defined (and therefore relatively small), which means that the patients are very similar to one another and, therefore, that the results may not apply to a wider population. All measurements take place under optimum conditions, and the doctors interpreting the test results undergo special training so that they give the same interpretation, for example, to the same computed tomography scan, eliminating one source of measurement error. Typically, efficacy studies precede effectiveness studies, and the results are used as a “proof of principle.” After proof of principle, other studies may explore the size of the effect in different study populations, at different clinical sites, and under different conditions of practice. These are effectiveness studies.

Effectiveness Studies

Effectiveness studies evaluate the technology under real-world conditions of actual medical practice. The study population resembles that which one would see in day-to-day clinical practice, which means that any results are likely to apply to real-world clinical practice. Effectiveness studies have a greater chance of measurement error if the researchers have taken few precautions to standardize the measurements. Although measurement error reduces the precision of effectiveness studies, study populations are typically large, which increases the statistical precision.

Noninferiority and Superiority Trials

Some researchers want to prove that their product is better than the standard product. If the new product is very effective, relatively small study populations may suffice to prove that the product is superior to the standard product. Often, however, researchers are content to show that their product is equivalent to the standard product. A typical situation is a minor chemical variation to a standard drug. The minor variation means that the patent on the standard drug does not apply, and the company making the new product can market it, as long as it is as good as the standard product. Thus, some studies are designed to prove noninferiority (the product is highly likely to be as good as the standard product). The designer of the trial tries to estimate the number of patients required to show that the two products do not result in clinically important differences in outcomes. One means to this end is to include enough subjects to be sure that the upper limit of the 95 percent confidence interval of the difference in outcomes (the largest difference that is reasonably possible) is slightly smaller than the minimum clinically important difference in outcomes. This technique almost guarantees that if the two products are truly equivalent, any difference actually observed between them in a particular study will be less than the smallest difference that clinicians would find meaningful. For all intents, the two interventions have equivalent effects.

Co-Morbidity and Cointerventions as Confounders

Suppose one is investigating the relationship between eating a particular brand of breakfast cereal and subsequent myocardial infarction. The greater the intake of the cereal is, the greater the incidence of myocardial infarction is. Is eating the cereal associated with myocardial infarction? Now, suppose that cereal eaters and noneaters also differ in the prevalence of diabetes, with more cases of diabetes occurring in the cereal-eating population. The presence of diabetes is confounding the relationship between cereal eating and myocardial infarction. Diabetes is thus a “co-morbid condition,” a form of confounder.

Medications can also confound a relationship. Suppose that one studies the effects of two blood pressure medications on heart failure. Since high blood pressure is a cause of heart failure, the study will be stronger if the blood pressure in the two study groups is the same. So, the researchers allow the doctors caring for the patients to use medications in addition to the study medications to get a patient's blood pressure to a target level. The other blood pressure medications are “cointerventions.” Cointerventions can be confounders if they affect the outcome state, which is heart failure in this example. If the researchers do not adjust for differences in the cointervention medications, which may vary throughout the study, they may form an incorrect conclusion about the relationship of the study medication to heart failure. That is, what appears to be an effect of the study medication may be an effect of the cointervention.

Single-Center Versus Multicenter Studies

Studies that take place in many different clinical sites are increasingly the norm for the testing of major hypotheses about treatments for disease. One reason is sample size. More sites mean more patients, which means greater statistical precision and the ability to make strong statistical inferences about relatively small (and even clinically unimportant) differences (of course, one may lose statistical precision if the clustering of outcomes occurs, but a good study design will allow a larger sample size so that clustering does not reduce the statistical power of a study). The use of more study sites means greater variability in the patients and in clinical care and less risk that the differences between two interventions are due to idiosyncrasies of practice at a single site rather than the intervention itself. The use of more study sites also means that more investigators are talking among themselves and finding ways to strengthen the study that a single investigator might miss. A study at more sites also means greater costs, which often make a study infeasible without a corporate sponsor. Alternatively, the greater costs may mean less thorough data collection and a greater risk that the findings from the study will not be interpretable at its conclusion. Despite the costs, multicenter studies are the norm for the testing of important hypotheses. Relatively few studies of CAM interventions have been performed at multiple sites, so this form of research is an untapped opportunity for CAM researchers.

Clustering of Outcomes

Conventional statistical methods assume that the outcomes for individual patients are independent of one another, so that each patient contributes new additional information about the relationship of an outcome to the two interventions. When the care given by different providers (either institutions or doctors) results in different outcomes, outcomes are said to be “clustered.” When a study takes place in several different institutions, which is common practice, it is possible that care provided at each of the institutions differs, so that knowledge of what institution is providing the care allows one to predict the outcomes for the patients. Under these circumstances, the outcome for each study patient is not independent of those for other patients at that institution, and the assumptions of conventional statistical methods do not hold. The assumption of independence when outcomes are related means that measures of variability, such as the 95 percent confidence interval, appear to be more precise than they really are. The true 95 percent confidence interval is wider than it appears to be from the findings of the study, which means that an apparent true difference may be consistent with random variation between the study patients who receive the intervention and those who do not. The use of an appropriate statistical design can account for the effects of clustering, so that the statistical power of the study and the widths of resulting confidence intervals are accurately known. Widening of the 95 percent confidence interval after this statistical adjustment is made means that clustering of outcomes is present. Clustering of outcomes makes it more difficult to conclude that a difference between two interventions is due to the interventions rather than to chance variation.

Clustering of outcomes is especially important in studies in which the deployment of an intervention may vary from practitioner to practitioner or from study site to study site. CAM experts commonly cite the special role of the practitioner as a characteristic of CAM interventions, so it is important to know when outcomes vary in this way. If adjustment for clustering widens the confidence interval, the clustering of outcomes by provider or by site may be occurring. If some providers or sites are doing better (or worse), researchers have an opportunity to discover what makes certain providers or sites more effective.


Hierarchies of Evidence

The U.S. Preventive Services Task Force (1996) and groups organized to develop treatment guidelines have adopted a concept of “levels of evidence” or a “hierarchy of evidence” that they use to rate the strength of the body of published data on a specific test or treatment. The Task Force's approach to rating evidence appears in Table 3-1. Note that it does not use a hierarchy of study designs ranging from the most powerful (randomized clinical trials) to the weakest (case series). Rather, it uses generic characteristics of a study and of a group of studies. In effect, the term “well designed” reflects a hierarchy of study designs, but a hierarchy is not an explicit part of the Task Force's evidence hierarchy.

TABLE 3-1. The Evidence Rating System of the U.S. Preventive Services Task Force.


The Evidence Rating System of the U.S. Preventive Services Task Force.

The principal product of the U.S. Preventive Services Task Force is recommendations for using preventive interventions in office-based clinical practice. The Task Force has a hierarchy of rating of the strengths of recommendations, which it has refined over the two decades of its existence. The hierarchy of the strengths of recommendations is important because practitioners, health care organizations, and payers pay attention to the Task Force's recommendations. An explicit hierarchy of recommendations with definitions that are tied to the strength of evidence makes the Task Force accountable for the strength of its recommendations. A system of accountability reduces the chance that the Task Force will make an arbitrary recommendation. The hierarchy of strengths of recommendations appears in Table 3-2.

TABLE 3-2. Strength of Recommendation and Strength of Evidence, U.S. Preventive Services Task Force.


Strength of Recommendation and Strength of Evidence, U.S. Preventive Services Task Force.

Another hierarchy of evidence, from the National Health Service Centre for Evidence-Based Medicine, appears in Table 3-3. In contrast to the U.S. Preventive Services Task Force hierarchy, this hierarchy depends on the study design, the number of studies in the body of evidence, and the consistency of study results. In this hierarchy, the combined results of several randomized controlled clinical trials (RCTs) receive the greatest weight in evaluating treatment effectiveness. The results of a single, well-designed RCT is given the next greatest weight. The combined results of observational studies or other non-RCT study designs comes next, followed by case series or anecdotal reports, and professional judgment or consensus.

TABLE 3-3. Example of a Hierarchy of Evidence from the National Health Service Centre for Evidence-Based Medicine, 2002.


Example of a Hierarchy of Evidence from the National Health Service Centre for Evidence-Based Medicine, 2002.

A recent IOM report (2001) proposed a slightly different approach to levels of evidence for research when the question considered is one of treatment effectiveness rather than efficacy. First, that report describes using an “effectiveness RCT.” Such a study would have the following characteristics:

  • light patient exclusion criteria;
  • conducted in a range of treatment settings;
  • treatment provided by the kinds of providers who would provide treatment in non-study settings;
  • no elaborate data collection (e.g., extra lab test or imaging studies);
  • analysis done on “intention to treat” basis; and
  • random assignment with one or more control groups.

Further, that report takes the position that when evaluating treatment effectiveness, “the results of a single well-designed outcomes study should be considered to be as compelling as the results of a single well-controlled randomized trial” (IOM, 2001) and lays out a hierarchy of evidence as shown in Table 3-4.

Table 3-4. Hierarchy of Evidence.

Table 3-4

Hierarchy of Evidence.

In this report about CAM, the committee has chosen not to recommend one particular hierarchy; however, it does emphasize the following points:

  • In general an RCT is the preferred study design if the issue is establishing treatment efficacy.
  • More studies are better than fewer studies, therefore a meta-analysis of multiple good RCTs is better than one good RCT.
  • Other study designs can provide evidence of efficacy or effectiveness.
  • Meta-analysis of multiple non-RCT studies is better than one non-RCT study. Meta-analysis of multiple non-RCT studies may or may not be better than one good RCT; it depends on the details of the studies and the specific question being asked.
  • If the question is treatment effectiveness, then some features of the typical RCT (stringent inclusion/exclusion criteria; treatment given in high-quality, high-volume clinical sites; detailed, frequent patient follow-up; etc.) create problems in generalizing findings to routine practice settings.
  • Other study designs, including observational studies or “effectiveness RCTs,” may provide evidence that is at least equally compelling as that provided by an “efficacy RCT.”

Effect size is another consideration that must be taken into account along with features of study design when one weighs the strength of evidence for a particular therapy. Treatments with clear, dramatic, positive effects in small or less well-controlled studies may be deemed “efficacious” sooner than treatments with more modest effects.


The remainder of this chapter discusses the context in which researchers will apply these established research methods, including the idea that CAM users may present particular needs for research, that CAM interventions may pose particular problems in applying research methods that have worked well for conventional medicine, and that such interventions may also expose some of the weaknesses of applying contemporary research practices to conventional medicine.

Decision Makers and Sources of Evidence

Lewith and colleagues (2001) have described the different decisions that various participants in health care make about treatments and how they use different kinds of information to make those decisions. Patients, providers, insurers, government policy makers, and others typically require different types of evidence and different amounts of certainty to decide for or against a particular treatment or treatment modality. The committee recognized that a discussion of evidence of CAM treatment effectiveness must be set in the context of the differences among users of information about CAM in terms of the decisions that they make, the information that they need to make those decisions, and the way(s) in which they think about treatment effectiveness.


Researchers are typically interested in understanding cause-and-effect relationships between underlying mechanisms of illness, treatments designed to alter those mechanisms, and patient outcomes. Researchers trained in Western cultures and scientific traditions generally think in terms of linear cause-and-effect and try to identify the simplest possible causal models (i.e., the fewest explanatory variables and the simplest relationships among those variables) that account for the observed associations (Nisbett, 2003). Scientists from other cultures, however, may be more likely to think in terms of more complex “system” models that involve multiple factors and multiple levels of relationships and highly interactive and iterative, rather than linear, relationships (Nisbett, 2003).

The results of a given study are taken as evidence of cause-and-effect relationships to the extent that certain criteria are met. These criteria typically include

  • Features of the study design that allow strong inferences to be made about cause-and-effect relationships:

    a well-defined population to whom the conclusions apply;

    a well-defined, sufficiently large, and representative sample drawn from that population;

    a well-defined and controlled treatment(s) administration;

    a concurrent control or comparison group(s), when possible, that receives either no treatment or some different form or dose of the study treatment;

    well-defined study endpoints (objectively defined and measured outcome variables); and

    statistical analysis to assess the likelihood that the findings are produced by chance.

  • Plausible biological mechanisms, that is, the ability to fit the observed relationships into some larger body of theory and evidence on how the body works.
  • Consistency of findings from study to study. A single study is rarely definitive, although some large, well-designed clinical trials may produce evidence that is treated by the scientific community as definitive. Confidence in the existence of cause-and-effect relationships grows with the ability to see them in multiple studies over time. Confidence diminishes when results vary from study to study.
  • Dose-response relationships. In most biological processes, the introduction of a larger amount of a substance produces a larger subsequent effect. There is almost always some upper limit at which no further effect is found or some different or counterbalancing biological process begins to take over. For the most part, however, within a reasonable range of doses, more “cause” produces more “effect.” Clear dose-response relationships typically increase the confidence in the underlying causal relationships between the treatment and the outcome.

Teachers Training New Practitioners

Medical school, nursing school, and allied health school faculty require evidence of treatment effectiveness to determine how to train students. The standards of evidence for specific treatments are not necessarily the same as those used by researchers, but they are similar. They include

  • The criteria for researchers listed above. Faculty have the responsibility to stay current with the published literature and generally to apply the same criteria to published studies that researchers apply.
  • Personal experience. In addition, however, clinical faculty draw heavily on their own experiences in determining which treatments are effective and which ones are not. This may be particularly true in the context of clinical rotations and residency training, in which much teaching is done on the basis of an apprenticeship model in a specific clinical environment. In this setting, both faculty and students have a chance to observe, directly and together, the effectiveness of specific treatments.
  • The extent to which the treatment in question is a “standard of practice” in the medical community or is moving toward that standing. Students entering a profession become part of a professional community, and part of their learning involves knowing what the standards and typical practices of that community are. There is often a gap in time between the publication of scientific evidence of the effectiveness of a new treatment and the widespread adoption of that treatment by most or all members of a professional community, along with some appropriate caution and skepticism about new findings that seem to run counter to daily experience. Teachers train students in what the members of the professional community typically do on a daily basis as well as what the published literature says that they could or should do.

Practicing Clinicians

Clinicians treating patients have a somewhat more complex set of information requirements about treatment effectiveness, because they must know not only what has worked or what should have been effective in the abstract but also what they are actually able to do in the context of their own training and skills, their own practice settings, and their own sets of patients. Their requirements for information on treatment effectiveness include

  • All of the preceding criteria, although many active clinicians will not have the same amount of time as their researcher or faculty colleagues do to monitor developments in the published literature.
  • Consistency of a new practice with other aspects of current practice. A psychotherapist may accept the published evidence about the effectiveness of a specific herb for the treatment of depression but may be unwilling to incorporate the use of the herb into his or her own practice because of a professional commitment to therapies based on a different theory and conceptual model of mental illness.
  • The availability of essential equipment, trained staff, supplies, and anything else necessary to provide a treatment safely and effectively. Many treatments require specialized equipment, training, or support staff that are not readily available to all clinicians.
  • Difficulty in learning new skills (e.g., for new surgical procedures).
  • The acceptability of a new treatment to patients and others in the community. Health care is usually a two-way human interaction; and potentially effective treatments will not be used if they conflict with the beliefs, cultural values, or expectations of large numbers of patients in a practice.
  • Opinions of professional peers. In an environment in which it is impossible to keep up with all new advances in treatment, the opinions and practices of respected colleagues are a kind of evidence of treatment effectiveness that is often dominant.
  • Reimbursement policies affecting a new treatment. Even when all other criteria have been met, a new treatment may not be adopted if the provision of it will not be adequately reimbursed.
  • The extent to which the patient population is similar to those studied in clinical trials or other studies of treatment effectiveness. There are always variations in published studies of treatment effectiveness, and clinicians may legitimately believe that what works for many or most patients will not necessarily work for their own patients, particularly if they share some clinically relevant characteristic (Park, 2002).

Employers or Purchasers and Insurers

Those who pay for health care through insurance care about effectiveness, but also about cost-effectiveness, since they have at least some responsibility to use the dollars available for insurance to produce the best possible health benefit for covered employees. Evidence of treatment effectiveness relevant to employers and insurers, then, includes

  • The scientific evidence listed above for researchers.
  • The preferences, expectations, and experiences of employees and their families. Employers are not insuring passive and uninformed people. Employees who have positive experiences with specific therapies will ask for such therapies to be covered by insurance plans and may use coverage for those therapies as the basis for choosing one plan over another at open enrollment or even changing jobs.
  • Published cost-effectiveness studies (when available). Employers and insurers may legitimately refuse to cover treatments that are effective but that are so costly that their inclusion prevents the coverage of less costly treatments that provide more health benefit to larger numbers of people.
  • Internal cost-effectiveness analyses (for some larger employers). Large companies with many thousands of employees may be able to use their own databases to study relationships between treatments and work attendance, productivity, or the costs of illness. This information may be more compelling than information in published studies because there is no question about the generalizability of the findings to that employer's population.

Patients and Consumers

Individual patients generally do not have direct access to peer-reviewed journals, and most patients do not have the technical background to interpret the results of published treatment-effectiveness studies. This information tends to be filtered through someone else before it reaches the individual patient. In addition, patients (particularly those with chronic conditions) have their own experiences to draw on and can judge treatment effectiveness by the extent to which their own symptoms or functional status improve with treatment. Information on treatment effectiveness for individual patients, then, comes mainly from

  • Information provided by a clinician(s) in one-on-one treatment encounters,
  • Word of mouth from friends and relatives,
  • The lay press or media,
  • Direct-to-consumer advertising,
  • Internet,
  • Direct personal experience (particularly for patients with chronic conditions), and
  • Communications from illness advocacy groups.

The Application of Contemporary Clinical Research Methods to CAM: Some Cautions

Although the concept of levels of evidence has generally been accepted and widely used in many domains of conventional medicine, some question its applicability to CAM therapies or to individual treatment decisions for specific patients. These questions particularly relate to the use of RCTs as the “gold standard” of evidence. Given the broad array of modalities that are included within the definition of CAM, it may be that some CAM therapies are more amenable to evaluation than others. Questions about the applicability of clinical research methods to CAM are described and discussed below.

Emphasis on Efficacy Rather Than Effectiveness

As noted above, the distinction between efficacy and effectiveness refers to the extent to which a treatment has a measurable positive effect in highly controlled clinical trial contexts (efficacy) versus whether the treatment has a measurable positive effect in routine daily clinical practice with unselected clinicians and patients (effectiveness). Efficacy refers to what a treatment can do under ideal circumstances; effectiveness refers to what a treatment does do in routine daily use. Because the highest level of evidence in most evidence hierarchies is the combined results of several RCTs, the resulting recommendations will inevitably be based on evidence of efficacy rather than evidence of effectiveness.

Difficult to Apply to Therapies for Which RCTs Are Difficult, Expensive, or Unethical

It may be impossible to organize RCTs in situations in which the effects to be observed occur rarely, take many years to develop, or are relatively subtle. It is also difficult to conduct RCTs in situations in which the treatment is already in wide use and is generally accepted as effective. It may also be difficult or impossible to randomize patients to CAM modalities or specific therapies that inherently depend on patients' belief, faith, or confidence in or relationship with a particular modality or provider. (See the discussion of “preference trials” in Chapter 4 for one way to address this problem.)

Hard to Apply to Treatments That Become Popular and Widely Used Very Quickly

Study participants may not accept random assignment to a placebo or some other type of control groups if the general public believes that the treatment being studied is widely effective. Likewise, institutional review boards may not be willing to approve randomization to a placebo or another control group if the professional community believes that the therapy being studied is widely effective. In addition to the problem of organizing RCTs for widely used treatments, there may also be a problem with all other study designs that involve some form of control condition that involves administration of a possibly ineffective treatment.

Relatively Long Delay from First Development of a Treatment to Assembly of Large Body of Evidence

The FDA has requirements for research on new drugs before they can be prescribed, but there are no similar requirements for surgical procedures and most CAM modalities. In both cases, there may be a long time lag (several years, in some instances) between the development and the first use of a treatment and the assembly of a body of scientific evidence of effectiveness. For drugs, this lag is invisible to most of the general public, and some evidence from RCTs must have been assembled before a drug is allowed on the market. For other treatments, however, the time required to organize an RCT or collect the results of other types of studies means that a large body of anecdotal experience will have been developed before more formal scientific evidence appears. For many CAM therapies based on traditional cultural beliefs, this time lag may be measured in hundreds of years.

Emphasis on What's Best for Largest Number Rather Than Search for What's Best for Unique, Individual Patients

A treatment is judged effective in an RCT if it is better than a placebo or an alternative form of treatment. “Better” means that the average outcome for the experimental group is superior to that for a control group, as determined by statistical tests that relate the difference in average outcomes to the variation in outcomes in the two groups. Unless the differences between the experimental group and the control group are dramatic, however, there are usually some patients in the experimental group who do worse than some patients in the control group (Park, 2002). What is best, then, for the “typical” or “average” patient is not necessarily best for every patient. This approach to identifying effective treatments is fundamentally different from the approach that emphasizes individual tailoring of treatments found in CAM modalities like homeopathy or traditional Chinese medicine.

The desire to have objective, well-defined study endpoints in RCTs can lead to a focus on health outcomes like mortality, tumor shrinkage, or change in a measurable physiological parameter like temperature or blood pressure. An exclusive focus on objective endpoints can lead researchers to miss or ignore other effects in the realm of subjective symptoms (e.g., pain, fatigue, and cognitive function) and general well-being. For many CAM therapies, the treatment goals include feelings of well-being and mastery of the illness (Jonas and Linde, 2002); these will not be captured in studies with more objectively defined primary endpoints.

Wellness Versus Treatment Effectiveness as a Research Objective

Recent national surveys (see Chapter 2; Astin, 1998; Astin et al., 2000) have highlighted the fact that many CAM “treatments” are not used to treat a specific current problem or disease but, rather, are used to either prevent disease or to promote a more general state of health and well-being. RCTs may still be used to assess the effects of CAM on general health or well-being, but such RCTs may be even more difficult to conduct than RCTs of the effectiveness of treatments for specific diseases. RCTs in the domain of disease prevention or wellness enhancement may require much longer time lines (e.g., 10 to 20 years or more), very large sample sizes because of the relatively low incidence of specific medical problems being prevented, or even larger sample sizes because of the potential of loss to follow-up or switching of treatment arms over the course of the study (i.e., patients randomized to the presumed active treatment quit taking or doing it, and patients randomized to the control arm begin to take or do the active treatment on their own). Some outcome variables may be hard to define and measure (e.g., “I just feel better”), and effect sizes may be small, again adding to the sample size required for a trial to have a reasonable chance of detecting an effect if it is truly present. Finally, patients will inevitably be doing several things that contribute to wellness (or lack of it) over a multiyear study period, and it will be difficult to isolate the effects of a CAM therapy or modality from the effects of a larger package of lifestyle factors.


  • Astin JA. Why patients use alternative medicine: Results of a national study. JAMA. 1998;279(19):1548–1553. [PubMed: 9605899]
  • Astin JA, Pelletier KR, Marie A, Haskell WL. Complementary and alternative medicine use among elderly persons: One-year analysis of a Blue Shield Medicare supplement. J Gerontol A Biol Sci Med Sci. 2000;55(1):M4–M9. [PubMed: 10719766]
  • Bensen K, Hartz AJ. A comparison of observatonal studies and randomized controlled trials. NEJM. 2000;342(25):1878–1886. [PubMed: 10861324]
  • Bowling A. Measuring Health: A Review of Quality of Life Measurement Scales. Philadelphia, PA: Open University Press; 1997.
  • Byar DP. Why data bases should not replace randomized clinical trials. Biometrics. 1980 June;36:337–342. [PubMed: 7407321]
  • Cochrane AL. Effectiveness and Efficiency: Random Reflections on Health Services. London: Nuffield Provincial Hospitals Trust; 1972. [PubMed: 2691208]
  • Farquhar C, Basser R, Hetrick S, Lethaby A, Marjoribanks J. High dose chemotherapy and autologous bone marrow or stem cell transplantation versus conventional chemotherapy for women with metastatic breast cancer. Cochrane Database Syst Rev. 2003;(1):CD003142. [PubMed: 12535458]
  • Frank-Stromborg M, Olsen SJ. Instruments for Clinical Health Care Research. Sudbury, MA: Jones and Bartlett; 1997.
  • Garber AM, Phelps CE. Economic foundations of cost-effectiveness analysis. J Health Econ. 1997;16:1–31. [PubMed: 10167341]
  • IOM (Institute of Medicine). Gulf War Veterans: Treating Symptoms and Syndromes. Washington, DC: National Academy Press; 2001.
  • Jonas WB, Linde K. Conducting and Evaluating Clinical Research on Complementary and Alternative Medicine. In: Gallin JI, editor. Principles and Practice of Clinical Research. San Diego, CA: Academic Press; 2002. pp. 401–426.
  • Kaptchuk TJ, Kerr CE. Commentary: Unbiased divination, unbiased evidence, and the patulin clinical trial. Int J Epidemiol. 2004;33(2):247–251. [PubMed: 15082621]
  • Lewith GT, Hyland M, Gray SF. Attitudes to and use of complementary medicine among physicians in the United Kingdom. Complement Ther Med. 2001;9(3):167–172. [PubMed: 11926430]
  • McDowell I, Newell C. Measuring Health: A Guide to Rating Scales and Questionnaires. New York: Oxford University Press; 1996.
  • McPherson K, Wennberg JE, Hovind OB, Clifford P. Small-area variations in the use of common surgical procedures: An international comparison of New England, England, and Norway. N Engl J Med. 1982;307(21):1310–1314. [PubMed: 7133068]
  • Moseley JB, O'Malley K, Petersen NJ, Menke TJ, Brody BA, Kuykendall DH, Hollingsworth JC, Ashton CM, Wray NP. A controlled trial of arthroscopic surgery for osteoarthritis of the knee. N Engl J Med. 2002;347(2):81–88. [PubMed: 12110735]
  • Neuhauser D. Heroes and martyrs of quality and safety: Ernest Armory Codman, MD. Qual Saf Health Care. 2002;11:104–105. [PMC free article: PMC1743579] [PubMed: 12078360]
  • Nisbett RE. The Geography of Thought: How Asians and Westerners Think Differently and Why. New York: Free Press; 2003.
  • Park CM. Diversity, the individual, and proof of efficacy: Complementary and alternative medicine in medical education. Am J Public Health. 2002;92(10):1568–1572. [PMC free article: PMC1447280] [PubMed: 12356593]
  • Phillips R, Ball C, Sackett D, Badenoch D, Straus S, Haynes B, Dawes M, McAlister FA. 2004. [May 2004]. [Online]. Available: http://www​​/Oxford_CEBM_Levels_5.rtf.
  • Rossouw JE, Anderson GL, Prentice RL, LaCrois AZ, Kooperberg C, Stefanick ML, Jackson RD, Beresford SA, Howard BV, Johnson KC, Kotchen JM, Ockene J. Risks and benefits of estrogen plus progestin in healthy postmenopausal women: Principal results from the Women's Health Initiative randomized controlled trial. JAMA. 2002;288(3):321–333. [PubMed: 12117397]
  • Torrance GW. Measurement of health state utilities for economic appraisal: A review. J Health Econ. 1986;5(1):1–30. [PubMed: 10311607]
  • Upjohn Co. v. Finch. 422 F.2d 944, 955 (6th Cir. 1970).
  • U.S. Preventive Services Task Force. Guide to Clinical Preventive Services. Baltimore, MD: Williams & Wilkins; 1996.
  • U.S. Statutes at Large. 1951. p. 648.
  • U.S. Statutes at Large. Vol. 65. 1962. pp. 788–789.
  • Wennberg J, Gittelsohn A. Variations in medical care among small areas. Sci Am. 1982;246(4):120–134. [PubMed: 7079718]



CROs provide a wide range of research and development services. CROs assist pharmaceutical, biotechnology, and medical device companies to produce new medicines and new treatments (www​

Copyright © 2005, National Academy of Sciences.
Bookshelf ID: NBK83795


  • PubReader
  • Print View
  • Cite this Page
  • PDF version of this title (7.0M)

Related information

  • PMC
    PubMed Central citations
  • PubMed
    Links to PubMed

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...