U.S. flag

An official website of the United States government

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Institute of Medicine (US) Roundtable on Evidence-Based Medicine; Olsen LA, Aisner D, McGinnis JM, editors. The Learning Healthcare System: Workshop Summary. Washington (DC): National Academies Press (US); 2007.

Cover of The Learning Healthcare System

The Learning Healthcare System: Workshop Summary.

Show details

2The Evolving Evidence Base—Methodologic and Policy Challenges


An essential component of the learning healthcare system is the capacity to continually improve approaches to gathering and evaluating evidence, taking advantage of new tools and methods. As technology advances and our ability to accumulate large quantities of clinical data increases, new challenges and opportunities to develop evidence on the effectiveness of interventions will emerge. With these expansions comes the possibility of significant improvements in multiple facets of the information that underlies healthcare decision making, including the potential to develop additional insights on risk and effectiveness; an improved understanding of increasingly complex patterns of comorbidity; insights on the effect of genetic variation and heterogeneity on diagnosis and treatment outcomes; and evaluation of interventions in a rapid state of flux such as devices and procedures. A significant challenge will be in piecing together evidence from the full scope of this information to determine what is best for individual patients. This chapter offers an overview of some of the key methodologic and policy challenges that must be addressed as evidence evolves.

In the first paper in this chapter, Robert M. Califf presents an overview of the alternatives to large randomized controlled trials (RCTs), and Telba Irony and David Eddy present three methods that have been developed to augment and improve current approaches to generating evidence. Califf suggests that, while the RCT is a valuable tool, the sheer volume of clinical decisions requires that we understand the best alternative methods to use when RCTs are inapplicable, infeasible, or impractical. He outlines the potential benefits and pitfalls of practical clinical trials (PCTs), cluster randomized trials, observational treatment comparisons, interrupted time series, and instrumental variables analysis, noting that advancements in methodologies are important; but increasing the evidence base will also require expanding our capacity to do clinical research—which can be ex-emplified by the need for increased organization, clinical trials that are embedded in a nodal network of health systems with electronic health records, and development of a critical mass of experts to guide us through study methodologies.

Another issue complicating evaluation of medical devices is their rapid rate of turnover and improvement, which makes their appraisal especially complicated. Telba Irony discusses the work of the Food and Drug Administration (FDA) in this area through the agency’s Critical Path Initiative and its Medical Device Innovation Initiative. The latter emphasizes the need for improved statistical approaches and techniques to learn about the safety and effectiveness of medical device interventions in an efficient way, which can also adapt to changes in technology during evaluation periods. Several examples were discussed of the utilization of Bayesian analysis to accelerate the approval process of medical devices. David M. Eddy presented his work with Archimedes to demonstrate how the use of mathematical models is a promising approach to help answer clinical questions, particularly to fill the gaps in empirical evidence. Many current gaps in evidence relate to unresolved questions posed at the conclusion of clinical trials; however most of these unanswered questions do not get specifically addressed in subsequent trials, due to a number of factors including cost, feasibility, and clinical interest. Eddy suggests that models can be particularly useful in utilizing the existing clinical trial data to address issues such as head-to-head comparisons, combination therapy or dosing, extension of trial results to different settings, longer follow-up times, and heterogeneous populations. Recent work on diabetes prevention in high-risk patients illustrates how the mathematical modeling approach allowed investigators to extend trials in directions that were otherwise not feasible and provided much needed evidence for truly informed decision making. Access to needed data will increase with the spread of electronic health records (EHRs) as long as person-specific data from existing trials are widely accessible.

As we accumulate increasing amounts of data and pioneer new ways to utilize information for patient benefit, we are also developing an improved understanding of increasingly complex patterns of comorbidity and insights into the effect of genetic variation and heterogeneity on diagnosis and treatment outcomes. Sheldon Greenfield outlines the many factors that lead to heterogeneity of treatment effects (HTE)—variations in results produced by the same treatment in different patients—including genetic, environmental, adherence, polypharmacy, and competing risk. To improve the specificity of treatment recommendations, Greenfield suggests that prevailing approaches to study design and data analysis in clinical research must change. The authors propose two major strategies to decrease the impact of HTE in clinical research: (1) the use of composite risk scores derived from multivariate models should be considered in both the design of a priori risk stratification groups and data analysis of clinical research studies; and (2) the full range of sources of HTE, many of which arise for members of the general population not eligible for trails, should be addressed by integrating the multiple existing phases of clinical research, both before and after an RCT.

In a related paper, David Goldstein gives several examples that illustrate the mounting challenges and opportunities posed by genomics in tailoring treatment appropriately. He highlights recent work on the Clinical Anti-psychotic Trials of Intervention Effectiveness (CATIE), which compared the effectiveness of atypical antipsychotics and one typical antipsychotic in the treatment of schizophrenia and Alzheimer’s disease. While results indicated that, with respect to discontinuation of treatment, there was no difference between typical and atypical antipsychotics, in terms of adverse reactions, such as increased body weight or development of the irreversible condition known as tardive dyskinesia, these medications were actually quite distinct. Pharmacogenetics thus offers the potential ability to identify subpopulations of risk or benefit through the development of clinically useful diagnostics, but only if we begin to amass the data, methods, and resources needed to support pharmacogenetics research.

The final cluster of papers in this chapter engage some of the policy issues in expanding sources of evidence, such as those related to the interoperability of electronic health records, expanding post-market surveillance and the use of registries, and mediating an appropriate balance between patient privacy and access to clinical data. Weisman et al. comment on the rich opportunities presented by interoperable EHRs for post-marketing surveillance data and the development of additional insights on risk and effectiveness. Again, methodologies outside of the RCT will be increasingly instrumental in filling gaps in evidence that arise from the use of data related to interventions in clinical practice because the full value of an intervention cannot truly be appreciated without real-world usage. Expanded systems for post-marketing surveillance offer substantial opportunities to generate evidence; and in defining the approach, we also have an opportunity to align the interests of many healthcare stakeholders. Consumers will have access to technologies as well as information on appropriate use; manufacturers and regulatory agencies might recognize significant benefit from streamlined or harmonized data collection requirements; and decision makers might acquire means to accumulate much-needed data for comparative effectiveness studies or recognition of safety signals. Steve Teutsch and Mark Berger comment on the obvious utility of clinical studies, particularly comparative effectiveness studies—to demonstrate which technology is more effective, safer, or beneficial for subpopulations or clinical situation—for informing the decisions of patients, providers, and policy makers. However they also note several of the inherent difficulties of our current approach to generating needed information, including a lack of consensus on evidence standards and how they might vary depending on circumstance, and a needed advancement in the utilization, improvement, and validation of study methodologies.

An underlying theme in many of the workshop papers is the effect of HIPAA (Health Insurance Portability and Accountability Act) regulation on current research and the possible implications for utilizing data collected at the point of care for generation of evidence on effectiveness of interventions. In light of the substantial gains in quality of care and advances in research possible by linking health information systems and aggregating and sharing data, consideration must be given to how to provide access while maintaining appropriate levels of privacy and security for personal health information. Janlori Goldman and Beth Tossell give an overview of some of the issues that have emerged in response to privacy concerns about shared medical information. While linking medical information offers clear benefits for improving health care, public participation is necessary and will hinge on privacy and security being built in from the outset. The authors suggest a set of first principles regarding identifiers, access, data integrity, and participation that help move the discussion toward a workable solution. This issue has been central to many discussions of how to better streamline the healthcare system and facilitate the process of clinical research, while maximizing the ability to provide privacy and security for patients. A recent Institute of Medicine (IOM) workshop, sponsored by the National Cancer Policy Forum, examined some of the issues surrounding HIPAA and its effect on research, and a formal IOM study on the topic is anticipated in the near future.


Robert M. Califf, M.D.

Duke Translational Medicine Institute and the Duke University Medical Center

Researchers and policy makers have used observational analyses to support medical decision making since the beginning of organized medical practice. However, recent advances in information technology have allowed researchers access to huge amounts of tantalizing data in the form of administrative and clinical databases, fueling increased interest in the question of whether alternative analytical methods might offer sufficient validity to elevate observational analysis in the hierarchy of medical knowledge. In fact, 25 years ago, my academic career was initiated with access to one of the first prospective clinical databases, an experience that led to several papers on the use of data from practice and the application of clinical experience to the evaluation and treatment of patients with coronary artery disease (Califf et al. 1983). However, this experience led me to conclude that no amount of statistical analysis can substitute for randomization in ensuring internal validity when comparing alternative approaches to diagnosis or treatment.

Nevertheless, the sheer volume of clinical decisions made in the absence of support from randomized controlled trials requires that we understand the best alternative methods when classical RCTs are unavailable, impractical, or inapplicable. This discussion elaborates upon some of the alternatives to large RCTs, including practical clinical trials, cluster randomized trials, observational treatment comparisons, interrupted time series, and instrumental variables analysis, and reviews some of the potential benefits and pitfalls of each approach.

Practical Clinical Trials

The term “large clinical trial” or “megatrial” conjures an image of a gargantuan undertaking capable of addressing only a few critical questions. The term “practical clinical trial” is greatly preferred because the size of a PCT need be no larger than that required to answer the question posed in terms of health outcomes—whether patients live longer, feel better, or incur fewer medical costs. Such issues are the relevant outcomes that drive patients to use a medical intervention.

Unfortunately, not enough RCTs employ the large knowledge base that was used in developing the principles relevant to conducting a PCT (Tunis et al. 2003). A PCT must include the comparison or alternative therapy that is relevant to the choices that patients and providers will make; all too often, RCTs pick a “weak” comparator or placebo. The populations studied should be representative; that is, they should include patients who would be likely to receive the treatment, rather than including low-risk or narrow populations selected in hopes of optimizing the efficacy or safety profile of the experimental therapy. The time period of the study should include the period relevant to the treatment decision, unlike short-term studies that require hypothetical extrapolation to justify continuous use.

Also, the background therapy should be appropriate for the disease, an issue increasingly relevant in the setting of international trials that include populations from developing countries. Such populations may be comprised of “treatment-naïve” patients, who will not offer the kind of therapeutic challenge presented by patients awaiting the new therapy in countries where active treatments are already available. Moreover, patients in developing countries usually do not have access to the treatment after it is marketed. Well-designed PCTs offer a solution to the “outsourcing” of clinical trials to populations of questionable relevance to therapeutic questions better addressed in settings where the treatments are intended to be used. Of course, the growth of clinical trials remains important for therapies that will actually be used in developing countries, and appropriate trials in these countries should be encouraged (Califf 2006a).

Therefore, the first alternative to a “classical” RCT is a properly designed and executed PCT. Research questions should be framed by the clinicians who will use the resulting information, rather than by companies aiming to create an advantage for their products through clever design. Similarly, a common fundamental mistake occurs when scientific experts without current knowledge of clinical circumstances are allowed to design trials. Instead, we need to involve clinical decision makers in the design of trials to ensure they are feasible and attractive to practice, as well as making certain that they include elements critical to providing generalizable knowledge for decision making.

Another fundamental problem is the clinical research enterprise’s lack of organization. In many ways, the venue for the conduct of clinical trials is hardly a system at all, but rather a series of singular experiences in which researchers must deal with hundreds of clinics, health systems, and companies (and their respective data systems). Infrastructure for performing trials should be supported by the both the clinical care system and the National Institutes of Health (NIH), with continuous learning about the conduct of trials and constant improvements in their efficiency. However, the way trials are currently conducted is an engineering disaster. We hope that eventually trials will be embedded in a nodal network of health systems with electronic health records combined with specialty registries that cut across health systems (Califf et al. [in press]). Before this can happen, however, not only must EHRs be in place, but common data standards and nomenclature must be developed, and there must be coordination among numerous federal agencies (FDA, NIH, the Centers for Disease Control and Prevention [CDC], the Centers for Medicare and Medicaid Services [CMS]) and private industry to develop regulations that will not only allow, but encourage, use of interoperable data.

Alternatives to Randomized Comparisons

The fundamental need for randomization arises from the existence of treatment biases in practice. Recognizing that random assignment is essential to ensuring the internal validity of a study when the likely effects of an intervention are modest (and therefore subject to confounding by indication), we cannot escape the fact that nonrandomized comparisons will have less internal validity. However, nonrandomized analyses are nonetheless needed, because not every question can be answered by a classical RCT or a PCT, and a high-quality observational study is likely to be more informative than relying solely on clinical experience. For example, interventions come in many forms—drugs, devices, behavioral interventions, and organizational changes. All interventions carry a balance of potential benefit and potential risk; gathering important information on these interventions through an RCT or PCT might not always be feasible.

As an example of organizational changes requiring evaluation, consider the question: How many nurses, attendants, and doctors are needed for an inpatient unit in a hospital? Although standards for staffing have been developed for some environments relatively recently, in the era of computerized entry, EHRs, double-checking for medical errors, and bar coding, the proper allocation of personnel remains uncertain. Yet every day, executives make decisions based on data and trends, usually without a sophisticated understanding of their multivariable and time-oriented nature.

In other words, there is a disassociation between the experts in analysis of observational clinical data and the decision makers. There are also an increasing number of sources of data for decision making, with more and more healthcare systems and multispecialty practices developing data repositories. Instruments to extract data from such systems are also readily available. While these data are potentially useful, questionable data analyses and gluts of information (not all of it necessarily valid or useful) may create problems for decision makers.

Since PCTs are not feasible for answering the questions that underlie a good portion of the decisions made every day by administrators and clinicians, the question is not really whether we should look beyond the PCT. Instead, we should examine how best to integrate various modes of decision making, including both PCTs and other approaches to data analysis, in addition to opinion based on personal experience. We must ask ourselves: Is it better to combine evidence from PCTs with opinion, or is it better to use a layered approach using PCTs for critical questions and nonrandomized analyses to fill in gaps between clear evidence and opinion?

For the latter approach, we must think carefully about the levels of decision making that we must inform every day, the speed required for this, how to adapt the methodology to the level of certainty needed, and ways to organize growing data repositories and the researchers who will analyze them to better develop evidence to support these decisions. Much of the work in this arena is being conducted by the Centers for Education and Research in Therapeutics (CERTs) (Califf 2006b). The Agency for Healthcare Research and Quality (AHRQ) is a primary source of funding for these efforts, although significant increases in support will be needed to permit adequate progress in overcoming methodological and logistical hurdles.

Cluster Randomized Trials

If a PCT is not practical, the second alternative to large RCTs is cluster randomized trials. There is growing interest in this approach among trialists, because health systems increasingly provide venues in which practices vary and large numbers of patients are seen in environments that have good data collection capabilities. A cluster randomized trial performs randomization on the level of a practice rather than the individual patient. For example, certain sites are assigned to intervention A, others use intervention B, and a third group serves as a control. In large regional quality improvement projects, factorial designs can be used to test more than one intervention. This type of approach can yield clear and pragmatic answers, but as with any method, there are limitations that must be considered. Although methods have been developed to adjust for the nonindependence of observations within a practice, these methods are poorly understood and difficult to explain to clinical audiences. Another persistent problem is contamination that occurs when practices are aware of the experiment and alter their practices regardless of the randomized assignment. A further practical issue is obtaining informed consent from patients entering a health system where the practice has been randomized, recognizing that individual patient choice for interventions often enters the equation.

There are many examples of well-conducted cluster randomized trials. The Society of Thoracic Surgeons (STS), one of the premier learning organizations in the United States, has a single database containing data on more than 80 percent of all operations performed (Welke et al. 2004). Ferguson and colleagues (Ferguson et al. 2002) performed randomization at the level of surgical practices to test a behavioral intervention to improve use of postoperative beta blockers and the use of the internal thoracic artery as the main conduit for myocardial revascularization. Embedding this study into the ongoing STS registry proved advantageous, because investigators could examine what happened before and what happened after the experiment. They were able to show that both interventions work, that the use of this practice improved surgical outcomes, and that national practice improved after the study was completed.

Variations of this methodologic approach have also been quite successful, such as the amalgamation of different methods described in a recent study by (Schneeweiss et al. 2004). This study used both cluster randomization and time sequencing embedded in a single trial to examine nebulized respiratory therapy in adults and the effects of a policy change. Both approaches were found to yield similar results with regard to healthcare utilization, cost, and outcomes.

Observational Treatment Comparisons

A third alternative to RCTs is the observational treatment comparison. This is a potentially powerful technique requiring extensive experience with multiple methodological issues. Unfortunately, the somewhat delicate art of observational treatment comparison is mostly in the hands of naïve practitioners, administrators, and academic investigators who obtain access to databases without the skills to analyze them properly. The underlying assumption of the observational treatment comparison is that if the record includes information on which patients received which treatment, and outcomes have been measured, a simple analysis can evaluate which treatment is better. However in using observational treatment comparisons, one must always consider not only the possibility of confounding by indication and inception time bias, but also the possibility of missing data at baseline to adjust for differences, missing follow-up data, and poor characterization of outcomes due to a lack of prespecification. In order to deal with confounding, observational treatment comparisons must include adjustment for known prognostic factors, adjustment for propensity (including consideration of inverse weighted probability estimators for chronic treatments), and employment of time-adjusted covariates when inception time is variable.

Resolving some of these issues with definitions of outcomes and missing data will be greatly aided by development of interoperable clinical research networks that work together over time with support from government agencies. One example is the National Electronic Clinical Trials and Research (NECTAR) network—a planned NIH network that will link practices in the United States to academic medical centers by means of interoperable data systems. Unfortunately, NECTAR remains years away from actual implementation.

Despite the promise of observational studies, there are limitations that cannot be overcome even by the most experienced of researchers. For example, SUPPORT (Study to Understand Prognoses and Preferences for Outcomes and Risks of Treatment) (Connors et al. 1996; Cowley and Hager 1996) examined use of a right heart catheter (RHC) using prospectively collected data, so there were almost no missing data. After adjusting for all known prognostic factors and using a carefully developed propensity score, this study found an association between use of RHC in critically ill patients and an increased risk of death. Thirty other observational studies came to the same conclusion, even when looking within patient subgroups to ensure that comparisons were being made between comparable groups. None of the credible observational studies showed a benefit associated with RHC, yet more than a billion dollars’ worth of RHCs were being inserted in the United States every year.

Eventually, five years after publication of the SUPPORT RHC study, the NIH funded a pair of RCTs. One focused on heart disease and the other on medical intensive care. The heart disease study (Binanay et al. 2005; Shah et al. 2005) was a very simple trial in which patients were selected on the basis of admission to a hospital with acute decompensated heart failure. These patients were randomly assigned to receive either an RHC or standard care without an RHC. This trial found no evidence of harm or of benefit attributable to RHC. Moreover, other trials were being conducted around the world; when all the randomized data were in, the point estimate comparing the two treatments was 1.003: as close to “no effect” as we are likely ever to see. In this instance, even with some of the most skillful and experienced researchers in the world working to address the question of whether RHC is a harmful intervention, the observational data clearly pointed to harm, whereas RCTs indicated no particular harm or benefit.

Another example is drawn from the question of the association between hemoglobin and renal dysfunction. It is known that as renal function declines, there is a corresponding decrease in hemoglobin levels; therefore, worse renal function is associated with anemia. Patients with renal dysfunction and anemia have a significantly higher risk of dying, compared to patients with the same degree of renal dysfunction but without anemia. Dozens of different databases all showed the same relationship: the greater the decrease in hemoglobin level, the worse the outcome.

Based on these findings, many clinicians and policy makers assumed that by giving a drug to manage the anemia and improve hematocrit levels, outcomes would also be improved. Thus, erythropoietin treatment was developed and, on the basis of observational studies and very short term RCTs, has become a national practice standard. There are performance indicators that identify aggressive hemoglobin correction as a best practice; CMS pays for it; and nephrologists have responded by giving billions of dollars worth of erythropoietin to individuals with renal failure, with resulting measurable increases in average hemoglobin.

To investigate effects on outcome, the Duke Clinical Research Institute (DCRI) coordinated a PCT in patients who had renal dysfunction but did not require dialysis (Singh et al. 2006). Subjects were randomly assigned to one of two different target levels of hematocrit, normal or below normal. We could not use placebo, because most nephrologists were absolutely convinced of the benefit of erythropoietin therapy. However, when an independent data monitoring committee stopped the study for futility, a trend toward worse outcomes (death, stroke, heart attack, or heart failure) was seen in patients randomized to the more “normal” hematocrit target; when the final data were tallied, patients randomized to the more aggressive target had a significant increase in the composite of death, heart attack, stroke and heart failure. Thus the conclusions drawn from observational comparisons were simply incorrect.

These examples of highly touted observational studies that were ultimately seen to have provided incorrect answers (both positive and negative for different interventions) highlight the need to improve methods aimed at mitigating these methodological pitfalls. We must also consider how best to develop a critical mass of experts to guide us through these study methodologies, and what criteria should be applied to different types of decisions to ensure that the appropriate methods have been used.

Interrupted Time Series and Instrumental Variables

A fourth alternative to large RCTs is the interrupted time series. This study design requires significant expertise because it includes all the potential difficulties of observational treatment comparisons, plus uncertainties about temporal trends. However, one example is drawn from an analysis of administrative data, in which data were used to assess retrospective drug utilization review and effects on the rate of prescribing errors and on clinical outcomes (Hennessy et al. 2003). This study concluded that, although retrospective drug utilization review is required of all state Medicaid programs, the authors were unable to identify an effect on the rate of exceptions or on clinical outcomes.

The final alternative to RCTs is the use of instrumental variables, which are variables unrelated to biology that produce a contrast in treatment that can be characterized. A national quality improvement registry of patients with acute coronary syndromes evaluated the outcomes of use of early versus delayed cardiac catheterization using instrumental variable analysis (Ryan et al. 2005). The instrumental variable in this case was whether the patient was admitted to the hospital on the weekend (when catheterization delays were longer) or on a weekday (when time to catheterization is shorter). Results indicated a trend toward greater benefit of early invasive intervention in this high-risk condition. One benefit of this approach is that variables can be embedded in an ongoing registry (e.g., population characteristics in a particular zip code can be used to create an approximation of the socioeconomic status of a group of patients). However, results often are not definitive, and it is common for this type of study design to raise many more questions than it answers.

Future Directions: Analytical Synthesis

A national network funded by the AHRQ demonstrates a concerted, systematic approach to addressing all these issues in the context of clinical questions that require a synthesis of many types of analysis. The Developing Evidence to Inform Decisions about Effectiveness (DEcIDE) network seeks to inform the decisions that patients, healthcare providers, and administrators make about therapeutic choices. The DCRI’s first project as part of the DEcIDE Network examines the issue of choice of coronary stents. This is a particularly interesting study because, while there are dozens of RCTs addressing this question, new evidence continues to emerge. Briefly, when drug-eluting stents (DES) first became available to clinicians, there was a radical shift in practice from bare metal stents (BMS) to DES. Observational data from our database—now covering about 30 years of practice—are very similar to those reported in RCTs and indicate reduced need for repeat procedures with DES, because they prevent restenosis in the stented area.

The problem, however, is that only one trial has examined long-term outcomes among patients who were systematically instructed to discontinue dual platelet aggregation inhibitors (i.e., aspirin and clopidogrel). This study (Pfisterer et al. In press; Harrington and Califf 2006) was funded by the Swiss government and shows a dramatic increase in abrupt thrombosis in people with DES compared with BMS when clopidogrel was discontinued per the package insert instructions, leaving the patients receiving only aspirin to prevent platelet aggregation. In the year following discontinuation of clopidogrel therapy, the primary composite end point of cardiac death or myocardial infarction occurred significantly more frequently among patients with DES than in the BMS group. This was a small study, but it raises an interesting question: If you could prevent restenosis in 10 out of 100 patients but had 1 case of acute heart attack per 100, how would you make that trade-off? This is precisely the question we are addressing with the DEcIDE project.

Despite all these complex issues, the bottom line is that when evidence is applied systematically to practice improvement, there is a continuous improvement in patient outcomes (Mehta et al. 2002; Bhatt et al. 2004). Thus the application of clinical practice guidelines and performance measures seems to be working, but all of us continue to dream about improving the available evidence base and using this evidence on a continuous basis. However, this can only come to pass when we use informatics to integrate networks, not just within health systems but across the nation. We will need continuous national registries (of which we now have examples), but we also need to link existing networks so that clinical studies can be conducted more effectively. This will help ensure that patients, physicians, and scientists form true “communities of research” as we move from typical networks of academic health center sites linked only by a single data coordinating center to networks where interoperable sites can share data.

A promising example of this kind of integration exists in the developing network for child psychiatry. This is a field that historically has lacked evidence to guide treatments; however, there are currently 200 psychiatrists participating in the continuous collection of data that will help answer important questions, using both randomized and nonrandomized trials (March et al. 2004).

The classical RCT remains an important component of our evidence-generating system. However, it needs to be replaced in many situations by the PCT, which has a distinctly different methodology but includes the critical element of randomization. Given the enormous number of decisions that could be improved by appropriate decision support however, alternative methods for assessing the relationships between input variables and clinical outcomes must be used. We now have the technology in place in many health systems and government agencies to incorporate decision support into practice, and methods will evolve with use. An appreciation of both the pitfalls and the advantages of these methods, together with the contributions of experienced analysts, will be critical to avoiding errant conclusions drawn from these complex datasets, in which confounding and nonintuitive answers are the rule rather than the exception.


Telba Irony, Ph.D.

Center for Devices and Radiological Health, Food and Drug Administration

Methodological obstacles slow down the straightforward use of clinical data and experience to assess the safety and effectiveness of new medical device interventions in a rapid state of flux. This paper discusses current and future technology trends, the FDA’s Critical Path Initiative, the Center for Devices and Radiological Health (CDRH), Medical Device Innovation Initiative and, in particular, statistical methodology being currently implemented by CDRH to take better advantage of data generated by clinical studies designed to assess safety and effectiveness of medical device interventions.

The Critical Path is the FDA’s premier initiative aiming to identify and prioritize the most pressing medical product development problems and the greatest opportunities for rapid improvement in public health benefits. As a major source of breakthrough technology, medical devices are becoming more critical to the delivery of health care in the United States. In addition, they are becoming more and more diverse and complex as improvements are seen in devices ranging from surgical sutures and contact lenses to prosthetic heart valves and diagnostic imaging systems.

There are exciting emerging technology trends on the horizon and our objective is to obtain evidence on the safety and effectiveness of new medical device products as soon as possible to ensure their quick approval and time to market. New trends comprise computer-related technology and molecular medicine including genomics, proteomics, gene therapy, bioinformatics, and personalized medicine. We will also see new developments in wireless systems and robotics to be applied in superhigh-spatial-precision surgery, in vitro sample handling, and prosthetics. We foresee an increase in the development and use of minimally invasive technologies, nanotechnology (extreme miniaturization), new diagnostic procedures (genetic, in vitro, or superhigh-resolution sensors), artificial organ replacements, decentralized health care (home or self-care, closed-loop home systems, and telemedicine), and products that are a combination of devices and drugs.

CDRH’s mission is to establish a reasonable assurance of the safety and effectiveness of medical devices and the safety of radiation-emitting electronic products marketed in the United States. It also includes monitoring medical devices and radiological products for continued safety after they are in use, as well as helping the public receive accurate, evidence-based information needed to improve health. To accomplish its mission, CDRH must perform a balancing act to get safe and effective devices to the market as quickly as possible while ensuring that devices currently on the market remain safe and effective. To better maintain this balance and confront the challenge of evaluating new medical device interventions in a rapid state of flux, CDRH is promoting the Medical Device Innovation Initiative. Through this initiative, CDRH is expanding current efforts to promote scientific innovation in product development, focusing device research on cutting-edge science, modernizing the review of innovative devices, and facilitating a least burdensome approach to clinical trials. Ongoing efforts include the development of guidance documents to improve clinical trials and to maximize the information gathered by such trials, the expansion of laboratory research, a program to improve the quality of the review of submissions to the CDRH, and expansion of the clinical and scientific expertise at the FDA. The Medical Device Critical Path Opportunities report (FDA 2004) identified key opportunities in the development of biomarkers, improvement in clinical trial design, and advances in bioinformatics, device manufacturing, public health needs, and pediatrics.

The “virtual family” is an example of a project encompassed by this initiative. It consists of the development of anatomic and physiologically accurate adult and pediatric virtual circulatory systems to help assess the safety and effectiveness of new stent designs prior to fabrication, physical testing, animal testing, and human trials. This project is based on a computer simulation model which is designed to mimic all physical and physiological responses of a human being to a medical device. It is the first step toward a virtual clinical trial subject. Another example is the development of a new statistical model for predicting the effectiveness of implanted cardiac stents through surrogate outcomes, to measure and improve the long-term safety of these products.

To better generate evidence on which to base clinical decisions, the Medical Device Innovation Initiative emphasizes the need for improved statistical approaches and techniques to learn about the safety and effectiveness of medical device interventions in an efficient way. It seeks to conduct smaller and possibly shorter trials, and to create a better decision-making process.

Well-designed and conducted clinical trials are at the center of clinical decision making today and the clinical trial gold standard is the prospectively planned, randomized, controlled clinical trial. However, it is not always feasible to conduct such a trial, and in many cases, conclusions and decisions must be based on controlled, but not randomized, clinical trials, comparisons of an intervention to a historical control or registry, observational studies, meta-analyses based on publications, and post-market surveillance. There is a crucial need to improve assessment and inference methods to extract as much information as possible from such studies and to deal with different types of evidence.

Statistical methods are evolving as we move to an era of large volumes of data on platforms conducive to analyses. However, being able to easily analyze data can also be dangerous because it can lead to false discoveries, resources wasted chasing false positives, wrong conclusions, and suboptimal or even bad decisions. CDRH is therefore investigating new statistical technology that can help avoid misleading conclusions, provide efficient and faster ways to learn from evidence, and enable better and faster medical decision making. Examples include new methods to adjust for multiplicity to ensure that study findings will be reproduced in practice as well as new methods to deal with subgroup analysis.

A relatively new statistical method that is being used to reduce bias in the comparison of an intervention to a nonrandomized control group is propensity score analysis. It is a method to match patients by finding patients that are equivalent in the treatment and control groups. This statistical method may be used in nonrandomized controlled trials and the control group may be a registry or a historical control. The use of this technique in observational studies attempts to balance the observed covariates. However, unlike trials in which there is random assignment of treatments, this technique cannot balance the unobserved covariates.

One of the new statistical methods being used to design and analyze clinical trials is the Bayesian approach, which has been implemented and used at CDRH for the last seven years, giving excellent results. The Bayesian approach is a statistical theory and approach to data analysis that provides a coherent method for learning from evidence as it accumulates. Traditional (also called frequentist) statistical methods formally use prior information only in the design of a clinical trial. In the data analysis stage, prior information is not part of the analysis. In contrast, the Bayesian approach uses a consistent, mathematically formal method called Bayes’ Theorem for combining prior information with current information on a quantity of interest. When good prior information on a clinical use of a medical device exists, the Bayesian approach may enable the FDA to reach the same decision on a device with a smaller-sized or shorter-duration pivotal trial. Good prior information is often available for medical devices. The sources of prior information include the company’s own previous studies, previous generations of the same device, data registries, data on similar products that are available to the public, pilot studies, literature controls, and legally available previous experience using performance characteristics of similar products. The payoff of this approach is the ability to conduct smaller and shorter trials, and to use more information for decision making. Medical device trials are amenable to the use of prior information because the mechanism of action of medical devices is typically physical, making the effects local and not systemic. Local effects are often predictable from prior information when modifications to a device are minor.

Bayesian methods may be controversial when the prior information is based mainly on personal opinion (often derived by elicitation methods). They are often not controversial when the prior information is based on empirical evidence such as prior clinical trials. Since sample sizes are typically small for device trials, good prior information can have greater impact on the analysis of the trial and thus on the FDA decision process.

The Bayesian approach may also be useful in the absence of informative prior information. First, the approach can provide flexible methods for handling interim analyses and other modifications to trials in midcourse (e.g., changes to the sample size). Conducting an interim analysis during a Bayesian clinical trial and being able to predict the outcome at midcourse enables early stopping either for early success or for futility. Another advantage of the Bayesian approach is that it allows for changing the randomization ratio at mid-trial. This can ensure that more patients in the trial receive the intervention with the highest probability of success, and it is not only ethically preferable but also encourages clinical trial participation. Finally, the Bayesian approach can be useful in complex modeling situations where a frequentist analysis is difficult to implement or does not exist.

Several devices have been approved through the use of the Bayesian approach. The first example was the INTER FIX™ Threaded Fusion Device by Medtronic Sofamor Danek, which was approved in 1999. That device is indicated for spinal fusion in patients with degenerative disc disease. In that case, a Bayesian predictive analysis was used in order to stop the trial early. The statistical plan used data of 12-month visits combined with partial data of 24-month visits to predict the results of patients who had not reached 24 months in the study (later these results were confirmed). Later (after approval) the sponsor completed the follow-up requirements for the patients enrolled in the study. The final results validated the Bayesian predictive analysis, which significantly reduced the time that was needed for completion of the trial (FDA 1999).

Another example is the clinical trial designed to assess the safety and effectiveness of the LT-CAGE™ Tapered Fusion Device, by Medtronic Sofamor Danek, approved in September 2000. This device is also indicated for spinal fusion in patients with degenerative disc disease. The trial to assess safety and effectiveness of the device was planned as Bayesian, and Bayesian statistical methods were used to analyze the results. Data from patients that were evaluated at 12 and 24 months were used combined with data from patients evaluated only at 12 months in order to make predictions and comparisons for success rates at 24 months. The Bayesian predictions performed during the interim analyses significantly reduced the sample size and the time that was needed for completion of the trial. Again, the results were later confirmed (Lipscomb et al. 2005; FDA 2002).

A third example, where prior information was used along with interim analyses is the Bayesian trial for the St. Jude Medical Regent heart valve, which was a modification of the previously approved St. Jude Standard heart valve. The objective of this trial was to assess the safety and effectiveness of the Regent heart valve. The trial used prior information from the St. Jude Standard heart valve by borrowing the information via Bayesian hierarchical models. In addition, the Bayesian experimental design provided a method to determine the stopping time based on the amount of information gathered during the trial and the prediction of what future results would be. The trial stopped early for success (FDA 2006).

In 2006, the FDA issued a draft guidance for industry and FDA staff that elaborates on the use of Bayesian methods. It covers Bayesian statistics, planning a Bayesian clinical trial, analyzing a Bayesian clinical trial, and post-market surveillance. A public meeting for discussion of the guidance took place in July 2006; this can be found at http://www.fda.gov/cdrh/meetings/072706-bayesian.html.

In general, adaptive trial designs, either Bayesian or frequentist, constitute an emerging field that seems to hold promise for more ethical and efficient development of medical interventions by allowing fuller integration of available knowledge as trials proceed. However, all aspects and trade-offs of such design need to be understood before they are widely used. Clearly there are major logistic, procedural, and operational challenges in using adaptive clinical trial designs, not all of them as yet resolved. However, they have the potential to play a large role and be beneficial in the future. The Pharmaceutical Research and Manufacturers of America (PhRMA) and the FDA organized a workshop that took place on November 13 and 14, 2006, in Bethesda, Maryland, to discuss challenges, opportunities and scope of adaptive trial designs in the development of medical interventions. PhRMA has formed a working group on adaptive designs that aims to facilitate a constructive dialogue on the topic by engaging statisticians, clinicians, and other stakeholders in academia, regulatory agencies, and industry to facilitate broader consideration and implementation of such designs. PhRMA produced a series of articles that have been published in the Drug Information Journal, Volume 40, 2006.

Finally, formal decision analysis is a mathematical tool that should be used when making decisions on whether or not to approve a device. This methodology has the potential to enhance the decision-making process and make it more transparent by better accounting for the magnitude of the benefits as compared with the risks of a medical intervention.

CDRH is also committed to achieving a seamless approach to regulation of medical devices in which the pre-market activities are integrated with continued post-market surveillance and enforcement. In addition, appropriate and timely information is fed back to the public. This regulatory approach encompasses the entire life cycle of a medical device. The “total product life cycle” enhances CDRH’s ability to fulfill its mission to protect and promote public health. CDRH’s pre-market review program cannot guarantee that all legally marketed devices will function perfectly in the post-market setting. Pre-market data provide a reasonable estimate of device performance but may not be large enough to detect the occurrence of rare adverse events. Moreover, device performance can render unanticipated outcomes in post-market use, when the environment is not as controlled as in the pre-market setting. Efforts are made to forecast post-market performance based on pre-market data, but the dynamics of the post-market environment create unpredictable conditions that are impossible to investigate during the pre-market phase. As a consequence, CDRH is committed to a Post-market Transformation Initiative and recently published two documents on the post-market safety of medical devices. One describes CDRH’s post-market tools and the approaches used to monitor and address adverse events and risks associated with the use of medical devices that are currently on the market (see “Ensuring the Safety of Marketed Medical Devices: CDRH’s Medical Device Post-market Safety Framework”). The second document provides a number of recommendations for improving the post-market program (see “Report of the Post-market Transformation Leadership Team: Strengthening FDA’s Post-market Program for Medical Devices”). Both of these documents are available at http://www.fda.gov/cdrh/postmarket/mdpi.html. It is important to mention that one of the recommended actions to transform the way CDRH handles post-market information to assess the performance of marketed medical device products is to design a pilot study to investigate quantitative decision-making techniques to evaluate medical devices throughout the “total product life cycle.”

In conclusion, as the world of medical devices becomes more complex, the Center for Devices and Radiological Health is developing tools to collect information, make decisions, and manage risk in the twenty-first century. Emerging medical device technology will fundamentally transform the healthcare and delivery system, provide new and cutting-edge solutions, challenge existing paradigms, and revolutionize the way treatments are administered.


David M. Eddy, M.D., Ph.D., and David C. Kendrick, M.D., M.P.H.

Archimedes, Inc.

A commitment to evidence-based medicine makes excellent sense. It helps ensure that decisions are founded on empirical observations. It helps ensure that recommended treatments are in fact effective and that ineffective treatments are not recommended. It also helps reduce the burden, uncertainty, and variations that plague decisions based on subjective judgments. Ideally, we would answer every important question with a clinical trial or other equally valid source of empirical observations.

Unfortunately, this is not feasible. Reasons include high costs, long durations, large sample sizes, difficulty getting physicians and patients to participate, large number of options to be studied, speed of technological innovation, and the fact that the questions can change before the trials are completed. For these reasons we need to find alternative ways to answer questions—to fill the gaps in the empirical evidence.

One of these is to use mathematical models. The concept is straightforward. Mathematical models use observations of real events (data) to derive equations that represent the relationships between variables. These equations can then be used to calculate events that have never been directly observed. For a simple example, data on the distances traveled when moving at particular speeds for particular lengths of time can be used to derive the equation “distance = rate × time” (D = RT). Then, that equation can be used to calculate the distance traveled at any other speeds for any other times. Mathematical models have proven themselves enormously valuable in other fields, from calculating mortgage payments, to designing budgets, to flying airplanes, to taking photos of Mars, to e-mail. They have also been successful in medicine, examples being computed tomography (CT) scans and magnetic resonance imaging (MRI), radiation therapy, and electronic health records. Surely there must be a way they can help us improve the evidence base for clinical medicine.

There is very good reason to believe they can, provided some conditions are met. First, we must understand that models will never be able to completely replace clinical trials. There are several reasons. Most fundamentally, trials are our anchor to reality—they are observations of real events. Models are not directly connected to reality. Indeed, models are built from trials and other sources of empirical observations. They are simplified representations of reality, filtered by observations and constrained by equations and will never be as accurate as reality. Not only are they one step removed from empirical observations, but they cannot exist without them. Thus, if it is feasible to answer a question with a clinical trial, then that is the preferred approach. Models should be used to fill the gaps in evidence only when clinical trials are not feasible.

The second condition is that the model should be validated against the clinical trials that do exist. More specifically, before we rely on a model to answer a question we should ensure that it accurately reproduces or predicts the most important clinical trials that are adjacent to or surround that question. The terms “adjacent to” and “surround” are intended to identify the trials that involve similar populations, interventions, and outcomes. For example, suppose we want to compare the effects of atorvastatin, simvastatin, and pravastatin on the 10-year rate of myocardial infarctions (MIs) in people with coronary artery disease (previous MI, angina, history of percutaneous transluminal coronary angioplasty [PTCA], or bypass). These head-to-head comparisons have never been performed, and it would be extraordinarily difficult to do so, given the long time period (10 years), very large sample sizes required (tens of thousands), and very high costs (hundreds of millions of dollars). However a mathematical model could help answer these questions if it had already been shown to reproduce or predict the existing trials of these drugs versus placebos in similar populations. In this case the major adjacent trials would include 4S, the Scandinavian Simvastatin Survival Study (Randomised trial of cholesterol lowering in 4,444 patients with coronary heart disease [4S] 1994); WOSCOPS (Shepherd et al. 1995); CARE (Flaker et al. 1999), LIPID (Prevention of cardiovascular events and death with pravastatin in patients with coronary heart disease and a broad range of initial cholesterol levels [LIPID] 1998), PROSPER (Shepherd et al. 2002), CARDS (Colhoun et al. 2004), TNT (LaRosa et al. 2005), and IDEAL (Pedersen et al. 2005).

The methods for selecting the surrounding trials and performing the validations are beyond the scope of this paper, but four important elements are that (1) the trials should be identified or at least reviewed by a third party, (2) the validations should be performed at the highest level of clinical detail of which the model is capable, (3) all the validations should be performed with the same version of the model, and (4) to the greatest extent possible, the validations should be independent in the sense that they were not used to help build the model. On the third point, it would be meaningless if a model were tweaked or parameters were refitted to match the results of each trial. On the fourth point, it is almost inevitable that some trials will have been used to help build a model. In those cases we say that the validation is “dependent”; these validations ensure that the model can faithfully reproduce the assumptions used to build it. If no information from a trial was used to help build the model, we say that a validation against that trial is “independent.” These validations provide insights into the model’s ability to simulate events in new areas, such as new settings, target populations, interventions, outcomes, and durations.

If these conditions are met for a question, it is not feasible to conduct a new trial to answer the question, and there is a model that can reproduce or predict the major trials that are most pertinent to the question, then it is reasonable to use the model to fill in the gaps between the existing trials. While that approach will not be as desirable as conducting a new clinical trial, one can certainly argue that it is better than the alternative, which is clinical or expert judgment.

If a model is used, then the degree of confidence we can place in its results will depend on the number of adjacent trials against which it has been validated, on the “distance” between the questions being asked and the real trials, and on how well the model’s results matched the real results. For example, one could have a fairly high degree of confidence in a model’s results if the question is about a subpopulation of an existing trial whose overall results the model has already predicted. Other examples of analyses about which we could be fairly confident are the following:

  • Head-to head comparisons of different drugs, all of which have been studied in their own placebo-controlled trials, such as comparing atorvastatin, simvastatin, and pravastatin;
  • Extension of a trial’s results to settings with different levels of physician performance and patient compliance;
  • Studies of different doses of drugs, or combinations of drugs, for which there are good data from phase II trials on biomarkers, and there are other trials connecting the biomarkers to clinical outcomes;
  • Extensions of a trial’s results to longer follow-up times; and
  • Analyses of different mixtures of patients, such as different proportions of people with CAD, particular race/ethnicities, comorbidities, or use of tobacco, provided the model’s accuracy for these groups has been tested in other trials.

As one moves further from the existing trials and validations, the degree of confidence in the model’s results will decrease. At the extreme, a model that is well validated for, say Type 2 diabetes, cannot be considered valid for a different disease, such as coronary artery disease (CAD), congestive heart failure (CHF), cancer, or even Type 1 diabetes. A corollary of this is that a model is never “validated” in a general sense, as though that were a property of the model that carries with it to every new question. Models are validated for specific purposes, and as each new question is raised, their accuracy in predicting the trials that surround that question needs to be examined.

Example: Prevention of Diabetes in High-Risk People

We can illustrate these concepts with an example. Several studies have indicated that individuals at high risk for developing diabetes can be identified from the general population and that with proper management the onset of diabetes can be delayed, or perhaps even prevented altogether (Tuomilehto et al. 2001; Knowler et al. 2002; Snitker et al. 2004; Chiasson et al. 1998; Gerstein et al. 2006). Although these results indicate the potential value of treating high-risk people, the short durations and limited number of interventions studied in these trials leave many important questions unanswered.

Taking the Diabetes Prevention Program (DPP) as an example, it showed that in people at high risk of developing diabetes, over a follow-up period of four years about 35 percent developed diabetes (the control arm). Metformin decreased this to about 29 percent, for a relative reduction of about 17 percent. Lifestyle modification decreased it to about 18 percent, for a relative reduction of about 48 percent. Over the mean follow-up period of 2.8 years the relative reduction was about 58 percent. This is certainly an encouraging finding and is sufficient to stimulate interest in diabetes prevention. However 2.8 years or even 4 years is far too short to determine the effects of these interventions on the long term progression of diabetes or any of its complications; for example:

  • Do the prevention programs just postpone diabetes or do they prevent it altogether?
  • What are the long-term effects of the prevention programs on the probabilities of micro- and macrovascular complications of diabetes, such as cardiovascular disease, retinopathy, and nephropathy?
  • What are the effects on long-term costs, and what are the cost-effectiveness ratios of the prevention programs?
  • Are there any other programs that might be more cost effective?
  • What would a program have to cost in order to break even—no increase in net cost?

These new questions need to be answered if we are to plan diabetes prevention programs rationally. Ideally, we would answer them by continuing the DPP for another 20 to 30 years. But that is not possible for obvious reasons. The only possible method is to use a mathematical model to extend the trial. Specifically, if a model contains all the important variables and can demonstrate that it is capable of reproducing the DPP, along with other trials that document the outcomes of diabetes, then we could use it to run a simulated version of the DPP for a much longer period of time. This approach would also enable us to explore other types of prevention activities and see how they compare with metformin and the lifestyle modification program used in the DPP.

An example of such a model is the Archimedes model. Descriptions of the model have been published elsewhere (Schlessinger and Eddy 2002; Eddy and Schlessinger 2003a, 2003b). Basically, the core of the model is a set of ordinary and differential equations that represent human physiology at roughly the level of detail found in general medical textbooks, patient charts, and clinical trials. It is continuous in time, with clinical events occurring at any time. Biological variables are continuous and relate to one another in ways that they are understood to interact in vivo. Building out from this core, the Archimedes model includes the development of signs and symptoms, patient behaviors in seeking care, clinical events such as visits and admissions, protocols, provider behaviors and performance, patient compliance, logistics and utilization, health outcomes, quality of life, and costs. Thus the model simulates a comprehensive health system in which virtual people get virtual diseases, seek care at virtual hospitals and clinics, are seen by virtual healthcare providers, who have virtual behaviors, use virtual equipments and supplies, generate virtual costs, and so forth. An analogy is Electronic Arts’ SimCity game, but starting at the level of detail of the underlying physiologies of each of the people in the simulation rather than city streets and utility systems. This relatively high level of physiological detail enables the model to simulate diseases such as diabetes and their treatments. For example, in the model people have livers, which produce glucose, which is affected by insulin resistance and can be affected by metformin. Similarly, people in the model can change their lifestyles and lose weight, which affects the progression of many things including insulin resistance, blood pressure, cholesterol levels, and so forth. Thus the Archimedes model is well positioned to study the effects of activities to prevent diabetes.

The Archimedes model is validated by using the simulated healthcare system to conduct simulated clinical trials that correspond to real clinical trials (Eddy et al. 2005). This provides the opportunity to compare the outcomes calculated in the model with outcomes seen in the real trials. Thus far the model has been validated against more than 30 trials. The first 18 trials, with seventy-four separate treatment arms and outcomes, were selected by an independent committee appointed by the American Diabetes Association (ADA) and have been published (Eddy et al. 2005). The overall correlation coefficient between the model’s results and those of the actual trials is 0.98. Ten of the eighteen trials in the ADA-chosen validations provided independent validations; they were not used to build the model itself. The correlation coefficient for these independent validations was 0.96. An example of an independent validation that is particularly important for this application is a prospective, independent validation of the DPP trial itself; the published results matched the predicted results quite closely (Figure 2-1). The Archimedes model also accurately simulated several trials that observed the progression of diabetes, development of complications, and effects of treatment.

FIGURE 2-1. Model’s predictions of outcomes in Diabetes Prevention Program.


Model’s predictions of outcomes in Diabetes Prevention Program. Comparison of proportions of people progressing to diabetes in the control group observed in the real Diabetes Prevention Program (DPP) (solid lines) and in the simulation of the (more...)

An important example is the progression of diabetes and development of coronary artery disease in the United Kingdom Prospective Diabetes Study (Figure 2-2). The ability of the model to simulate or predict a large number of trials relating to diabetes and its complications builds confidence in its results.

FIGURE 2-2. Comparison of model’s calculations and results of the United Kingdom Prospective Diabetes Study (UKPDS): Rates of myocardial infarctions in control and treated groups.


Comparison of model’s calculations and results of the United Kingdom Prospective Diabetes Study (UKPDS): Rates of myocardial infarctions in control and treated groups. SOURCE: Eddy et al. Annals of Internal Medicine 2005; 143:251–264. (more...)

Thus the prevention of diabetes in high-risk people meets the criteria outlined above—it is impractical or impossible to answer the important questions with real clinical trials, there is a model capable of addressing the questions at the appropriate level of physiological detail, and the model has been successfully validated against a wide range of adjacent clinical trials.


Use of the Archimedes model to analyze the prevention of diabetes in high risk people has been reported in detail elsewhere (Eddy et al. 2005). To summarize, the first step was to create a simulated population that corresponds to the population used in the DPP trial. This was done by starting with a representative sample of the U.S. population, from the National Health and Nutrition Examination Survey (NHANES (National Health and Nutrition Evaluation Survey 1998–2002), and then applying the inclusion and exclusion criteria for the DPP to select a sample that matched the DPP population. Specifically, the DPP defined individuals to be at high risk for developing diabetes and included them in the trial if they had all of the following: body mass index (BMI) > 24, fasting plasma glucose (FPG) 90–125 mg/dL, and oral glucose tolerance test (OGTT) of 140–199 mg/dL. We then created copies or clones of the selected people from NHANES, by matching them on approximately 35 variables. A total of 10,000 people were selected and copied. This group was then exposed to three different interventions, corresponding to the arms of the real trial (baseline or control, metformin begun immediately, and the DPP lifestyle program begun immediately). The three groups were then followed for 30 years and observed for progression of diabetes and development of major complications such as myocardial infarction, stroke, end-stage renal disease, and retinopathy. Cost-generating events as well as symptoms and outcomes that affect the quality of life were also measured. The results could then be used to answer the questions about the long-term effects of diabetes prevention.

Do the Prevention Programs Just Postpone Diabetes or Do They Prevent It Altogether?

This can be answered by comparing the effects of metformin and lifestyle on the proportion of people who developed diabetes over the 30-year period. The results are shown in Figure 2-3. The natural rate of progression to diabetes, seen in the control group, was 72 percent over the 30-year follow-up period. Lifestyle modification, as offered in the DPP and continued until a person develops diabetes, would reduce the incidence of diabetes to about 61 percent, for a relative reduction of 15 percent. Thus, over a 30-year horizon the DPP lifestyle modification would actually prevent diabetes in about 11 percent of cases, while delaying it in the remaining 61 percent. In the metformin arm, about 4 percent of cases of diabetes would be prevented, for a 5.5 percent relative reduction in the 30-year incidence of diabetes.

FIGURE 2-3. Model’s calculation of progression to diabetes in four programs.


Model’s calculation of progression to diabetes in four programs. SOURCE: Eddy et al. Annals of Internal Medicine 2005; 143:251–264.

What Are the Long-Term Effects of the Prevention Programs on the Probabilities of Micro- and Macrovascular Complications of Diabetes, like Cardiovascular Disease, Retinopathy, and Nephropathy?

This question is also readily answered, in this case by counting the number of clinical outcomes that occur in the control and lifestyle groups. The effects of the DPP lifestyle program on long-term complications of diabetes are shown in Table 2-1. The 30-year rate of serious complications (including myocardial infarctions, congestive heart failure, retinopathy, stroke, nephropathy, and neuropathy) was reduced by an absolute 8.4 percent, from about 38.2 percent to about 29.8 percent, or a relative decrease of about 22 percent. The effects on other outcomes are shown in Table 2-1.

TABLE 2-1. Expected Outcomes Over Various Time Horizons for Typical Person with DPP Characteristics.


Expected Outcomes Over Various Time Horizons for Typical Person with DPP Characteristics.

What Are the Effects on Long-Term Costs, and What Are the Cost-Effectiveness Ratios of the Prevention Programs?

The effects of the prevention activities on these outcomes can be determined by tracking all the clinical events and conditions that affect quality of life or that generate costs. Over 30 years, the aggregate per-person cost of providing care for diabetes and its complications in the control group was $37,171. The analogous costs in the metformin and lifestyle groups were $4,081 and $9,969 higher, respectively. The average cost-effectiveness ratios for the metformin and lifestyle groups (both compared to no intervention, or the control group), measured in terms of dollars per quality adjusted life year (QALY) gained, were $35,523 and $62,602, respectively.

Are There Any Other Programs That Might Be More Cost-Effective?

The DPP had three arms: control, metformin begun immediately (i.e., when the patient is at risk of developing diabetes, but has not yet developed diabetes), and lifestyle modification begun immediately. Given the high cost of the lifestyle intervention as it was implemented in the DPP, it is reasonable to ask what the effect would be of waiting until a person progressed to diabetes and then beginning the lifestyle intervention. It is clearly not possible to go back and restart the DPP with this new treatment arm, but it is fairly easy to add it to a simulated trial. The results are summarized in Table 2-2. Compared to beginning the lifestyle modification immediately, waiting until a person develops diabetes gives up about 0.034 QALY, or about 21 percent of the effectiveness seen with immediate lifestyle modification. However, the delayed lifestyle program increases costs about $3,066, or about one-third as much as the immediate lifestyle program. Thus the delayed program is more cost-effective in the sense that it delivers a quality-adjusted life year at a lower cost than beginning the lifestyle modification immediately—$24,523 versus $62,602. If the immediate lifestyle program is compared to the delayed lifestyle program, the marginal cost per QALY of the immediate program is about $201,818.

TABLE 2-2. 30-Year Costs, QALYs, and Incremental Costs/QALY for Four Programs from Societal Perspective (Discounted 3%).


30-Year Costs, QALYs, and Incremental Costs/QALY for Four Programs from Societal Perspective (Discounted 3%).

What Would a Program Have to Cost in Order to Break Even—No Increase in Net Cost?

This can be addressed by conducting a sensitivity analysis on the cost of the intervention. Figure 2-4 shows the relationship between the cost of the DPP lifestyle program and the net financial costs. In order to break even, the DPP lifestyle program would have to cost $100 if begun immediately and about $225 if delayed until after a person develops diabetes. In the DPP trial itself, the lifestyle modification program cost $1,356 in the first year and about $672 in subsequent years.

FIGURE 2-4. Costs of two programs for diabetes prevention.


Costs of two programs for diabetes prevention. SOURCE: Eddy et al. Annals of Internal Medicine 2005; 143:251–264.

Discussion and Conclusions

This example illustrates how models can be used to help fill the gaps in the evidence provided by clinical trials and other well-designed empirical studies. Each of the questions addressed above is undeniably important, but each is also impossible to answer empirically. One way or another we are going to have to develop methods for filling the gaps between trials. The solution we describe here is to use trials to establish the effectiveness of interventions, but then use models to extend the trials in directions that are otherwise not feasible. In this case the DPP established that both metformin and intensive lifestyle modification decrease the rate of progression to diabetes in high-risk people. Its results suggest that either of those interventions, plus variations on them such as different doses or timing, should reduce downstream events such as the complications of diabetes and their costs. However they are incapable of determining the actual magnitudes of the downstream effects—the actual probabilities of the complications with and without the interventions, and the actual effects on costs. The DPP trial itself is reported to have cost on the order of $175 million. Continuing it for another 30 years, or even another 10 years, is clearly not possible. Furthermore, once the beneficial effects of the interventions have been established it would be unethical to continue the trial as originally designed. Thus, if we are limited to the clinical trial by itself, we will never know the long-term heath and economic outcomes that are crucially needed for rational planning.

There are three main ways to proceed. One is to ignore the lack of information about the magnitudes of the effects and promote the prevention activities on the general principle that their benefits have been shown. Since this option is not deterred by a lack of information about actual health or economic effects, it might as well promote the most expensive and effective intervention—in this case intensive lifestyle modification begun immediately. This type of nonquantitative reasoning has been the mainstay of medical decision making for decades and might still be considered viable except for two facts. First, it provides no basis for truly informed decision making; if a patient or physician wants to know what can be expected to occur or wants to consider other options, this approach is useless. Second, this approach almost uniformly drives up costs. While that might have been acceptable in the past, it is not acceptable or maintainable today.

The second approach is to rely on expert opinion to estimate the long-term effects. The difficulty here is that the size of the problem far exceeds the capacity of the human mind. When we can barely multiply 17 × 23 in our heads, there is no hope that we can mentally process all the variables that affect the outcomes of preventing diabetes with any degree of accuracy. As a result, different experts come up with different estimates, and there is no way to determine if any of them is even close to being correct.

The third approach is the one taken here—to use mathematical models to keep all the variables straight and perform all the calculations. In a sense, this is the logical extension of using expert opinion; use the human mind for what it is best at—storing and retrieving information, finding patterns, raising hypotheses, designing trials—and then call on formal analytical methods and the power of computers (all human-made, by the way) to perform the quantitative parts of the analysis. This approach can also be viewed as the logical extension of the clinical trial and other empirical research. Trials produce raw data. We already use quantitative methods to interpret the data—classical statistics if nothing else. The types of models we are talking about here are in the same vein, but they extend the methods to encompass information from a wider range of clinical trials and other types of research (to build and validate the models) and then extend the analyses in time to estimate long-term outcomes.

With all this said, however, it is also important to note that in the same ways that not all experts are equal and not all trial designs are equal, not all models are equal. Our proposal that models can be used to help fill the gaps in trials carries a qualification that this should be done only if the ability of the model to simulate real trials has been demonstrated. One way to put this is that if a model is to be used to fill a gap in the existing evidence, it should first be shown to accurately simulate the evidence that exists on either side of the gap. In this example, the model should be shown to accurately simulate (or as in this case, prospectively predict) the DPP trial of the prevention of diabetes (Figure 2-4) as well as other trials of outcomes that have studied the development of complications and their treatments (e.g., Figure 2-2). The demonstration of a model’s ability to simulate existing trials, as well as the condition that additional trials are not feasible, form the conditions we would propose for using models to fill the gaps in evidence.

This example has demonstrated that there are problems, and models, that meet these conditions today. In addition there are good reasons to believe that the power and accuracy of models will improve considerably in the near future. The main factor that will determine the pace of improvement is the availability of person-specific data. Access to such data should increase with the spread of EHRs, as more clinical trials are conducted, as the person-specific data from existing trials are made more widely accessible, as models push deeper into the underlying physiology, and as modelers focus more on validating their models against the data that do exist.


Sheldon Greenfield, M.D., University of California at Irvine, and Richard L. Kravitz, M.D., M.S.P.H., University of California at Davis

Three evolving phenomena indicate that results generated by randomized controlled trials are increasingly inadequate for the development of guidelines, for payment, and for creating quality-of-care measures. First, patients now eligible for trials have a broader spectrum of illness severity than previously. Patients at the lower end of disease severity, who are less likely to benefit from a drug or intervention, are now being included in RCTs. The recent null results from trials of calcium and of clopidogrel are examples of this phenomenon. Second, due to the changing nature of chronic disease along with increased patient longevity, more patients now suffer from multiple comorbidities. These patients are frequently excluded from clinical trials. Both of these phenomena make the results from RCTs generalizable to an increasingly small percentage of patients. Third, powerful new genetic and phenotypic markers that can predict patients’ responsiveness to therapy and vulnerability to adverse effects of treatment are now being discovered. Within clinical trials, these markers have the potential for identifying patients’ potential for responsiveness to the treatment to be investigated.

The current research paradigm underlying evidence-based medicine, and therefore guideline development and quality assessment, is consequently flawed in two ways. The “evidence” includes patients who may benefit only minimally from the treatment being tested, resulting in negative trials and potential undertreatment. Secondly, attempts to generalize the results from positive trials to patients who have been excluded from those trials (e.g., for presence of multiple comorbidities) have resulted in potential over- or ineffective treatment.

The major concern for clinical/health services researchers and policy makers is the identification of appropriate “inference groups.” To whom are the results of trials being applied and for what purpose? Patients with multiple comorbidities are commonly excluded from clinical trials. Some of these conditions can mediate the effects of treatment and increase heterogeneity of response through (1) altered metabolism or excretion of treatment; (2) polypharmacy leading to drug interactions; (3) nonadherence resulting from polypharmcy; or (4) increasing overall morbidity and reducing life expectancy. Research in Type 2 diabetes has shown that comorbidities producing early mortality or poor health status reduce the effectiveness of long-term reduction of plasma glucose. In the United Kingdom Prospective Diabetes Study (UKPDS), reducing the level of coexistent hypertension had considerably greater impact on subsequent morbidity and mortality than did reducing hyperglycemia to near-normal levels. Two decision analytic models have shown that there is very little reduction in microvascular complications based on reductions in hyperglycemia among older patients with diabetes. Similarly, the effectiveness of aggressive treatment for early prostate cancer is much reduced among patients with moderate to major amounts of coexistent disease. This decreased effectiveness must be balanced against mortality from and complications of aggressive therapy to inform patient choice, to improve guidelines for treatment, and to develop measures of quality of care. Several recent national meetings have focused on how guidelines and quality measures need to be altered in “complex” patients, those with more than one major medical condition for whom attention to the heterogeneity of treatment effects (HTE) is so important.

Although the problem of HTE is increasingly recognized, solutions have been slow to appear. Proposed strategies have included exploratory subgroup analysis followed by trials that stratify on promising subgroups. Some have argued for expanded use of experimental designs (n of 1 trials, multiple time series crossover studies, matched pair analyses) that, unlike parallel group clinical trials, can examine individual treatment effects directly. Still others have championed observational studies prior to trials to form relevant subgroups and after trials, as has been done in prostate cancer, to assess the prognosis in subgroups of patients excluded from trials. These strategies could lead to less overtreatment and less undertreatment, and to the tailoring of treatment for maximum effectiveness and minimum cost. The following paper, by the Heterogeneity of Treatment Effects Research Agenda Consortium,1 reviews these issues in greater detail.

Heterogeneity of Treatment Effects

Heterogeneity of Treatment Effects (HTE) has been defined by Kravitz et al. (Kravitz 2004) as variation in results produced by the same treatment in different patients. HTE has always been present; however, two contemporary trends have created an urgency to address the implications of HTE. One is the inclusion of a broader spectrum of illness or risk of outcome in some clinical trials. The other is mounting pressure from payers and patients to follow guidelines, pay according to evidence, and identify indicators of quality of care not only for patients in trials, but for the majority of the population that was not eligible for trials and to which the results of trials may not apply. This latter problem has been exacerbated in recent years by large proportions of the patient population living longer and acquiring other medical conditions that have an impact on the effectiveness of the treatment under study.

With respect to clinical trials, the literature and clinical experience suggest that the problem of identifying subgroups that may be differentially affected by the same treatment is critical, both when the trial results are small or negative and when the trials demonstrate a positive average treatment effect. It has been assumed in devising guidelines, paying for treatments, and setting up quality measures that subgroups behave similarly to the population average. After a trial showing a negative average treatment effect, guideline recommendations may not call for introducing a treatment to a subgroup that would benefit from it. Similarly, when a trial demonstrates a positive average treatment effect across the population, this assumption may encourage the introduction of the added costs, risks, and burdens of a treatment to individuals who may receive no or only a small benefit from it.

The causes of HTE, such as genetic disposition, ethnicity, site differences in care, adherence, polypharmacy, and competing risk (Kravitz 2004), can be classified according to four distinct categories of risk: (1) baseline outcome risk, (2) responsiveness to treatment, (3) iatrogenic risk, and (4) competing risk.

Baseline outcome risk is the rate of occurrence of unfavorable outcomes in a patient population in the absence of the study treatment. Responsiveness to treatment reflects the change in patient outcome risk attributable to the treatment under study. If a sample’s baseline outcome risk of myocardial infarction without treatment is 10 percent and the treatment was 20 percent effective, there would be a 2 percent absolute treatment effect, whereas the same level of effectiveness (20 percent) in a patient sample with a baseline outcome risk of 40 percent would yield an 8 percent absolute decrease in myocardial infarction.

The third type of risk, iatrogenic risk, is the likelihood of experiencing an adverse event related to the treatment under study. Finally, competing risk is the likelihood of experiencing unfavorable outcomes unrelated to the disease and treatment under study, such as death or disability due to comorbid conditions. The causes and implications of each of these sources of HTE are summarized below.

Baseline Outcome Risk

Variation in outcome risk is the best understood source of HTE. Figure 2-5, adapted from Kent and Hayward (Kent 2007), demonstrates how, even if the relative benefit of a treatment (responsiveness) and the risk of adverse events (iatrogenic risk) are constant across a population, patients with the highest risk for the outcome targeted by a treatment often enjoy the greatest absolute benefit, while low-outcome risk patients derive little or no absolute benefit. The lines A, B, and C in this figure depict the expected outcomes with treatment (Y-axis) for a given baseline risk. Line A shows the expected result when the treatment has no effect of outcome. Line B shows the expected result if the treatment reduces the risk of the outcome by 25 percent. Line C shows the expected outcome if the treatment reduces the risk of the outcome by 25 percent but also causes a treatment-related harm independent of baseline risk of 1 percent (i.e., this line is parallel to B, but 1 percent higher). By comparing lines A and C, it is clear that for patients with baseline risks of less than 4 percent, the risks of Treatment C would outweigh its benefits (albeit slightly). The curve shows a hypothetical, unitless baseline risk distribution. High-outcome-risk patients (right of the dashed line in Figure 2-5) may derive sufficient treatment benefit to offset possible harmful side effects, but patients with lower baseline outcome risk, who may be the majority of the sample (left of the dashed line), may fail to derive sufficient benefit from the treatment to justify exposure to possible treatment-related harm. Therefore with a skewed distribution of this kind, it is possible to have an average risk that yields overall an overall positive trial with Treatment C, even though most patients in the trial have a risk profile that makes Treatment C unattractive. This phenomenon occurred in the PROWESS trial examining the use of drotrecogin alfa (activated) (Xigris) in sepsis (Vincent et al. 2003) where the overall treatment effect was positive but driven solely by its effect on the sickest patients (APACHE2 > 25). A second RCT focusing on patients with low baseline outcome risk (APACHE < 25) showed no net benefit for this subpopulation (Abraham et al. 2005), demonstrating the wide possible variations due to HTE. A similar phenomenon may have occurred in a clopidogrel-aspirin trial (Bhatt 2006) where the sickest patients benefited and the least sick patients showed a trend toward harm. In this case, there was an overall null effect. In both sets of studies, the overall effect was misleading.

FIGURE 2-5. Wide variation of patients’ baseline risk (their risk of suffering a bad outcome in the absence of treatment) is one reason trial results don’t apply equally to all patients.


Wide variation of patients’ baseline risk (their risk of suffering a bad outcome in the absence of treatment) is one reason trial results don’t apply equally to all patients. SOURCE: Adapted from Kent, D, and R Hayward, When averages hid (more...)

Iatrogenic Risk

The causes of side effects include both genetic and environmental components. An environmental example would be a hospital’s failure to recognize and promptly treat side effects (Kravitz 2004).

Responsiveness to Treatment

Responsiveness describes the probability of clinical benefit for the individual from the treatment based on drug absorption, distribution, metabolism or elimination, drug concentration at the target site, or number and functionality of target receptors. Most of the reasons for variations in responsiveness are of genetic origin. Other causes of different trial responses in individual patients include behavioral and environmental variables (Kravitz 2004).

Competing Risk

The demographics of the majority of the population (or the average patient) are changing. The aging of the population, along with the advent of life-prolonging therapies, has caused an increase in the proportion of the patients within a diagnostic category who have multiple comorbidities, are treated with multiple medications, and therefore, have multiple competing sources of mortality. Compared with 20 years ago, patients with prostate cancer are much more likely to have coexisting congestive heart failure, chronic obstructive pulmonary disease (COPD), and diabetes and are much more likely to die from these conditions than from prostate cancer over the next few years. These patients are therefore much less likely to be able to benefit from arduous, long-term treatment. Two decision analyses have shown that intensive treatment of blood sugar in patients with diabetes who are older than 65 years has little impact on the reductions of complications in such patients (Vijan et al. 1997; CDC Diabetes Cost-Effectiveness Group 2002). Therefore, in these patients the effectiveness of aggressive treatment is substantially lower than that observed in clinical trials because these trials were conducted with younger patients without such comorbidities (Greenfield et al. [unpublished]).

Current Study Design and Analytical Approaches

Hayward and colleagues have noted that attention to HTE and its impact has been limited and focused almost exclusively on outcome risk (Hayward et al. 2005). In most trials only one subgroup was investigated at a time. These subgroups were usually not a priori specified.

A priori risk stratification, especially with multivariate risk groups, is almost never done. Comparisons of treatment effects across patients with varying outcome risk or illness severity usually involve post hoc subgroup analyses, which are not well suited for identifying the often multiple patient characteristics associated with differential treatment effects (Lagakos 2006).

The usefulness of multiple subgroup analyses is limited by at least two shortcomings. First, many subgroup analyses with multiple subgroup comparisons are performed using one variable at a time, increasing the likelihood of type I error (false positives), and requiring the allowable error rate (alpha) of each comparison to be set below 0.05 to ensure the overall alpha does not exceed 0.05. This reduces the likelihood of detecting significant differences between subgroups. This problem is compounded when subgroup analyses are conducted post hoc because subgroups are often underpowered when not prespecified in the study design. Even when subgroups are prespecified, however, one-variable-at-a-time analyses are still problematic. Secondly, one-at-a-time subgroup analysis treats risk categories dichotomously, which constrains the power of the analysis and increases the likelihood of type II error (false negatives).

To understand the distribution and impact of HTE in a way that improves the specificity of treatment recommendations, prevailing approaches to study design and data analysis in clinical research must change.

Recommendations and a New Research Agenda

Two major strategies can decrease the negative impact of HTE in clinical research: (1) The use of composite risk scores derived from multivariate models should be considered in both the design of a priori risk stratification groups and data analysis of clinical research studies; and (2) the full range of sources of HTE, many of which arise for members of the general population not eligible for trials, should be addressed by integrating multiple phases of clinical research, both before and after an RCT.

Addressing Power Limitations in Trials Using Composite Risk Scores

Multivariate risk models address the issues of both multiple comparisons and reliance on dichotomous subgroup definitions by combining multiple, hypothesized risk factors into a single continuous independent variable. A simulation study by Hayward et al. (Hayward et al. 2006) demonstrated that a composite risk score derived from a multivariate model predicting outcome risk alone significantly increased statistical power when assessing HTE compared to doing individual comparisons.

Despite these analytic advantages, using continuous risk scores derived from multivariate models may have drawbacks. One challenge of using continuous composite risk scores, for example, is that they do not provide a definitive cut point to determine when a treatment should or should not be applied. Because the decision to apply a specific treatment regimen is dichotomous, a specific cutoff distinguishing good versus bad candidates for a treatment must be determined. Relative unfamiliarity of the medical community with these multivariate approaches coupled with potential ambiguity in treatment recommendations from these methods are barriers to acceptance of their use.

The confusion introduced by ambiguous cut points for continuous risk scores is compounded by the methods used to develop risk models. Different risk models may predict different levels of risk for the same patient. That is, a patient with a given set of risk factors may be placed in different risk categories depending on the model used. Even continuous risk scores that correlate very highly may put the same patient in different risk groups. For patients at the highest and lowest levels of risk, there should be little ambiguity in treatment decisions regardless of the risk model used. Many other patients, however, will fall in a “gray area” where risk models with small differences in model may generate different categories of risk assignment.

To help promote understanding and acceptance of these methods by the medical community, demonstrations comparing the performance of different types of continuous composite risk scores to the performance of traditional one-at-a-time risk factor assessment in informing treatment decisions would be beneficial (Rothwell and Warlow 1999; Zimmerman et al. 1998; Kent et al. 2002; Selker et al. 1997; Fiaccadori et al. 2000; Teno et al. 2000; Slotman 2000; Stier et al. 1999; Pocock et al. 2001).

A final important point is that current multivariate approaches focus exclusively on targeted-outcome risk, but other sources of HTE remain unaddressed. Risk, in the context of a treatment decision, was defined as the sum of targeted-outcome risk, iatrogenic risk, and competing risk. If iatrogenic risk and competing risk are distributed heterogeneously in a patient population, methods to account for them alongside targeted-outcome risk should also be incorporated in the analysis of trial results. The advantages, drawbacks, and methodologic complexities of composite measures have recently been reviewed (Kaplan and Normand 2006).

Integrating Multiple Phases of Research

Clinical researchers can address the various sources of HTE across at least six phases of research: (1) observational studies performed before RCT and aimed at the trial outcome, (2) the primary RCT itself, (3) post-trial analysis, (4) Phase IV clinical studies, (5) observational studies following trials, and (6) focused RCTs. The recommended applications of these phases for studying each of the four sources of HTE are outlined in Table 2-3 and described below.

TABLE 2-3. Studying Sources of HTE Across Multiple Phases of Research.


Studying Sources of HTE Across Multiple Phases of Research.

Baseline outcome risk in clinical trials. To address both design-related and analysis-related issues, outcome risk variation as a source of HTE should be addressed in two phases: (1) risk stratification of the trial sample based on data from pre-trial observational studies (cell a) and (2) risk adjustment in the analysis of treatment effects based on pre-trial observational studies when available, or post hoc analysis of the trial control group data when observational studies are not feasible (cells a and b in Table 2-3).

As noted in the paper by Hayward et al. (Hayward et al. 2006), modeling HTE requires that risk groups, continuous or discrete, must be pre-specified and powered in the study design. Data from prior observational research such as those collected in the PROWESS trial (Vincent et al. 2003) can be modeled to identify predictors of baseline outcome risk for pre-trial subgroup specification at a much lower cost per subject than a second RCT. To date, however, prespecifying risk groups is not common practice outside of cardiology and a small number of studies in other fields.

Even in studies where comparisons across risk factors are prespecified, those risk factors are seldom collapsed into a small number of continuous composite risk scores to maximize power and minimize the number of multiple comparisons needed. In the clopidogrel trial (Bhatt et al. 2006), for example, a possibly meaningful difference in treatment effects between symptomatic and asymptomatic cardiac patients may have been masked by looking at these groups (those with and without symptoms) dichotomously and alongside 11 subgroup comparisons.

A priori risk stratification may not be supported by clinical researchers because it either delays the primary RCT while pre-trial observational data are collected or requires that a second, expensive, focused RCT be conducted after a model for baseline outcome risk is developed from the primary trial and follow-up studies. Because of the additional time and costs required, a priori risk stratification cannot become a viable component of clinical research unless the stakeholders that fund, conduct, and use clinical research sufficiently value its benefits. A comprehensive demonstration of the costs and benefits of this approach is needed to stimulate discussion of this possibility.

Post hoc analysis of control group data from the trial itself may also be used to identify risk factors when observational studies are not feasible. By identifying characteristics predicting positive and negative outcomes in the absence of treatment, control group analysis works as a small-scale observational study to produce a composite risk score to adjust estimates of treatment effectiveness from primary RCT. Even though the results from a model of one group’s risk may not necessarily apply to another group such as the treatment group, these same data provide a reasonable starting point to select and risk-stratify a clinically meaningful subsample for a future focused RCT.

To maximize the statistical power of treatment effectiveness models for a given RCT sample size, composite risk scores generated from multivariate analysis of observational or control group data should be introduced when feasible. Introducing composite risk scores generated from multivariate analysis of observational or control group data would maximize the statistical power of models of treatment effectiveness for a given RCT sample size. Whether or not a priori risk stratification is feasible, continuous composite risk scores should be generated from observational data or trial control group analysis to provide a risk-adjusted estimate of treatment effectiveness.

In exploratory cases where stratification factors that lead to treatment heterogeneity are not known, the latent class trajectory models of Muthen and Shedden (Muthen and Shedden 1999) may be used to identify latent subgroups or classes of patients by which treatments vary. Predictors of these classes can then be identified and form the basis of composite risk scores for future studies of treatment heterogeneity. Leiby et al. (in review) have applied such an approach to a randomized trial for a medical intervention of interstitial cystitis. Overall, the treatment was not effective. However, latent class trajectory models did identify a subgroup of responder patients for whom the treatment was effective in improving the primary end point, the Global Response Assessment. Additional work is needed on identifying baseline factors that are associated with this treatment heterogeneity and can form the basis of a composite score for a future trial.

Responsiveness. Responsiveness to an agent or procedure is studied in the trial and also needs to be studied in a Phase IV study for those not included in the trial to see how the agent responds in unselected populations, where side effects or preference-driven adherence, polypharmacy, competing risk, or disease modifying effects are in play (cells c, d), or in a focused second trial (cell e).

Iatrogenic risk. Vulnerability to adverse effects needs to be studied in two phases, in the trials (cell f) and in Phase IV studies (cell g).

Competing risk. The effects of polypharmacy, adherence in the general population, and utility can be best studied in observational studies among populations that would not be included in trials, especially those who are elderly and/or have multiple comorbidities (cell h). The most critical issue for understanding competing risk is deciding when a clinical quality threshold measure shown to be “effective” in clinical trials (e.g., HbA1c < 7.0 percent for diabetes) is recommended for populations not studied in the trials. Cholesterol reduction, blood pressure control, and glucose control in diabetes are examples of measures with quality thresholds that are not required for all populations. Even when, as in the case of cholesterol, they have been studied in the elderly, they have not been studied in non-trial populations where the patients have other medical conditions, economic problems, polypharmacy, or genetics that may alter the effectiveness of the treatment. For glucose in patients with diabetes and for treatment of patients with prostate cancer, there is an additional issue, that of “competing comorbidities” or competing risk, where the other conditions may shorten life span such that years of treatment (in the case of diabetes) or aggressive treatment with serious complications (prostate cancer) may not allow the level of effectiveness achieved in the trials (Litwin et al. [in press]).

Clinicians intuitively decide how to address competing risk. For example, in a 90-year-old patient with diabetes and end-stage cancer, it is obvious to the clinician that blood pressures of 130/80 need not be achieved. In most cases, however, the criteria by which clinical decisions are made should be based explicitly on research data. This evidence base should begin with observational studies to identify key patient subgroups that derive more or less net benefit from a treatment in the presence of competing risk. Failing to account for competing risks may overestimate the value of a treatment in certain subgroups. If 100 patients are treated, for example, and 90 die from causes unrelated to the treated condition, even if the treatment effect is 50 percent, only 5 people benefit (number needed to treat = 20). If the original 100 patients are not affected by other diseases, however, 50 will benefit from the treatment (number needed to treat = 2).

When treatment effects are underestimated for a subgroup, treatments that are arduous, have multiple side effects, or are burdensome over long periods of time may be rejected by doctors and patients, even if the patient belongs to a subgroup likely to benefit from treatment. For this reason, research methodologies supplemental to RCTs should be introduced to predict likelihood to benefit for key patient subgroups not included in trials.

Observational studies, because of their lower cost and less restrictive inclusion criteria, can include larger and more diverse samples of patients to address important research questions the RCTs cannot. Unlike RCTs, observational studies can include high-risk patients who would not be eligible for trials, such as elderly patients and individuals with multiple, complex medical conditions, and would provide insight in predicting likelihood to benefit for this large and rapidly growing fraction of chronic disease patients. The generalizability of RCTs can be addressed in observational studies if designed properly, based on principles for good observational studies (Mamdani et al. 2005; Normand et al. 2005; Rochon et al. 2005).

There are possible solutions to the problems of HTE, the principal one being multivariate pre-trial risk stratification based on observational studies. For patients not eligible for trials, mainly elderly patients with multiple comorbidities, observational studies are recommended to determine mortality risk from the competing comorbidities so that positive results could be applied judiciously to non-trial patients. Government and nongovernmental funders of research will have to be provided incentives to expand current research paradigms.


David Goldstein, Ph.D.

Duke Institute for Genome Sciences and Policy

Many clinical challenges remain in the treatment of most therapeutic areas. This paper discusses the potential role of pharmacogenetics in helping to address some of these challenges, focusing particular attention on the treatment of epilepsy and schizophrenia. Several points are emphasized: (1) progress in pharmacogenetics will likely require pathway-based approaches in which many variants can be combined to predict treatment response; (2) pharmacodynamic determinants of treatment response are likely of greater significance than pharmacokinetic; and (3) the allele frequency differences of functional variants among human populations will have to be taken into account in using sets of gene variants to predict treatment response.

Pharmacogenetics has previously focused on describing variation in a handful of proteins and genes but it is now possible to assess entire pathways that might be relevant to disease or to drug responses. The clinical relevance will come as we identify genetic predictors of a patient’s response to treatment. Polymorphisms can have big effects on such responses, and identification of these effects can offer significant diagnostic value about how patients respond to medicine, to avoid rare adverse drug reactions (ADRs), or to select which of several alternative drugs has the highest efficacy.

The CATIE trial (Clinical Antipsychotic Trials of Intervention Effectiveness) compared the effectiveness of atypical antipsychotics (olanzapine, quetiapine, risperidone, ziprasidone) with a typical antipsychotic, perphenazine. The end point was discontinuation of treatment, and in this respect there is really no difference between typical and atypical antipsychotics. Results such as these signal an end to the blockbuster drug era, in that often no drug, or even drug class, is much better than the others. However certain drugs were better or worse for certain patients. For example, olanzapine increases body weight more than 7 percent in about 30 percent of patients, and for older medicines, with many of the typical antipsychotics a similar proportion of patients will develop tardive dyskinesia (TD) as an adverse reaction. TDs are involuntary movements of the tongue, lips, face, trunk, and extremities and are precisely the type of adverse reaction (AR) one would want to avoid, because when the medicine is removed the AR continues without much amelioration over time. So perhaps as many as 30 percent of patients exposed to a typical antipsychotic will develop this AR, whereas the majority will not.

This means that in deciding which medications to use at the level of the individual patient, these medications are quite different from each other. A significant problem is that there is very little information on which patients might experience an adverse reaction to a particular type of drug. In this particular case, there is virtually no information on who will get TD or who will suffer severe weight gain such that they would not continue medication. These types of results unambiguously constitute a call to arms to the genetics community because this is an area in which we can truly add value by helping clinicians to distinguish patients and guide clinical decision making. Having a predictor for TD on typicals would be an unambiguous, clinically useful diagnostic. However currently, within the field of pharmacogenetics, we have very few examples of such utility. We have examples that may be of some relevance in some context, but you would have to do an RCT to determine how to utilize this information. In general, at the germline versus the somatic level, the current set of genetic differences among patients is not that clinically important, particularly when you contrast them with something like a predictor of weight gain or TD in the use of atypicals or typicals. These are the types of things we are working toward in the pharmacogenetics community, and it looks as though some of these problems are quite crackable and there will be genetic diagnostics of significant relevance and impact to clinical decisions.

The idea of moving toward doing more genetic studies in the context of a trial is quite exciting because the data will be quite rich. However there are real doubts that the amount of genetic information and the complexity of clinical data will allow the identification of any kind of association, and as a result the pharmacogenetics community is going to flood the literature with claims about a particular polymorphism’s relevance to a specific disease or drug use within a specific subgroup. Therefore, as we move toward these types of analyses, it is very important to specify a hypothesis in advance and one will need to be quite careful about what results one wants to pay attention to. For CATIE, the project design included hypotheses that were specified in advance of any genetic analyses to allow appropriate correction for planned comparisons. This trial is ongoing, and preliminary results are discussed here to give a flavor of what kinds of information might be derived from these types of analyses in the future.

We have delineated two broad categories of analyses. One is to look at determinants of phamacokinetics (PK) to see if functional variation in dopaminergic genes related to dose and discontinuation has any effect. These analyses focus on how the drug is moved around the body and factors that influence how the drug is metabolized. The second category is on pharmacodynamic polymorphisms (PD) or the genetic differences and determinants among people that might affect how the drug works. Here we are looking at differences in the target of the drug, the target pathways, or genetic differences that influence the etiology of the condition as related to specified measures of responses and to adverse drug reactions. To perform a comprehensive pharmacogenetic analysis of drug effectiveness and adverse drug reactions, we looked at PK though dosing variation and asked whether genetic differences influence the eventual decision the clinician makes about what dosage to use. We looked at enzymes that metabolize the drug and common polymorphisms in these enzymes. Our early results indicated that in terms of PK variation, there were no impacts on dosing decisions by clinicians.

Pharmacodynamic analysis on the other hand looks more promising in terms of clinical utility. CATIE examined the relatively obvious pathways that might influence how the drug acts and how patients might respond to antipsychotics. All antipsychotics have dopinergic activities, so the dopinergic system is an obvious place to start. This pathway approach included looking at neurotransmitters—the synthesis, metabolism, transporters, receptors, et cetera—for dopamine, serotonin, glutamate, gamma-aminobutyric acid (GABA), acetylcholine, and histamine. In addition, memory and neurocognition related genes, and genes previously implicated in drug response were examined. We scanned through these pathways for polymorphisms and tried to relate these to key aspects of how patients respond to treatment. We also considered other classes of candidate genes, in particular those genes that might influence the cognitive impairments associated with schizophrenia that are not currently well treated by antipsychotics. Ultimately we selected about 118 candidate genes and a total of 3,072 single nucleotide polymorphisms (SNPs) and looked at neurocognitive phenotypes, optimized dose, overall effectiveness, and occurrence of TD, weight gain, and anticholinergic adverse events. This study emphasizes the importance of clearly specifying the hypothesis in advance. If a study does not clearly articulate what the opportunity space for associations were prior to undertaking the study, ignore it, because there likely will have been arbitrary amounts of data mining and you cannot trust that approach. Both of these studies are in the process of being completed and submitted for publication along with collaborators from the CATIE study and others.

By helping to subgroup diseases genetically and providing pointers toward the genetic and physiological cause of variable and adverse reactions, pharmacogenetics will also have indirect benefits for future drug development. In addition, some drugs that work well generally are rejected, withdrawn, or limited in use because of rare but serious ADRs. Examples include the antiepileptic felbamate, the atypical antipsychotic clozapine, and most drug withdrawals owing to QT-interval-associate arrhythmias. If pharmacogenetic predictors of adverse events could prevent the exposure of genetically vulnerable patients and so preserve even a single drug, the costs of any large-scale research effort in pharmacogenetics could be fully recovered. An example of this is vigabatrin, which is a good antiepilepsy drug in terms of efficacy, and for some types of seizures (e.g., infantile spasms) it is clinically essential. Unfortunately, in some cases it also has a devastating effect on the visual field and can constrain the visual field to the point of almost eliminating peripheral vision. This adverse reaction has dramatically restricted the use of this medicine and, in fact, it was never licensed in the United States. We’ve done a study to try to identify genetic differences that might predict this and have potentially identified a polymorphism where the minor allele is the only place that we see severe reduction in visual field, which could be used to predict this reaction to vigabatrin. Again, this is the kind of pharmacogenetics result that provides an opportunity for improving clinical treatment of epilepsy in that this medication, which might not otherwise be used broadly, can be prescribed to the appropriate population. The wrinkle here is that we are using very large tertiary referral centers and we have used all of the vigabatrin-exposed patients for whom we have DNA samples. We think we see an association and would like to develop results like this but we need data. Our results with vigabatrin need to be confirmed in larger sample size. Since we do not have most of the exposure data available to study, it is possible that we will never be able to conclude either way.

The current work in the field gives grounds for real optimism that careful pharmacogenetic studies, in virtually all classes of medications, will identify genetic differences that are relevant to how patients respond to treatment and therefore impact clinical decision making. These will not be pharmacokinetic but rather pharmacodynamic. The examples presented illustrate several of the challenges and opportunities for pharmacogenetics. These types of information will be increasingly generated, but we need to think about how such information will be useful and utilized for clinical decision making. For example, despite the fact that we have no evidence that variation in the genes being studied actually influence decisions of clinicians in a useful way, devices such as Amplichip are being pushed as a useful diagnostic. Because variations will increasingly be investigated for use in clinical diagnostics, we need to think about how such diagnostics should be evaluated and what kinds of evidence are needed before they are widely utilized. The preliminary results of the vigabatrin study makes an extremely strong argument that what we want to be doing as we go forward is setting up the framework to do these types of studies, because it is entirely possible that once a medication is introduced and generates huge numbers of exposures, if it generates a rare adverse event and is withdrawn, a pharmacogenetics study could resurrect the use of that medication in the appropriate population.

Two overriding priorities in pharmacogenetics research are the establishment of appropriate cohorts to study the most important variable responses to medicines, both in terms of variable efficacy and in terms of common or more rare but severe adverse reactions. It must be appreciated that larger randomized trials are not always the most appropriate settings for specific pharmacogenetic questions and it will often be necessary to recruit patients specifically for pharmacogenetics projects. For example, in the case of weight gain and atypical antipsychotics, the ideal dataset would be to look at weight in patients not previously exposed to an atypical. Secondarily, it is important that a framework is developed for evaluating the clinical utility of pharmacogenetic diagnostics.


Harlan Weisman, M.D., Christina Farup, M.D., Adrian Thomas, M.D., Peter Juhn, M.D., M.P.H., and Kathy Buto, M.P.A.

Johnson & Johnson

The establishment of electronic medical records linked to a learning healthcare system has enormous potential to accelerate the development of real-world data on the benefits and risks of new innovative therapies. When integrated appropriately with physician expertise and patient preferences and variation, data from novel sources of post-marketing surveillance will further enable various stakeholders to distinguish among clinical approaches on how much they improve care and their overall value to the healthcare system. To ensure these goals are achieved without jeopardizing patient benefit or medical innovation, it is necessary to establish a road map, charting a course toward a common framework for post-marketing surveillance, initial evidence evaluation, appropriate and timely reevaluations, and application to real-world use with all key stakeholders involved in the process. Continuous improvement requires policy makers to address accountability for decisions they make based on this common framework that impacts patient health. Where possible, post-marketing data requirements of different global agencies or authorities should be harmonized to enhance the efficiency and quality of safety data from these sources, and to reduce the burden on governments and industry due to costs of collection and unnecessary duplication of efforts. In addition, policy development should strive to find the right alignment of incentives and controls to accelerate adoption of evidence on medicine, technology, and services that advance the standard of care.

The current landscape of health care in the United States is one of organized chaos where providers, payers, employers, patients, and manufacturers often have different vantage points and objectives that can result in inadequate patient access, poor care delivery, inconsistent quality, and increasing costs. A recent study on the quality of health care in the United States found that adults receive only half of the recommended care for their conditions (McGlynn et al. 2003). It is important to remember that although these multiple stakeholders approach health care from different angles, they all share the same objective: to improve patient health. To move to a system that delivers effective and high-quality care, which optimally balances benefits and risks, health care must be transformed to a knowledge-based learning network, focused on the patient and aligned on data systems, evaluation, and treatment guidelines, without sacrificing the human elements of empathy, caring, and healing. An interoperable electronic health record that provides longitudinal, real-time, clinical and economic outcomes at the patient level will be a critical enabler to allow for a wealth of new information to drive fact-based treatment decisions that are transparent and shared between patients and physicians, as well as with other stakeholders including payers. The resultant improvement in efficiencies, cost savings, and most important, clinical outcomes should permit physicians and other healthcare providers to restore not only the scientific basis of modern medicine, but also the humanity of traditional medicine through active engagement and dialogue between patients and healthcare providers.

Post-marketing surveillance of new technologies will be a key component of this system because it will provide needed real-world information on unique populations not evaluated in clinical trials and a better characterization of the full benefit-risk profile over the life cycle of product use. Because the benefits and risks of a new technology are never fully known at launch, ongoing evaluation of a product based on real-world use in broader populations of patients, with comorbidities and concomitantly prescribed therapies, is important to new insights. In addition, the full value of innovative new therapies may only be appreciated with real-world usage and comparative economic evaluation based on observed outcomes; this information will enable decision makers to continue to assess the overall value and appropriate use of a product in the healthcare system. However, the scope of what we need to know to assess value in an era of information overload and complex healthcare systems changes rapidly and continuously. To properly evaluate new products we need to acknowledge the advantages and limitations of the methods we have historically used for regulatory approval. Randomized clinical trials with blinding are currently used in the approval of drugs and higher-risk devices to ensure high internal validity of findings. However, RCTs may have limited validity for broader use in diverse populations (e.g., old versus young, urban versus rural). Observational studies conducted within an interoperable electronic medical record can be utilized to lend additional insights beyond efficacy, including real-world effectiveness and long-term outcomes.

Methodological challenges for conducting real-world observational studies can be daunting, but the opportunities for evidence development are substantial. The study of an intervention within a dynamic healthcare system adds a level of significant complexity and raises many questions. How can we deal with confounding by indication and the increasing variation of health states? How can we apply comparative effectiveness studies conducted in broad populations and allow for individual variation in treatment outcomes? When an adverse event occurs, is it due to the progression of underlying patient pathology or therapeutic interventions? How do we apply comparative effectiveness (average patient) to individual patients, each of whom brings his or her own specific variability (whether due to genetics, nutrition, environment, or risk tolerance)?

Selection of research methods and decisions about the level of evidence required must also take into consideration the type of technology. For example, devices can vary from a simple gauze bandage to a complex implant with a drug component. For many devices used in surgical procedures, patient outcomes are highly dependent on operator skill and can also depend on the hospital’s volume of procedures. In a review of the literature by the Institute of Medicine, more than two-thirds of published studies reported an association between hospital volume and patient outcomes for certain diagnoses and surgical procedures (IOM 2000). Randomized clinical trials with blinding are the gold standard for drug evaluations of safety and efficacy but may not be possible in device studies. For example, the comparators and procedures of the new device and control may be so different (e.g., open vs. minimally invasive) that it may not be possible to blind the trial. The timing of evidence in a device’s development is also an important consideration because technical development often occurs in parallel to efficacy and safety evaluations. Evaluations that are premature can lead to inaccurate conclusions about the benefit-risk profile, and evaluations that are too late may be irrelevant because iterative improvements may have been introduced to the market in the interim. Moreover, we face an expanding universe of treatment opportunities. Regenerative medicine and stem cell therapies look promising and potentially revolutionary, but realizing their substantial benefits will depend on our ability to develop the capacity necessary to answer the kinds of questions that these new therapies raise at the appropriate level of detail.

Although all stakeholders seem to be aligned on the need to define evidence requirements, there is not alignment on what evidence is needed under specific circumstances. For every drug or device the number of potential questions to answer about appropriate use is limitless; thus there is a need to prioritize what new evidence is needed to close the critical gaps of knowledge so that quality decisions can be made. We also need to think carefully about what evidence we need to make good decisions for healthcare policy. Additional issues to consider include the level of certainty required for the evidence gaps and the urgency of the information. Once we determine that evidence is needed and has been generated, how will the evidence be reviewed and assessed? As outlined by Teutsch and Berger, the integration of evidence-based medicine into decision making requires a deliberative process with two key components: (1) evidence review and synthesis and 2) evidence-based decision making (Teutsch and Berger 2005). We need to develop transparent methods and guidelines to gather, analyze, and integrate evidence to get true alignment of the manufacturers, payers, and providers. One major consideration that needs to be anticipated and managed is how this new form of clinical data will be integrated into policies and treatment paradigms to ensure that sufficient evidence drives these changes and that inappropriate use of exploratory data does not lead to premature policy decisions, or to partially informed clinical decisions. Finally, an efficient process for new evidence to undergo a timely evaluation with peer review and synthesis into the existing evidence base needs to be further developed with appropriate perspective and communication for patients.

There are many issues to consider as we build toward this learning system. We have a unique opportunity to begin to align the many interests of healthcare stakeholders by not only providing consumers earlier access to these technologies but also generating the evidence necessary to make better decisions about the appropriate application of new technologies. It is critical that a nonproprietary (open source) approach be encouraged to ensure commonality of data structure and interoperability of EHRs, providing for the ability to appropriately combine data from different EHR populations and allow patients to be followed across treatment networks. Where possible, post-marketing data requirements of different global agencies or authorities should be harmonized to reduce costs of collection and unnecessary duplication. An example of where this seems particularly feasible is in meeting post-marketing requirements of the Centers for Medicare and Medicaid Services and the Food and Drug Administration. In some circumstances, CMS is requiring “real-world” data collection in order to assess the benefit and risk of technologies in the Medicare population, while FDA is requiring post-market surveillance studies to further evaluate the benefit-risk equation in broader populations than required for market entry. Having a single set of data used for both regulatory approval (i.e., FDA’s “safe and effective”) and coverage determination (i.e., CMS’s “reasonable and necessary”) has the potential to bring new innovations to market faster and may reduce costly data collection efforts. As coverage determinations become increasingly “dynamic” (i.e., contingent upon further evidence generation), this may create an opportunity to collect data that can be used for ongoing coverage determinations as well as for post-market safety surveillance. Multiple uses of common datasets would require the following: (1) agreement among the interested agencies (i.e., FDA and CMS) that a common dataset would qualify for both safety assessments and coverage determinations; (2) input from both agencies for specific data collection requirements to inform the design of the data collection tools and the management of the data; (3) clear accountabilities for funding the data collection and explicit rules for accessing the data; and (4) clarification of how collection of data on off-label use for coverage determinations will be reconciled with regulatory status. If these steps are taken, manufacturers may be able to bring innovative products to patients more quickly while continuing to fund and direct future innovations.

Establishing incentives for evidence generation by manufacturers within an integrated EHR will foster even greater acceleration. Examples of incentives for drug and device manufacturers should include those that reward evidence generation in the marketplace. In general, expansion of a drug’s use with new indications or claims requires the provision of two RCTs to the FDA. With the acceleration of data from reliable sources such as EHRs and the enhancement of methods for retrospective analyses of these data sources, alternative levels of evidence could be considered for FDA review for expanded claims or promotional uses. Insurance coverage policies could support uses for which post-market surveillance studies are being done, rather than restrict coverage until the post-market data are collected and analyzed. It is a widely recognized fact that completion of post-approval commitment studies in the United States is problematic, often for pragmatic reasons related to willingness to participate by patients and physicians when products are available clinically. The potential for EHRs to provide information that could replace certain of these studies would be a major advance and would provide a framework to continue the collection of relevant real-world and high-quality information on benefits and risks. When real-world, comparative studies are required, shared funding of data collection should be considered; the CMS program of “coverage with evidence development” is one prototype of this approach. Alternative incentives include extension of patent protections for the period of post-marketing data collection.

Currently for device manufacturers, there are even fewer incentives for evidence due to their typically narrower patent claim scope and shorter life cycles. Often innovative devices that are breakthrough treatments generate substantial evidence on value only to be quickly followed by a cheaper “me-too” device (a device that creatively worked around intellectual property issues) with little or no evidence that is accepted in the marketplace. It is less likely that the manufacturer with the innovative breakthrough device as the first market entrant will develop all of the needed evidence before the next competitor enters the market because the competitor will likely capitalize on the same data with no investment. Providing device manufacturers with similar extensions of their exclusivity periods (as, for example, via patent term extensions) as pharmaceuticals during the period of data collection could help rectify this situation.

Opportunities for collaboration across pharmaceutical and device companies to advance development of electronic medical records and evidence generation as a whole should be encouraged. For example, the data contained within the placebo arm of randomized controlled trials could provide a wealth of information when pooled together and would provide a larger cohort of populations for comparative analyses. Another example of potential collaboration is in the design of these new EHR systems, especially for the identification of safety signals. The industry could play a role in working with payers, regulators, and healthcare delivery networks to explore ways to access EHR and claims databases to form a proactive longitudinal framework of automated surveillance within the context of ensuring patient and practitioner privacy. In addition, no current scientific standards exist for signal detection and surveillance of safety; industry could play an important role to develop these standards through a collaboration in which algorithms, methods, and best scientific practices are shared. Finally, in most cases, there is no unique identification of medical devices in EHRs. Industry and other stakeholders will have to collaborate to determine which devices need unique identification, the appropriate technology, and a reasonable strategy for investments in the necessary infrastructure.

Many manufacturers place great emphasis on the importance of the value added by their products. In our case (Johnson & Johnson) the company is guided by a set of core principles to ensure the safe use of medicines and devices: (1) patients and doctors need timely and accurate information about the benefits and risks of a product so they can make well-informed choices; (2) the FDA and other global authorities are, and should remain, the principal arbiters of benefits and risks, determining whether to approve and maintain availability of products through transparent and aligned regulations; and (3) the best government policies and actions are those that continue to enhance patient health and safety and to promote innovation.

With these principles in mind, we propose a model for benefit-risk evaluation characterized by early and continuous learning enabled by EHRs. In this model, once an indication is approved and the product moves into real-world use, high-quality data, infrastructure, and analysis capability will enable benefit-risk monitoring and lead to refinement of understanding elements underlying risk and expanding upon possible benefits either through appropriate application of an intervention or through further innovation based on qualities of benefit or risk. This should be a system that understands the need for and appropriate methods to generate the right evidence for the right questions. It addresses decision-maker needs by looking at safety and efficacy, real-world effectiveness and risk, surrogate end points, and long-term outcome—for the right patient—accounting for genetics, comorbidities, and patient preference. The critical success factors for such a learning system going forward will be (1) establishing appropriate methods for research with the EHR as a data source; (2) prioritizing the need for evidence on safety and quality and not just intervention cost; (3) establishing appropriate peer review processes to ensure rigor and timeliness; (4) requiring intensive education and training of health professionals on evidence generation in clinical practice; and (5) using this new information as an adjunct, not a replacement, to RCTs for any purposes, and to ensure agreed upon standards for when such data are sufficient to drive policy and treatment decisions. There are many technical and policy issues, including privacy, that need to be addressed to create this EHR enabled learning framework. We believe a collaborative effort supported by patients, physicians, payers, industry, and regulators can accomplish the goal of a learning healthcare system with an open and transparent process toward developing standards and interoperable capabilities.


Steven M. Teutsch, M.D., M.P.H., and Marc L. Berger, M.D.

Merck & Co., Inc.

With new technologies rapidly introduced fast on the heels of effective older technologies, the demand for high-quality and timely comparative effectiveness studies is exploding. Well-done comparative effectiveness studies tell us which technology is more effective, safer, or for which subpopulation and/or clinical situation a therapy is superior. Clinicians need to understand the incremental benefits and harms of newer treatments compared to standard regimens, particularly with regards to needs of specific patients; and in addition, payers need to know the incremental cost so value can be ascertained. Understanding the magnitude of impact should also guide priorities for quality improvement initiatives.

Systematic evidence reviews of comparative effectiveness are constrained by the limited availability of head-to-head randomized controlled trials of health outcomes for alternative therapies. Such trials are usually costly because they must be large and, for most chronic conditions, long in duration. Indeed, the greatest need for such information is near the time of introduction of a new therapy, before a technology is in widespread use, precisely the time when such information is least likely to be available. Moreover, as challenging as it is to show efficacy of treatments compared to placebos, it is much more difficult and costly to demonstrate differences compared to active comparators and best-available alternatives. Thus, well-conducted head-to-head trials will remain uncommon. In the absence of head-to-head outcomes trials, however, insights may be obtained from trials using surrogate markers, comparisons of placebo controlled trials, or observational studies. The validity and generalizability of these studies for comparative purposes remains a topic of controversy. Observational studies can provide perspective on these issues, although the value of the information gleaned must be balanced against potential threats to validity and uncertainty around estimates of benefit.

While the limitations of different study designs are well known and ways to minimize bias well established, methods for quality assessments and data syntheses need to be refined and standards established to enhance confidence in the information generated. In addition, well-done models can synthesize the information, harness uncertainty to identify critical information gaps, focus attention on core outcomes and surrogates, and provide insights into the relative and absolute differences of therapeutic options. Because of their complexity and potential for bias, we need better processes to reduce the bias and enhance the credibility of models. These include systematic and transparent processes for identifying the key questions to be answered, the options to be evaluated, the structure of models, the parameter estimates, and sensitivity analyses. Mechanisms for validation and accreditation of models would enhance our confidence in their results. Investment in transparent development processes would go a long way to maximizing the value of existing information and identifying critical information needs. All along the way, important stakeholders, including payers and patients, need to participate in deliberative processes to ensure relevance and legitimacy.

For comparative effectiveness studies, there is an additional methodologic issue. Even when available, they typically do not directly provide estimates of absolute benefit or harms applicable to relevant populations. Indeed, most RCTs present results primarily in terms of relative risk reduction. In part this is related to a sense that the relative risk reduction is less variable across a broad range of patients than is the absolute risk reduction. Yet from a clinician’s or payer’s perspective, what is most important is the absolute change in benefit and harms; accurately estimating this change for specific patient groups is critical to informing bedside choices among alternative therapies. Moreover, estimation of benefits from RCTs may differ substantially from what is achieved for typical patients in real-world practice.

How then should these different sources of information (e.g., randomized clinical trials, effectiveness studies, observational studies, models) be evaluated and what weight should they be given in health policy decisions? We have previously discussed the need to consider the nature of the clinical question to determine the value to be gained from having additional certainty (Teutsch et al. 2005). Djulbegovic et al. (Djulbegovic et al. 2005) have emphasized the “potential for decision regret” to help frame how to deal with uncertainty in different clinical contexts. Some examples may illustrate potential approaches. For prevention, we generally require a high level of confidence that benefits, which generally accrue to a modest proportion of the population, exceed harms, which however uncommon can potentially occur to a large proportion of the population, most of whom will have little or no health benefit from the treatment. The magnitudes of benefits are bounded, of course, by the occurrence of the outcome, the success of identification and treatment of cases when they become clinically apparent, and the effectiveness of treatment. On the other hand, we may accept only small uncertain benefits and the occurrence of real harms for treatment of a fatal condition, such as cancer, for which other therapeutic alternatives have been exhausted.

The magnitude and certainty of net benefit is critical to optimizing the use of healthcare resources. Even for recommended preventive services, the estimated net benefit can vary by orders of magnitude (Maciosek et al. 2006). Understanding which services may provide the greatest benefit can guide providers and health systems to ensure that the greatest attention is paid to those underutilized services with the potential for the greatest health improvement. The same principle applies to diagnostic tests and procedures. Although many technologies may provide some level of benefit, clinical improvement strategies should focus on those technologies that provide the greatest net benefit and should be tailored to the extent possible to the specific subpopulations that have the most to gain.

Currently, there is no general consensus as to what standard—based upon the absolute benefits and harms and the certainty surrounding these estimates—should apply to different clinical recommendations. We have argued that it would be helpful to develop a taxonomy of clinical decision making ex ante to ensure that recommendations are made in a just and equitable manner. Such taxonomy alone would not be sufficient, but combined with procedures that ensure “accountability for reasonableness,” it would enhance public trust in such recommendations.

Coverage decisions are perhaps the most critical arbiters of those choices. We need guidance for the information that must be available to warrant positive coverage decisions for the range of medical services. A rational and acceptable process for developing the “rules of the road for coverage decisions” needs to engage all important stakeholders, and a public debate about the trade-offs and consequences is needed to legitimize decisions. It is absolutely plausible that different payers will establish different rules leading to very different coverage decisions at substantially different costs, which will be attractive to different constituencies. However, potential elements in a taxonomy of decisions may include the quality of evidence, the magnitude of effect, the level of uncertainty regarding harms and benefits, the existence or absence of good treatment alternatives, the potential for decision regret, precedent, and acceptability.

We need a marketplace of health technology assessments providing comparative effectiveness information to provide checks and balances of different methods, assumptions, and perspectives. Regardless, a forum needs to be created whereby the methods and results of reports on similar topics are discussed so that methods can be refined and conclusions vetted. In the United States, the Agency for Healthcare Research and Quality has been pivotal to moving the field as far as it has and should continue to play a leadership role. It has spearheaded methods development, fostered the research, and established priorities. It can also capitalize on the extensive experience of groups around the world, such as the Cochrane and Campbell Collaborations, and the National Institute for Health and Clinical Excellence (NICE), among many others. We can also benefit from participation by nontraditional professionals, such as actuaries and operations research scientists, who use different methodologic approaches.

To create a taxonomy of decisions and identify the level of certainty required for each, it will be necessary to convene payers, health professionals, and patients along with methodologic experts. The Institute of Medicine is well-positioned to fulfill this role. More specific guidance can be developed by professional organizations with recognized expertise, such as Society for Medical Decision Making (SMDM), International Society for Pharmacoeconomics and Outcomes Research (ISPOR), and Academy Health, as well as AHRQ-sponsored consortia including the Evidence-based Practice Centers (EPCs), Centers for Education and Research on Therapeutics (CERTs), and the Developing Evidence to Inform Decisions about Effectiveness (DEcIDE) Network. Because of the complexity and need for transparency and legitimacy, the standards have to be the product of a public, deliberative process that includes all important stakeholders including patients, providers, plans, employers, government, and industry. The taxonomy will need to walk a fine line between clarity and overspecification. There is a real risk that when criteria for coverage are too explicit, it may be possible to game the system. This is very apparent with economic evaluations, where having a fixed cutoff for coverage, such as $50,000 per quality-adjusted life year, may lead manufacturers to price products as high as possible while still being below the threshold. Of course, a strict threshold might also work in the other direction where prices might be reduced to ensure cost-effectiveness criteria are met. Whatever the decision criteria, they need to leave room for decision makers to exercise their judgment based on contextual factors such as budget constraints, preferences, and acceptability. Since there is no completely free market for health care, it is important to recognize that decision makers are acting as a proxy for social decision making. Thus, their decisions must be based on criteria recognized as fair and reasonable.

Criteria should not be varied on a decision-by-decision basis for at least two reasons. Fairness requires understanding the “rules of the road” by all stakeholders determined behind a “veil of ignorance” to ensure equity and justice (Rawls 1971) as well as efficiency; this requires that we spend time up front agreeing on the methodology and not delay each decision by revisiting criteria. Thus significant investment in the process must be made up front. Moreover transparency will force disclosure of considerations that heretofore have been only implicit (including preferences, acceptability, and precedent) and not either apparent or explicitly disclosed as critical considerations to the broad range of stakeholders. Other groups are moving ahead with such efforts including America’s Health Insurance Plans (AHIP); the specific goal of the AHIP project is to develop explicit guidance on how the certainty of evidence and magnitude of effect should be integrated with the contextual and clinical issues in making coverage decisions. Much as a marketplace of ideas represents a healthy situation for health technology assessments, here too, it will be constructive in the development of a taxonomy acceptable to a broad range of decision makers and stakeholders.


An underlying workshop theme in many discussions of the need for better use of clinical data to assess interventions centers around the effects of privacy regulations, especially HIPAA (see Box 2-1), on current and future efforts to maximize learning from the healthcare system. The potential to collect and combine large quantities of data, including information derived from the point of care, has broad implications for research on clinical effectiveness, as well as on privacy concerns. As we extend our capacity to collect and aggregate data on medical care, researchers are increasingly confronted with limited access to data or burdensome regulatory requirements to conduct research, and privacy regulations are frequently cited as a significant constraint in clinical research.

Box Icon

BOX 2-1

HIPPA Privacy Provisions.

Concerns cited during the workshop about these HIPAA regulations revealed broad implications for the notion of a learning healthcare system. The prospect of learning as a by-product of everyday care rests on the notion of collecting the data that results from these everyday clinical interactions. Many of the workshop participants cited that HIPAA regulations as limiting on research efforts, and expressed further concern that these limitations would be magnified with efforts to bring together clinical data to generate insights on clinical effectiveness. Throughout this workshop, there was a common concern that protection of privacy is crucial, yet the likely gains for public health need to be taken into account and privacy regulations should be implemented in a manner which is compatible with research. Ensuring that this goal is met will require collaboration between patients and the research community and careful consideration of the concerns of patients and the public at large. In the following piece, Janlori Goldman and Beth Tossell highlight many of the key issues from the patient perspective on privacy issues.


Janlori Goldman, J.D., and Beth Tossell

Health Privacy Project

Critical medical information is often nearly impossible to access both in emergencies and during routine medical encounters, leading to lost time, increased expenses, adverse outcomes and medical errors. Imagine the following scenarios:

  • You are rushed to the emergency room, unable to give the paramedics your medical history.
  • Your young child gets sick on a school field trip, and you are not there to tell the doctor that your child has a life-threatening allergy to penicillin.
  • As you are being wheeled into major surgery, your surgeon realizes she must first look at an MRI taken two weeks earlier at another hospital.

If health information were easily available electronically, many of the nightmare scenarios above could be prevented.

But, to many, the potential benefits of a linked health information system are matched in significance by the potential drawbacks. The ability to enhance medical care coexists with the possibility of undermining the privacy and security of people’s most sensitive information. In fact, privacy fears have been a substantial barrier to the development of a national health information network. A 1999 survey by the California HealthCare Foundation showed that even when people understood the huge health advantages that could result from linking their health records, a majority believed that the risks—of lost privacy and discrimination—outweighed the benefits.

The issue does not split along partisan lines; prominent politicians from both parties have taken positions both for and against electronically linking medical records. During speeches to Congress and the public in 1993, Former President Bill Clinton touted a prototype “health security card” that would allow Americans to carry summaries of their medical records in their wallets. In response, Former Senate Minority Leader Bob Dole decried the health plan as “a compromise of privacy none of us can accept.” And yet, in his State of the Union address last month, President Bush advocated “computerizing health records [in order to] avoid dangerous medical mistakes, reduce costs, and improve care.”

History of Medical Record Linkage

But since the HIPAA privacy rule went into effect last April, the issue of unique health identifiers has resurfaced in the political debate. In November, the Institute of Medicine issued a report urging legislators to revisit the question of how to link patient data across organizations. “Being able to link a patient’s health care data from one department location or site to another unambiguously is important for maintaining the integrity of patient data and delivering safe care,” the report concluded. In fact, the Markle Foundation’s Information Technologies for Better Health program recently announced that the second phase of its Connecting for Health initiative will be aimed at recommending policy and technical options for accurately and securely linking patient records. Decision makers in the health arena are once again grappling with the questions of whether and how to develop a national system of linking health information.

Is It Linkage? Or Is It a Unique Health Identifier?

The initial debate over linking medical records foundered over concern that any identifier created for health care purposes would become as ubiquitous and vulnerable as the Social Security number. At a hearing of the National Committee on Vital and Health Statistics in 1998, one speaker argued that “any identifier issued for use in health care will become a single national identifier . . . used for every purpose under the sun including driver’s licenses, voter registration, welfare, employment and tax.”

Using a health care identifier for non-health purposes would make people’s information more vulnerable to abuse and misuse because the identifier would act as a key that could unlock many databases of sensitive information. To break this impasse, a more expansive approach is needed, focusing on the overarching goal of securely and reliably linking medical information. An identifier is one way to facilitate linkage, but not necessarily the only one. A 1998 NCVHS (National Committee on Vital and Health Statistics) white paper identified a number of possible approaches to linkage, some of which did not involve unique identifiers. At this stage, we should consider as many options as possible. It is simplistic to suggest that creating linkages is impossible simply because some initial proposals were faulty.

Linkage Will Improve Health Care

A reliable, confidential and secure means of linking medical records is necessary to provide the highest quality health care. In this era of health care fragmentation, most people see many different providers, in many different locations, throughout their lives. To get a full picture of each patient, a provider must request medical records from other providers or the patient, a burdensome process that rarely produces a thorough and accurate patient history, and sometimes produces disastrous errors. According to the Institute of Medicine, more than 500,000 people annually are injured due to avoidable adverse drug events in the United States. Linking medical records is, literally, a matter of life and death.

The question, then, is not whether we need to link medical records but what method of linking records will best facilitate health care while also protecting privacy and ensuring security. The time is long overdue for politicians, technical specialists, and members of the health care industry to find a workable solution.

Privacy Must Be Built in from the Outset

If privacy and security are not built in at the outset, linkage will make medical information more vulnerable to misuse, both within health care and for purposes unrelated to care. Even when most records are kept in paper files in individual doctors’ offices, privacy violations occur. People have lost jobs and suffered stigma and embarrassment when details about their medical treatment were made public. Putting health information in electronic form, and creating the technical capacity to merge it with the push of a button, only magnifies the risk. Recently, computers containing the medical records of more 500,000 retired and current military personnel were stolen from a Department of Defense contractor. If those computers had been linked to an external network, the thieves might have been able to break into the records without even entering the office. We must therefore make sure that any system we implement is as secure as possible.

Similar Obstacles Have Been Overcome in Other Areas

The fields of law enforcement and banking have succeeded in linking personal information across sectors, companies and locations. Like health care, these fields are decentralized, with many points of entry for data and many organizations with proprietary and jurisdictional differences. Yet the urgent need to link information has motivated them to implement feasible and relatively secure systems. Law enforcement, for example, uses the Interstate Identification Index, which includes names and personal identification information for most people who have been arrested or indicted for a serious criminal offense anywhere in the country. In the banking industry, automated teller machines use a common operating platform that allows information to pass between multiple banks, giving people instant access to their money, anytime, almost anywhere in the world with an ATM card and a PIN.

Although the health care field is particularly diverse, complex, and disjointed, these examples show that, with dedication and creativity, it is possible to surmount both technical and privacy barriers to linking large quantities of sensitive information. A caveat—no information system, regardless of the safeguards built in—can be 100 percent secure. But appropriate levels of protection coupled with tough remedies and enforcement measures for breaches can strike a fair balance.

First Principles

In resolving the conjoined dilemmas of linking personal health information and maintaining confidentiality, the Health Privacy Project urges an adherence to the following first principles:

  • Any system of linkage or identification must be secure, limiting disclosures from within and preventing unauthorized outside access.
  • An effective system of remedies and penalties must be implemented and enforced. Misuse of the identifier, as well as misuse of the information to which it links, must be penalized.
  • Any system of linkage or identifiers must be unique to health care.
  • Patients must have electronic access to their own records.
  • A mechanism for correcting—or supplementing—the record must be in place.
  • Patients must have the ability to opt out of the system.
  • Consideration should be given to making only core encounter data (e.g., blood type and drug allergies) accessible in emergencies and developing the capacity for a more complete record to be available with patient consent in other circumstances, such as to another provider.

With these privacy protections built in at the outset, a system of linking medical records may ultimately gain the public’s approval.


  1. Abraham E, Laterre P-F, Garg R, Levy H, Talwar D, Trzaskoma B, Francois B, Guy J, Bruckmann M, Rea-Neto A, Rossaint R, Perrotin D, Sablotzki A, Arkins N, Utterback B, Macias W. the Administration of Drotrecogin Alfa in Early Stage Severe Sepsis Study G. Drotrecogin alfa (activated) for adults with severe sepsis and a low risk of death. New England Journal of Medicine. 2005;353(13):1332–1341. [PubMed: 16192478]
  2. Bhatt D, Roe M, Peterson E, Li Y, Chen A, Harrington R, Greenbaum A, Berger P, Cannon C, Cohen D, Gibson C, Saucedo J, Kleiman N, Hochman J, Boden W, Brindis R, Peacock W, Smith S Jr, Pollack C Jr, Gibler W, Ohman E. Utilization of early invasive management strategies for high-risk patients with non-ST-segment elevation acute coronary syndromes: results from the CRUSADE Quality Improvement Initiative. Journal of the American Medical Association. 2004;292(17):2096–2104. [PubMed: 15523070]
  3. Bhatt D, Fox K, Hacke W, Berger P, Black H, Boden W, Cacoub P, Cohen E, Creager M, Easton J, Flather M, Haffner S, Hamm C, Hankey G, Johnston S, Mak K-H, Mas J-L, Montalescot G, Pearson T, Steg P, Steinhubl S, Weber M, Brennan D, Fabry-Ribaudo L, Booth J, Topol E. for the CHARISMA Investigator. Clopidogrel and aspirin versus aspirin alone for the prevention of atherothrombotic events. New England Journal of Medicine. 2006;354(16):1706–1717. [PubMed: 16531616]
  4. Binanay C, Califf R, Hasselblad V, O’Connor C, Shah M, Sopko G, Stevenson L, Francis G, Leier C, Miller L. Evaluation study of congestive heart failure and pulmonary artery catheterization effectiveness: the ESCAPE trial. Journal of the American Medical Association. 2005;294(13):1625–1633. [PubMed: 16204662]
  5. Califf R. Fondaparinux in ST-segment elevation myocardial infarction: the drug, the strategy, the environment, or all of the above? Journal of the American Medical Association. 2006;295(13):1579–1580. [PubMed: 16537724]
  6. Califf R. Benefit assessment of therapeutic products: the Centers for Education and Research on Therapeutics. Pharmacoepidemiology and Drug Safety. 2006
  7. Califf R, Tomabechi Y, Lee K, Phillips H, Pryor D, Harrell F Jr, Harris P, Peter R, Behar V, Kong Y, Rosati R. Outcome in one-vessel coronary artery disease. Circulation. 1983;67(2):283–290. [PubMed: 6848217]
  8. Califf R, Harrington R, Madre L, Peterson E, Roth D, Schulman K. Curbing the cardiovascular disease epidemic: aligning industry, government, payers, and academics. Health Affairs. press. [PubMed: 17211015]
  9. CDC Diabetes Cost-Effectiveness Group. Cost-effectiveness of intensive glycemic control, intensified hypertension control, and serum cholesterol level reduction for type 2 diabetes. Journal of the American Medical Association. 2002;287(19):2542–2551. [PubMed: 12020335]
  10. Chiasson J, Gomis R, Hanefeld M, Josse R, Karasik A, Laakso M. The STOP-NIDDM Trial: an international study on the efficacy of an alpha-glucosidase inhibitor to prevent type 2 diabetes in a population with impaired glucose tolerance: rationale, design, and preliminary screening data. Study to Prevent Non-Insulin-Dependent Diabetes Mellitus. Diabetes Care. 1998;21(10):1720–1725. [PubMed: 9773737]
  11. Colhoun H, Betteridge D, Durrington P, Hitman G, Neil H, Livingstone S, Thomason M, Mackness M, Charlton-Menys V, Fuller J. Primary prevention of cardiovascular disease with atorvastatin in type 2 diabetes in the Collaborative Atorvastatin Diabetes Study (CARDS): multicentre randomised placebo-controlled trial. Lancet. 2004;364(9435):685–696. [PubMed: 15325833]
  12. Connors A Jr, Speroff T, Dawson N, Thomas C, Harrell F Jr, Wagner D, Desbiens N, Goldman L, Wu A, Califf R, Fulkerson W Jr, Vidaillet H, Broste S, Bellamy P, Lynn J, Knaus W. The effectiveness of right heart catheterization in the initial care of critically ill patients. SUPPORT Investigators. Journal of the American Medical Association. 1996;276(11):889–897. [PubMed: 8782638]
  13. Cowley G, Hager M. Are catheters safe. Newsweek: 1996 Sep 30;71
  14. Djulbegovic B, Frohlich A, Bennett C. Acting on imperfect evidence: how much regret are we ready to accept? Journal of Clinical Oncology. 2005;23(28):6822–6825. [PubMed: 16145058]
  15. Eddy D, Schlessinger L. Archimedes: a trial-validated model of diabetes. Diabetes Care. 2003;26(11):3093–3101. [PubMed: 14578245]
  16. Eddy D, Schlessinger L. Validation of the archimedes diabetes model. Diabetes Care. 2003;26(11):3102–3110. [PubMed: 14578246]
  17. Eddy D, Schlessinger L, Kahn R. Clinical outcomes and cost-effectiveness of strategies for managing people at high risk for diabetes. Annals of Internal Medicine. 2005;143(4):251–264. [PubMed: 16103469]
  18. FDA (Food and Drug Administration). Summary of Safety and Effectiveness: INTER FIX Intervertebral Body Fusion Device. 1999. [accessed April 4, 2007]. Available from www​.fda.gov/cdrh/pdf/p970015b.pdf.
  19. FDA (Food and Drug Administration). Summary of Safety and Effectiveness Data: InFUSE Bone Graft / LT-CAGE Lumbar Tapered Fusion Device by Medtronic. 2002. [accessed April 4, 2007]. Available from www​.fda.gov/cdrh/pdf/p000058b.pdf.
  20. FDA (Food and Drug Administration). Challenge and Opportunity on the Critical Path to New Medical Products. Mar, 2004.
  21. FDA (Food and Drug Administration). Public Meeting for the Use of Bayesian Statistics in Medical Device Clinical Trials. 2006. [accessed April 4, 2007]. Available from http://www​.fda.gov/cdrh​/meetings/072706-bayesian.html.
  22. Ferguson T Jr, Coombs L, Peterson E. Preoperative beta-blocker use and mortality and morbidity following CABG surgery in North America. Journal of the American Medical Association. 2002;287(17):2221–2227. [PubMed: 11980522]
  23. Fiaccadori E, Maggiore U, Lombardi M, Leonardi S, Rotelli C, Borghetti A. Predicting patient outcome from acute renal failure comparing three general severity of illness scoring systems. Kidney International. 2000;58(1):283–292. [PubMed: 10886573]
  24. Flaker G, Warnica J, Sacks F, Moye L, Davis B, Rouleau J, Webel R, Pfeffer M, Braunwald E. Pravastatin prevents clinical events in revascularized patients with average cholesterol concentrations. Cholesterol and Recurrent Events CARE Investigators. Journal of the American College of Cardiology. 1999;34(1):106–112. [PubMed: 10399998]
  25. Gerstein H, Yusuf S, Bosch J, Pogue J, Sheridan P, Dinccag N, Hanefeld M, Hoogwerf B, Laakso M, Mohan V, Shaw J, Zinman B, Holman R. Effect of rosiglitazone on the frequency of diabetes in patients with impaired glucose tolerance or impaired fasting glucose: a randomised controlled trial. Lancet. 2006;368(9541):1096–1105. [PubMed: 16997664]
  26. Greenfield S, Kravitz R, Duan N, Kaplan S. Heterogeneity of treatment effects: implications for guidelines, payment and quality assessment. Unpublished. [PubMed: 17403380]
  27. Harrington R, Califf R. Late ischemic events after clopidogrel cessation following drug-eluting stenting: should we be worried? Journal of the American College of Cardiology. 2006;48(12):2584–2591. [PubMed: 17174202]
  28. Hayward R, Kent D, Vijan S, Hofer T. Reporting clinical trial results to inform providers, payers, and consumers. Health Affairs. 2005;24(6):1571–1581. [PubMed: 16284031]
  29. Hayward R, Kent D, Vijan S, Hofer T. Multivariable risk prediction can greatly enhance the statistical power of clinical trial subgroup analysis. BMC Medical Research Methodology. 2006;6:18. [PMC free article: PMC1523355] [PubMed: 16613605]
  30. Hennessy S, Bilker W, Zhou L, Weber A, Brensinger C, Wang Y, Strom B. Retrospective drug utilization review, prescribing errors, and clinical outcomes. Journal of the American Medical Association. 2003;290(11):1494–1499. [PubMed: 13129990]
  31. IOM (Institute of Medicine). Interpreting the Volume-Outcome Relationship in the Context of Health Care Quality: Workshop Summary. Washington, DC: National Academy Press; 2000.
  32. IOM (Institute of Medicine). Effect of the HIPAA Privacy Rule on Health Research: Proceedings of a Workshop Presented to the National Cancer Policy Forum. Washington, DC: The National Academies Press; 2006.
  33. Kaplan S, Normand S. Conceptual and analytic issues in creating composite measure of ambulatory care performance. In. Final Report to NQF. 2006 Dec;
  34. Kent D. Analyzing the results of clinical trials to expose individual patients’ risks might help doctors make better treatment decisions. American Scientist. 2007;95(1) In press.
  35. Kent D, Hayward R, Griffith J, Vijan S, Beshansky J, Califf R, Selker H. An independently derived and validated predictive model for selecting patients with myocardial infarction who are likely to benefit from tissue plasminogen activator compared with streptokinase. American Journal of Medicine. 2002;113(2):104–111. [PubMed: 12133748]
  36. Knowler W, Barrett-Connor E, Fowler S, Hamman R, Lachin J, Walker E, Nathan D. Reduction in the incidence of type 2 diabetes with lifestyle intervention or metformin. New England Journal of Medicine. 2002;346(6):393–403. [PMC free article: PMC1370926] [PubMed: 11832527]
  37. Kravitz R, Duan N, Braslow J. Evidence-based medicine, heterogeneity of treatment effects, and the trouble with averages. The Milbank Quarterly. 2004;82(4):661–687. [PMC free article: PMC2690188] [PubMed: 15595946]
  38. Lagakos S. The challenge of subgroup analyses—reporting without distorting. New England Journal of Medicine. 2006;354(16):1667–1669. [PubMed: 16625007]
  39. LaRosa J, Grundy S, Waters D, Shear C, Barter P, Fruchart J, Gotto A, Greten H, Kastelein J, Shepherd J, Wenger N. Intensive lipid lowering with atorvastatin in patients with stable coronary disease. New England Journal of Medicine. 2005;352(14):1425–1435. [PubMed: 15755765]
  40. Lipscomb B, Ma G, Berry D. Bayesian predictions of final outcomes: regulatory approval of a spinal implant. Clinical Trials. 2005;2(4):325–333. discussion 334–339. [PubMed: 16281431]
  41. Litwin M, Greenfield S, Elkin E, Lubeck D, Broering J, Kaplan S. Total Illness Burden Index Predicts Mortality. In press.
  42. Maciosek M, Edwards NN, Coffield A, Flottemesch T, Nelson W, Goodman M, Solberg L. Priorities among effective clinical preventive services: methods. American Journal of Preventive Medicine. 2006;31(1):90–96. [PubMed: 16777547]
  43. Mamdani M, Sykora K, Li P, Normand S, Streiner D, Austin P, Rochon P, Anderson G. Reader’s guide to critical appraisal of cohort studies: 2. Assessing potential for confounding. British Medical Journal. 2005;330(7497):960–962. [PMC free article: PMC556348] [PubMed: 15845982]
  44. March J, Kratochvil C, Clarke G, Beardslee W, Derivan A, Emslie G, Green E, Heiligenstein J, Hinshaw S, Hoagwood K, Jensen P, Lavori P, Leonard H, McNulty J, Michaels M, Mossholder A, Osher T, Petti T, Prentice E, Vitiello B, Wells K. AACAP 2002 research forum: placebo and alternatives to placebo in randomized controlled trials in pediatric psychopharmacology. Journal of the American Academy of Child and Adolescent Psychiatry. 2004;43(8):1046–1056. [PubMed: 15266201]
  45. McGlynn E, Asch S, Adams J, Keesey J, Hicks J, DeCristofaro A, Kerr E. The quality of health care delivered to adults in the United States. New England Journal of Medicine. 2003;348(26):2635–2645. [PubMed: 12826639]
  46. Mehta R, Montoye C, Gallogly M, Baker P, Blount A, Faul J, Roychoudhury C, Borzak S, Fox S, Franklin M, Freundl M, Kline-Rogers E, LaLonde T, Orza M, Parrish R, Satwicz M, Smith M, Sobotka P, Winston S, Riba A, Eagle K. Improving quality of care for acute myocardial infarction: the Guidelines Applied in Practice (GAP) Initiative. Journal of the American Medical Association. 2002;287(10):1269–1276. [PubMed: 11886318]
  47. Muthen B, Shedden K. Finite mixture modeling with mixture outcomes using the EM algorithm. Biometrics. 1999;55(2):463–469. [PubMed: 11318201]
  48. National Health and Nutrition Evaluation Survey 1998–2002. [accessed April 4, 2007]. Available from http://www​.cdc.gov/nchs/hnanes.htm.
  49. Normand S, Sykora K, Li P, Mamdani M, Rochon P, Anderson G. Readers guide to critical appraisal of cohort studies: 3. Analytical strategies to reduce confounding. British Medical Journal. 2005;330(7498):1021–1023. [PMC free article: PMC557157] [PubMed: 15860831]
  50. Pedersen T, Faergeman O, Kastelein J, Olsson A, Tikkanen M, Holme I, Larsen M, Bendiksen F, Lindahl C, Szarek M, Tsai J. High-dose atorvastatin vs usual-dose simvastatin for secondary prevention after myocardial infarction: the IDEAL study: A randomized controlled trial. Journal of the American Medical Association. 2005;294(19):2437–2445. [PubMed: 16287954]
  51. Pfisterer M, Brunner-La Rocca H, Buser P, Rickenbacher P, Hunziker P, Mueller C, Jeger R, Bader F, Osswald S, Kaiser C. for the BASKET-LATE Investigators. Late clinical events after clopidogrel discontinuation may limit the benefit of drug-eluting stents: an observational study of drug-eluting versus bare-metal stents. Journal of the American College of Cardiology. In press. [PubMed: 17174201]
  52. Pocock S, McCormack V, Gueyffier F, Boutitie F, Fagard R, Boissel J. A score for predicting risk of death from cardiovascular disease in adults with raised blood pressure, based on individual patient data from randomised controlled trials. British Medical Journal. 2001;323(7304):75–81. [PMC free article: PMC34541] [PubMed: 11451781]
  53. Prevention of cardiovascular events and death with pravastatin in patients with coronary heart disease and a broad range of initial cholesterol levels. The Long-Term Intervention with Pravastatin in Ischaemic Disease (LIPID) Study Group. New England Journal of Medicine. 1998;339(19):1349–1357. [PubMed: 9841303]
  54. Randomised trial of cholesterol lowering in 4444 patients with coronary heart disease: the Scandinavian Simvastatin Survival Study (4S). Lancet. 1994;344(8934):1383–1389. [PubMed: 7968073]
  55. Rawls J. A Theory of Justice. Boston, MA: Harvard University Press; 1971.
  56. Rochon P, Gurwitz J, Sykora K, Mamdani M, Streiner D, Garfinkel S, Normand S, Anderson GM. Reader’s guide to critical appraisal of cohort studies: 1. Role and design. British Medical Journal. 2005;330(7496):895–897. [PMC free article: PMC556167] [PubMed: 15831878]
  57. Rothwell P, Warlow C. Prediction of benefit from carotid endarterectomy in individual patients: A risk-modelling study. European Carotid Surgery Trialists’ Collaborative Group. Lancet. 1999;353(9170):2105–2110. [PubMed: 10382694]
  58. Ryan J, Peterson E, Chen A, Roe M, Ohman E, Cannon C, Berger P, Saucedo J, DeLong E, Normand S, Pollack C Jr, Cohen D. Optimal timing of intervention in non-ST-segment elevation acute coronary syndromes: insights from the CRUSADE (Can rapid risk stratification of unstable angina patients suppress adverse outcomes with early implementation of the ACC/AHA guidelines) Registry. Circulation. 2005;112(20):3049–3057. [PubMed: 16275863]
  59. Schlessinger L, Eddy D. Archimedes: a new model for simulating health care systems—the mathematical formulation. Journal of Biomedical Informatics. 2002;35(1):37–50. [PubMed: 12415725]
  60. Schneeweiss S, Maclure M, Carleton B, Glynn R, Avorn J. Clinical and economic consequences of a reimbursement restriction of nebulised respiratory therapy in adults: direct comparison of randomised and observational evaluations. British Medical Journal. 2004;328(7439):560. [PMC free article: PMC381049] [PubMed: 14982865]
  61. Selker H, Griffith J, Beshansky J, Schmid C, Califf R, D’Agostino R, Laks M, Lee K, Maynard C, Selvester R, Wagner G, Weaver W. Patient-specific predictions of outcomes in myocardial infarction for real-time emergency use: a thrombolytic predictive instrument. Annals of Internal Medicine. 1997;127(7):538–556. [PubMed: 9313022]
  62. Shah M, Hasselblad V, Stevenson L, Binanay C, O’Connor C, Sopko G, Califf R. Impact of the pulmonary artery catheter in critically ill patients: meta-analysis of randomized clinical trials. Journal of the American Medical Association. 2005;294(13):1664–1670. [PubMed: 16204666]
  63. Shepherd J, Cobbe S, Ford I, Isles C, Lorimer A, MacFarlane P, McKillop J, Packard C. Prevention of coronary heart disease with pravastatin in men with hypercholesterolemia. West of Scotland Coronary Prevention Study Group. New England Journal of Medicine. 1995;333(20):1301–1307. [PubMed: 7566020]
  64. Shepherd J, Blauw G, Murphy M, Bollen E, Buckley B, Cobbe S, Ford I, Gaw A, Hyland M, Jukema J, Kamper A, Macfarlane P, Meinders A, Norrie J, Packard C, Perry I, Stott D, Sweeney B, Twomey C, Westendorp R. Pravastatin in elderly individuals at risk of vascular disease (PROSPER): a randomised controlled trial. Lancet. 2002;360(9346):1623–1630. [PubMed: 12457784]
  65. Singh A, Szczech L, Tang K, Barnhart H, Sapp S, Wolfson M, Reddan D. Correction of anemia with epoetin alfa in chronic kidney disease. New England Journal of Medicine. 2006;355(20):2085–2098. [PubMed: 17108343]
  66. Slotman G. Prospectively validated prediction of organ failure and hypotension in patients with septic shock: the Systemic Mediator Associated Response Test (SMART). Shock. 2000;14(2):101–106. [PubMed: 10947150]
  67. Snitker S, Watanabe R, Ani I, Xiang A, Marroquin A, Ochoa C, Goico J, Shuldiner A, Buchanan T. Changes in insulin sensitivity in response to troglitazone do not differ between subjects with and without the common, functional Pro12Ala peroxisome proliferator-activated receptor-gamma2 gene variant: Results from the Troglitazone in Prevention of Diabetes (TRIPOD) study. Diabetes Care. 2004;27(6):1365–1368. [PMC free article: PMC2928565] [PubMed: 15161789]
  68. Stier D, Greenfield S, Lubeck D, Dukes K, Flanders S, Henning J, Weir J, Kaplan S. Quantifying comorbidity in a disease-specific cohort: adaptation of the total illness burden index to prostate cancer. Urology. 1999;54(3):424–429. [PubMed: 10475347]
  69. Teno J, Harrell F Jr, Knaus W, Phillips R, Wu A, Connors A Jr, Wenger N, Wagner D, Galanos A, Desbiens N, Lynn J. Prediction of survival for older hospitalized patients: the HELP survival model. Hospitalized Elderly Longitudinal Project. Journal of the American Geriatrics Society. 2000;48(5 Suppl):S16–S24. [PubMed: 10809452]
  70. Teutsch S, Berger M. Evidence synthesis and evidence-based decision making: related but distinct processes. Medical Decision Making. 2005;25(5):487–489. [PubMed: 16160204]
  71. Teutsch S, Berger M, Weinstein M. Comparative effectiveness: asking the right question. Choosing the right method. Health Affairs. 2005;24:128–132. [PubMed: 15647223]
  72. Tunis S, Stryer D, Clancy C. Practical clinical trials: Increasing the value of clinical research for decision making in clinical and health policy. Journal of the American Medical Association. 2003;290(12):1624–1632. [PubMed: 14506122]
  73. Tuomilehto J, Lindstrom J, Eriksson J, Valle T, Hamalainen H, Ilanne-Parikka P, Keinanen-Kiukaanniemi S, Laakso M, Louheranta A, Rastas M, Salminen V, Uusitupa M. Prevention of type 2 diabetes mellitus by changes in lifestyle among subjects with impaired glucose tolerance. New England Journal of Medicine. 2001;344(18):1343–1350. [PubMed: 11333990]
  74. Vijan S, Hofer T, Hayward R. Estimated benefits of glycemic control in microvascular complications in type 2 diabetes. Annals of Internal Medicine. 1997;127(9):788–795. [PubMed: 9382399]
  75. Vincent J, Angus D, Artigas A, Kalil A, Basson B, Jamal H, Johnson G 3rd, Bernard G. Effects of drotrecogin alfa (activated) on organ dysfunction in the PROWESS trial. Critical Care Medicine. 2003;31(3):834–840. [PubMed: 12626993]
  76. Welke K, Ferguson T Jr., Coombs L, Dokholyan R, Murray C, Schrader M, Peterson E. Validity of the society of thoracic surgeons national adult cardiac surgery database. Annals of Thoracic Surgery. 2004;77(4):1137–1139. [PubMed: 15063217]
  77. Zimmerman J, Draper E, Wright L, Alzola C, Knaus W. Evaluation of acute physiology and chronic health evaluation III predictions of hospital mortality in an independent database. Critical Care Medicine. 1998;26(8):1317–1326. [PubMed: 9710088]



Naihua Duan, Ph.D., Sheldon Greenfield, M.D., Sherrie H. Kaplan, Ph.D., M.P.H., David Kent, M.D., Richard Kravitz, M.D., M.S.P.H., Sharon-Lise Normand, Ph.D., Jose Selby, M.D., M.P.H., Paul Shekelle, M.D., Ph.D., Hal Stern, Ph.D., Thomas R. Tenhave, Ph.D., M.P.H.; paper developed for a research conference sponsored by Pfizer.


Acute Physiology and Chronic Health Evaluation (APACHE) scores are often utilized to stratify patients according to risk.


Text reprinted from iHealthBeat, February 2004, with permission from the California HealthCare Foundation, 2007.

Copyright © 2007, National Academy of Sciences.
Bookshelf ID: NBK53488


  • PubReader
  • Print View
  • Cite this Page
  • PDF version of this title (2.0M)

Related information

  • PMC
    PubMed Central citations
  • PubMed
    Links to PubMed

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...