The following section describes some of the most common weaknesses in study design seen by medical technology evaluators.
Poorly Described Patient Populations. Unless the criteria used to determine patients' eligibility for a study are clearly outlined, and the characteristics of the patients who actually get enrolled are clearly described, it is impossible to know to which populations of patients the results of the study can be confidently applied, or whether the results from two different studies are truly comparable. If the experience of particular subgroups of patients is important and likely to vary, then enrollment should be stratified on the basis of those patient subgroup characteristics.
Too Narrow a Patient Population. The patients enrolled in a study of a new technology should be similar to the patients in whom the technology is most likely to be used. The enrolled population should include individuals without the target disease of interest, such as patients with risk factors for the disease but without the disease itself, patients with different, but commonly confused conditions, and patients with other types of pathology in the same organ systems.a Failure to enroll the appropriate spectrum of patients in a study of a new diagnostic technology can lead to overestimates of both the sensitivity and specificity of the new technology. Similarly, failure to enroll an appropriate spectrum of patients in a study of a new therapy can lead to an overestimate of the effectiveness of that therapy. This particular failure lies at the root of countless headlines announcing new breakthrough procedures or therapies that kindle excitement, but deliver only false hopes—and leave the public wondering why there are so few breakthroughs in their own treatment.
Failure to Use Appropriate Controls or Comparison Groups. The purpose of a control group is to allow the observer to conclude that any change observed in the “active treatment group” is due to the treatment being studied, rather than to other factors. Control groups are particularly important when factors in addition to the intervention under study can affect the outcome of interest, when the new technology of interest and some established technology are both effective, and when the natural course of untreated disease is not clear or consistent, as is the case with breast cancer. Failure to use a control group, or use of an inappropriate control group, can make it impossible to draw meaningful conclusions from a study.
Failure to Demonstrate the Comparability of Patients in Treatment and Control Groups. Given the purpose of a control group, it is important that patients in the treatment and control groups be similar in terms of baseline characteristics that can influence the outcome of the intervention under study. For example, if one study group included more women at high risk for breast cancer than another group, then a detection technology tested in the high-risk group would likely detect more cancer cases than a technology tested in the low-risk group, leading to the perception that the detection system was more sensitive than a system tested in lower risk patients.
Unclear Definition of Study Endpoints. Medical technologies can be assessed a multiple levels, depending upon whether they are diagnostic or therapeutic. The most basic level at which a diagnostic technology can be assessed is definition of its performance characteristics—sensitivity and specificity. Even this basic level of assessment is not easy to perform. It requires comparison of the performance of the new technology with that of a gold standard. And true gold standards (such as tissue obtained during surgery) are not always available.
Bias. The confidence that you can have that the results of using a technology described in a study are the same results you would get if you used the technology in a similar fashion depends on the absence of bias. Bias is systematic sources of variation that distort the results of a study in one direction or another. There are many types of bias that have been described including those that are especially problematic in cancer screening. The most common general sources of bias in clinical trials are:
Confounding. A confounding variable is one that falsely obscures or accentuates the relationship between two factors, such as the effect of a treatment on patient outcome. Confounding occurs when a factor other than the interventions being compared is not distributed equally in the study groups being assessed and affects the outcome of interest.
Systematic Errors or Differences in Measurement. Selection bias can occur inadvertently if there are systematic errors or differences in the way particular patient characteristics (e.g., eligibility criteria) are measured or in the way a determination is made of the intervention to which a patient was exposed. (The latter could be a problem when exposure to the intervention is ascertained from insurance claims data, which may or may not be comprehensive or accurate.) The most common sources of bias due to measurement error, however, arise in evaluation of the outcomes of patients in two arms of a study. Ascertainment of patient outcomes by an “unblinded'” investigator who knows what intervention each patient received poses a serious risk of bias. An unblinded investigator, for example, may interpret particular findings differently, or look for particular findings with varying efforts, if she or he has preconceived notions about the comparative effects of the two technologies under study. Finally, although it may seem obvious that measurements of the outcomes of patients in two arms of a study should be performed in an identical manner and at the same point in time (relative to the interventions under study), this important aspect of study design is not always followed.
Loss of Patients to Follow-Up. Anyone who has conducted an observational study knows how difficult it is to follow patients over time. Loss of patients to follow-up becomes a threat to the internal validity of a study when it occurs in a substantial proportion of patients and at differential rates in the various arms of a study. Failure to account for all patients who were initially enrolled in a study is particularly problematic. In one study submitted to the Food and Drug Administration (FDA), for example, data on patients who had received a new device were reported only for those patients who were followed for at least one year. Many patients dropped out of the study prior to the one-year endpoint, however, due either to side effects of to ineffectiveness of the device. Consequently, the results reported to the FDA exaggerated the effectiveness and tolerability of the device. All enrolled patients thus must be accounted for. If some patients withdraw or are lost to follow-up, the number of withdrawals and losses in each arm should be reported with specification of the reasons for withdrawal.
Inappropriate Statistical Analysis and Planning. On occasion, statistical analyses reported in published studies are not performed correctly, or the most appropriate statistical analyses are not performed. In other instances, statistical issues, such as statistical power to detect a difference between two arms of a study if one really existed, do not seem to have been adequately considered in planning the study, or had to be compromised for practical reasons (such as study cost or patient availability). As a result, the results reported in some studies are misleading and have a significant probability of being wrong. Investigators should report the statistical significance of their results, and provide 95 percent confidence intervals around group differences or main effects. In addition, if relative risks or odds ratios are reported (such as reporting that a particular outcome is twice as likely to occur with treatment A as with treatment B), the absolute rate with which the outcome occurs also should be reported.
Poorly Described Techniques. Diagnostic and therapeutic techniques are often employed using very specific protocols or techniques that affect the effectiveness or safety of the interventions. For example, different pulse sequences can be used in magnetic resonance imaging studies and different software might base comparisons of digitized mammography images on different calculations. Unless the technology under study, and the technologies to which it is being compared, are clearly described, it is not possible to meaningfully compare the results of one study to those of other studies of what appears to be the same technology. Without such descriptions it also may be difficult or even impossible to judge the relevance of the study results.
For example, different breast cancer detection technologies vary in their ability to detect microcalcifications, which are not cancerous lesions but are significant breast cancer risk factors. For instance, ultrasound is highly sensitive to lesions but does a poor job of detecting microcalicifications.
National Academies Press (US), Washington (DC)
Institute of Medicine (US) and National Research Council (US) Committee on New Approaches to Early Detection and Diagnosis of Breast Cancer; Joy JE, Penhoet EE, Petitti DB, editors. Saving Women's Lives: Strategies for Improving Breast Cancer Detection and Diagnosis. Washington (DC): National Academies Press (US); 2005. Appendix D, Common Weaknesses in Study Designs.