Methods

Topic Development

The current report is designed to update Efficacy and Comparative Effectiveness of Atypical Antipsychotics for Off-label Use, which the Agency for Healthcare Research and Quality (AHRQ) published in 2006. Since this is an update, we tried to be as consistent as possible with regard to the general topics, scope of work, and analytical methods, but made revisions to reflect the important changes mentioned in the introduction. The key questions were posted on the AHRQ Effective Health Care Program Web site to obtain public comments which were considered when focusing the scope of this report. The present evidence report focuses on eight Food and Drug Administration (FDA)-approved atypical antipsychotics (clozapine was excluded because of its documented severe or life-threatening side effects) used for the following psychiatric conditions: anxiety, attention-deficit hyperactivity disorder (ADHD), dementia and severe geriatric agitation, depression, eating disorder, insomnia, obsessive-compulsive disorder (OCD), post-traumatic stress disorder (PTSD), personality disorders, substance abuse, and Tourette's syndrome. We reviewed all conditions among adults (defined as 18 years old and older); for ADHD, eating disorders, insomnia, and Tourette's syndrome, children (younger than 12 years old) and adolescents (12–17 years old) were also included. Autism, which was included in the original study, is included in a report on the comparative effectiveness of typical and atypical antipsychotics for on-label indications conducted by another Evidence-based Practice Center. Thus, autism is excluded from the present review.

Analytic Framework

Figure 1 presents the analytic framework for the update of this Comparative Effectiveness Review, with the five Key Questions depicted. First, by reviewing utilization data, surveys on prescribing patterns, and general information about the leading off-label uses, new off-label uses and trends in utilization in the target populations are summarized. Next, by using data from clinical trials and large cohort studies, evidence of benefits and harms in treating the mental health conditions is documented. The evidence of benefits—efficacy and comparative effectiveness (vs. placebo, vs. other atypicals, or vs. conventional therapy) for the off-label indications—is evaluated separately for each of the atypical antipsychotics within condition (dementia, OCD, PTSD, depression, etc.) via the examination of selected outcome measures, mainly symptom response rates measured by recognized psychometric tools.

Figure 1 presents the analytic framework for the update of this Comparative Effectiveness Review, with the five key questions depicted. This figure is described further on page nine as follows. “By reviewing utilization data, surveys on prescribing patterns, and general information about the leading off-label uses, new off-label uses and trends in utilization in the target populations are summarized. Next, by using data from clinical trials and large cohort studies, evidence of benefits and harms in treating the mental health conditions is documented. The evidence of benefits – efficacy and comparative effectiveness (versus placebo, versus other atypicals, or versus conventional therapy) for the off-label indications – is evaluated separately for each of the atypical antipsychotics within condition (dementia, OCD, PTSD, depression, etc.) via the examination of selected outcome measures, mainly symptom response rates measured by recognized psychometric tools. Where available, benefits and harms for specific subpopulations (by gender, age, and race/ethnicity) or other important factors (setting, severity of condition, length of use, and dosage) are documented. Special attention is given to identify the efficacious dose and time limit for off-label indications. The evidence of risks – adverse events associated with off-label indications – is summarized, first within individual drugs across condition, and then compared within the class and with other drugs used for the conditions.“

Figure 1

Analytic framework for comparative effectiveness review: off-label uses of atypical antipsychotics.

Benefits and harms for specific subpopulations (by gender, age, and race/ethnicity) or related to other important factors (setting, severity of condition, length of use, and dosage) are documented. Special attention is given to identify the efficacious dose and time limit for off-label indications. The evidence of risks—adverse events associated with off-label indications—is summarized, first within individual drugs across condition, and then compared within the class and with other drugs used for the conditions.

Search Strategy

We conducted an initial update search on June 1, 2008, as part of a project to determine if Comparative Effectiveness Reviews (CERs) funded by AHRQ needed updating; this search included only the drugs aripiprazole, olanzapine, quetiapine, risperidone, and ziprasidone. Regular update searches continued through May 2011. The search for off-label use of the newly approved atypicals (iloperidone, paliperidone and asenapine) included all years available in the electronic databases through May 2011. Searches for utilization data were conducted, as were searches for use for new conditions (anxiety, ADHD, eating disorders, insomnia, and substance abuse). Databases searched include: DARE (Database of Abstracts of Reviews of Effects), Cochrane Database of Systematic Reviews, CENTRAL (Cochrane Central Register of Controlled Trials), PubMed (National Library of Medicine, includes MEDLINE), Embase (biomedical and pharmacological bibliographic database), CINAHL (Cumulative Index to Nursing and Allied Health Literature), and PsycINFO. A summary of detailed search strategies is available in Appendix A. Other sources of literature include clinicaltrials.gov, references of included studies, references of relevant reviews, and personal files from related topic projects. In addition, the AHRQ Effective Health Care Program Scientific Resource Center (SRC) at Oregon Health Sciences University requested unpublished studies from pharmaceutical manufacturers and searched the FDA and Health Canada databases.

Technical Expert Panel

A Technical Expert Panel (TEP) provided expertise and different perspectives on the topic of this review. We invited a distinguished group of scientists and clinicians to participate in the TEP. We aimed to have at least one expert on each psychiatric condition on our TEP. TEP conference calls were held in November 2009 and February 2010. TEP members and their affiliations are listed in the front matter.

The TEP provided valuable information throughout the entire study. It provided information to identify literature search strategies; helped to decide appropriate outcome measures for specific psychiatric conditions and to identify recently published or ongoing clinical trials; and recommended approaches to specific issues raised from the public posting.

Study Selection

Two trained researchers reviewed the list of titles resulting from our electronic searches and selected articles to obtain. Each article retrieved was reviewed with a brief screening form (see Appendix B: screener) that collected data on medication, psychiatric condition, study design, population, sample size, and study duration. Only studies on humans were included. Studies that did not report any outcomes of efficacy, effectiveness, safety/adverse events, or utilization patterns were excluded. As single dose or short term trials (less than 6 weeks in length) are common for several of the new uses, we decided, at the TEP's suggestion, not to limit inclusion by study duration. Clinical trials were used to review efficacy outcomes. In the case that no clinical trials were found for a given condition or drug of interest, we turned to observational studies.

All reported side effects and adverse events were abstracted from clinical trials, even if the trial did not report efficacy or effectiveness results. We also included large observational studies of adverse events. Reports of utilization and prescribing patterns were accepted if they discussed use in the United States since 1995.

Data Extraction

Data were independently abstracted by a health services researcher and a psychiatrist trained in the critical assessment of evidence. The following data were abstracted from included trials: trial name, setting, population characteristics (including sex, age, ethnicity, and diagnosis), eligibility and exclusion criteria, interventions (dose, frequency, and duration), any co-interventions, other allowed medication, comparisons, and results for each outcome. Data abstraction forms are provided in Appendix B.

For efficacy and effectiveness outcomes, a statistician extracted data. Published summary data for each treatment or placebo arm within a trial was collected. For outcomes that reported count data, event counts and sample sizes by group were extracted. For continuous outcomes, sample size, mean difference and standard deviations were extracted. If a study did not report a mean difference by outcome or if a mean difference could not be calculated from the given data, the study was excluded from analysis. For those trials that did not report a followup standard deviation, we imputed one by assigning the weighted mean standard deviation from other trials that reported the standard deviation for the same outcome.

Data from publications reporting adverse events were extracted by two reviewers and reconciled by a third. Since the most common type of data reported across adverse event publications were sample size and number of people with each event, we collected this information by treatment. Each event was counted as if it represented a unique individual. Because a single individual might have experienced more than one event, this assumption may have overestimated the number of people having an adverse event. A trial needed to report at least instance of an adverse event in order to be included in the analysis of that adverse event. This decision may over- or underestimate the number of patients with that adverse event, but seems the only logical choice.

Quality Assessment

To assess internal validity, we abstracted data on the adequacy of the randomization method; the adequacy of allocation concealment; maintenance of blinding; similarity of compared groups at baseline and the author's explanation of the effect of any between-group differences in important confounders or prognostic characteristics; specification of eligibility criteria; maintenance of comparable groups (i.e., reporting of dropouts, attrition, crossover, adherence, and contamination); the overall proportion of subjects lost to followup and important differences between treatments; use of intent-to-treat analysis; post-randomization exclusions, and source of funding. We defined loss to followup as the number of patients excluded from efficacy analyses, expressed as a proportion of the number of patients randomized.

To assess external validity, we recorded the number screened, eligible, and enrolled; the use of run-in and washout periods or highly selective criteria; the use of standard care in the control group; and overall relevance. Funding source was also abstracted.

To arrive at a quantitative measure, we used the Jadad scale, which was developed for drug trials. This method measures quality on a scale that ranges from 0 to 5, assigning points for randomization, blinding, and accounting for withdrawals and dropouts.17 Across a broad array of meta-analyses, an evaluation found that trials scoring 0-2 report exaggerated results compared with trials scoring 3–5.18 The latter have been called “good” quality and the former called “poor” quality.

The Newcastle-Ottawa Scale19 was used to assess internal validity of observational studies of adverse events.

Applicability

People may use “efficacy” and “effectiveness” of an intervention interchangeably, but they have important differences. CERs assess internal validity and external validity (e.g., applicability or generalizability) of included studies. Internal validity is emphasized in efficacy studies, while applicability is emphasized in effectiveness studies. The efficacy of an intervention measures the extent to which the intervention works under ideal circumstances, and the effectiveness of an intervention measures the extent to which the intervention works under real world conditions.20 Therefore, designs of effectiveness trials are based on conditions of routine clinical practice, and outcomes of effectiveness trials are more essential for real world clinical decisions.

The fundamental distinction between efficacy and effectiveness studies lies in the populations and control over the intervention(s).21 Efficacy studies tend to be performed on referred patients and in specialty settings, and enrolled populations are highly selected (patients with comorbidities may be excluded); effectiveness studies are usually conducted on populations in primary care settings, which reflect the heterogeneity of the general population and thus are more representative. The vast majority of studies included in our report are efficacy studies as there are few effectiveness studies reporting health outcomes of interest. However, effectiveness studies are included in our analyses of adverse events.

Rating the Body of Evidence

We assessed the overall strength of evidence for intervention efficacy using guidance suggested by AHRQ for its Efffective Health Care Program.22 This method is based loosely on one developed by the Grade Working Group,23 and classifies the grade of evidence according to the following criteria:

  • High = High confidence that the evidence reflects the true effect. Further research is very unlikely to change our confidence on the estimate of effect.
  • Moderate = Moderate confidence that the evidence reflects the true effect. Further research may change our confidence in the estimate of effect and may change the estimate.
  • Low = Low confidence that the evidence reflects the true effect. Further research is likely to change our confidence in the estimate of effect and is likely to change the estimate.

The evidence grade is based on four primary domains (required) and four optional domains. The required domains are risk of bias, consistency, directness, and precision; the additional domains are dose-response, plausible confounders that would decrease the observed effect, strength of association, and publication bias. A brief description of the required domains is displayed in Table 1 below. For this report, we used both this scoring scheme and the global implicit judgment about “confidence” in the result. Where the two disagreed, we went with the lower classification.

Table Icon

Table 1

Grading the strength of a body of evidence: required domains and their definitions.

Data Synthesis

We constructed evidence tables displaying the study characteristics and results for all included trials (Appendix D). Trials that evaluated one atypical antipsychotic against another and provided direct evidence were classified as “head-to-head” trials. “Active” controlled trials compared an atypical antipsychotic with another class of medication. Trials that compared atypical antipsychotics with a placebo were referred to as “Placebo” controlled trials. Finally, trials that compared an antipsychotic taken with another medication with the other medication alone were examined (referred to as augmentation trials). We provided four separate evidence tables, one for each type of study (head-to-head, active control, placebo control, and augmentation).

Efficacy

For the efficacy analyses, we focused on controlled trials. Effect sizes were calculated for each comparison, for studies reporting a continuous outcome. If all trials within a condition and subgroup used the same scale, then the effect size did not need to be standardized and a mean difference was calculated. For subgroups where pooling was done across several scales, we calculated a standardized mean difference using the Hedges' g effect size.24 A positive effect size indicates that the atypical drug has a higher efficacy than does the comparison arm (active control or placebo arm). Effect sizes of 0.20 or smaller were considered small, sizes of 0.50 and greater were considered large, and those between were considered moderate.25

For outcomes that reported count data (number of events), relative risks (RR) were calculated. An RR greater than one indicates that the atypical has higher efficacy than does the comparison arm.

Based on important outcomes suggested by the TEP, a psychiatrist chose which outcomes were most appropriate to pool. Poolability across studies was also important; the psychiatrist, the statistician, and the project team jointly made the selection based on their professional knowledge and also considering the frequency of an outcome measure being reported by the trials. A minimum of three studies was required for meta-analysis. An effect size or relative risk was calculated for studies that reported data but did not contribute to a pooled analysis.

For trials that were judged sufficiently clinically similar to warrant meta-analysis, we estimated a pooled random-effects estimate26 of the overall effect size or RR in outcome measures. The individual trial outcomes were weighted by both within-study variation and between-study variation in this synthesis.

We assessed publication bias for each condition that is pooled. Tests were conducted using the Begg adjusted rank correlation test27 and the Egger regression asymmetry test.28 Heterogeneity was assessed using the Q test and I-squared29 test. All meta-analyses were conducted with Stata statistical software, version 10.0 (Stata Corp., College Station, Texas).30

We reviewed and when appropriate included studies used in the 2006 CER. For efficacy outcomes, pooled analysis included both new studies and those included in the 2006 CER when clinically similar.

Adverse Events

All adverse-event data from the prior report were combined with adverse event data extracted from new studies, as long as there was no overlap. We identified mutually exclusive groups of similar events, based on clinical expertise. For example, events that affected the head, ear, eye, nose, or throat were grouped together as HEENT. For each adverse-event group, we report the number of trials that provided data for any event in the subgroup. We also report the total number of individuals in the treatment group as well as the number who were observed to have experienced the event. We then report the analogous counts for the control groups.

Adverse events were analyzed based on three comparison types: atypical antipsychotic versus placebo; atypical antypsychotics versus other atypical antipsychotics, and atypical antipsychotics versus another active drug.

For reporting the data on adverse events, we treated each atypical antipsychotic separately and (in general) did not group them together as a class. However, we did summarize the findings of other published analyses that treated these drugs as a class. For our own analyses, we divided the study populations into three groups to make them more clinically homogeneous with respect to adverse events: children and adolescents, adults, and the elderly (i.e., the dementia trials).

For subgroups of events that occurred in two or more trials, we performed a meta-analysis to estimate the pooled odds ratio and its associated 95 percent confidence interval. Given that many of the events were rare, we used exact conditional inference to perform the pooling rather than applying the usual asymptotic methods that assume normality. Asymptotic methods require corrections if zero events are observed; generally, half an event is added to all cells in the outcome-by-treatment (two-by-two) table in order to allow estimation, because these methods are based on assuming continuity. Such corrections can have a major impact on the results when the outcome event is rare. Exact methods do not require such corrections. We conducted the meta-analyses using the statistical software package StatXact Procs v6.1 (Cytel Software, Cambridge, MA).

Any significant pooled odds ratio greater than one indicates the odds of the adverse event associated with the atypical antipsychotic is larger than the odds associated with the comparison (placebo, active control, or other antipsychotic) group. We calculated number needed to harm (NNH) where this occurred. We note that if no events were observed in the comparison group, but events were observed in the intervention group, the odds ratio is infinity and the associated confidence interval is bounded only from below. In such a case, we report the lower bound of the confidence interval. If no events were observed in either group, the odds ratio is undefined, which we denote as “Not calculated (NC)” in the results tables.

Peer Review and Public Commentary

Experts on the various psychiatric conditions and various stakeholder communities (listed in the Acknowledgements section) performed an external peer review of this CER. The AHRQ Effective Health Care Program SRC located at Oregon Health Sciences University oversaw the peer review process. Peer reviewers were charged with commenting on the content, structure, and format of the evidence report and encouraged to suggest any relevant studies we may have missed. We compiled all comments and addressed each one individually, revising the text as appropriate. AHRQ and the SRC also requested review from its own staff. The SRC placed the draft report on the AHRQ Effective Health Care Program Web site (http://effectivehealthcare.ahrq.gov/) for public comment and compiled the comments for our review. We also requested review from each member of our TEP.