NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Abou-Setta AM, Mousavi SS, Spooner C, et al. First-Generation Versus Second-Generation Antipsychotics in Adults: Comparative Effectiveness [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2012 Aug. (Comparative Effectiveness Reviews, No. 63.)

Cover of First-Generation Versus Second-Generation Antipsychotics in Adults: Comparative Effectiveness

First-Generation Versus Second-Generation Antipsychotics in Adults: Comparative Effectiveness [Internet].

Show details


This chapter describes the a priori methods we used to synthesize the evidence on the comparative effectiveness of first-generation (FGAs) and second-generation antipsychotics (SGAs) in the adult population. We describe the topic refinement process for developing the Key Questions (KQs). We outline the literature search strategy, the selection process for identifying relevant articles, the process for extracting data from eligible studies, the methods for assessing the methodological quality of individual studies and for grading the strength of evidence of the overall body of evidence, and our approach to data analysis and synthesis. In general, we followed methodologically rigorous methods for systematic reviews as described in recent standards documents.27,28 and the EPC Methods Guide (

Topic Refinement and Technical Expert Panel

Our EPC was commissioned to conduct a preliminary literature review to gauge the availability of evidence and to draft the key research questions for a full comparative effectiveness review (CER). Investigators from our EPC developed the KQs in consultation with AHRQ, the Scientific Resource Center, and a Technical Expert Panel. AHRQ posted the initial questions on their Web site for public comment for a period of 1 month. After reviewing the public comments, we revised the KQs, and AHRQ approved the final questions.

We invited the Technical Expert Panel to provide content and methodological expertise throughout the development of the CER.

Literature Search Strategy

Our research librarian conducted comprehensive searches in the following electronic databases from 1950 to July 2011: Ovid’s MEDLINE®, Embase, PsycINFO, International Pharmaceutical Abstracts, Ebscohost CINAHL, ProQuest® Dissertations and Theses–Full Text, Cochrane Central Register of Controlled Trials (CENTRAL), and Scopus™ (Appendix A-1 to A-5). We also searched the U.S. National Library of Medicine’s TOXLINE® database and the MedEffect™ Canada Adverse Drug Reaction Database from 1950 to July 2010 in order to identify additional data on adverse events (AEs). We restricted the searches to English-language randomized controlled trials (RCTs), nonrandomized controlled trials (nRCTs), cohort studies, and review articles examining adults.

We selected search terms by scanning search strategies of systematic reviews on similar topics and by examining index terms of potentially relevant studies. The detailed search strategies for each database are presented in Appendix A. We conducted the original searches between July 15 and July 22, 2010, with periodic updates of the searches up to July 2011.

We hand searched conference proceedings of the American Psychiatric Association (APA) (2008–2010), the International College of Neuropsychopharmacology (2008–2010), and the International Society for Bipolar Disorders (2008–2010). To identify unpublished studies and studies in progress, we searched clinical trials registers, contacted experts in the field, and contacted authors of relevant studies. We reviewed the reference lists of reviews and guidelines to identify potential studies for inclusion. We searched for articles citing the studies that met the inclusion criteria for this review using Scopus™ Citation Tracker. We searched grey literature by searching the U.S. Food and Drug Administration (FDA) Web site for relevant documents, and soliciting “Scientific Information Packets” from manufacturers of the FGAs and SGAs through the Scientific Resource Center. We collected these materials asking the manufacturers for any material (published or unpublished) related to the KQs of the review. We made manufacturers aware that any materials submitted may become public through the Freedom of Information Act. The materials received from several manufacturers was reviewed for potential inclusion.

We used a Reference Manager® 11.0.1 (Thomson Reuters, Carlsbad, CA) bibliographic database to manage the results of our literature searches.

Criteria for Study Selection

Study selection was based on an a priori set of inclusion and exclusion criteria for study design, patient population, interventions, comparators, and outcomes (Table 2). We screened the results of our searches using a two-step process. First, two reviewers independently screened the titles and abstracts (level 1 screening) to determine if an article met the broad inclusion or exclusion criteria for study design, population, interventions, and comparators. We rated each citation as: “include,” “exclude,” or “unclear.” Records rated as “include” or “unclear” were advanced to level 2 screening. For full-text screening (level 2 screening), two reviewers independently reviewed each retrieved study using a standardized screening form (Appendix B) that was developed and piloted by the review team. We resolved discrepancies through discussion and consensus or by third-party adjudication. Reviewers were not masked to the study authors, institution, or journal.29

Table 2. Inclusion and exclusion criteria.

Table 2

Inclusion and exclusion criteria.

We included studies that included at least 80 percent of patients from the adult population (18–64 years). Polypharmacy is common in clinical practice; therefore, we did not exclude studies examining patients taking other medications from the CER. Studies that included both patients with schizophrenia and patients with bipolar disorder, but did not provide separate results for these two conditions, were included only for the AEs section (KQ3). To be included, cohort studies were required to have a followup period of at least 2 years and present data on at least one serious adverse event (SAE), as determined by the Technical Expert Panel (i.e., type II diabetes mellitus, mortality, tardive dyskinesia, and major metabolic syndromes).

Assessment of Methodological Quality

We assessed the risk of bias of RCTs and nRCTs using the Cochrane Collaboration’s Risk of Bias tool.27 We assessed the methodological quality of cohort studies using the Newcastle-Ottawa Scale.30 A priori, the research team developed decision rules regarding application of the tools.

For RCTs and nRCTs, we performed a domain-based risk of bias assessment according to the principles of the Risk of Bias tool. The domains were: (1) sequence generation (i.e., was the allocation sequence adequately generated?); (2) allocation concealment (i.e., was allocation adequately concealed?); (3) blinding of participants, personnel, and outcome assessors (i.e., was knowledge of the allocated intervention adequately prevented during the study?); (4) incomplete outcome data (i.e., were incomplete outcome data adequately addressed?); (5) selective outcome reporting (i.e., were reports of the study free of suggestion of selective outcome reporting?); and (6) other sources of bias (i.e., was the study apparently free of other problems that could put it at a high risk of bias?). Other sources of bias included baseline imbalances and appropriateness of crossover design. Each domain was rated as having “low,” “unclear,” or “high” risk of bias.

The overall assessment was based on the responses to individual domains.27 In accordance with the guidance from the Cochrane Handbook for Systematic Reviewers, if one or more of the individual domains had a high risk of bias, we rated the overall risk of bias as high. We rated the overall risk of bias as low only if all components were assessed as having a low risk of bias. The overall risk of bias was unclear for all other situations.

The Newcastle-Ottawa Scale, used to assess the quality of cohort studies, is comprised of eight items that evaluate three broad domains: (1) the selection of the study groups; (2) the comparability of the groups; and (3) the assessment of study outcomes. Each item that is adequately addressed is awarded one star, except for the “comparability of cohorts” item, for which a maximum of two stars can be given. The overall score is calculated by tallying the stars. We considered a total score of 7 to 9 stars to indicate high quality, 4 to 6 stars to indicate moderate quality, and 3 or fewer stars to indicate poor quality.

Two reviewers independently performed quality assessment of the included studies and resolved disagreements through discussion and consensus or third party adjudication, as needed.

Data Extraction

Two reviewers independently extracted published data using standardized data extraction forms in Microsoft Word and Excel (Microsoft Corporation, Redmond, WA; Appendix B) forms. We resolved discrepancies through discussion and consensus or by third-party adjudication. We piloted the data extraction forms with three studies3133 and resolved any identified issues.

We extracted data on the following: general study characteristics (e.g., study design, inclusion and exclusion criteria, length of followup); population characteristics (e.g., age and sex); interventions and dosing regimens; numbers of patients allocated to relevant treatment groups; outcomes measured, and the results of each outcome, including measures of variability by relevant intervention arm. We also recorded the funding source, if reported. When relevant data for multiple followup or observation periods were reported, we extracted only the longest followup data. When studies incorporated multiple relevant treatment arms, we extracted data from all groups. We noted the specific intervention, dosage, and intervals of each intervention to determine if arms were clinically appropriate for pooling.

When there were multiple reports of the same study, we referenced the primary or most relevant study and extracted only additional data from companion reports. We contacted corresponding authors for data clarification and missing data. We imported all data into Microsoft Excel (Microsoft Corporation, Redmond, WA) for data management.

For dichotomous data, we extracted the number of participants with events and the total number of participants. For continuous outcomes, we extracted the mean with the accompanying measure of variance for each treatment group. We analyzed continuous data as post-treatment score or absolute difference (or change score) from baseline.34 Since final scores and change scores can be mixed in a meta-analysis, change scores were not calculated, but extracted, when presented by the authors. Since many studies used multiple scales and scoring systems to measure the outcomes, therefore, in addition to summary data and measure of variance, we extracted the scale and the type of analysis used in the study. For all outcomes, we used the definitions as reported by the authors of individual studies. For response rates, when multiple definitions were provided by authors, we chose the lower percentage reduction levels in order to standardize data extraction across all studies.

For AEs, we extracted the number of participants experiencing events and the total number of participants. We did not extract continuous measures (e.g. severity of AEs or plasma levels) because the primary concern was to define the comparative differences in AE incidence rather than severity. We counted each event as if it corresponded to a unique individual. Because an individual patient may have experienced more than one event during the course of the study, this assumption may have overestimated the number of patients that experienced an AE. We did not extract count data. All adverse events reported in the primary publication and companion papers were extracted to allow for comparative effectiveness of the adverse events profiles of FGAs and SGAs.

When data were available only in a graphical format, we extracted data from the available graphs using the distance measurement tool in Adobe Acrobat 8 Professional (Adobe Systems Inc., San Jose, CA). When data were not available for the measure of variability for continuous outcomes, we calculated the variability from the computed p-value; if not available, we imputed the variability from other studies in the same analysis.

Data Analysis

We present evidence tables for all included studies and a qualitative description of results. We conducted meta-analyses to answer the KQs using Review Manager 5.01 (The Cochrane Collaboration, Copenhagen, Denmark).

We pooled binary data using the Mantel-Haenszel method and a random-effects model (DerSimonian and Laird).35 For continuous outcomes, we used the inverse variance method and a random-effects model (DerSimonian and Laird).35 We used Chi-square to test for significant heterogeneity reduction in partitioned subgroups; p<0.1 was considered to be significant. We generated forest plots for KQ1 when at least two trials provided evidence. For all other outcomes, we presented forest plots only if there were at least five included studies.

We combined RCTs and nRCTs in the meta-analyses. We synthesized cohort studies separately, as meta-analysis including both trials and cohort studies is controversial.36 For continuous summary estimates where the same measure of analysis was used, we calculated the MD with 95% confidence intervals (CI). We reported dichotomous summary estimates as relative risk with accompanying 95% CI.

For KQ3, data are not presented separately for schizophrenia and related psychoses and bipolar disorder because AEs associated with an antipsychotic are likely to be consistent regardless of the indication for which a drug is being administered.

We tested for heterogeneity using an I-squared (I2) statistic and accompanying 95% uncertainty intervals.37 Heterogeneity could not be estimated when only one study provided evidence for an outcome. We did not calculate uncertainty intervals around the I2 statistic when less than three studies were pooled. If the lower uncertainty boundary for the I2 had a value of 75 percent or greater, we considered this to represent substantial heterogeneity, thereby precluding pooling of studies. When there was substantial statistical heterogeneity in a meta-analysis, we explored heterogeneity in subgroup and sensitivity analyses and removal of outliers. The I2 statistic was interpreted based on the guidance in the Cochrane Handbook for Systematic Reviews of Interventions.27

Variables that we considered important to explain heterogeneity included specific intervention details (e.g., type and quantity), study design, funding source, and risk of bias. In addition, we conducted sensitivity analyses on studies with imputed data to determine if the imputations had any effect on the effect estimate (i.e., the measure used to estimate the differences in effect of an intervention against a comparator) or heterogeneity. A priori subgroup analyses included disorder subtypes, sex, age group (18–35 years, 36–54 years, and 55–64 years), race, comorbidities, drug dosage, followup period, previous exposure to antipsychotics, treatment of a first episode versus treatment in the context of prior episodes, and treatment resistance.

When appropriate, we combined data across the available dosing arms before conducting the meta-analysis. We combined dichotomous arms by simple addition and combined continuous arms by calculating the pooled mean and standard deviation.

We did not include dichotomous data with zero values (i.e., no participant experienced an event) in meta-analyses because summary trial results were not estimable. However, we reported the results from these studies in the narrative synthesis for the relevant intervention.

We explored potential publication bias graphically through funnel plots for comparisons with at least 10 studies. Additionally, we quantitatively assessed publication bias using the Begg adjusted rank correlation test and Egger regression asymmetry test.38

When pooled estimates were available, we considered clinical significance to be at least a 20 percent improvement between interventions on an individual scale.

Grading the Strength of a Body of Evidence

We evaluated the overall strength of evidence (SoE) for key outcomes identified a priori by the clinical experts (i.e., core illness symptoms in the categories of positive symptoms, negative symptoms, general psychopathology, global ratings and total scores, and clinically important serious AEs: diabetes mellitus, mortality, tardive dyskinesia, and major metabolic syndrome). We used the EPC GRADE39 approach, which is based on the standard GRADE approach developed by the Grading of Recommendation Assessment, Development and Evaluation (GRADE) Working Group.40 We assessed the SoE for the key core symptom scales and AEs (Table 3) by examining four major domains: risk of bias (low, medium, or high), consistency (inconsistency not present, inconsistency present, unknown, or not applicable), directness (direct or indirect), and precision (precise or imprecise).

Table 3. Outcomes assessed by GRADE.

Table 3

Outcomes assessed by GRADE.

For each key outcome for each comparison of interest, we assigned an overall evidence grade based on the ratings for the individual domains. We graded the overall SoE as “high” (i.e., high confidence that the evidence reflects the true effect, and further research is very unlikely to change our confidence in the estimate of effect); “moderate” (i.e., moderate confidence that the evidence reflects the true effect, and further research may change our confidence in the estimate of effect and may change the estimate); “low” (i.e., low confidence that the evidence reflects the true effect, and further research is likely to change our confidence in the estimate of effect and is likely to change the estimate); or “insufficient” (i.e., evidence is either unavailable or does not permit estimation of an effect). When no studies were available for an outcome or comparison of interest, we graded the evidence as insufficient. We used the GRADEprofiler software (GRADE Working Group) and modified the results in accordance with the EPC GRADE. Two reviewers independently graded the body of evidence and resolved disagreements through discussion.


Applicability of evidence distinguishes between effectiveness studies, conducted in primary care or office-based settings that use less stringent eligibility criteria, assess health outcomes, and have longer followup periods, and efficacy studies.41 The results of effectiveness studies are more applicable to the spectrum of patients in the community than efficacy studies, which usually involve highly selected populations. We assessed the applicability of the body of evidence following the PICOTS (population, intervention, comparator, outcomes, timing of outcome measurement, and setting) format used to assess study characteristics. Specific characteristics we examined included those related to patients (e.g., age, diagnostic criteria, severity of illness, comorbidities, concomitant medications, inpatient or outpatient status) and those related to study design (e.g., length of followup). We reported clinically important outcomes and participant characteristics in the results.


  • PubReader
  • Print View
  • Cite this Page
  • PDF version of this title (6.5M)

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...