NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.
Butler M, Urosevic S, Desai P, et al. Treatment for Bipolar Disorder in Adults: A Systematic Review [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2018 Aug. (Comparative Effectiveness Review, No. 208.)
The methods for this Comparative Effectiveness Review (CER) follow the methods suggested in the Agency for Healthcare Research and Quality (AHRQ) Methods Guide for Effectiveness and Comparative Effectiveness Reviews (available at http://www.effectivehealthcare.ahrq.gov/methodsguide.cfm); certain methods map to the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) checklist.10 This section summarized the methods used.
Topic Refinement and Review Protocol
This report topic and preliminary Key Questions arose through a public process. Initially a panel of key informants, involving psychiatrists, psychologists, researchers, consumer advocates, and consumers, gave input on the Key Questions and population, interventions, comparators, outcomes, and timing (PICOT) to be examined. Key Questions, PICOT, and the analytic framework were posted for public comment from December 19, 2013 to January 10, 2014. In response to comments provided, we made several changes. We then drafted a protocol for the CER and recruited a technical expert panel to provide high-level content and methodological expertise feedback on the review protocol. The protocol was posted June 23, 2014 at https://effectivehealthcare.ahrq.gov/search-for-guides-reviews-and-reports/?pageaction=displayproduct&productid=1926.
Literature Search Strategy
We searched Ovid Medline, Ovid PsycInfo, Ovid Embase, and the Cochrane Central Register of Controlled Trials (CENTRAL) to identify previous randomized controlled trials and prospective cohort studies published and indexed in bibliographic databases. Our search strategy, which appears in Appendix A, included relevant medical subject headings and natural language terms for the concept of bipolar disorder. This concept was combined with filters to select randomized controlled trials (RCTs), observational studies, and systematic reviews. Dates for the search algorithm were 1994 to May 2017. We anticipated that older, established treatments would be covered by prior reviews, and we supplemented our searches with backward citation searches of relevant systematic reviews.
We conducted additional grey literature searching to identify relevant completed and ongoing studies. Relevant grey literature resources include trial registries and Food and Drug Administration databases. We searched ClinicalTrials.gov and the International Controlled Trials Registry Platform (ICTRP) for ongoing studies. We also reviewed Scientific Information Packets (SIPs) sent by manufacturers of relevant interventions. Grey literature search results were used to identify studies, outcomes, and analyses not reported in the published literature to assess publication and reporting bias and inform future research needs.
Studies were included in the review based on the PICOTS framework outlined in Table 2 and the study-specific inclusion criteria described in Table 3.
Table 3
Study inclusion criteria.
We reviewed bibliographic database search results for studies relevant to our PICOTS framework and study-specific criteria. All studies identified at title and abstract as relevant by either of two independent investigator underwent full-text screening. Two investigators independently performed full-text screening to determine if inclusion criteria were met. Differences in screening decisions were resolved by consultation between investigators, and, if necessary, consultation with a third investigator.
Risk of Bias Assessment of Individual Studies
Risk of bias of eligible studies was assessed by two independent investigators using instruments specific to each study design. For RCTs, questionnaires developed from the Cochrane Risk of Bias11 tool were used. We developed an instrument for assessing risk of bias for observational studies based on the RTI Observational Studies Risk of Bias and Precision Item Bank12 (Appendix B). We selected items most relevant in assessing risk of bias for this topic, including participant selection, attrition/incomplete outcome data, ascertainment of group assignment, and appropriateness of analytic methods. Study power was assessed in ‘other sources of bias’in studies with data that were not eligible for pooling. For psychosocial intervention, the presence of treatment fidelity, that is, treatment definition and implementation, was also evaluated. Overall summary risk of bias assessments for each study were classified as low, moderate, or high based upon the collective risk of bias inherent in each domain and confidence that the results were believable given the study’s limitations. When the two investigators disagreed, a third party was consulted to reconcile the summary judgment.
Data Extraction
For studies meeting inclusion criteria, one investigator abstracted relevant data into extraction forms created in Excel. Evidence tables were reviewed and verified for accuracy by a second investigator. Data fields included author, year of publication, setting, subject inclusion and exclusion criteria, intervention and control characteristics (intervention components, timing, frequency, duration), followup duration, participant baseline demographics, comorbidities; method of diagnosis, enrollment, and severity, descriptions and results of primary outcomes, adverse effects, study withdrawals, and study funding source.
For outcomes, only overall scale scores were reported for all measurement scales; subscales or individual items from scales were not abstracted. Abstracted outcomes included:
- Responders and/or remitters (for acute states) number and/or time to relapse (for maintenance), including definitions used in the studies,
- Symptoms scales; only one scale per state per study, following a “most reported” hierarchy,
- Global functioning (including social performance and quality of life for psychosocial studies),
- Utilization, such as emergency department use,
- Change in self-harm behaviors, including suicidality,
- Withdrawals; overall, due to lack of effect, and due to side effects,
- Serious adverse events; rates of extrapyramidal symptoms, switching, and weight gain of > 7 percent.
Adverse events were treatment emergent, not treatment-related events. Harms were chosen based on an informal prioritization process with the help of the Technical Expert Panel (TEP). We focused on patient-centered harms and not on those that were already well-established.
For maintenance studies reporting time to relapse as the primary outcome but with greater than 50 percent attrition, only summary measures of time to relapse, overall withdrawal, withdrawal due to adverse events and adverse events were abstracted. We did not abstract symptom scales due to loss of participants over time. Time to relapse for any mood episode was primary unless the study was designed for a specific episode type; for example, the primary outcome of time to next depressive episode for bipolar II patients stabilized from depression.
As a courtesy to readers, we also abstracted limited information on studies excluded for greater than 50 percent attrition: study design, enrollment, intervention, and comparison (available in Appendix D).
Data Synthesis
We summarized the results into evidence tables and synthesized evidence for each unique population, comparison, and outcome combination. We emphasized patient-centered outcomes in the evidence synthesis. Results are organized by bipolar type and state (such as acute mania, acute depression, or euthymia). Where available, results by population subgroups were also provided. We used statistical differences between groups to assess effects. For outcomes with well-established minimum important differences (MIDs), we used the MID to aid interpretation. Appendix C provides a list of outcomes used in the available literatures, with associated MIDs where available.
Decisions for pooling were based on the homogeneity of study populations using inclusion criteria, specific interventions, and the ability to treat outcome measures as similar. When pooling was possible, we conducted meta-analyses using the random effects modeling approach. Continuous outcomes were summarized with precision-weighted mean differences (WMD) and/or standardized mean differences (SMD) and 95 percent confidence intervals (CIs). In our context, these were generally difference in difference estimates from each study. If a study did not report a standard error for the difference in difference estimate, we calculated it from a P-value or CI and the appropriate degrees of freedom. If neither a CI nor an exact P-value was given but an upper bound for the P-value was, e.g., < 0.05, we used that to calculate an upper bound of the standard error. If the degrees of freedom of the relevant t-distribution was not given, we attempted to back it out of the study based upon the statistical methods that were used as long as we could confidently conclude that it was greater than 25. Binary outcomes were summarized with precision-weighted log odds ratios (OR) and 95 percent CIs.
We used the restricted maximum likelihood estimator (REML) of the heterogeneity variance because, although simulation studies have shown it to suffer from negative bias13, it has generally performed comparatively well with regards to mean-square error14. We also used the Knapp-Hartung adjustment in order to avoid the potentially high inflation of the type-I error rate that can arise when dealing with small numbers of even moderately heterogeneous studies.15, 16 We chose not to perform meta-analyses when only two studies were available to pool as, in this context, application of the Knapp-Hartung adjustment can diminish power to trivial levels and standard approaches can easily suffer from extreme inflation of type-I error.17
As a sensitivity analysis, we also performed all meta-analyses using fixed-effect models. These results are charitably interpreted as providing an estimate of the true average effect among completed trials and are presented along with the results derived from analyses using random-effect models.18 However, we base our main conclusions on the random-effects set of results. All analyses were performed with R software19, using the metaphor package.18
We assessed the clinical and methodological heterogeneity to determine appropriateness of pooling data.20 When pooling was not appropriate due to lack of comparable studies or heterogeneity, qualitative synthesis was conducted.
Studies were grouped by treatment, bipolar type and/or bipolar state. Phases were grouped as: (1) acute mania or hypomania, including mixed, (2) acute depression, (3) any acute state (often for psychosocial maintenance studies), (4) euthymic or subsyndromal (generally for maintenance studies), and (5) nonspecific, that is, either euthymic, acute in any episode, or post-hospitalization (these studies stated essentially any patient with bipolar disorder except acute mania). For drug studies treating patients for residual symptoms, patients were classified as nonresponders to standard treatment (usually noted in adjunctive drug studies). Studies were categorized as maintenance studies if the study inclusion criteria did not specify an acute episode at study entry.
Study outcomes were grouped by treatment duration or followup period. For acute mania treatment, outcomes were grouped by 3-4 weeks and then final measurement (generally 6 to 12 weeks) if available. Depression treatment studies are reported at 3 months and final endpoint. Maintenance study outcomes are reported at 6 months, 8-12 months, and “prolonged followup” of the final endpoint.
Comparators for psychosocial studies were grouped as inactive (usual care or standardized care) or active (active head to head comparisons of psychosocial therapies including supportive therapy).
We conducted several sensitivity analyses where possible. In forest plots, outcomes in studies assessed as having a high risk of bias, or low to moderate risk of bias but at least 40 percent attrition, were presented in grey scale.
Strength of Evidence for Major Comparisons and Outcomes
The overall strength of evidence for primary outcomes within each comparison were evaluated based on four required domains: (1) study limitations (risk of bias); (2) directness (a single, direct link between intervention and outcome); (3) consistency (similarity of effect direction and size); and (4) precision (degree of certainty around an estimate).21 A fifth domain, reporting bias, was assessed when strength of evidence based upon the first four domains is moderate or high.21 Based on study design and conduct, study limitations were rated as low, moderate, or high. Consistency was rated as consistent, inconsistent, or unknown/not applicable (e.g., single study). Directness was rated as either direct or indirect. Precision was rated as precise or imprecise. Assessing strength of evidence for studies with null findings is especially challenging because several domains are designed to address differences. Although it is important to assess the strength of evidence for null findings (i.e. intervention and comparison yielded results that were not statistically different from each other), it is difficult. It is hard to assess effect size when there is no effect in studies that test for superiority; how does one establish a level of precision that provides confidence of no effect? This is especially true when populations, interventions, and comparators are not consistent, as is the case with much of the nondrug literature. We also downgraded precision when there was considerable attrition that was addressed through last-observation carried forward methods. Due to the large number of comparisons with findings of no effect, we assessed strength of evidence and formulated results cautiously. Based on these factors, the overall evidence for each outcome was rated as:21
High: Very confident that estimate of effect lies close to true effect. Few or no deficiencies in body of evidence, findings believed to be stable.
Moderate: Moderately confidence that estimate of effect lies close to true effect. Some deficiencies in body of evidence; findings likely to be stable, but some doubt.
Low: Limited confidence that estimate of effect lies close to true effect; major or numerous deficiencies in body of evidence. Additional evidence necessary before concluding that findings are stable or that estimate of effect is close to true effect.
Insufficient: No evidence, unable to estimate an effect, or no confidence in estimate of effect. No evidence is available or the body of evidence precludes judgment.
We assessed strength of evidence for validated scales (such as the Beck Depression Inventory, Young Mania Rating Scale, Hamilton Depression Rating Scale, Clinical Global Improvement Scale) and commonly used items that examine improved function (such as the Functional Assessment Short Test). We did not assess strength of evidence for less commonly measured items such as increased time between episodes or hospitalizations. Attempted suicide and other self-harming behaviors were also not assessed for strength of evidence due to the difficulty of defining and measuring such behaviors.
Applicability
Applicability of studies was determined according to the PICOTS framework. Bipolar research generally draws from highly defined populations, resulting in samples that are often drawn from subpopulations rather than the bipolar populations at large. Thus, the ability to infer generalizability can be compromised. Applicability also deals with transportability of evidence for the type of treatment—level of treatment, treatment fidelity, skills of treatment agent, setting (and measurement)—and its fit to a particular treatment setting. Study characteristics that may affect applicability include, but are not limited to, the population from which the study participants are enrolled, diagnostic assessment processes, narrow eligibility criteria, and patient and intervention characteristics different than those described by population studies of bipolar disorder.22 These applicability issues are present in the synthesis frameworks and sensitivity analyses described in more detail in the data synthesis section.
Peer Review and Public Commentary
Experts in bipolar disorder and systematic reviews were invited to provide external peer review of this systematic review; AHRQ and an associate editor also provided comments. The draft report was posted on the AHRQ website for 4 weeks to elicit public comment. We addressed all reviewer comments, revised the text as appropriate, and documented all responses in a disposition of comments report made available within 3 months of the Agency posting the final systematic review on the Effective Health Care website.
- Methods - Treatment for Bipolar Disorder in Adults: A Systematic ReviewMethods - Treatment for Bipolar Disorder in Adults: A Systematic Review
Your browsing activity is empty.
Activity recording is turned off.
See more...