This publication is provided for historical reference only and the information may be out of date.
Methods described below were suggested in the Agency for Healthcare Research and Quality (AHRQ) “Methods Guide for Effectiveness and Comparative Effectiveness Reviews.”48 The structure of this Methods chapter is aligned with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) checklist.49 Unless otherwise specified all methods and analyses were determined a priori.
Topic Refinement and Review Protocol
For all Evidence-based Practice Center (EPC) reviews, Key Questions (KQs) were reviewed and refined as needed by the EPC with input from Key Informants and the Technical Expert Panel (TEP) to ensure that the questions are specific and explicit about what information is being reviewed. In addition, for the comparative effectiveness review, the KQs were posted for public comment and finalized by the EPC after review of the comments.
Key Informants are the end-users of research, including patients and caregivers, practicing clinicians, relevant professional and consumer organizations, purchasers of health care, and others with experience in making health care decisions. Within the EPC program, the Key Informant role is to provide input into identifying the KQs for research that will inform health care decisions. The EPC solicits input from Key Informants when developing questions for systematic review or when identifying high-priority research gaps and needed new research. Key Informants are not involved in analyzing the evidence or writing the report and have not reviewed the report, except as given the opportunity to do so through the peer or public review mechanism.
Key Informants must disclose any financial conflicts of interest greater than $10,000 and any other relevant business or professional conflicts of interest. Because of their role as end-users, individuals are invited to serve as Key Informants and those who present with potential conflicts may be retained. The Task Order Officer (TOO) and the EPC work to balance, manage, or mitigate any potential conflicts of interest identified.
Technical Experts comprise a multidisciplinary group of clinical, content, and methodological experts who provide input in defining populations, interventions, comparisons, or outcomes as well as identifying particular studies or databases to search. They are selected to provide broad expertise and perspectives specific to the topic under development. Divergent and conflicting opinions are common and perceived as producing healthy scientific discourse that results in a thoughtful, relevant systematic review. Therefore, study questions, design and/or methodological approaches do not necessarily represent the views of individual technical and content experts. Technical Experts provide information to the EPC to identify literature search strategies and recommend approaches to specific issues as requested by the EPC. Technical Experts do not conduct analysis of any kind or contribute to the writing of the report; they do not review the report, except as given the opportunity to do so through the public review mechanism. In addition to methodologists, the Technical Experts represented the diversity of practitioners whose care is sought for the treatment of seasonal allergies. They included allergists, family practitioners, pharmacists, and otolaryngologists.
Technical Experts must disclose any financial conflicts of interest greater than $10,000 and any other relevant business or professional conflicts of interest. Because of their unique clinical or content expertise, individuals are invited to serve as Technical Experts and those who present with potential conflicts may be retained. The TOO and the EPC work to balance, manage, or mitigate any potential conflicts of interest identified.
Literature Search Strategy
To identify relevant studies for the four KQs, literature search strategies were developed by an expert librarian in collaboration with the project team and were peer reviewed by a second librarian. The searches were developed on MEDLINE® (PubMed®) and adapted for the other databases. Methodological search filters were added to the disease and intervention terms to identify randomized controlled trials (RCTs), quasi-randomized trials, observational studies and systematic reviews. The databases searched for primary studies were MEDLINE® (PubMed® and Ovid®), Embase® (Ovid®), and the Cochrane Central Register of Controlled Trials (CENTRAL). For systematic reviews, the databases searched were the Cochrane Database of Systematic Reviews, Database of Abstracts and Reviews of Effects (DARE), and the Health Technology Assessment (HTA) databases of the Centre for Reviews and Dissemination (all through the Wiley InterScience platform). Articles were limited to those published in the English language. Technical Experts advised that the majority of the literature on this topic is published in English. Although the search was not limited by date, only systematic reviews published after 2010 were considered for potential incorporation of results into this review. Full details of the search strategies are given in Appendix A. All databases were searched on July 18, 2012, with no date restrictions.
Grey literature was sought by searching the United States (U.S.) Food and Drug Administration (FDA) Web site; electronic conference abstracts of relevant professional organizations via Scopus; and the Web sites of two professional societies: The American Academy of Allergy, Asthma & Immunology (AAAAI) and the British Society for Allergy and Clinical Immunology (BSACI). In addition, the following Web sites were searched: the clinical trial registries of the U.S. National Institutes of Health (NIH) (ClinicalTrials.gov and NIH Reporter) and the World Health Organization (WHO); AHRQ Effective Health Care Program and AHRQ Home Page; and Current Controlled Trials. Scientific Information Packets provided by product manufacturers were evaluated to identify unpublished trials that met inclusion criteria. The grey literature searching was carried out between April 5 and September 26, 2012. Details of the Web sites and dates accessed are given in Appendix A.
We scanned the bibliographies of relevant systematic reviews and meta-analyses and of the final list of included studies to identify any additional studies not retrieved by the electronic database or grey literature searches.
Inclusion and Exclusion Criteria
Key Question 1. Comparative Effectiveness of Treatments in Adults 12 Years of Age or Older
The focus of this KQ is the comparison of effectiveness of six pharmacologic classes of treatments for seasonal allergic rhinitis (SAR) and nasal saline. Drug classes, routes of administration, and specific drugs within each class are shown in Table 1. Only drugs approved by the FDA for the treatment of SAR were included. Antihistamines were classified into nonselective and selective subclasses based on their specificity for peripheral H1 histamine receptors.
Within a pharmacologic class, previous CERs did not find sufficient evidence to support superior effectiveness of any single drug.3, 28, 38, 41-47 Thus, the focus of the review was across-class treatment comparisons, except when multiple routes of administration were available for a single drug class (e.g., intranasal versus oral selective antihistamines, intranasal versus oral sympathomimetic decongestants).
We sought expert guidance to identify drug class comparisons most relevant for treatment decisionmaking. The checked boxes in Table 2 indicate the treatment comparisons identified. Reasons most often cited for not including a specific comparison were differential efficacy for specific SAR symptoms (e.g., intranasal anticholinergic [ipratropium] treats rhinorrhea versus intranasal sympathomimetic decongestant treats nasal congestion) and noncomparable indications (e.g., nasal antihistamine for long-term use versus intranasal sympathomimetic decongestant for short-term use).
We sought trials comprising the highest level of evidence for treatment effectiveness and applied the following inclusion and exclusion criteria:
- Head-to-head RCTs were preferred; the risk of bias in uncontrolled and noncomparative studies is magnified due to the subjective reporting of both efficacy outcomes and adverse events in SAR research.
- Trials of less than 2 weeks duration were excluded; this is the minimum treatment duration recommended in draft FDA guidance for industry.50
- Patients had to be symptomatic at the time of the intervention.
- Trials that involved exposure chambers or allergen challenge interventions were excluded.
- Only FDA-approved drugs administered at FDA-approved doses for SAR treatment were considered.
- To be most inclusive, a minimum number of trial participants was not required.
For comparisons that did not have data from RCTs, nonrandomized trials and observational study designs were considered. Inclusion criteria for these studies were:
- Any of the following designs:
- Quasi-RCTs (crossover trials, before/after trials, open-label extensions, etc.)
- Controlled (nonrandomized) clinical trials
- Population-based comparative cohort studies
- Case-control studies
- Each study must have compared two drug classes directly.
- Control of confounders, such as baseline comorbidities, baseline symptom severity, and pollen counts, was necessary.
- Detection bias was addressed through blinding of outcome assessors or clinicians to drug exposure.
For all studies, disease was limited to SAR. Studies that reported both SAR and perennial allergic rhinitis (PAR) were included if SAR outcomes were reported separately. Outcomes had to include patient-reported symptom scores and/or validated quality of life instruments; for comorbid asthma symptoms, pulmonary function tests also were required. Definitions of symptom severity were adapted from the Allergic Rhinitis in Asthma (ARIA) guidelines.6 The ARIA definition of mild SAR excluded individuals with sleep disturbance, impairment of daily or leisure activities, impairment of school or work, or troublesome symptoms. Moderate/severe SAR is characterized by one or more of these disturbances. The following symptom rating scale is commonly used in SAR clinical trials.50
- 0 = Absent symptoms (no sign/symptom evident)
- 1 = Mild symptoms (sign/symptom clearly present, but minimal awareness; easily tolerated)
- 2 = Moderate symptoms (definite awareness of sign/symptom that is bothersome but tolerable)
- 3 = Severe symptoms (sign/symptom that is hard to tolerate; causes interference with activities of daily living and/or sleeping)
We examined results of existing systematic reviews and meta-analyses published after 2010 for potential incorporation into the report when they assessed relevant treatment comparisons, reported at least one outcome of interest, and were of high quality. Quality was assessed by two independent reviewers with criteria derived from the AHRQ “Methods Guide for Effectiveness and Comparative Effectiveness Reviews” and the Assessment of Multiple Systematic Reviews (AMSTAR) tool.51 Narrative reviews were excluded, but their bibliographies were searched if they were thought to have relevant references. In addition, reference lists of RCTs, systematic reviews, and other reviews were hand searched to confirm that all relevant RCTs had been identified. These selection criteria are summarized in Table 3. References obtained through grey literature searching were excluded if the study was not published in a peer-reviewed journal or if the full-text of the study could not be obtained.
Key Question 2. Comparative Adverse Effects of Treatments in Adults 12 Years of Age or Older
Comparative adverse effects reported in the RCTs, systematic reviews, meta-analyses, and observational studies identified for KQ1 were included. Additionally, systematic reviews and meta-analyses that specifically assessed adverse events associated with treatment comparisons of interest were sought. Table 4 lists systemic and local adverse effects of interest for making treatment decisions. Of particular interest were adverse effects associated with long-term treatment exposures in locations where allergen seasons are of longer duration (e.g., certain parts of the U.S.). For these adverse effects, comparative clinical trials of at least 300 patients evaluated for 6 months or 100 patients evaluated for at least 1 year were sought, according to FDA draft guidance for industry.50
Key Question 3. Comparative Effectiveness and Adverse Effects of Treatments in Pregnant Women
Treatment comparisons of interest included Pregnancy Category B oral and topical (intranasal) preparations and nasal saline, which is considered safe for use in pregnancy. These are presented in Table 5. Adverse effects of interest were the same as those listed for KQ2. Adverse fetal effects associated with SAR treatments in pregnant women were not specifically identified as a target adverse event because we restricted the drugs of interest to Pregnancy Category B only. Thus, we expected reporting of common treatment-related adverse events and adverse events associated with the physiologic changes of pregnancy, rather than teratogenic effects.
Oral sympathomimetic decongestants are Pregnancy Category C and were not included in this KQ.
Because pregnancy is commonly an exclusion criterion for participation in pharmaceutical RCTs, additional study designs in pregnant women with SAR (i.e., observational data, systematic reviews, and meta-analyses) were considered for KQ3. The inclusion criteria for these study designs were the same as for KQ1.
Key Question 4. Comparative Effectiveness and Adverse Effects of Treatments in Children Younger Than 12 Years of Age
The population of interest was children younger than 12 years of age who have SAR. Identified treatment comparisons of interest for KQ4 are presented in Table 6. Because of concerns about the use of sympathomimetic decongestants in children, comparisons of oral and nasal preparations as monotherapy were not included. Similarly, intranasal anticholinergic (ipratropium) was not included because Technical Experts indicated that this drug is rarely used in children younger than 12 years of age. Potential comparative harms of intranasal corticosteroids in this population (reduced bone growth and height) were of particular interest. Comparative effect on school performance in school-age children was an additional key outcome.
Selection criteria are the same as in KQ1, that is, RCTs were the preferred study type. For comparisons of interest that did not have RCT data, observational study designs were considered. Inclusion criteria for RCTs, observational studies, systematic reviews, and meta-analyses were those outlined in Table 3, with the exception that the study population was younger than 12 years old. For comparisons with sparse bodies of evidence, we considered inclusion of studies that mixed results for adults and children together.
Figure 2 shows the flow of data from article screening to data synthesis. Search results were transferred to EndNote®52 and subsequently into DistillerSR53 for selection. Using the study selection criteria for screening titles and abstracts, each citation was marked as: (1) eligible for review as full-text articles; (2) ineligible for full-text review; or (3) uncertain. A training set of 25 to 50 abstracts was initially examined by all team members to ensure uniform application of screening criteria. A first-level title screen was performed by one senior and one junior team member. Discrepancies were decided through discussion and consensus. A second-level abstract screen was conducted in duplicate manner by senior and junior team members according to defined criteria. When abstracts were not available, the full-text papers were obtained wherever possible and reviewed in the same way to determine whether selection criteria had been satisfied. For additional citations identified through subsequent literature searches, combined title and abstract screening was performed by senior and junior team members as described. Inclusion and exclusion were decided by consensus opinion.
Full-text articles were reviewed in the same fashion to determine their inclusion in the systematic review. Records of the reason for exclusion for each paper retrieved in full text, but excluded from the review, were kept in the DistillerSR database.
The complete set of data to be extracted was developed during the abstraction phase and included some anticipated elements. The final set of abstracted data included the following: general study characteristics (e.g., author, study year, enrollment dates, center[s], and funding agency), eligibility criteria, blinding, numbers of patients enrolled, baseline characteristics of patients enrolled (e.g., age and disease severity and duration), intervention, outcome instrument(s), adverse events and method of ascertainment, and results.
A list of excluded studies is available in Appendix B.
Data were abstracted directly into tables created in DistillerSR with elements defined in an accompanying data dictionary. A training set of five articles was abstracted by all team members who were abstracting data. From this process, an abstraction guide was created and used by all abstractors to ensure consistency. Two team members abstracted data from each article, and discrepancies were reconciled during daily team discussions. Abstracted data were transferred from DistillerSR to Microsoft Excel54 for construction of the study-level evidence tables and summary tables included in this report.
Data abstraction form elements are located in Appendix D.
Quality (Risk of Bias) Assessment of Individual Studies
In accordance with the AHRQ Methods Guide,48 individual RCTs were assessed using the United States Preventive Services Task Force (USPSTF) criteria,55 shown in Appendix E. Two independent reviewers assigned ratings of good, fair, or poor to each study, with discordant ratings resolved with input from a third reviewer. Trials that did not use an intention-to-treat (ITT) analysis were rated poor quality, as per USPSTF criteria. 55 Trials that did not specify the type of analysis done and did not provide sufficient patient flow data to determine that an ITT analysis was done, also were rated poor quality. Additionally, because all outcomes of interest were patient-reported, particular care was taken to ascertain whether patients were properly blinded to treatment. Open-label trials and trials in which patient blinding was deemed inadequate based on the description provided received a quality rating of poor.
The quality of harms reporting was assessed using the USPSTF rating, with specific attention to both patient and assessor blinding, and the McMaster Quality Assessment Scale of Harms (McHarm) for primary studies,56 shown in Appendix F. In particular, the process of harms ascertainment was noted and characterized as either an active process, if structured questionnaires were used; a passive process, if only spontaneous patient reports were collected; or intermediate, if active surveillance for at least one adverse event was reported. Trials using only passive harms ascertainment were considered to have a high risk of bias, specifically, underreporting or inconsistent reporting of harms.
For populations, comparisons, and interventions that were not adequately represented in RCTs, we sought nonrandomized comparative studies (observational, case-control, and cohort studies). We planned to assess studies of these designs using a selection of items proposed by Deeks and colleagues.57 However, we found no such studies. Therefore, quality rating was not applicable.
Two reviewers independently assessed the risk of bias of relevant systematic reviews and meta-analyses using the following criteria derived from the AMSTAR tool and AHRQ guidance:51
- Details of the literature search were provided.
- Study inclusion and exclusion criteria were stated.
- The quality assessment of included studies was described and documented.
These were considered the minimum criteria for assessing potential bias of any summary results and conclusions. Criteria 1 and 2 address the potential for selection bias. Criterion 3 is necessary to assess potential bias of included studies.
Evidence for effectiveness and safety provided by each treatment comparison was summarized in narrative text. The decision to incorporate formal data synthesis into this review was made after completing data abstraction.
Overall Approaches and Meta-Analyses for Direct Comparisons
Pooling of treatment effects was considered for each treatment comparison according to AHRQ guidance.58 Three or more clinically and methodologically similar studies (i.e., studies designed to ask similar questions about treatments in similar populations and to report similarly defined outcomes) were required for pooling. Three was an arbitrary number used as an operational criterion for meta-analyses. Only trials that reported variance estimates (standard error, standard deviation, or 95 percent confidence interval) for group-level treatment effects could be pooled. The measure of the pooled effect was the mean difference or the standardized mean difference, depending on how treatment effects were reported in pooled trials. Some trials reported mean changes from baseline, and others reported mean final symptom scores. When these trials were pooled together, the measure of the pooled effect was the mean difference.48 Trials also used different symptom rating scales (e.g., 4-point integer scales or 10 cm visual analog scales [VAS]). When these trials were pooled together, the standardized mean difference (SMD) was calculated. Otherwise, the mean difference was the preferred measure for pooled effects. Trials that used both different calculations for treatment effects and different symptom rating scales could not be pooled together.
We used RevMan59 to conduct meta-analyses using inverse variance weighting and random-effects models. For any meta-analysis performed, we identified the presence of statistical heterogeneity by using Cochran's Q statistic (chi-squared test) and assessed the magnitude of heterogeneity using the I2 statistic.60 For Cochran's Q statistic, a p-value less than or equal to 0.10 was considered statistically significant. An approximate guide for the interpretation of I2 was:61
- 0 percent to 40 percent: may not be important
- 30 percent to 60 percent: may represent moderate heterogeneity
- 50 percent to 90 percent: may represent substantial heterogeneity
- 75 percent to 100 percent: considerable heterogeneity
When present, we explored statistical heterogeneity as well as clinical diversity by performing subgroup analyses, sensitivity analyses, and meta-regression when possible.48 Statistical heterogeneity and clinical diversity are related concepts: Statistical heterogeneity describes variability in observed treatment effects that is due to clinical and/or methodological diversity, biases, or chance. Clinical diversity describes variability across trial study populations, interventions, and outcome assessments. In exploratory analyses, study level variables included study quality (risk of bias assessment), specific drugs studied, and covariates, such as inclusion of asthma patients or use of rescue or ancillary medications. Meta-analysis was planned for adverse events that investigators reported as severe or that led to discontinuation of treatment. Three or more trials reporting the adverse event were required for pooling. Adverse events of unspecified severity were considered not comparable across trials.
In this review, we formed conclusions about treatment classes based on meta-analyses of studies that compared single treatments. Methodological approaches for this type of analysis have not been published. However, we proceeded with this analysis with support from the TEP. For class comparisons that were poorly represented (i.e., a small proportion of drugs in a class were assessed in included studies), we applied conclusions to the specific drugs studied; how well such conclusions generalize to other drugs in the same class is uncertain. Previous comparative effectiveness reviews in allergic rhinitis3, 28, 38, 41-47 have found insufficient evidence to support superior effectiveness of any single drug within a drug class.
To assess the magnitude of treatment effects, we searched the published literature for minimal clinically important differences (MCIDs) derived from anchor-based or distribution-based methods. Anchor-based methods correlate observed changes on an investigational outcome assessment instrument with those on a known, validated instrument. Distribution-based MCIDs are obtained from the pooled variance in a clinical trial, for example, 20 percent or 50 percent of the pooled baseline standard deviation.62, 63 Anchor-based MCIDs are considered more robust than distribution-based MCIDs. FDA Guidance for patient-reported outcomes in clinical research supports the use of anchor-based MCIDs.50, 64
Anchor-based MCIDs have been published for quality of life measures commonly used in clinical research on rhinitis.65, 66 For the Rhinitis Quality of Life Questionnaire (RQLQ) and the mini-RQLQ, anchor-based MCIDs are 0.5 and 0.7, respectively, on a 0-6 point scale. Another validated quality of life questionnaire, the Nocturnal RQLQ, does not have a well-defined (i.e., anchor-based preferably or distribution-based) MCID.67
For asthma outcomes, anchor-based MCIDs have been defined for rescue medication use68 (1 puff per day) and forced expired volume in 1 second (FEV1; 100-200 ml).68, 69 A Health Canada Advisory Committee70 proposed definitions of the MCID for FEV1 using percent change from baseline (10 percent in adults [greater than age 11 years] and 7 percent in children [age 6 to 11 years]). For asthma symptoms,71 asthma exacerbations, and morning peak expired flow [PEF], MCIDs have not been well-defined. Definitions of “asthma exacerbation” vary; it has been proposed that any reduction in severe exacerbations (e.g., requiring treatment with systemic corticosteroids) is clinically significant.70 For PEF, a change of 25 L/min from baseline values is commonly considered clinically significant.69 It is unclear how this value was derived.
For nasal and eye symptom scales, anchor-based MCIDs have not been published. We identified three published attempts to assess clinically important changes in these scales.
- A study of responsiveness yielded a minimum clinically meaningful change of 30% maximum score: Bousquet and colleagues73 conducted a trial (n=839) that included a sub-study (n=796) comparing the responsiveness of VAS scores to changes in TNSS and RQLQ. Responsiveness and MCID are overlapping but not identical concepts. Responsiveness, defined as the ability of an instrument to measure change in a clinical state, ideally includes the ability to measure a clinically meaningful change,74 but may overestimate the minimal meaningful change. Bousquet and colleagues found that patients with a “clinically relevant improvement” in TNSS had a reduction of 2.9 cm on a 10 cm VAS. In this study, “clinically relevant improvement” was defined a priori as a decrease of at least 3 points on a 0-12 point TNSS scale. This threshold was based on placebo- and active-controlled trials of intranasal corticosteroids in patients with SAR and PAR, which showed improvements in TNSS of 40 to 50 percent from baseline in the active treatment groups. Because baseline TNSS in the Bousquet trial was approximately 7 on a 0-12 point scale, a 40 percent improvement correlated to a 3-point reduction in TNSS (7 × 0.40 = 2.8 ≈ 3).
- In allergen-specific immunotherapy (SIT) trials, a minimum 30 percent greater improvement in composite scores compared to placebo is considered clinically meaningful:75 The WHO currently recommends use of a composite outcome measure (symptoms plus rescue medication use) in SIT trials.76 Although “minimal clinically relevant efficacy” for this outcome is considered to be a 20 percent greater improvement compared to placebo, the cited reference for this threshold 77 does not support the recommendation: It is a systematic review of pharmacologic (not immunologic) treatments in which only symptom scores (not combination scores) were assessed, and a difference between two treatments of 10 percent was assumed to be clinically relevant. In contrast, an earlier paper by a member of the WHO writing group75 asserted that a 30 percent reduction in symptom/medication scores compared to placebo is minimally clinically relevant. This threshold was based on an evaluation of 68 placebo-controlled, double-blind trials.
In the absence of gold-standard MCIDs for symptom rating scales in SAR patients, we sought input from our TEP as recommended in the AHRQ Methods Guide.48 Three of seven experts provided input.
- For individual symptoms rated on a 0-3 point scale, all three experts considered a 1-point change meaningful.
- For TNSS on a 0-12 point scale, two experts considered a 4-point change and one expert considered a 2-point change meaningful.
- For total ocular symptom score (TOSS) on a 0-9 point scale, two experts considered a 3-point change and one expert considered a 1-point change meaningful.
For TNSS, potential MCIDs obtained from the three sources listed above and from the TEP are summarized in Table 7. As shown, two sources (row 2 and row 4) converged around an MCID of 30 percent change of maximum TNSS score. This was supported by three TEP members who proposed a similar threshold for individual nasal symptoms (1 point on a 0-3 point scale) and two TEP members who proposed a similar threshold for TOSS (3 points on a 0-9 point scale). The concordance of these values increased our confidence that 30 percent of maximum score is a useful threshold for purposes of our analysis and could be applied across symptom scales. We therefore examined the strength of evidence for symptom outcomes using this MCID calculated for each scale used.
A summary of MCIDs used in this report is presented in Table 8. As shown, three outcomes – asthma symptoms, quality of life assessed using the Nocturnal RQLQ, and harms – did not have MCIDs.
Two types of symptom scores were reported: reflective and instantaneous. Reflective scores represent a drug's effectiveness throughout the dosing interval. Instantaneous scores represent effectiveness at the end of the dosing interval. Instantaneous scores are recommended by the FDA for clinical development programs of SAR drugs.50 The FDA considers these scores a pharmacokinetic/pharmacodynamic feature of drugs in development, important for assessing dosing interval, but not important to patients. Consequently, only reflective symptom scores were abstracted for this review.
Symptom scores were reported at various time points, from 2 to 8 weeks. For treatment comparisons that involved intranasal corticosteroids, 2-week results were segregated from results at all other time points based on the pharmacodynamic profile of this class of drugs (onset of action occurs during the first 2 weeks of treatment). Results after 2 weeks were qualitatively synthesized. For all other drug classes, results from all time points were pooled. For trials that reported more than one time point, only results for the identified primary time point were included in meta-analysis. If a primary outcome (time point) was not identified, the latest outcome was included.
For adverse events, the measure of the pooled effect was the risk difference. Trials that reported adverse events as the proportion of patients experiencing the event were considered for pooling (meta-analysis or qualitative synthesis). Trials that reported adverse events as a proportion of all adverse events reported or did not report events by treatment group were not considered for pooling.
We initially assessed the evidence to determine whether one treatment was therapeutically superior to another and found that, for many comparisons, the evidence suggested equivalence of the treatments compared. We therefore decided post hoc to adopt an equivalence approach to evidence assessment in accordance with the AHRQ Methods Guide.48 Equivalence assessments increased our ability to form conclusions about the comparative effectiveness of treatments. In contrast to superiority assessments, equivalence assessments aim to determine whether two treatments are therapeutically similar within a predefined margin of equivalence48 (discussed further below). Therefore, we assessed the body of evidence to support one of the following conclusions:
- Superiority: One treatment demonstrated greater effectiveness than the other, either for symptom improvement or harm avoidance.
- Equivalence: Treatments demonstrated comparable effectiveness, either for symptom improvement or harm avoidance.
- Insufficient evidence: The evidence supported neither a conclusion of superiority nor a conclusion of equivalence.
To form clinically relevant conclusions, we compared both individual and pooled treatment effects to the MCID for each outcome, if one existed. Conclusions that could be drawn depended on whether or not an MCID existed and whether or not we were able to conduct meta-analysis:
- If an MCID existed and meta-analysis was done, one of three conclusions could be made: superiority, equivalence, or insufficient evidence. This was based on examination of the 95 percent confidence interval of the pooled effect in relation to the MCID (described further below).
- If there was no MCID and meta-analysis was done, one of two conclusions could be made: superiority or insufficient evidence. This was based on examination of the 95 percent confidence interval of the pooled effect in relation to the “no effect” line (i.e., treatment difference of zero). In this instance, a margin of equivalence could not be identified.
- If meta-analysis was not done, one of two conclusions could be made regardless of whether an MCID existed: superiority or insufficient evidence. In this instance, we estimated qualitatively the magnitude of the overall treatment effect for the body of evidence by inspection of individual trial results. Because a 95 percent CI for the overall effect was not generated, equivalence could not be assessed.
Strength of the Body of Evidence
The strength of the body of evidence for each outcome was determined in accordance with the AHRQ Methods Guide48 and is based on the Grading of Recommendations Assessment, Development, and Evaluation (GRADE) system.58, 78 Two reviewers independently evaluated the strength of evidence; agreement was reached through discussion and consensus when necessary. Four main domains were assessed: risk of bias, consistency, directness, and precision. Additional domains (dose-response association, strength of association, and publication bias) were considered for assessment. The body of evidence was evaluated separately for each treatment comparison and each outcome of interest, to derive a single GRADE of high, moderate, low, or insufficient evidence.
The GRADE definitions are as follows:
- High: high confidence that the evidence reflects the true effect. Further research is very unlikely to change our confidence in the estimate of effect.
- Moderate: moderate confidence that the evidence reflects the true effect. Further research may change our confidence in the estimate of effect and may change the estimate.
- Low: low confidence that the evidence reflects the true effect. Further research is likely to change the confidence in the estimate of effect and is likely to change the estimate.
- Insufficient: evidence either is unavailable or does not permit a conclusion.
We assessed the four strength of evidence domains using the following decision rules.
- Risk of bias: Ratings were based on USPSTF criteria applied to trials that reported on a given outcome, weighted by sample size using a semi-quantitative method. We used the following approximate cutoffs:
- Low risk of bias: Most patients were in good quality trials, and fewer than one-third were in poor quality trials.
- High risk of bias: Most patients were in poor quality trials, and fewer than two-fifths were in good quality trials.
- Medium risk of bias: The body of evidence falls between low and high risk of bias.
- For harms, active ascertainment of adverse events using structured questionnaires was considered to reduce the risk of bias, and passive ascertainment of adverse events by spontaneous patient report only, to increase the risk of bias.
- Consistency: We assessed consistency by comparing the direction of treatment effects. Because conclusions that could be drawn depended on whether or not an MCID existed and whether or not meta-analysis was done (as described above in “Evidence Synthesis”), the application of consistency assessments differed for outcomes with and without an MCID and for bodies of evidence without and without meta-analysis.
- If an MCID existed and meta-analysis was done, we determined consistency by visual inspection of forest plots. As shown in Figure 3:
- Point estimates and their 95 percent confidence intervals that fell completely above or below an interval bounded by the MCID (i.e., −MCID, +MCID) were considered consistent in support of a conclusion of superiority of the treatment favored. (See A and J in Figure 3.)
- Point estimates and their 95 percent confidence intervals that fell completely within an interval bounded by the MCID (i.e., −MCID, +MCID) were considered consistent in support of a conclusion of equivalence of the two treatments. (See C, D, and E in Figure 3.)
- Point estimates that fell on either side of the MCID (i.e., some greater than and some less than the MCID) or 95 percent confidence intervals that included the MCID were considered inconsistent. (See B, F, G, H, and I in Figure 3.)
- If an MCID existed and meta-analysis was not done, treatment effects in the same direction (i.e., all greater than or all less than the MCID) were considered consistent. Effects in opposite directions were considered inconsistent.
- If there was no MCID and meta-analysis was done, we also inspected forest plots.
- Point estimates and their 95 percent confidence intervals that fell completely on one side of the line of “no effect” (i.e., treatment difference of zero) were considered consistent.
- Point estimates that fell on either side of “no effect” (i.e., some treatment differences greater than zero and some less than zero) or 95 percent confidence intervals that included zero were considered inconsistent.
- If there was no MCID and meta-analysis was not done, treatment effects in the same direction (i.e., all greater than or all less than a treatment difference of zero) were considered consistent. Effects in opposite directions were considered inconsistent.
- A body of evidence that included both meta-analysis and additional trials reporting results that conflicted with the meta-analysis was considered consistent if 10 percent or less of patients reporting the outcome were in the additional trials.
- For meta-analyses that used the mean difference as the measure of the pooled effect, we also examined statistical heterogeneity (Cochran's Q statistic and I2 statistic) to support the consistency assessments described above.
- Low statistical heterogeneity supported consistency.
- We examined moderate and greater statistical heterogeneity using additional analyses (as described above in “Overall Approaches and Meta-Analyses for Direct Comparisons”) to determine an overall assessment of consistency.
- Directness: As displayed in the Analytic Framework (Figure 1), intermediate health outcomes and final health outcomes pertain directly to patients' experience of improvement in symptoms and quality of life. Therefore, all outcomes were considered direct.
- Precision: The assessment of precision depended on whether an MCID existed for the outcome and whether the body of evidence included meta-analysis.
- If an MCID existed and meta-analysis was done, precision of the pooled effect estimate was determined by the 95 percent confidence interval of the estimate. As shown in Figure 3:
- If both the point estimate and its 95 percent confidence interval fell completely above or below an interval bounded by the MCID (i.e., −MCID, +MCID), the body of evidence was considered precise in support of a conclusion of superiority of the favored treatment. (See A and J in Figure 3.)
- If both the point estimate and its 95 percent confidence interval fell completely within an interval defined by the MCID (i.e., −MCID, +MCID), the body of evidence was considered precise in support of a conclusion of equivalence of the two treatments (See C, D, and E in Figure 3.)
- If the 95 percent confidence interval included the MCID, the body of evidence was considered imprecise and insufficient to support a conclusion of either superiority or equivalence (See B, F, G, H, and I in Figure 3.)
- If an MCID existed and meta-analysis was not done, effect estimates clearly exceeding the MCID (to accommodate unknown variance in the estimate) were considered precise in support of a conclusion of superiority of the favored treatment. Otherwise, effects were considered imprecise.
- If there was no MCID and meta-analysis was done, pooled effects were considered precise if their 95 percent confidence intervals excluded conflicting conclusions (i.e., did not include treatment differences of zero).
- If there was no MCID and meta-analysis was not done, statistically significant treatment effects were considered precise; statistically nonsignificant treatment effects were considered imprecise. Although conceptually different from precision, statistical significance of treatment effects is highly correlated with precision.
- For bodies of evidence with additional trials not included in meta-analysis, we assessed the impact of the additional treatment effects on the pooled estimate semi-quantitatively. We considered both the direction and magnitude of the additional treatment effects as well as trial size (i.e., the number of patients reporting the outcome):
- Effects that clearly would have little impact on the pooled estimate if included in the meta-analysis were noted (e.g., 5 percent of patients reporting the outcome in a trial with an effect estimate very close to the pooled estimate).
- Effects that would have uncertain impact on the pooled estimate were added to the meta-analysis with assumed standard deviations equal to half the mean change in outcome score in each treatment group. This assumption was based on the observation that reported group-level standard deviations were often approximately equal to group means (Appendix C). Because we used inverse variance weighting in our pooling method, larger standard deviations would have yielded smaller confidence intervals for treatment effects and increased the risk of a Type I error (i.e., a 95 percent confidence interval that erroneously excluded the MCID would lead to an incorrect conclusion of equivalence or superiority; if there was no MCID, a 95 percent confidence interval that erroneously excluded zero would lead to an incorrect conclusion of superiority). Using a smaller standard deviation was a more conservative approach.
- For trials that did not report treatment effect magnitudes, a body of evidence could be considered precise if the trials represented less than 10 percent of patients reporting the outcome.
We assigned overall strength of evidence grades using a semi-quantitative approach. Because our body of evidence comprised RCTs, we began with an overall rating of high strength of evidence, which assumed low risk of bias and consistent and precise effects. (All outcomes were considered direct as noted above). We downgraded the strength of evidence one level for each domain rating that differed from this starting assumption. For example, if the risk of bias was medium and the evidence was inconsistent, the strength of evidence was downgraded two levels, from high to low. The one exception to this approach was precision: Any imprecise body of evidence was considered insufficient to support a conclusion about the comparative effectiveness or harms of the treatments compared.
The objective of this review was to provide an evidence-based understanding of the comparative effectiveness of available treatments for SAR. Populations of interest were children, adolescents, and adults (including pregnant women) who experience mild or moderate/severe SAR symptoms. In this context, applicability is defined as the extent to which treatment effects observed in published studies reflect expected results when treatments are applied to these populations in the real world.79, 80
Potential factors that may affect the applicability of the evidence for the KQs include:
- Underrepresentation of populations of interest, especially pregnant women
- Selection of patients with predominantly severe symptoms
- Dosage of comparator interventions not reflective of current practice
- Effects of keeping a patient diary on treatment adherence
The applicability of the body of evidence for each KQ was assessed by two reviewers with agreement reached through discussion and consensus when necessary. Limitations to the applicability of individual studies are described in the Discussion chapter.
Peer Review and Public Commentary
Peer Reviewers were invited to provide written comments on the draft report based on their clinical, content, or methodological expertise. The EPC addressed Peer Review comments on the preliminary draft of the report when preparing the final draft of the report. Peer Reviewers did not participate in writing or editing the final report or other products. The synthesis of the scientific literature presented in the final report does not necessarily represent the views of individual reviewers. The dispositions of the Peer Review comments were documented and will be published three months after publication of the final report.
Potential reviewers disclosed any financial conflicts of interest greater than $10,000 and any other relevant business or professional conflicts of interest. Invited Peer Reviewers could not have any financial conflict of interest greater than $10,000. Peer reviewers who disclosed potential business or professional conflicts of interest could submit comments on draft reports through the public comment mechanism.
The Research Protocol was posted for public comment on March 8, 2012. The Draft Report was available for public comment from August 2, 2012 to August 30, 2012. No public comments were received.
Agency for Healthcare Research and Quality (US), Rockville (MD)
Glacy J, Putnam K, Godfrey S, et al. Treatments for Seasonal Allergic Rhinitis [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2013 Jul. (Comparative Effectiveness Reviews, No. 120.) Methods.