NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Chou R, McDonagh MS, Nakamoto E, et al. Analgesics for Osteoarthritis: An Update of the 2006 Comparative Effectiveness Review [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2011 Oct. (Comparative Effectiveness Reviews, No. 38.)

Cover of Analgesics for Osteoarthritis

Analgesics for Osteoarthritis: An Update of the 2006 Comparative Effectiveness Review [Internet].

Show details


Topic Development

The topic for the original 2006 report33 was nominated in a public process. The key questions for that report were developed by investigators from the Evidence-based Practice Center (EPC) with input from a Technical Expert Panel (TEP), which helped to refine key questions, identify important issues, and define parameters for the review of evidence.

For the present report update, the same scope and key questions were proposed to the EPC by Agency for Healthcare Research and Quality (AHRQ). The key questions and list of included drugs were modified by the EPC after receiving input from a new TEP convened for this report update. The revised key questions were then posted to a public Web site for comment. AHRQ and the EPC agreed upon the final key questions after reviewing the public comments and receiving additional input from the TEP.

Search Strategy

We updated the search conducted with the comparative effectiveness review (CER) for studies published in the years 2005–present. We searched the Cochrane Database of Systematic Reviews (through January 2011) the Cochrane Central Register of Controlled Trials (through fourth quarter 2010) and Ovid MEDLINE (2005–January 2011.) We used relatively broad searches, combining terms for drug names with terms for relevant research designs, limiting to those studies that focused on osteoarthritis and rheumatoid arthritis (see Appendix C for the complete search strategy). Other sources include selected grey literature provided to the EPC by the Scientific Resource Center librarian, reference lists of review articles, and citations identified by public reviewers of the Key Questions. Pharmaceutical manufacturers were invited to submit scientific information packets, including citations and unpublished data.

All 1,184 citations from these sources and the original report were imported into an electronic database (EndNote X3) and considered for inclusion.

Study Selection

We developed criteria for inclusion and exclusion of studies based on the key questions and the populations, interventions, comparators, outcomes, timing and setting (PICOTS) approach. Abstracts were reviewed using abstract screening criteria (Appendix D) and a two-pass process to identify potentially relevant studies. For the first pass, the abstracts were divided between three investigators. In the second pass, a fourth investigator reviewed all abstracts not selected for inclusion in the first-pass. Two investigators then independently reviewed all potentially relevant full text using a more stringent set of criteria for inclusion and exclusion (Appendix D).

Population and Condition of Interest

As specified in the Key Questions, this review focuses on adults with osteoarthritis. We included studies that evaluate the safety, efficacy, or effectiveness of the included medications in adults with any grade of osteoarthritis. We also included studies that report safety in patients with rheumatoid arthritis or who were taking the drug for cancer or Alzheimer’s prevention.

Interventions and Comparators of Interest

We considered studies that compared any of the oral and topical analgesics listed in Table 2 to another included drug, or placebo.

Oral agents include:

For this report, we defined the terms “selective nonsteroidal anti-inflammatory drug (NSAID)” or “cyclooxygenase (COX)-2 selective NSAID” as drugs in the “coxib” class (e.g. celecoxib, rofecoxib, and valdecoxib). We grouped etodolac, nabumetone, and meloxicam into a separate category that we referred to as “partially selective NSAIDs,” based on in vitro differences in COX-2 selectivity intermediate between COX-2 selective NSAIDs and nonselective NSAIDs. However, whether partially selective NSAIDs are truly different from nonselective NSAIDs is unclear because COX-2 selectivity may be lost at higher doses and the effects of in vitro COX-2 selectivity on clinical outcomes are uncertain.35 The salicylic acid derivatives aspirin and salsalate were also considered a separate subgroup. We defined “non-aspirin, nonselective NSAIDs” or simply “nonselective NSAIDs” as all other NSAIDs. We excluded evidence on NSAIDs unavailable in the United States, leaving celecoxib as the only COX-2 selective NSAID included in this update.

Outcomes of Interest

We included studies that evaluate the safety, efficacy, or effectiveness of the previously mentioned medications. Outcomes include:

  • Improvements in osteoarthritis symptoms
  • Adverse events were evaluated from studies of the drugs used for osteoarthritis, rheumatoid arthritis, or cancer treatment
    • Cardiovascular (CV): stroke, myocardial infarction, congestive heart failure, hypertension, and angina
    • Gastrointestinal (GI): perforations, symptomatic gastroduodenal ulcers and upper GI bleeding (PUBs), obstructions, dyspepsia
    • Renal toxicity
    • Hepatotoxicity
  • Other outcomes of interest: quality of life, sudden death

We defined “benefits” as relief of pain and osteoarthritic symptoms and improved functional status. The main outcome measures for this review were pain, functional status, and discontinuations due to lack of efficacy. Frequently used outcome measures include visual and categorical pain scales.36

Patients use visual analog scales (VAS) to indicate their level of pain, function, or other outcome by marking a scale labeled with numbers (such as 0 to 100) or descriptions (such as “none” to “worst pain I’ve ever had”). One study found minimum clinically important improvement thresholds of an absolute improvement from baseline for 15 to 20 points on a 0 to 100 VAS scale, or a relative improvement of 30 percent to 40 percent.37

Categorical pain scales consist of several pain category options from which a patient must choose (e.g., no pain, mild, moderate, or severe). A disadvantage of categorical scales is that patients must chose among categories that may not accurately describe their pain. A variety of disease-specific and nonspecific scales are used to assess these outcomes in patients with osteoarthritis. Commonly used categorical pain scales include:

  • The Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC), a 24-item, disease-specific questionnaire used to assess the functional status of patients with osteoarthritis of the knee and hip. Separate scores can be calculated for pain (5 items, scored 0 to 20, 0 to 500, or 0 to 100), functional status (17 items, scored 0 to 68, 0 to 1700, or 0 to 100), and stiffness (2 items, scored 0 to 8, 0 to 200, or 0 to 100). For each subscale, the score is calculated by adding the scores for all the items together (in some cases translating to a 100 point scale). A lower score indicates better function.38 One study found minimum clinically important improvement thresholds of an absolute improvement from baseline in the WOMAC total score of about 10 points (on a 0 to 100 scale) or a relative improvement of 25 percent.37
  • The Medical Outcomes Short Form-36 (SF-36) health survey, an 8-item questionnaire for measuring health-related quality of life across different diseases. Each item is score from 0 to 100, with higher scores indicating better health. Physical and mental component summary scores can be calculated by combining results for several subscales.39
  • Patient Global Assessment of Disease Status and Investigator Global Assessment of Disease Status. The patient or investigator answers questions about the overall response to treatment, functional status, and pain response, using a VAS or categorical scale. Thresholds for minimum clinically important improvements for global assessment of disease status were similar to those for pain, based on a 0 to 100 VAS.37

Another method for measuring outcomes is classifying patients dichotomously as “responders” or “nonresponders.” Responders are often defined as patients with at least a 50 percent improvement in pain or function. The Outcomes Measures in Arthritis Clinical Trials-Osteoarthritis Research Society International (OMERACT-OARSI) criteria, for example, were developed through a consensus process and classifies patients as responders if they meet specific predefined criteria (≥50% improvement in pain or function that was ≥20 mm on a 100-mm VAS, or a ≥20% improvement in at least two of pain, function, or patient global assessment that was ≥10 mm on a 100-mm VAS).40

“Harms” include tolerability (not having to stop the drug due to adverse effects); CV, hepato-, renal, and GI toxicity; and increased risk for hospitalizations, drug interactions, and death. For GI toxicity, we focused on serious complications associated with NSAIDs including perforation, bleeding ulcer, and gastric outlet obstruction, though we also evaluated other GI side effects (such as nausea, dyspepsia, and GI tolerability). We only considered rates of endoscopic ulcers when data on clinical ulcer complications were incomplete or not available.


We did not apply a minimum threshold for duration of intervention.


Studies conducted in primary care and specialty settings were included.

Types of Studies

We included systematic reviews41 and controlled trials pertinent to the Key Questions. We retrieved and evaluated for inclusion and exclusion any blinded or open, parallel or crossover randomized controlled trial that compared one included drug to another, another active comparator, or placebo. We also included cohort and case-control studies with at least 1,000 cases or participants that evaluated serious GI and CV endpoints that were inadequately addressed by randomized controlled trials. We excluded non-English language studies unless they were included in an English-language systematic review, in which case we relied on the data abstraction and results as reported in the systematic review. A list of excluded studies can be found in Appendix E.

Figure 1 depicts the key questions within the context of the PICOTS described in the previous section. In general, the figure illustrates how the nonopioid oral medications, over-the-counter supplements, and topical agents may result in outcomes such as improvements in osteoarthritis symptoms. Also, adverse events (including, but not limited to, CV, GI, renal and hepatic events) may occur at any point after analgesics are received.

This analytic framework is a model linking the key questions, evidence, and the population related to the clinical outcomes. Here, the population of interest is patients with osteoarthritis who are appropriate candidates for analgesic medications (intervention) and the clinical outcomes are improvement in osteoarthritis symptoms. Another possible outcome is the adverse effects of these medications.

Figure 1

Analytic framework.

Data Extraction

After studies were selected for inclusion based on the key questions and PICOTS, the following data were abstracted and used to assess applicability and quality of the study: study design; inclusion and exclusion criteria; population and clinical characteristics (including sex, age, ethnicity, diagnosis, comorbidities, concomitant medications, GI bleeding risk, CV risk); interventions (dose and duration); method of outcome ascertainment, if available; the number of patients randomized relative to the number of patients enrolled, and how similar those patients were to the target population; whether a run-in period was used; the funding source; and results for each outcome, focusing on efficacy and safety. We recorded intention-to-treat results if available. Data abstraction for each study was completed by two investigators: the first abstracted the data, and the second reviewed the abstracted data for accuracy and completeness.

Quality Assessment

We assessed the quality of systematic reviews, randomized trials, and cohort and case control studies based on the predefined criteria listed in Appendix F. We adapted criteria from the Assessment of Multiple Systematic Reviews (AMSTAR) tool (systematic reviews),42 methods proposed by Downs and Black (observational studies),43 and methods developed by the US Preventive Services Task Force.44 The criteria used is similar to the approach recommended by AHRQ in the Methods Guide for Comparative Effectiveness Reviews.45

We rated the quality of each controlled trial based on the methods used for randomization, allocation concealment, and blinding; the similarity of compared groups at baseline; maintenance of comparable groups; adequate reporting of dropouts, attrition, crossover, adherence, and contamination; loss to followup; the use of intention-to-treat analysis; and ascertainment of outcomes.44

Included systematic reviews also were rated for quality based on predefined criteria assessing whether they had a clear statement of the question(s), reported inclusion criteria, used an adequate search strategy, assessed validity, performed dual data abstraction, reported adequate detail of included studies, assessed for publication bias, and used appropriate methods to synthesize the evidence.42 We included systematic reviews and meta-analyses that included unpublished data inaccessible to the public, but because the results of such analyses are not verifiable, we considered this a methodological shortcoming. For each systematic review included in this report, we considered their relevance to the key questions and scope, their quality, and how new evidence might affect conclusions.41

Large observational studies on serious harms associated with selective and nonselective NSAIDs have primarily relied on claims data or other administrative databases or on electronic medical record data collected in practice networks to identify cases, and prescription claims to determine exposure. A strength of these studies is that they evaluated much larger populations than could be enrolled into clinical trials.46 In addition, they may reflect how NSAIDs are actually used in practice better than many clinical trials, which are usually short term, mandate rigid dosing regimens, limit the use of other drugs, and implement strategies to monitor and enhance compliance. Population- and practice-based studies may also better represent patients who would be excluded from randomized trials because of comorbidities, age, or other factors.

The most important weakness of observational studies is that patients are allocated treatment in a nonrandomized matter. This can lead to biased estimates of effects even when appropriate statistical adjustment on a variety of confounding variables is performed.47 In addition, data sources often cannot reliably assess over-the-counter aspirin, NSAIDs, or acid-suppressing medication use,46 and information on prescription fills may not always accurately correspond to the actual degree of exposure to the drugs.

For assessing the internal validity of cohort studies, we evaluated whether they used nonbiased selection methods to create an inception cohort; whether rates of loss to followup were reported and acceptable; whether they used accurate methods for ascertaining exposures, potential confounders, and outcomes; and whether they performed appropriate statistical analyses of potential confounders.43 For assessing the internal validity of case-control studies, we evaluated whether similar inclusion and exclusion criteria were applied to select cases and controls, whether they used accurate methods to identify cases, whether they used accurate methods for ascertaining exposures and potential confounders, and whether they performed appropriate statistical analyses of potential confounders.43 We only included studies that performed adjustment for important confounders (such as age, sex, and markers of underlying risk) and only reported adjusted risk estimates.

Individual studies were rated as “good,” “fair” or “poor” as defined below:44

Studies rated “good” have the least risk of bias and results are considered valid. Good-quality studies include clear descriptions of the population, setting, interventions, and comparison groups; a valid method for allocation of patients to treatment; low dropout rates, and clear reporting of dropouts; appropriate means for preventing bias; appropriate measurement of outcomes, and reporting results.

Studies rated “fair” are susceptible to some bias, but it is not sufficient to invalidate the results. These studies do not meet all the criteria for a rating of good quality because they have some deficiencies, but no flaw is likely to cause major bias. The study may be missing information, making it difficult to assess limitations and potential problems. The “fair” quality category is broad, and studies with this rating vary in their strengths and weaknesses: the results of some fair-quality studies are likely to be valid, while others are only probably valid.

Studies rated “poor” have significant flaws that imply biases of various types that may invalidate the results. They have a serious or “fatal” flaw in design, analysis, or reporting; large amounts of missing information; or discrepancies in reporting. The results of these studies are at least as likely to reflect flaws in the study design as the true difference between the compared drugs. We did not exclude studies rated poor quality a priori, but poor quality studies were considered to be less reliable than higher quality studies when synthesizing the evidence, particularly when discrepancies between studies were present.

Studies could receive one rating for assessment of efficacy and a different rating for assessment of harms. Study quality was assessed by two independent investigators, and disagreements were resolved by consensus. Quality assessments for individual studies can be found in Appendix G.

Assessing Research Applicability

The applicability of trials and other studies was assessed based on whether the publication adequately described the study population, how similar patients were to the target population in whom the intervention will be applied, whether differences in outcomes were clinically (as well as statistically) significant, and whether the treatment received by the control group was reasonably representative of standard practice.48 We also recorded the funding source and role of the sponsor. We did not assign a rating of applicability (such as “high” or “low”) because applicability may differ based on the user of this report.

Evidence Synthesis and Rating the Body of Evidence

We assessed the overall strength of evidence for a body of literature about a particular key question in accordance with AHRQ’s Methods Guide for Comparative Effectiveness Reviews,45 based on evidence included in the original CER,32 as well as new evidence identified for this update. We considered the risk of bias (based on the type and quality of studies); the consistency of results within and between study designs; the directness of the evidence linking the intervention and health outcomes; the precision of the estimate of effect (based on the number and size of studies and confidence intervals for the estimates); strength of association (magnitude of effect); and the possibility for publication bias. We did not perform original meta-analyses. Rather, we relied on the results of existing individual studies and systematic reviews (including meta-analyses when available).

We rated the strength of evidence for each Key Question using the four categories recommended in the AHRQ guide:45 A “high” grade indicates high confidence that the evidence reflects the true effect and that further research is very unlikely to change our confidence in the estimate of effect. A “moderate” grade indicates moderate confidence that the evidence reflects the true effect and further research may change our confidence in the estimate of effect and may change the estimate. A “low” grade indicates low confidence that the evidence reflects the true effect and further research is likely to change the confidence in the estimate of effect and is likely to change the estimate. An “insufficient” grade indicates evidence either is unavailable or does not permit a conclusion.

PubReader format: click here to try


Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...