In this chapter, we document the procedures that our Evidence-based Practice Center (EPC) used to conduct a comparative effectiveness review (CER) on the effectiveness and comparative effectiveness and harms of local hepatic therapies for primary hepatocellular carcinoma (HCC). The methods for this CER follow the methods suggested in the Agency for Healthcare Research and Quality (ARHQ) “Methods Guide for Effectiveness and Comparative Effectiveness Reviews” (available at

The main sections in this chapter reflect the elements of the protocol established for the CER; certain methods map to the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) checklist.38 We first describe the topic refinement process and the construction of the review protocol. We then present our strategy for identifying articles relevant to our key questions (KQs), our inclusion and exclusion criteria, and the process we used to extract information from the included articles and to generate our evidence tables. In addition, we discuss our method for grading the quality of individual articles, rating the strength of the evidence and assessing the applicability of individual studies and the body of evidence for each KQ. Finally, we describe the peer review process. All methods and analyses were determined a priori and documented in a research protocol that was publically posted by AHRQ.

Given the clinical complexity of this topic and the evolution of the scope and KQs, we sought the input of the Technical Expert Panel (TEP) throughout the process. In some cases, this was done through joint teleconferences; in other cases, we contacted TEP members individually to draw on each member's particular expertise (and availability).

Topic Refinement and Review Protocol

The topic for this report was nominated in a public process. With input from Key Informants, the EPC team drafted the initial KQs and posted them to a Web site for public comment for 4 weeks. Changes to the KQs and the PICOTS framework were made based on the public commentary and discussion with the TEP; however, the initial stratification of KQs and interventions by intent of treatment (palliative or curative) was deemed inappropriate and confusing. Interventions could not be clearly classified as either curative or palliative. Also, the term “palliative” is often associated with end-of-life care, and applying that term to this population, who may have early-stage disease, would cause confusion.

The inability to translate disease stage from one classification system to another made it difficult to differentiate between patients with BCLC stage A and B liver disease across publications. Therefore, two KQs refer to effectiveness and harms of liver-directed therapy for patients with unresectable disease without portal invasion or extrahepatic spread, with preserved liver function, and with an ECOG status ≤1 or BCLC stage A or B, or equivalent. A third KQ was added to address potential differences in effectiveness by patient and tumor characteristics. SBRT was added to the list of interventions. Increased alkaline phosphatase, increased bilirubin, increased transaminases, liver failure, and rare adverse events were added to the list of harms.

After reviewing the public commentary and TEP recommendations, the EPC drafted final KQs and submitted them to AHRQ for approval. Members of the TEP and KI were not involved with the writing, analysis or interpretation of the data. The views represented are solely those of the authors.

Literature Search Strategy

Search Strategy

Our search strategy used the National Library of Medicine's Medical Subject Headings (MeSH) keyword nomenclature developed for MEDLINE® and adapted for use in other databases. The searches were limited to the English language.39 The TEP noted that most of the pivotal studies are published in English language journals and, therefore, the exclusion of non–English-language articles from this review would not impact the conclusions. The search was further restricted to articles published between January 1, 2000, and July 27, 2012. With input from the TEP, the EPC investigators decided to limit the search to these dates to ensure the applicability of the interventions and outcomes data to current clinical practice. In 1999 the BCLC staging system was published which links the stage of disease to specific treatment strategies. In addition to the new staging system, prior to the year 2000 some interventions were in their infancy and based on current standards used outdated regimens.40,41,42 Thermal therapies were not used significantly until late 1990s and major changes in proton beam and stereotactic therapy occurred during that same period.43 Chemoembolization drugs and embolic mixtures have also changed a great deal in the last ten years and are more standard now. For these reasons which were strongly supported by the TEP we excluded studies where patient treatment preceded the year 2000. The texts of the major search strategies are given in Appendix A.

We searched for the following publication types: randomized controlled trials (RCTs), nonrandomized comparative studies, and case series. The TEP was given an opportunity to comment on the list of included articles and were invited to provide additional references if applicable.

Grey literature was sought by searching for clinical trials (,,, material published on the U.S. Food and Drug Administration Web site (, and relevant conference abstracts (American Society of Clinical Oncology, Gastrointestinal Cancers Symposium, Society of Surgical Oncology, The Radiosurgery Society, American Association for the Study of Liver Diseases) for data pertaining to the interventions used to treat unresectable HCC that are under consideration in this review. Scientific Information Packets from the Scientific Resource Center were reviewed. The original intent was to contact study authors if the EPC staff believed the evidence could meaningfully impact results (i.e., alter Grading of Recommendations Assessment, Development, and Evaluation [GRADE] strength of evidence). However, due to the limited number of studies included in this report, authors were contacted for any article lacking complete information on patient characteristics, interventions, or outcomes. The list of contacted authors is in Appendix B.

Inclusion and Exclusion Criteria

Table 6 lists the inclusion/exclusion criteria we selected based on our understanding of the literature, key informant and public comment during the topic-refinement phase, input from the TEP, and established principles of systematic review methods.

Table 6. Inclusion and exclusion criteria.

Table 6

Inclusion and exclusion criteria.

Study Selection

Search results were transferred to EndNote® and subsequently into DistillerSR® (Evidence Partners Inc., Ottawa, Canada) for selection. Using the study selection criteria for screening titles and abstracts, each citation was marked as: (1) eligible for review as full-text articles, or as (2) ineligible for full text review. Reasons for article exclusions at this level were not noted. The first-level title-only screening was performed in duplicate. To be excluded, a study needed to be independently excluded by both team members. In cases where there was disagreement, second-level abstract screening was completed by two independent reviewers.

A total of four team members participated in the dual data abstractions. Discrepancies were decided by consensus opinion and a third reviewer was consulted when necessary. All four team members were trained using a set of 50 abstracts to ensure uniform application of screening criteria. Full-text review was performed when it was unclear if the abstract met study selection criteria.

Full-text articles were reviewed in the same fashion to determine their inclusion in the systematic review. Records of the reason for exclusion for each paper retrieved in full-text, but excluded from the review, were maintained in the DistillerSR database. While an article may have been excluded for multiple reasons, only the first reason identified was recorded.

Development of Evidence Tables and Data Extraction

The tables were designed to provide sufficient information enabling readers to understand the studies and determine their quality. Emphasis was given to data elements essential to our KQs. Evidence table templates were identical for KQ1, KQ2, and KQ3. The format of our evidence tables was based on examples from prior systematic reviews.

Data extraction was performed directly into tables created in DistillerSR with elements defined in an accompanying data dictionary. All team members extracted a training set of five articles into evidence table to ensure uniform extraction procedures and test the utility of the table design. All data extractions were performed in duplicate, with discrepancies identified and resolved by consensus. If this was not successful, the project lead arbitrated the dispute. The full research team met regularly during the period of article extraction to discuss any issues related to the extraction process. Extracted data included patient and treatment characteristics, outcomes related to the interventions effectiveness, and data on harms. Harms included specific negative effects, including the narrower term of adverse effects. Data extraction forms used during this review are presented in Appendix C.

The final evidence tables are presented in their entirety in Appendix D. Studies are presented in the evidence tables by study design, then year of publication alphabetically by the last name of the first author. Abbreviations and acronyms used in the tables are listed as table notes and are presented in Appendix E.

Risk of Bias Assessment of Individual Studies

In the assessment of risk of bias in individual studies, we followed the Methods Guide.44 Quality assessment of each study was conducted by two independent reviewers, with discrepancies adjudicated by consensus. The United States Preventive Services Task Force (USPSTF) tool for RCTs and nonrandomized comparative studies45 and a set of study characteristics proposed by Carey and Boden for studies with a single-arm design 46 were used to assess individual study quality. The USPSTF tool is designed for the assessment of studies with experimental designs and randomized participants. Fundamental domains include assembly and maintenance of comparable groups; loss to followup; equal, reliable and valid measurements; clear definitions of interventions; consideration of all important outcomes; and analysis that adjusts for potential confounders and intention-to-treat analysis. It has thresholds for good, fair, and poor quality as follows,45 which were applied to the RCTs and nonrandomized comparative studies:

  • Good: Meets all criteria; comparable groups are assembled initially and maintained throughout the study (follow up at least 80 percent); reliable and valid measurement instruments are used and applied equally to the groups; interventions are spelled out clearly; all important outcomes are considered; and appropriate attention is given to confounders in analysis. In addition, for RCTs, intention-to-treat analysis is used.
  • Fair: Studies are graded as “fair” if any or all of the following problems occur, without the fatal flaws noted in the “poor” category below: in general, comparable groups are assembled initially but some question remains as to whether some (although not major) differences occurred with follow up; measurement instruments are acceptable (although not the best) and generally applied equally; some but not all important outcomes are considered; and some but not all potential confounders are accounted for. Intention-to-treat analysis is done for RCTs.
  • Poor: Studies are graded as “poor” if any of the following fatal flaws exists: groups assembled initially are not close to being comparable or maintained throughout the study; unreliable or invalid measurement instruments are used or not applied equally among groups (including not masking outcome assessment); and key confounders are given little or no attention. For RCTs, intention-to-treat analysis is lacking.

The criteria by Carey and Boden46 for assessing single-arm studies evaluate: clearly defined study questions; well-described study population; well-described intervention; use of validated outcome measures; appropriate statistical analyses; well-described results; discussion and conclusion supported by data. These criteria do not produce an overall quality ranking; therefore, we created the following thresholds to convert these ratings into the AHRQ standard quality ratings (good, fair, and poor). A study was ranked as good quality if each of the Carey and Boden46 criteria listed above was met. A fair quality rating was given if one of the criteria was not met, and a poor quality rating was given to studies with more than one unmet criteria.

The classification of studies into categories of good, fair, and poor was used for differentiation within the group of studies of a specific study design, and not for the overall body of evidence described below. Each study design was evaluated according to its own strengths and weaknesses. These quality ranking forms and their conversion thresholds can be found in Appendix D.

Data Synthesis

Evidence tables were completed for all included studies, and data were presented in summary tables and analyzed qualitatively in the text. We considered whether formal data synthesis (e.g., meta-analysis) would be possible and appropriate from the set of included studies.

Overall Approaches and Meta-Analyses for Direct Comparisons

Pooling of treatment effects was considered for each treatment comparison according to AHRQ guidance.47 Three or more clinically and methodologically similar studies (i.e., studies designed to ask similar questions about treatments in similar populations and to report similarly defined outcomes) were required for pooling. Only trials that reported variance estimates (standard error, standard deviation, or 95% confidence interval [CI]) for group-level treatment effects could be pooled. The pooling method involved inverse variance weighting and a random-effects model. For any meta-analysis performed, we assessed statistical heterogeneity by using Cochran's Q statistic (chi-squared test) and the I2 statistic. A p value of 0.10 was used to determine statistical significance of Cochran's Q statistic. Thresholds for the interpretation of I2 were:

  • 0 percent to 40 percent, may not be important
  • 30 percent to 60 percent, may represent moderate heterogeneity
  • 50 percent to 90 percent, may represent substantial heterogeneity
  • 75 percent to 100 percent, represents considerable heterogeneity

Strength of the Body of Evidence

We graded the strength of the overall body of evidence for overall survival, quality of life, and harms for the three KQs. We used the EPC approach developed for the EPC program and referenced in the Methods Guide,24,48 which is based on a system developed by the GRADE Working Group.47 This system explicitly addresses four required domains: risk of bias, consistency, directness, and precision. Table 7 describes criteria for selecting different levels within each of the four required domains. Outcomes with no studies reporting data have a level of unknown for each domain. Each domain is evaluated by outcome of interest in this report.

Table 7. Strength of evidence rating domains.

Table 7

Strength of evidence rating domains.

The grade of evidence strength is classified into four categories as shown in Table 8. Rules for the starting strength of evidence and factors that would raise or lower the strength are also described in the table

Table 8. Strength of evidence categories and rules.

Table 8

Strength of evidence categories and rules.

Two independent reviewers rated all studies on domain scores and resolved disagreements by consensus discussion; the same reviewers also used the domain scores to assign an overall strength of evidence grade for the body of evidence for each outcome of interest.


Applicability of the results presented in this review was assessed in a systematic manner using the PICOT framework (Population, Intervention, Comparison, Outcome, Timing). Assessment included both the design and execution of the studies and their relevance with regard to target populations, interventions, and outcomes of interest.

Peer Review and Public Commentary

This report received external peer review. Peer Reviewers were charged with commenting on the content, structure, and format of the evidence report, providing additional relevant citations, and pointing out issues related to how we conceptualized the topic and analyzed the evidence.

Our Peer Reviewers (listed in the front matter) gave us permission to acknowledge their review of the draft. In addition, the Eisenberg Center placed the draft report on the AHRQ Web site ( for public review.

No public comments were received. We compiled all peer review comments and addressed each one individually, revising the text as appropriate. Based on peer review, structure was added to the results section to distinguish that all comparisons were made within each category of intervention. Additional language was added to the Comparator in the PICOTS to restrict comparisons to the same intervention type. AHRQ staff and an associate editor provided reviews. A disposition of comments from public commentary and peer review will be posted on the AHRQ Effective Healthcare Web site ( 3 months after the final report is posted.