U.S. flag

An official website of the United States government

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Lin JS, Perdue LA, Henrikson NB, et al. Screening for Colorectal Cancer: An Evidence Update for the U.S. Preventive Services Task Force [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2021 May. (Evidence Synthesis, No. 202.)

Cover of Screening for Colorectal Cancer: An Evidence Update for the U.S. Preventive Services Task Force

Screening for Colorectal Cancer: An Evidence Update for the U.S. Preventive Services Task Force [Internet].

Show details

Chapter 2Methods

Scope and Purpose

The USPSTF will use this evidence review in conjunction with microsimulation models from the Cancer Intervention and Surveillance Modeling Network (CISNET) to update its 2016 recommendation statement on screening for CRC.1 This review is an update of our prior work82, 83 and addresses the benefit and harms associated with CRC screening and the test accuracy of the individual screening tests currently available in U.S. clinical practice. The accompanying CISNET simulation models address how the benefits and harms of screening might vary by screening test, screening interval, age to start screening, age to stop screening, as well as by sex, race/ethnicity, and comorbidities.

Key Questions and Analytic Framework

The analytic framework is presented in Figure 2.

Figure 2 is the analytic framework that depicts the three Key Questions to be addressed in the systematic review. The figure illustrates how screening for colorectal cancer in adults 40 years or older may result in a decrease in colorectal cancer incidence, colorectal cancer mortality, and/or all-cause mortality (Key Question 1). There is also a question related to the accuracy of screening tests used to detect colorectal cancer or adenomatous polyps (Key Question 2) and potential harms of screening (Key Question 3).

Figure 2

Analytic Framework: Screening for Colorectal Cancer. * Screening technologies with conditional approval from the U.S. Food and Drug Administration for screening for colorectal cancer. Abbreviations: CTC = computed tomography colonography; FIT = fecal (more...)

Key Questions

  1. What is the effectiveness or comparative effectiveness of screening programs in reducing colorectal cancer, mortality, or both?
    1. Does the effectiveness of screening programs vary by subgroups (e.g., age, sex, race/ethnicity)?
  2. What is the accuracy of direct visualization, stool-, serum-, or urine-based screening tests for detecting colorectal cancer, advanced adenomas, or adenomatous polyps based on size?
    1. Does the accuracy of the screening tests vary by subgroups (e.g., age, sex, race/ethnicity)?
  3. What are the serious harms of the different screening tests?
    1. Do the serious harms of screening tests vary by subgroups (e.g., age, sex, race/ethnicity)?

Data Sources and Searches

We searched the following databases to identify English-language literature published between January 1, 2015 and December 4, 2019: MEDLINE, PubMed, and the Cochrane Central Register of Controlled Trials. A research librarian developed and executed the search, which was peer-reviewed by a second research librarian (Appendix A). We also reviewed all included studies from the prior review,82, 83 which identified studies prior to 2015. We then supplemented our database searches with expert suggestions and by reviewing reference lists from other recent relevant systematic reviews.8498 We also searched ClinicalTrials.gov for ongoing screening trials. We imported the literature from these sources directly into EndNote X9 (Thomson Reuters, New York, NY).

Study Selection

Two investigators independently reviewed 11,306 newly identified titles and abstracts using an online platform (DistillerSR) and 502 articles (Appendix A Figure 1) with specified inclusion criteria (Appendix A Table 1). We resolved discrepancies through consensus and consultation with a third investigator. We carried forward 126 studies (159 articles) from our prior review. Four studies from the previous review were not included in this review due to study design (screening effectiveness studies comparing multiple screening tests among the same group of participants99, 100), screening modality (early versions of sDNA tests101), or outcomes (no description of colonoscopy complications102). Additionally, we excluded articles that did not meet inclusion criteria or those we rated as poor quality (i.e., at high risk of bias). Appendix D contains a list of all excluded trials.

Eligible studies included asymptomatic screening populations of individuals age 40 years and older at average risk for CRC. We excluded symptomatic populations and populations selected for: personal history of CRC, high risk for CRC due to known genetic susceptibility syndromes (e.g., Lynch syndrome, familial adenomatous polyposis), first-degree relative younger than age 60 years with CRC, personal history of inflammatory bowel disease, previous abnormal screening test, iron deficiency anemia, or under surveillance for a previous colorectal lesion. In studies with mixed populations, we limited our inclusion to those with less than 50 percent surveillance and/or less than 10 percent with symptoms, abnormal gFOBT or FIT, or anemia. For studies of harms of screening, we allowed mixed populations (e.g., indications for colonoscopy or CTC not reported or detailed) if the sample was larger than 10,000 participants. This allowed us to include studies that might detect rare or uncommon harms. We arrived at the number 10,000 based on estimates derived from our 2008 systematic review.103, 104 Because many studies reporting extracolonic findings on CTC limited population descriptions to asymptomatic or symptomatic, we included any studies in asymptomatic people that could include people at high risk for CRC (e.g., anemia, abnormal FOBT result, personal history of CRC or colorectal lesions).

For the greatest applicability to U.S. practice, we focused on studies conducted in developed countries, as defined by “very high” development according to the United Nations Human Development Index.105 We included only studies that published their results in English because of resource constraints.

We included studies that evaluated direct visualization screening tests (i.e., colonoscopy, FS, CTC, capsule endoscopy) and currently available stool-, serum-, or urine-based screening tests. Although we reviewed the evidence for benefit of older-generation gFOBT (i.e., Hemoccult II) on cancer incidence and mortality (Key Question 1), we did not update the evidence of its test accuracy (Key Question 2) because it has been replaced with high-sensitivity gFOBT (HSgFOBT) and FIT in U.S. practice. We excluded stool testing based on in-office digital rectal examination, double-contrast barium enema, and magnetic resonance colonography, as none of these modalities are used or recommended for use in screening for CRC. We also excluded studies that primarily focused on evaluating technological improvements to colonoscopy or CTC. We excluded endoscopy studies conducted in primarily single-center research settings or those with a limited number of endoscopists (e.g., <5 to 10) in order to approximate test performance and harms of screening tests in community practice.

Key Question 1

We included randomized or controlled trials of CRC screening versus no screening or another screening test. For screening tests without trial-level evidence, we examined well-conducted prospective cohort studies. We included trials and prospective observational studies that reported outcomes of cancer incidence and/or CRC-specific or all-cause mortality. Included studies could report either intention to screen or ‘as screened’ results. We excluded retrospective cohort studies and population-based case control studies. We also excluded decision analyses because this review is paired with CISNET microsimulation models designed to compare the effectiveness and harms of different screening strategies.

Key Question 2

We included test accuracy studies that used colonoscopy as a reference standard. We generally excluded studies whose design was subject to a high risk of bias, including those that did not apply colonoscopy to at least a random subset of screen-negative people (verification bias),106 although we made an exception for otherwise well-conducted diagnostic accuracy studies of FITs in which screen-negative people received registry followup (instead of colonoscopy) to determine cancer outcomes. We excluded studies without an adequate representation of a full spectrum of patients (spectrum bias), such as case-control studies.106110 Test accuracy studies had to include outcomes of test performance (i.e., sensitivity, specificity, and positive and negative predictive value) for the detection of CRC, AA, SSL, and/or adenomatous polyp by size (≥6 mm or ≥10 mm). We also captured test performance by location in the colon (i.e., proximal vs. distal), when reported.

Key Question 3

We included all trials or observational studies that reported serious adverse events requiring unexpected or unwanted medical attention and/or resulting in death. These events included, but were not limited to, perforation, major bleeding, severe abdominal symptoms, and cardiovascular events. We excluded studies whose reported harms were limited to minor adverse events that did not necessarily result in medical attention (e.g., patient dissatisfaction, worry, minor gastrointestinal complaints), physiologic outcomes only (e.g., hypoxia, renal or electrolyte disturbances), or harms of health certificate effect (i.e., people with negative screening results engaging in risky health behaviors or not pursuing future screening). Studies of harms did not have to include a comparator (i.e., people who did not receive any screening test). We also included studies designed to assess for extracolonic findings (incidental findings on CTC) and resultant diagnostic workup and harms of workup. We extracted extracolonic findings and radiation exposure per CTC examination from relevant diagnostic accuracy (Key Question 2) studies, when reported.

Quality Assessment and Data Abstraction

At least two reviewers critically appraised all articles that met inclusion criteria using the USPSTF’s design-specific quality criteria (Appendix A Table 2).111 We supplemented this criteria with the Newcastle Ottawa Scales for cohort and case-control studies,112 and the Quality Assessment of Diagnostic Accuracy Studies for studies of test accuracy.113 We rated articles as good, fair, or poor quality. In general, a good-quality study met all criteria. A fair-quality study did not meet, or it was unclear whether it met, at least one criterion, but also had no known important limitations that could invalidate its results. A poor-quality study had a single fatal flaw or multiple important limitations. We excluded all poor-quality studies from this review. Disagreements about critical appraisal were resolved by consensus and, if needed, consultation with a third independent reviewer.

Only one RCT examining screening effectiveness was excluded for poor quality.114 This study had several limitations: it was a small pilot study not powered to detect a difference in CRC, it had variable adherence to each arm, and there was crossover between arms. The most common fatal flaw for test accuracy studies was application of the reference standard to only those with an abnormal screening result (screen positive), because verification of only screen-positive patients will generally lead to an overestimation of both sensitivity and specificity.106, 109, 110, 115 We also excluded test studies that did not provide a description of followup of screen-negative people for poor quality because of limitations in reporting. For cohorts examining harms of screening, the most common limitation was poor reporting (so uncertain risk of bias).

One reviewer extracted key elements of included studies into standardized evidence tables in DistillerSR. A second reviewer checked the data for accuracy. Evidence tables were tailored for each key question and to specific study designs and/or specific screening tests. Tables generally included details on: study design/quality, setting and population (e.g., country, inclusion criteria, age, sex, race/ethnicity, family history), screening test/protocol (e.g., who administered, how administered, definition of test positive/diagnostic threshold[s], frequency/interval), reference standard or comparator (if applicable), adherence to testing, length of followup, outcomes (e.g., CRC incidence, mortality, sensitivity/specificity, harms) and outcomes for a priori specified subgroups.

Data Synthesis and Analysis

We synthesized results by key question and type of screening test, incorporating those studies from our previous review that met our updated inclusion criteria.

Key Question 1

We organized the syntheses primarily by study design and separated them into three main categories: 1) trials designed to assess the effectiveness (intention to screen) of screening tests (either as a one-time application or in a screening program) compared with no screening on CRC-specific and/or all-cause mortality; 2) well-conducted observational studies designed to assess the effectiveness of receipt of a screening test (either as a one-time application or in a screening program) compared with no screening on CRC incidence and mortality; and 3) comparative effectiveness trials of one screening test (e.g., FIT) versus another screening test (e.g., colonoscopy). Many of the trials comparing screening tests that met our inclusion criteria, however, were designed to determine the differential uptake of tests and/or to determine the comparative yield between tests and were not powered to detect differences in CRC outcomes or mortality (i.e., comparative effectiveness). Primary outcomes of interest were: CRC incidence (by stage if reported), CRC mortality, and all-cause mortality, as well as CRC incidence and mortality by location of CRC (distal vs. proximal).

Because of the limited number of studies and/or clinical heterogeneity of studies, we primarily synthesized results qualitatively using summary tables and figures to allow for comparisons across different studies. We conducted quantitative analyses of incidence rate ratios for four large FS trials for the above stated outcomes. We conducted random-effects meta-analyses using the restricted maximum likelihood (REML) method to estimate the pooled IRR in Stata version 16 (StataCorp LP, College Station, TX). We assessed the presence of statistical heterogeneity among the studies using the I2 statistic.

Key Question 2

We organized our synthesis by type of screening test. Most commonly, these results are limited to a single application of a screening test. Our analyses primarily focused on per-person test sensitivity to detect CRC, AAs (as defined by the study), advanced neoplasia (a composite outcome of AA plus CRC), and adenomas by size (≥6 or ≥10 mm). SSLs were sometimes included in the definition of AA, and when possible, we report test sensitivity for SSL alone. If the per-person sensitivity was not reported and could not be calculated, we substituted per-lesion test performance. If per-person test accuracy was not reported for adenomas by size, we allowed for any lesion (i.e., polyp) regardless of histology. We calculated sensitivity and specificity for adenomas by size and AAs excluding CRC lesions (i.e., people who had CRC were removed from the contingency table for AA). Analyses were conducted in Stata version 16. Data from contingency tables was analyzed in Stata using a bivariate model, which modeled sensitivity and specificity simultaneously. If there were not enough studies to use the bivariate model, sensitivity and specificity were pooled separately. We did not quantitatively pool results when data were limited to fewer than three studies. When quantitative analyses were not possible, we used summary tables and forest plots, prepared using Stata, to provide a graphical summary of results. We assessed the presence of statistical heterogeneity among the studies using the I2 statistic. When analyses found large statistical heterogeneity, we suggest using the 95% CI or range of estimates across the individual studies as opposed to point estimates. However, the high statistical heterogeneity for specificity is in part due to the high degree of precision around estimates from individual studies.

For test performance of CTC, we synthesized results for examinations with bowel preparation separately from those without bowel preparation. For studies of stool-based tests, we focused on designs that provided a colonoscopy to all patients (the reference standard) regardless of the screening test result. In this way we avoided potential test referral bias, which increases apparent test sensitivity and decreases specificity. We separately evaluated studies that employed differential followup (i.e., registry followup for screen-negative people and direct visualization for screen-positive people). For the FITs, we conducted random-effects meta-analyses by “family” (Appendix D Table 1). For example, tests produced by the same manufacturer, utilizing the same components and method, and compatible with different automated analyzers (and often reported by analyzer name) were placed in the same FIT family. We attempted to report test cutoff values expressed in μg Hb/g feces because values expressed in μg Hb/g feces are more comparable between tests.116

In support of accompanying microsimulation models, we conducted additional pooled analyses. These pooled analyses are located in Appendix F and include studies identified at an interim phase of the review (literature identified through January 2019).

Key Question 3

We organized our synthesis into four main categories, all for direct visualization tests: 1) harms from screening FS and colonoscopy; 2) harms from colonoscopy following an abnormal screening test; 3) harms from CTC, including radiation exposure and extracolonic findings; and 4) harms from capsule endoscopy. We did not hypothesize any serious harms for stool- or blood/serum-based screening tests beyond those from followup testing (i.e., colonoscopy following an abnormal screening test).

We primarily synthesized results qualitatively using summary tables to allow for comparisons of studies. When possible, we conducted quantitative analyses for serious harms, including major bleeding and perforation, for colonoscopy or FS. We defined major bleeding as any bleeding that required medical attention or intervention (e.g., emergency visit, hospitalization, transfusion, endoscopic management, surgery), or defined/reported as “major” or “serious” by the individual study. Using Stata version 16, we conducted random-effects meta-analyses using the DerSimonian and Laird method to estimate rates of serious adverse events. We assessed the presence of statistical heterogeneity among the studies using the I2 statistic. Quantitative analyses were not performed for other serious adverse events, as they were not routinely or consistently reported or defined.

Grading the Strength of the Body of Evidence

We graded the strength of the overall body of evidence for each KQ. We adapted the Evidence-based Practice Center (EPC) approach,117 which is based on a system developed by the Grading of Recommendations Assessment, Development and Evaluation Working Group.118 Our method explicitly addresses four of the five EPC-required domains: consistency (similarity of effect direction and size), precision (degree of certainty around an estimate), reporting bias (potential for bias related to publication, selective outcome reporting, or selective analysis reporting), and study quality (i.e., study limitations). We did not address the fifth required domain—directness—as it is implied in the structure of the KQs (i.e., pertains to whether the evidence links the interventions directly to a health outcome).

Consistency was rated as reasonably consistent, inconsistent, or not applicable (e.g., single study). Precision was rated as reasonably precise, imprecise, or not applicable (e.g., no evidence). The body-of-evidence limitations reflect potential reporting bias, study quality, and other important restrictions in answering the overall KQ (e.g., lack of replication of interventions, nonreporting of outcomes important to patients).

We graded the overall strength of evidence as high, moderate, or low. “High” indicates high confidence that the evidence reflects the true effect and that further research is very unlikely to change our confidence in the estimate of effects. “Moderate” indicates moderate confidence that the evidence reflects the true effect and that further research may change our confidence in the estimate of effect and may change the estimate. “Low” indicates low confidence that the evidence reflects the true effect and that further research is likely to change our confidence in the estimate of effect and is likely to change the estimate. A grade of “insufficient” indicates that evidence is either unavailable or does not permit estimation of an effect. We developed our overall strength-of-evidence grade based on consensus discussion involving at least two reviewers.

Expert Review and Public Comment

The draft Research Plan was posted on the USPSTF Web site for public comment from January 3 to January 30, 2019. In response to public comment, the USPSTF modified the analytic framework to be more consistent with USPSTF methodology and to indicate which screening tests have conditional approval from the U.S. Food and Drug Administration. The USPSTF also added urine-based tests as a screening method. Additionally, in the inclusion and exclusion criteria, the USPSTF revised the language to distinguish between the cancer location (proximal or distal colon or rectum) and added SSL as an outcome of interest for test accuracy studies. The USPSTF made no other substantive changes that altered the scope of the review.

A draft version of this report was reviewed by content experts, representatives of Federal partners, USPSTF members, and AHRQ Medical Officers. Reviewer comments were presented to the USPSTF during its deliberations and subsequently addressed in revisions of this report. Additionally, a draft of the full report was posted on the USPSTF Web site from October 27 through November 24, 2020. All comments and suggested citations were considered; minor editorial changes were made to the report based on these comments (e.g., inclusion of more detail on differences by race/ethnicity, provision of absolute numbers in addition to relative findings when possible, updated citations to the background and discussion sections) but no substantive changes were made to the included evidence, our interpretation of the evidence, or to our conclusions.

USPSTF Involvement

The authors worked with five USPSTF liaisons at key points throughout the review process to develop and refine the analytic framework and key questions and to resolve issues around scope for the final evidence synthesis.

This research was funded by the Agency for Healthcare Research and Quality (AHRQ) under a contract to support the work of the USPSTF. AHRQ staff provided oversight for the project, coordinated systematic review work with decision models, reviewed the draft report, and assisted in an external review of the draft evidence synthesis.

Image appaf1

Views

  • PubReader
  • Print View
  • Cite this Page
  • PDF version of this title (3.1M)

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...