Methodological practices followed in this review were derived from AHRQ “Methods Guide for Effectiveness and Comparative Effectiveness Reviews”70 (hereafter Methods Guide) and AHRQ “Methods Guide for Medical Test Reviews.”71

Topic Development and Refinement

Key Questions (KQs) were reviewed and refined as needed by the Evidence-based Practice Center (EPC) with input from Key Informants and the Technical Expert Panel (TEP) to ensure that the questions were specific and explicit about what information was being reviewed. In addition, for Comparative Effectiveness Reviews, the KQs were posted for public comment and finalized by the EPC after review of the comments.

Key Informants are the end users of research, including patients and caregivers, practicing clinicians, relevant professional and consumer organizations, purchasers of health care, and others with experience in making health care decisions. Within the EPC program, the Key Informant role is to provide input into identifying the KQs for research that will inform health care decisions. The EPC solicits input from Key Informants when developing questions and an analytic framework for the systematic review or when identifying high priority research gaps and needed new research. The Key Informants selected to work on PCA3 included individuals with expertise in urology, pathology, laboratory medicine, internal medicine, family medicine, clinical trial design, as well as a patient advocate. The experts selected for the Technical Expert Panel to provide expertise and perspectives specific to the topic included individuals with expertise in urology, pathology, laboratory medicine, internal medicine, family medicine, clinical trial design and statistics. Technical Experts provided information to the EPC to identify literature search strategies and recommended approaches to specific issues as requested by the EPC.

Key Informants and Technical Experts are not involved in analyzing the evidence or writing the report and have not reviewed the report, except as given the opportunity to do so through the peer or public review mechanism. Key Informants and Technical Experts were required to disclose any financial conflicts of interest greater than $10,000, and any other relevant business or professional conflicts of interest. Because of their role as end-users or their unique clinical or content expertise, individuals were invited to serve as Key Informants or Technical Experts, and those who presented without potential conflicts were retained. The AHRQ Task Order Officer (TOO) and the EPC worked to balance, manage, or mitigate any potential conflicts of interest identified.

Literature Search Strategy

The research librarian, in collaboration with the review team, developed and implemented search strategies designed to identify articles relevant to each KQ. Abstracts from selected recent professional meetings were also identified and followed up to identify subsequent publications and provide insight into types of data relevant to gaps in knowledge; abstract review was not used to assess publication bias. Details on strategies with full search strings are presented in Appendix A. The search was limited to English-language articles or articles in other languages for which the journal provided an English translation. The rational for this decision is that this EPC's experience demonstrated that non-English references did not yield information of sufficiently high quality to justify the resources needed for translation. In addition, studies have demonstrated that excluding non-English language studies has little impact on effect size estimates or conclusions relative to the resources required.72,73 Systematic reviews/meta-analyses were identified through the MEDLINE® searches and grey literature searches. Bibliographies of included articles were hand-searched to ensure complete identification of relevant studies. The timeframe for the search was limited to literature published after January 1, 1990 based on FDA approval of the tPSA test for early detection of prostate cancer in 1993.

  • MEDLINE® (January 1, 1990 to August 9, 2011)
  • Embase® (January 1, 1990 to August 15, 2011)
  • Cochrane Central Register of Controlled Trials (no date restriction)

Search results were stored in a project-specific EndNote9® database that was subsequently uploaded into DistillerSR (Evidence Partners Inc., Ottawa, Ontario, Canada), a web-based systematic review software application. Two independent reviewers used the DistillerSR software to determine study eligibility. Using selection criteria for screening abstracts, the two reviewers marked each abstract as: 1) yes (eligible for review of the full article); 2) no (ineligible for review); or 3) uncertain (review the full article to resolve eligibility). Reviewer discrepancies were resolved by discussion and consensus opinion; a third reviewer was consulted as needed. When abstracts were unavailable or unclear, full-text articles were obtained for review.

Using study selection criteria and the DistillerSR software, a single reviewer read each full-text article and determined eligibility of the study for data abstraction. A second reviewer audited a subset of articles, and reviewed all articles marked as uncertain. Discrepancies were resolved by discussion and consensus opinion; a third reviewer was consulted as needed. Key reasons for excluding studies were captured by DistillerSR and Excel® spreadsheet. Each paper retrieved in full-text, but excluded from the review, is listed in Appendix B with reasons for exclusion.

An updated search of the published literature through May 15, 2012 was conducted upon submission of the draft report to determine if new information had been published since completion of the previous search (see Appendix A, Addendum). In addition, the Technical Expert Panel and individuals and organizations providing peer review were asked to inform the project team of any studies relevant to the KQs that were not included in the draft list of selected studies.

Study Selection

Studies were included if they fulfilled the following criteria:

  • Study was a randomized controlled trial, a matched comparative study (e.g., prospective or retrospective cohort, diagnostic accuracy and case-control studies), or a systematic review of matched comparative studies. Matched studies were defined as performed in comparable clinical settings and provided test results and estimates of diagnostic performance for PCA3 and at least one other comparator (e.g., %fPSA) from the same patient population. A study of PCA3 alone, or a comparator alone, would not be included. Note that systematic reviews of unmatched studies were initially retained in DistillerSR (but not extracted) based on potential usefulness in two areas: 1) providing references that might identify additional studies of PCA3; and 2) as sources of more broadly based unmatched data on performance characteristics of PCA3 and comparators (i.e., to compare with results based on smaller numbers of subjects in the primary matched studies of %fPSA, to determine if the results are consistent or inconsistent).
  • Study subjects were adult males with elevated total PSA tests and/or abnormal DRE who have not had a prostate biopsy or who have had one or more prostate biopsies (KQ 1 and 2), OR adult male patients with prostate cancer positive biopsies (KQ 3).
  • Study intervention included testing for PCA3 and at least one designated pretreatment standard comparator test for prostate cancer, and a prostate biopsy (6 core minimum) or radical prostatectomy (KQ 3 only).
  • Study comparators for KQ 1 and 2 were standard validated tests for prostate cancer that included tPSA, %fPSA, PSA velocity and doubling time, PSA density, complexed PSA and externally validated nomograms/risk assessment programs. For KQ 3, comparators included Gleason score, pathological staging, other pathological tumor characteristics and tumor volume.
  • Study outcomes included intermediate outcomes (e.g., diagnostic accuracy for prostate cancer, impact on biopsy decisionmaking), long-term outcomes (e.g., mortality, morbidity, function, quality of life) and potential harms (e.g., adverse effects of biopsy, misdiagnosis) (KQ 1 and 2).


  • Study outcomes included the intermediate outcomes of diagnostic accuracy for tumor risk category (i.e., insignificant/low risk, aggressive/high risk) and impact on decisionmaking about active surveillance versus aggressive treatment, as well as long-term outcomes (e.g., mortality, morbidity, function, quality of life) and potential harms (e.g., adverse effects of treatment, misdiagnosis) (KQ 3).

Studies were excluded if they fulfilled at least one of the following criteria:

  • Did not study prostate cancer.
  • Did not address one or more of the KQs.
  • Were published in a non-English language for which the journal did not provide a translation.
  • Were published as a meeting abstract.
  • Did not use a relevant study design.
  • Did not report primary data.
  • Did not report relevant outcomes.

Search Strategies for Grey Literature

A systematic search of grey literature sources was undertaken to identify unpublished studies, or studies published in journals that are not indexed in major bibliographic citation database, in accordance with guidance from Effective Health Care Scientific Resource Center. The detailed search strategies and results can be found in Appendix C. Briefly, the searches included: regulatory information (i.e., FDA); clinical registries; abstracts and papers from professional annual meetings and conferences; organizations publishing guidance or review documents (e.g., National Guideline Clearinghouse, Cochrane, National Institute for Clinical Excellence); grants and federally funded research; and manufacturer web sites.

Search strategies were similar to those used in bibliographic databases, except for the following:

  • Regulatory information: The FDA website was searched for PMA and 510(k) decision summary documents related to urine PCA3 mRNA assays.
  • For clinical registries, NIH RePORTER, HSRPROJ, and AHRQ GOLD, searches were limited to completed studies only.
  • Abstracts and conferences articles published prior to 2009 were excluded.

Data Extraction and Management

The data elements from included studies were extracted using DistillerSR software into standard data formats and tables by one reviewer, and were subject to a full quality review for accuracy and completeness by a second reviewer. Data extraction question formats and tables were pilot-tested for completeness on a group of selected studies, and revised as necessary before full data extraction began. Project staff met regularly to discuss the results at each phase, review studies that were difficult to classify and/or abstract, and to address any questions raised by team members.

Data Elements

Data elements extracted from the selected studies were defined in consultation with the TEP. A detailed list can be found in Appendix B, and the corresponding database fields in the DistillerSR Data Extraction Forms in Appendix I.

Evidence Tables

DistillerSR reports were created that contained content for specific evidence tables and downloaded into Excel® spreadsheets for editing. Final tables were formatted in Microsoft Word®. Primary reporting of DistillerSR data elements for each evidence table was done by one person; a second person reviewed articles and evidence tables for accuracy. Disagreements were resolved by discussion and, if necessary, by consultation with a third reviewer. When small differences occurred in quantitative estimates of data from published figures, the values were obtained by averaging the two reviewers' estimates.

Individual Study Quality Assessment

Definition of Ratings for Individual Studies and Reviews

In adherence with the Methods Guide70, grading the methodological quality of individual comparative studies was performed based on study design-specific criteria. In all cases, quality of individual studies and the overall body of evidence was assessed by two independent reviewers. Discordant decisions were resolved through discussion or third-party adjudication. Quality assessments were summarized for each study and recorded in tables. Criteria for assessing quality of nonrandomized comparative intervention studies and quality rating definitions74,75 can be found in Appendix E. The quality of diagnostic accuracy studies was assessed using the Quality Assessment of Diagnostic Accuracy Studies (QUADAS) tool76 that included the following 14 questions:

  1. Was the spectrum of patients representative of the patients who will receive the test in practice?
  2. Were the selection criteria clearly described?
  3. Is the reference standard likely to classify the target condition correctly?
  4. Is the period between the reference standard and index test short enough to be reasonably sure that the target condition did not change between the two tests?
  5. Did the whole sample or a random selection of the sample receive verification by using a reference standard of diagnosis?
  6. Did patients receive the same reference standard regardless of the index test result?
  7. Was the reference standard independent of the index test (i.e., the index test did not form part of the reference standard)?
  8. Was the execution of the index test described in sufficient detail to permit replication of the test?
  9. Was the execution of the reference standard described in sufficient detail to permit replication of the reference standard?
  10. Were the index test results interpreted without knowledge of the results of the reference standard?
  11. Were the reference standard results interpreted without knowledge of the results of the index test?
  12. Were the same clinical data available when the test results were interpreted as would be available when the test is used in practice?
  13. Were uninterpretable/intermediate test results reported?
  14. Were withdrawals from the study explained?

For KQ 1 and 2, the index test was PCA3 and the reference standard was biopsy. However, because selection of the screening positive populations was largely based on levels of tPSA, it was necessary to also consider QUADAS question 11 for tPSA to assess the potential for verification bias. This additional criterion was added at the end of the QUADAS questions (Table F-1, Appendix F), along with an entry to indicate whether verification bias was identified or suspected (response, Yes or No). Because measurement of specific clinical outcomes was needed to assess diagnostic accuracy for KQ 3, the additional criterion of clinical followup was added to the QUADAS questions (Table F-2, Appendix F).

The QUADAS ratings were summarized into general quality classes (from Paper 5, Table 5-4, AHRQ Test Review Guide71):

  • Good - No major features that risk biased results.
  • Fair - Susceptible to some bias, but flaws not sufficient to invalidate the results.
  • Poor - Significant flaws that imply bias of various types which may invalidate the results.

Measuring Outcomes of Interest

There were several factors that supported the likelihood that most included studies would be focused on the predictive performance (e.g., clinical sensitivity and specificity, positive and negative predictive values for positive biopsy) of the PCA3 test. These factors included: the relatively short length of time that the PCA3 test, particularly the latest generation test, has been available; the comparative ease of conducting studies in which the end point is biopsy; and the length of time needed to collect long-term clinical outcomes related to the subsequent impact of interventions (e.g., active surveillance, treatment) related to the use of the test (as compared with no PCA3 testing or testing for other biomarkers). We expected that studies would provide a 2×2 table for PCA3 and other comparators, both for those subjects with positive biopsies and for those with negative biopsies (i.e., a matched analysis). In this way, one could evaluate not only the total performance of each test, but how the performance of the two tests varied in the population. For example, two tests could be shown to have equal sensitivity, but a matched analysis would indicate how often the two tests identified the same men with positive and negative biopsy results, and how often (and in what cases) they disagreed.

Two other intermediate outcomes for which data were sought were the impact of testing on physician and patient decisionmaking regarding biopsy and its potential harms (e.g., pain, bleeding, infection) and active surveillance versus treatment. Such data could be collected as followup to biopsy via records review or by conducting surveys of physicians and patients. Use of surveys requires particular attention to uptake rates and the reliability, validity and disease-specificity of survey instruments.

Long-term outcomes or study endpoints (e.g., 7-15 years) of interest include mortality and survival, morbidity and clinical and biochemical failure.3 All-cause mortality at different timeframes is reliable, but not a sensitive measure because it is dependent on age distribution and because most prostate cancer patients do not die of the disease. More sensitive measures are prostate cancer-specific 10-year survival or mortality if the cause of death is clear. Clinical failure may be measured as development of symptomatic disease, local disease progression or metastatic disease. Biochemical failure relates to increasing levels of total PSA (e.g., greater than 0.2 ng/mL) that may indicate disease recurrence. Morbidity also includes treatment-related adverse events (e.g., urinary incontinence, impotence) and other harms, as well as quality of life (QOL). Again, measuring QOL and the personal impact of symptoms related both to the cancer and to therapy requires the use of reliable and validated survey instruments. Minimally, assessment of QOL involves the use of a generic instrument to measure overall wellbeing, and a disease-specific instrument that focuses on specific symptoms and functions (e.g., incontinence, impotence).

Data Synthesis

After initial review of the extracted data from included studies, the analysis plan was finalized. No matched analyses were reported. However, only matched studies were included and pair-wise relative performance of PCA3 scores versus comparator results were summarized. Studies provided a wide variety of methods for comparing results, none of which were true matched analyses. For that reason, we chose to create the difference between the paired estimates and summarize these differences. Five separate analyses were designed:

  1. A comparison of area under the ROC curve (or AUC);
  2. Estimates of parameters defining the positive versus the negative biopsy populations;
  3. Performance of PCA3 at a common cutoff score of 35;
  4. Comparison of the ROC curves over a wide range of specificities/sensitivities; and
  5. Results from logistic regression analysis.

As an example, consider a study reporting on a cohort of men age 50 or older who have a prostate biopsy and tPSA and PCA3 testing. For the first analysis, the AUC for tPSA was subtracted from the AUC for PCA3, resulting in the “difference of AUCs.” This comparison is an unbiased estimate of effect size differences. The next retrieved study is analyzed in the same way, and the two differences are then compared for consistency across studies. This is repeated for all relevant studies, and then repeated for each of the five analyses. The entire process is then repeated for each comparator.

Due to the small number of relevant matched studies for most comparators, heterogeneity of results could only be explored for the PCA3/tPSA comparison. This included stratification by studies including men with all elevations of tPSA versus those focusing on the “grey zone” of borderline tPSA elevations. The analysis of tPSA was complicated by the presence of partial verification bias in all of the studies. We relied on published results and in-house modeling in an attempt to account for this bias, as original data were not available to use published correction methods.77,78

Modeling of PCA3 and tPSA performance could provide: 1) sensitivities of PCA3 and tPSA at set false positive rates and for a range of cutoffs, as well as a comparison of the number of additional cases of prostate cancer detected by the better performing marker; and 2) specificities of PCA3 and tPSA at set sensitivities and for a range of cutoffs, as well as a comparison of the number of false positives avoided by the better performing marker. The model would need to be anchored by two important findings. First, that the ROC curves for tPSA (and for PCA3) were not influenced by the partial verification bias, and that PCA3 and tPSA are essentially independent markers. Sets of parameters (distribution descriptors such as means and standard deviations for PCA3 and comparators in both biopsy negative and positive men) derived from studies would need to fit the relevant ROC curve. A more detailed explanation of the methods used for performance modeling can be found in Appendix J. One aim of such modeling would be to more reliably explore the comparison of prostate cancer markers and assist in providing methods to more fully inform decision-making by men and their health care providers.

Based on the limited number of studies identified that addressed KQ 3, we anticipated focusing on a qualitative analysis (e.g., descriptive narrative, summary tables, identification of themes in content). Identification of more than one matched study in comparable populations, tested for PCA3 and one or more selected comparators, and reporting on the same intermediate or long-term clinical outcomes appeared to be unlikely.

Grading the Body of Evidence

The strength of evidence for primary outcomes was graded by using the standard process of the Evidence-based Practice Centers as outlined in the Methods Guide.70 The method is based on a system developed by the Grading of Recommendations Assessment, Development and Evaluation (GRADE) Working Group, and addresses four specific domains: risk of bias, consistency of effect sizes and direction of effect, directness of the link between evidence and health outcomes, and precision (or degree of certainty) of an effect estimate for a given outcome.79 Additional domains (e.g., strength of association, dose-response relationship, plausible confounding, publication bias) can be assessed and reported if applicable based on the results of the evidence review. For this CER, grading was not limited to the KQs, but was applied to each outcome for each PCA3 comparator.

Based on the four required domains, each of these bodies of evidence was classified into one of four grade categories:

  • High - High confidence that the evidence reflects the true effect. Further research is unlikely to change our confidence in the estimate of effect.
  • Moderate – Moderate confidence that evidence reflects the true effect. Further research may change our confidence in the estimate of effect, or could change the estimate of effect.
  • Low - Low confidence that the evidence reflects the true effect. Further research is likely to change our confidence in the estimate of effect and is likely to change the estimate.
  • Insufficient – Evidence either is unavailable or does not permit a conclusion.

The GRADE ratings were determined by independent reviewers, and disagreements were resolved by consensus as necessary.

Assessment of Applicability

Applicability of the results presented in this review was assessed in a systematic manner using the PICOTS framework (Population, Intervention, Comparison, Outcome, Timing, Setting). Assessment included both the design and execution of the studies, and their relevance with regard to target populations, interventions and outcomes of interest.

Peer Review and Public Commentary

Peer reviewers and the public were invited to provide written comments on the draft report content based on their clinical and methodological expertise. Peer review comments on the preliminary draft were considered by the EPC in preparation of the final draft of the report. Peer reviewers did not participate in writing or editing of the final report or other products. The synthesis of the scientific literature presented in the final report did not necessarily represent the views of individual reviewers. The dispositions of the peer review comments will be documented and published three months after the publication of the evidence report. Potential reviewers were required to disclose any financial conflicts of interest greater than $10,000, and any other relevant business or professional conflicts of interest. Peer reviewers who disclosed potential business or professional conflicts of interest were able to submit comments on draft reports through the public comment mechanism.