NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Sun F, Bruening W, Erinoff E, et al. Addressing Challenges in Genetic Test Evaluation: Evaluation Frameworks and Assessment of Analytic Validity [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2011 Jun.

Cover of Addressing Challenges in Genetic Test Evaluation

Addressing Challenges in Genetic Test Evaluation: Evaluation Frameworks and Assessment of Analytic Validity [Internet].

Show details

Results

As described in the Introduction, the four objectives of the report fall into two categories: (1) evaluation frameworks and (2) analytic validity. The first category (objective 1) overarches all levels of genetic test evaluation, including analytic validity, clinical validity, clinical utility, and societal impact. The second category (objectives 2, 3, and 4) only focuses on the analytic validity issues. We have organized this chapter by the two categories of objectives.

Evaluation Frameworks

Key Question 1. Is it Feasible to Clarify a Comprehensive Framework or a Set of Frameworks for Evaluating Genetic Tests?

To answer Key Question 1, we addressed a series of related issues in a sequential fashion. These issues include:

  1. Define evaluation frameworks.
  2. Identify major evaluation frameworks already developed.
  3. Identify the unique needs of different stakeholders for evaluation frameworks.
  4. Determine whether it is feasible to clarify or develop a comprehensive framework or set of frameworks that would meet the needs of all stakeholders.
    This determination will be made by the consensus of the panel experts and ECRI Institute research team. Key factors to be considered will include a thorough evaluation of the different needs of the key stakeholders.
  5. Determine whether it is feasible to clarify a comprehensive framework or a set of frameworks by modifying existing frameworks that would fit different testing scenarios (e.g., diagnosis, prognostic evaluation, screening for heritable conditions, and pharmacogenetics).
    This determination was made by the consensus of the panel experts and ECRI Institute research team. We considered whether some common principles are shared when tests are evaluated for different clinical scenarios and what other groups (e.g., the U.S. Preventive Services Task Force [USPSTF] and the Evaluation of Genomic Applications in Practice and Prevention [EGAPP] Working Group) have achieved previously in the area. The ECRI Institute research team presented a set of frameworks adapted from existing frameworks during the exploratory process, and examined how well these frameworks apply to the common testing scenarios.

What are Evaluation Frameworks?

It is common practice in health technology assessment to lay out a framework for evaluating evidence regarding the intervention of interest. An evaluation (or “organizing”) framework for medical test assessment serves the purpose of clarifying the scope of the assessment and the types of evidence necessary for addressing various aspects of test performance and their consequences. Some evaluation frameworks (e.g., the Fryback-Thornbury model discussed in the following section) only provide general conceptual guidance to the evaluators or reviewers. Other types of evaluation frameworks (often referred as analytic frameworks) provide additional detail on the key questions (e.g., the relevant populations, interventions, comparators, outcomes, time points and settings [PICOTS]) and depict the evaluation process graphically. Examples of analytic frameworks include the USPSTF framework and the EGAPP frameworks, which will be discussed in the following section.

Evaluation frameworks represent systematic thinking about the evaluation of a health care technology and provide guidance to the evaluators for specifying key research questions and for collecting, evaluating and organizing the relevant evidence. In this report, when we discuss the feasibility of proposing a set of evaluation frameworks for genetic tests, we are focused on analytic frameworks. However, we first performed a review of conceptually-oriented evaluation frameworks, particularly in the historical overview section, since these frameworks provided conceptual foundations for practice-driven analytic frameworks.

Existing Evaluation Frameworks

The ECRI Institute Evidence-based Practice (EPC) team performed a comprehensive literature search to identify existing evaluation frameworks that had been developed or used for evaluating laboratory tests. This review built on a White Paper by Jeroen G Lijmer, M.D., Ph.D., Mariska Leeflang, Ph.D., and Patrick M.M. Bossuyt, Ph.D., that was presented at a meeting held at the Agency for Healthcare Research and Quality (AHRQ) on May 28 and 29, 2008.14 The detailed search strategy is provided in Appendix A. Our search identified multiple evaluation frameworks for clinical tests. Many of these frameworks are conceptually similar and were based on other frameworks that were developed earlier. The project team summarized these frameworks and provided them to the Workgroup with a historical overview of the different approaches to laboratory test evaluation.

A Historical Overview

Our current approaches to evaluating diagnostic tests have evolved from work done in the mid-twentieth century. Writing in 1947, Yerushalmy presented a paper comparing the “effectiveness for tuberculosis case finding” of different x-ray imaging devices.15 In this paper, Dr. Yerushalmy drew attention to the concepts of sensitivity and specificity for evaluation of diagnostic tests.

In 1959, Ledley and Lusted applied probability theory to diagnostic problems, using Bayes's theorem to elucidate the utility of tests in clinical decisionmaking.16 Green and Swets applied signal detection theory (developed in the study of radar systems used in World War II) to medical diagnostic tests. These radar systems required interpretation of output from a receiver that potentially indicated the presence of an incoming missile. Just as the radar screen displayed both true signals of incoming missiles as well as “noise,” the diagnostic test presents both disease signals and noise.17

Swets noted that “the two kinds of correct outcome are, respectively, hits and correct rejections; the two incorrect outcomes are, respectively, false alarms and misses.”18 This work led to the use of “receiver operating characteristics” curves to describe the relationship between sensitivity and specificity along with different thresholds for deciding whether a given signal represented “truth” or “noise.” Swets pointed out that the “fidelity” of the system in representing the signals and the “consistency” across repeat judgments by a single interpreter or across interpreters would impact the value of test information in practice, that is, its “efficacy.”18

These concepts were applied most readily to the fields of diagnostic imaging, and were further expanded as questions were asked about the value of new expensive imaging technologies in the 1970s and 1980s. Loop and Lusted, writing in 1978, described the American College of Radiology Diagnostic Efficacy Studies.19 While the investigators started with the intent of addressing efficacy of imaging tests in terms of patient outcomes (“outcome efficacy”), the difficulty of funding and the complexity of conducting long-term randomized studies examining the outcomes of multiple treatment alternatives resulting from imaging-derived diagnoses led to a more limited approach. The next approach, termed “therapeutic efficacy,” focused on determining the extent to which patient management actually changed following an imaging study. This also proved difficult to implement, and was abandoned in favor of studying the “diagnostic efficacy” of the radiologic procedure by measuring its influence on the clinician's diagnostic thinking. Physicians were asked to estimate probabilities of diagnoses prior to the imaging studies, and then to revise those estimates once they were given the results of the examinations. The impact of a test result on diagnostic thinking was interpreted as a useful proxy for studies of actual change in management or patient outcomes.

Guyatt and colleagues at McMaster University responded to this approach to evaluating diagnostic tests and stressed the need for randomized controlled trials (RCTs) to answer questions of therapeutic impact and patient outcomes.20 They recommended that once technical efficacy had been demonstrated, an efficient approach would be to design a single trial to assess diagnostic accuracy, impact on clinician decision making, therapeutic impact, and patient outcomes.20

In 1991, Fryback and Thornbury proposed an evaluation framework that synthesized these approaches.21 Their framework has been the most widely used and well known of all the evaluation frameworks. It describes six levels of medical test impact (see Table 1). Fryback and Thornbury suggested that the lower levels in this hierarchy should be verified prior to the higher levels. They advocated randomized controlled trials for tests with greater risk of harm, greater expense, or wider utilization, but suggested that decision modeling could be helpful for giving provisional answers or for focusing research efforts on the most important questions. The proposed use for their framework was to classify the published evidence on a diagnostic test, and to draw attention to the different “vantage points” from which a test could be evaluated.

Table 1. Fryback and Thornbury hierarchical model of efficacy.

Table 1

Fryback and Thornbury hierarchical model of efficacy.

Kent and Larson proposed a modification of the Fryback and Thornbury framework that they refer to as an “organizational framework” for use in assessment of diagnostic technologies. They recommended classifying studies along three dimensions: quality of individual studies, the spectrum of diseases to which the technology is applicable, and the levels of efficacy, such as those described above (technical, diagnostic accuracy, diagnostic thinking, therapeutic impact, and patient outcomes). They suggested that claims made for a new test could be compared with the available studies demonstrating each level of the efficacy hierarchy, noting both the study quality and the test's applicability to the severity or stage of disease.22

Other authors have described applications of the Fryback and Thornbury framework to the evaluation of screening and diagnostic laboratory tests. Issues specific to studies of technical efficacy or analytic validity of laboratory tests are discussed by van der Schouw et al.23 and Pearl.24 Several writers have suggested that an evaluation of a diagnostic test needs to account for the phase of development of the test, analogous to phases of drug development.25-34 Gatsonis pointed out that the evaluation of diagnostic imaging modalities is essentially an examination of the value of information.29 He proposed a matrix in which the value of the information is paired with the “developmental age” of the technology, which he categorized into four stages:

  • Stage I (discovery): establishment of technical parameters and diagnostic criteria
  • Stage II (introductory): early quantification of performance in clinical cohorts, usually in single institution studies
  • Stage III (mature): comparison to other modalities in large, prospective, multi-institutional clinical studies (“efficacy”)
  • Stage IV (disseminated): assessment of performance of the procedure as utilized in the community at large (“effectiveness”)29

Gatsonis commented that the outcomes of importance at these stages would vary according to the evaluator's perspective. For example, a test developer might be most interested in a Stage II study, whereas a payer would likely be most interested in a Stage III or IV study. He suggested the use of “adaptive statistical methods” (such as Bayesian statistical approaches) to account for the rapid evolution of diagnostic technology. Gatsonis also discussed the value of modeling studies as an alternative to “unrealistically complex and resource intensive” empirical studies of health outcomes (e.g., mortality reduction from screening examinations for malignancy).29 Lumbreras et al. urge that systematic reviews of diagnostic tests should analyze studies from these phases separately, because the nature of the relevant questions and the appropriate study designs are typically quite different.35

The USPSTF was first organized by the U.S. Public Health Service in 1984, and now is sponsored by AHRQ. Its mission is to assess the evidence for clinical preventive services to be delivered in the primary care setting. The services evaluated include screening tests, counseling interventions, and medications used to prevent disease. The Task Force Procedure Manual (July 2008) indicates a strong preference for systematic reviews of data from RCTs, and for data on “health outcomes,” which it defines as “symptoms and conditions that patients can feel or experience, such as visual impairment, pain, dyspnea, impaired functional status or quality of life, and death.” It contrasts these with “intermediate outcomes,” such as pathologic or physiologic measures which cannot be directly perceived by patients.36

The U.S. Centers for Disease Control and Prevention's (CDC's) National Office of Public Health Genomics (NOPHG) worked with the Foundation for Blood Research beginning in 2000 to develop a model for “assembling, analyzing, disseminating and updating existing data on the safety and effectiveness of DNA-based genetic tests and testing algorithms.” The ACCE model (Analytic validity; Clinical validity; Clinical utility; and Ethical, legal and social implications), specified 44 questions within this framework for use in the evaluation of DNA-based tests.37 In 2004, the NOPHG initiated the Evaluation of Genomic Applications in Practice and Prevention (EGAPP) project, which is focused on the review and synthesis of genomic applications to facilitate translation and dissemination into practice. The EGAPP Working Group, established in 2005, is charged with making recommendations based on EGAPP-sponsored reviews. The methods used by this group are described by Teutsch et al.3 and in a report from the Secretary's Advisory Commission on Genetics, Health and Society.1

In the sections below, we describe the frameworks utilized in recent systematic reviews of genetic tests, and compare them with the Fryback and Thornbury framework described previously.

Key Frameworks Used for Evaluation of Genetic Tests

To identify key frameworks that have been used for evaluation of genetic tests, the project team reviewed evidence reports or other government-sponsored reports on genetic testing topics. We decided to focus on these reports because the evaluation frameworks used in the reports had already been piloted in a real evaluation project and had considered the needs of some key stakeholders (e.g., patients, payers, regulators, and professional societies). We believe that these frameworks can be used as a foundation for building future evaluation frameworks.

Table 2 is a summary of the evaluation frameworks used in the selected reports. Four evaluation frameworks were identified in the reports, including the ACCE model,38 the Fryback-Thornbury model,21 the USPSTF framework for screening topics,39 and the EGAPP frameworks.3,5,40,41 The CDC-sponsored EGAPP frameworks consist of a set of frameworks for different testing purposes (e.g., pharmacogenetics, diagnosis of a disease, and risk assessment for a heritable condition) and were used in all but one EGAPP-initiated report. The CDC-sponsored ACCE model was used in one published report38 and five draft reports42-46 posted on the CDC's Web site. The Fryback-Thornbury model was used in an early EGAPP-initiated report published in 2006.47 The USPSTF framework was used in an evidence report requested by the USPSTF.48

Table 2. Evaluation frameworks used in completed evidence reports or other government-sponsored reports on genetic testing topics.

Table 2

Evaluation frameworks used in completed evidence reports or other government-sponsored reports on genetic testing topics.

Figure 1 is a comparison of the four frameworks. All four frameworks cover three common domains of evaluation: analytic validity, clinical validity, and clinical utility of the test. The ACCE and the Fryback-Thornbury model also cover another domain of evaluation: societal impact of the test. Three evidence reports that are included in Table 2 (both published in 2008) did not explicitly specify what evaluation framework was used. However, all three reports used a structured approach to evaluating key issues in the domains of analytic validity, clinical validity or clinical utility.4,7

Figure 1 is a graphical comparison of four key evaluation frameworks for clinical tests. The four frameworks are presented in four columns (from left to right): the ACCE framework, the Fryback-Thornbury framework, the USPSTF framework for screening topics, and the EGAPP framework for a pharmacogenetic topic. ACCE stands for analytic validity, clinical validity, clinical utility, and ethical, legal, and social implications). USPSTF is the acronym for the U.S. Preventive Services Task Force. EGAPP is the acronym for Evaluation of Genomic Applications in Practice and Prevention initiative. The column for the ACCE framework (the first column from the left) consists of four boxes from bottom to top: analytical validity, clinical validity, clinical utility, and ethical, legal and societal implications. The column for the Fryback-Thornbury framework (the second column from the left) consists of six boxes from bottom to top: technical efficacy, diagnostic accuracy efficacy, diagnostic thinking efficacy, therapeutic efficacy, patient outcome efficacy, and societal efficacy. The column for the USPSTF framework for screening topics (the third column from the left) consists of a series of headers sequentially linked by arrows from bottom to top: patients at risk, screening, early detection of target condition, treatment, intermediate outcomes, and reduced morbidity and/or mortality. There are two additional headers on the right: “adverse effects of screening” and “adverse effects off treatment.” An arrow from the header “screening” points to the header “adverse effects of screening.” An arrow from the header “treatment” points to the header “adverse effects of treatment.” The column for the EGAPP framework for a pharmacogenetic topic (the last column from the left) consists of a series of headers sequentially linked by arrows from bottom to top: adult with non-psychotic depression entering therapy with SSRIs, CYP450 genotype, metabolizer status, predicted drug efficacy or risk for drug adverse effects, treatment, and improved outcomes. There is one additional header on the right, “harms of subsequent management options.” An arrow from the header “treatment” points to the header “harms of subsequent management options.” Three dashed lines running across all the four columns divide the Figure into four paralleled areas (from bottom to top): domain 1 (analytic validity), domain 2 (clinical validity), domain 3 (clinical utility), and domain 4 (ethical, legal and societal implications). There is a note underneath the figure, saying: “This figure was created by ECRI Institute based on the specified evaluation frameworks. For a detailed description of each included framework, refer to the original references.” The main point that Figure 1 is trying to make is that all four frameworks being compared cover three common domains of evaluation: analytic validity, clinical validity, and clinical utility of the test and that the ACCE and the Fryback-Thombury model also cover another domain of evaluation: societal impacts of the test.

Figure 1

A comparison of key evaluation frameworks for clinical tests. Note: This figure was created by ECRI Institute based on the specified evaluation frameworks. For a detailed description of each included framework, refer to the original references. Domain (more...)

Note that Table 2 does not include any of the genetic-testing-related horizon scan reports prepared by an AHRQ EPC.2,49-51 Although these reports provide important information regarding the overall landscape of the genetic testing area, none of them evaluated any individual test using a formalized approach.

Unique Needs of Different Stakeholders for Evaluation Frameworks

The project team presented the findings of the targeted review to the Workgroup, including the historical overview of existing evaluation frameworks for laboratory tests, the key frameworks used in completed evidence reports, and the comparison of the key frameworks. The team invited the experts to identify common stakeholders who may use a framework in evaluating genetic tests and discuss the potentially unique needs of these different users for evaluation frameworks. The purpose of this activity was to determine whether one comprehensive framework (or one set of frameworks) would meet the needs of all stakeholders.

During the discussion, the following potential users of evaluations frameworks were identified: patients, providers, payers (e.g., Centers for Medicare and Medicaid Services [CMS], private health plans), regulators (e.g., U.S. Food and Drug Administration [FDA] and New York State Clinical Laboratory Evaluation Program [CLEP]), and test developers (clinical laboratories and test kit manufacturers). Technology assessment groups including EPCs are also users of evaluation frameworks, but their needs for evaluation frameworks generally reflect the needs of the stakeholders for whom the evaluation is being performed, including all stakeholders identified previously.

Unless other references are specified, the opinions provided in the remainder of the Evaluation Frameworks section are based on the discussions among the Workgroup and the ECRI Institute EPC project team.55,56

Evaluating Genetic Tests From Patients' Perspectives

The evaluation needs of patients were the emphasis of the discussion among the Workgroup, given that the ultimate reason for any test to be developed and adopted for clinical practice is that the test has potential to benefit patients. The needs of patients should also provide important guidance to the evaluation activities initiated by other stakeholders (e.g., providers, payers, regulators, and test developers).

From individual patients' perspectives, the test's impact on health outcomes (i.e., clinical utility) is typically the ultimate interest of evaluation. However, as pointed out by many Workgroup experts and the authors of some published reports,1,2 clinical utility studies that directly correlate health outcomes with a clinical test are often unavailable. As a result, analytic validity, clinical validity, and potential impacts of the testing on medical decision making will, in most cases, need to be evaluated in order to establish a chain of evidence to evaluate clinical utility indirectly.

Several Workgroup members suggested that there appears to be a hierarchy of evidence among analytic validity, clinical validity and clinical utility (i.e., Domains 1, 2, and 3 in Figure 1). That is, if the analytic validity of a test is poor, the clinical validity will inevitably be poor, and subsequently, the clinical utility will also be poor. If the performance of a genetic test to detect the target mutation is poor, the test will definitely not be able to assist clinicians to reach an accurate clinical diagnosis and will not have any positive impact on patient outcomes. Generally, the experts agreed that, when clinical utility studies (e.g., RCTs that correlate patient outcomes with testing) are missing, the evaluation of analytic or clinical validity studies could be helpful to establish an indirect chain of evidence supporting potential utility of the test. Even when clinical utility studies are available, evaluation of analytic or clinical validity might still be needed. In particular, when the number of clinical utility studies is small or the findings of the studies are contradictory, evaluation of analytic and clinical utility could be helpful in reducing the uncertainty about the conclusions.

One question that was raised during the panel discussion is: if clinical utility studies and clinical validity studies (i.e., diagnostic accuracy studies) are available, is there a need to evaluate analytic validity at all? Several experts suggested that analytic validity might still need to be evaluated in this situation. One suggestion from the Workgroup is that analytic validity studies evaluate a broad range of testing performance aspects. Some of these aspects, such as testing repeatability and reproducibility, are typically not evaluated in diagnostic accuracy studies but may have a significant implication about how well the test performs in the real-world laboratory settings (i.e., the generalizability or applicability of the evidence). For example, if data from a proficiency testing program suggest that the interlaboratory reproducibility of a test is poor, the test may perform poorly in predicting the clinical condition in the real-world setting, even though landmark clinical validity studies conducted in a single institution yielded a high diagnostic accuracy in a particular testing setting.

During the discussion, the experts acknowledged that, although evaluation of analytic validity is important, there are significant technical barriers to performing such evaluations. One major challenge is lack of published analytic validity data. Locating unpublished data can be difficult and time-consuming. Meanwhile, even if data—published or unpublished—are identified, no widely accepted guidance is available for judging the quality of these types of data. These challenges will be further addressed in the Analytic Validity section of this chapter.

For society as a whole, the ethical, legal, and social implications of testing might also need to be evaluated at certain times. However, from an individual patient's perspective, clinical utility would typically be the most important aspect of test evaluation.

Evaluating Genetic Tests From Other Stakeholders' Perspectives

The needs of other stakeholders (e.g., providers, payers, regulators, and test developers) for evaluation frameworks were also discussed among the Workgroup and the ECRI Institute EPC team. While the needs of patients provide important guidance to the evaluation activities initiated by these other stakeholders, the stakeholders may place more, less, or a different emphasis other than patients' needs during the evaluation due to the unique regulatory requirements or agendas that they need to meet.

Clinicians normally act as agents of patients in making key clinical decisions. The issues that concern clinicians thus would be addressed in the evaluation similarly to the way issues are addressed for patients. Institutional providers (e.g., hospitals) and payers, including public programs such as CMS and private insurance plans, should also be interested in evaluating analytic validity, clinical validity, and particularly, clinical utility of the tests. These providers and payers may need to use evaluation frameworks that are similar to the frameworks preferred by patients. However, these stakeholders may also have additional issues that need to be addressed in the evaluation, such as financial and operational concerns. For payers, cost-effectiveness of the test could be an important aspect of evaluation. In addition, when evaluating clinical utility, payers might be less willing than patients/clinicians to consider indirect chains of evidence linking patient outcomes to testing.

For regulators, the issues that need to be addressed in evaluation are largely delineated by the regulatory responsibilities mandated by law. For example, the Federal Food, Drug, and Cosmetic Act authorizes FDA to regulate medical devices, including commercially marketed test kits.57 FDA is charged with assessing the safety and effectiveness of the test. FDA reviews the analytic and clinical performance of the test kit to ensure the performance data supports manufacturer claims.1 In New York State, the Department of Health evaluates all clinical tests prior to offering them to those whose specimens are collected in New York. Neither FDA nor New York State requires the evaluation of clinical utility.

For test developers (e.g., clinical laboratories and test kit manufacturers), the goal of the evaluation might vary across different phases of the test development cycle. In the early phases of the cycle, the focus of evaluation might be on technical feasibility and analytic validity. As the test development progresses, the emphasis of evaluation may shift to clinical validity then to clinical utility.

Is it Feasible to Clarify a Set of Evaluation Frameworks for Genetic Tests?

Based on the findings of the targeted review and the input from the Workgroup, it became clear that a single comprehensive evaluation framework would not meet the needs of all stakeholders without being too general to be useful. The consensus decision was to explore the possibility of proposing a framework or a set of frameworks for each group of stakeholders. The ECRI Institute EPC team decided to start the exploratory effort by first looking at evaluation frameworks for the most important group of stakeholders: patients. As discussed previously, the evaluation frameworks for patients are most likely to form the basis for frameworks used by other stakeholders (e.g., providers, payers, and regulators).

Even while only focusing on the evaluation frameworks for patients, most Workgroup members thought a single framework might be too general to apply to different testing scenarios (e.g., diagnosis, prognostic evaluation, screening for heritable conditions, and pharmacogenetics). The experts suggested proposing a different framework for each general type of test usage. The EGAPP Working Group had previously done work in this area. The draft frameworks discussed by EGAPP cover four clinical settings: screening in asymptomatic populations for genetic susceptibility, genetic screening for acquired disease, diagnostic testing for symptomatic disease, and genetic testing to alter therapeutic approaches (e.g., pharmacogenetics).58 After reviewing the draft frameworks, the project team decided to use these draft frameworks and the frameworks used in published EGAPP reports5,8,40,41,52 as a foundation to present a set of analytic frameworks for common clinical scenarios.

During early discussion by the Workgroup, a few experts expressed their preference for the ACCE model as the basis for framework development. The ACCE model was considered to have two major advantages over alternatives. First, the ACCE concept (i.e., analytic validity, clinical validity, clinical utility, and ethical, legal and social impacts) has been widely accepted in the area of genetic testing evaluation, and secondly, the ACCE approach (i.e., evaluating the test by answering a fixed set of questions) is generally straightforward.

However, after a closer examination of the ACCE model, the Workgroup also identified some apparent disadvantages. First, the model does not have a visual representation of the relationship between the application of the test and the outcomes of importance to decision making. That visual representation was considered by most experts to be a desirable feature of analytic frameworks. Second, the ACCE model is somewhat cumbersome. Using the full model requires the evidence evaluator to address 44 different questions. Third, as a CDC-funded initiative, the ACCE project was discontinued and replaced in 2004 by another CDC-funded initiative, the EGAPP project.

After a discussion and comparison of the possible approaches, the team decided to use the EGAPP draft frameworks as the basis for framework development. The EGAPP frameworks have already incorporated input from multiple stakeholders and reflected some recent thinking of experts in genetic testing evaluation. Since the project began in 2004, the EGAPP frameworks have been used in several evidence reports for different testing topics, which can be considered as a pilot test process for framework development.

In addition, the EGAPP frameworks included the key concepts from other major evaluation models. As a sequel to the ACCE model, the EGAPP Workgroup adopted the concepts of analytic validity, clinical validity, clinical utility, and ethical, legal and social implications. Similar to the USPSTF evaluation model for screening topics, the EGAPP analytic frameworks provide a visual presentation of the relationships among testing, intermediate outcomes, and health outcomes. The EGAPP frameworks also incorporated some of the components of the widely used Fryback-Thornbury model (e.g., asking whether use of the test has impact on clinical decision-making).

The Workgroup agreed that some enhancements would need to be made to the EGAPP draft frameworks. Suggestions to enhance the frameworks included the addition of, when appropriate, a comparative question that compared the performance of the index test with that of the current standard-of-care diagnostic/screening approach. Another suggestion was to better represent the balance between potential benefits and harms of the testing. The Workgroup also felt there was a need to add additional frameworks to cover the testing scenarios that were not covered by the existing draft EGAPP frameworks, such as treatment monitoring, prenatal screening, and susceptibility assessment involving detection of germline mutations.

Analytic Frameworks for Genetic Tests: From Patients' Perspectives

Based on the findings from the targeted review and the input from the Workgroup, the ECRI EPC team presented a set of analytic frameworks by modifying the EGAPP frameworks (including both draft and published frameworks).5,8,40,41,52,58 One framework was presented for each of the following testing scenarios, depicted in Figures 28:

Figure 2 is a proposed evaluation framework for genetic testing for diagnosis in symptomatic patients. The framework includes a series of headers sequentially linked by arrows going from left to right, including symptomatic patients, testing, diagnosis of disease, treatment, intermediate outcomes, and health outcomes. There is an overarching arc line that links the header “testing” and the header “health outcomes.” Below the headers “intermediate outcomes” and “health outcomes,” there are two additional headers: “harms caused by the testing” and “harms caused by the treatment.” An arrow from the header “testing” points to the header “harms caused by the testing.” An arrow from the header “treatment” points to the header “harms caused by the treatment.” The four headers, “intermediate outcomes,” “health outcomes,” “harms caused by the testing,” and “harms caused by the treatment” are also in one bigger box labeled as “balance of benefits and harms.” This figure demonstrates a clinical path for individuals who receive this type of testing. Under this evaluation framework, eight key research questions can be generated: 1. Overarching question: Does use of the test lead to improved health outcomes compared to the standard-of-care diagnostic strategy that does not include the test? The test being evaluated may be used to substitute an existing diagnostic test, as a triage test, or as an add-on test (i.e., a test added to an existing testing protocol). This overarching key question involves comparison of use of the test with the standard-of-care diagnostic strategy that uses other tests or no test at all. 2. Does the test have adequate analytic validity? 3. How accurate is the test for detecting the target disease or condition? Is the test more accurate than standard-of-care test for detecting the target disease or condition? Or when the test is used as part of a diagnostic strategy (e.g., being used as a triage or add-on test), how accurate is the diagnostic strategy as a whole for detecting the disease or condition? Is the diagnostic strategy including the test more accurate than a standard-of-care diagnostic strategy for detecting the disease or condition? 4. Does use of the test have any impact on treatment decision making by clinicians or patients? 5. Does the treatment lead to improved intermediate outcomes in comparison with no treatment? 6. Does the treatment lead to improved health outcomes in comparison with no treatment? 7. What harms does the testing cause? Does the testing cause more harms than alternative testing strategies? 8. What harms does the treatment cause? Does the treatment cause more harms than alternative treatments?

Figure 2

Analytic framework for diagnosis in symptomatic patients. Key questions: Overarching question: Does use of the test lead to improved health outcomes compared to the standard-of-care diagnostic strategy that does not include the test? The test being evaluated (more...)

Figure 3 is a proposed evaluation framework for genetic testing for screening in asymptomatic patients. The framework includes a series of headers sequentially linked by arrows going from left to right, including asymptomatic individuals at risk, testing, detection of target condition, early intervention, intermediate outcomes, and health outcomes. There is an overarching arc line that links the header “testing” and the header “health outcomes.” Below the headers “intermediate outcomes” and “health outcomes,” there are two additional headers: “harms caused by the testing” and “harms caused by the intervention.” An arrow from the header “testing” points to the header “harms caused by the testing.” An arrow from the header “early intervention” points to the header “harms caused by the intervention.” The four headers, “intermediate outcomes,” “health outcomes,” “harms caused by the testing,” and “harms caused by the intervention” are also in one bigger box labeled as “balance of benefits and harms.” This figure demonstrates a clinical path for individuals who receive this type of testing. Under this evaluation framework, eight key research questions can be generated: 1. Overarching question: Does use of the test lead to improved health outcomes compared to the standard-of-care screening strategy or no screening? The screening test being evaluated may be used to substitute an existing test, as a triage test, or as an add-on test (i.e., a test added to an existing screening protocol). This overarching key question involves comparison of use of the test with no screening or the standard-of-care screening strategy that uses other tests. 2. Does the test have adequate analytic validity? 3. How accurate is the test for detecting the target condition? Is the test more accurate than a standard-of-care screening test (if any) for detecting the condition? Or when the test is used as part of a screening strategy (e.g., being used as a triage or add-on test), how accurate is the screening strategy as a whole for detecting the target condition? Is the screening strategy using the test more accurate than a standard-of-care screening strategy for detecting the condition? 4. Does use of the test have any impact on the decision making by clinicians or patients regarding early intervention (if any)? 5. Does the early intervention (if any) lead to improved intermediate outcomes in comparison with no intervention? 6. Does the early intervention (if any) lead to improved health outcomes in comparison with no intervention? 7. What harms does the testing cause? Does the testing cause more harms than alternative testing strategies? 8. What harms does the early intervention cause? Does the intervention cause more harms than alternative interventions?

Figure 3

Analytic framework for screening in asymptomatic patients. Key questions: Overarching question: Does use of the test lead to improved health outcomes compared to the standard-of-care screening strategy or no screening? The screening test being evaluated (more...)

Figure 4 is a proposed evaluation framework for genetic testing for prognosis assessment. The framework includes a series of headers sequentially linked by arrows going from left to right, including patients with the disease, testing, differential outcomes or natural history, disease management strategies, intermediate outcomes, and health outcomes. There is an overarching arc line that links the header “testing” and the header “health outcomes.. Below the headers “intermediate outcomes” and “health outcomes,” there are two additional headers: “harms caused by the testing” and “harms caused by the treatment decisions.” An arrow from the header “testing” points to the header “harms caused by the testing.” An arrow from the header “disease management strategies” points to the header “harms caused by the treatment decisions.” The four headers, “intermediate outcomes,” “health outcomes,” “harms caused by the testing,” and “harms caused by the treatment decisions” are also in one bigger box labeled as “balance of benefits and harms.” This figure demonstrates a clinical path for individuals who receive this type of testing. Under this evaluation framework, eight key research questions can be generated: 1. Overarching question: Does use of the test lead to improved health outcomes compared to the standard-of-care prognosis assessment strategy or not doing the assessment? The test being evaluated may be used to substitute an existing test for prognosis assessment or as an add-on test (i.e., a test added to an existing testing protocol for prognosis assessment). This overarching key question involves comparison of use of the test with the standard-of-care prognosis assessment or not doing prognosis assessment at all. 2. Does the test have adequate analytic validity? 3. How accurate is the test for predicting prognosis? Is the test more accurate than a standard-of-car test for predicting prognosis? Or when the test is used as part of a prognosis assessment strategy (e.g., being used as an add-on test), how accurate is the assessment strategy as a whole for predicting prognosis? Is the prognosis assessment strategy using the test more accurate than a standard-of-care prognosis assessment strategy? 4. Does use of the test have any impact on disease-management decisions? 5. Does the disease management strategy chosen based on the testing result lead to improved intermediate outcomes in comparison with alternative disease management strategies? 6. Does the disease management strategy chosen based on the testing result lead to improved health outcomes in comparison with alternative disease management strategies? 7. What harms does the testing cause? Does the testing cause more harms than alternative testing strategies? 8. What harms does the disease management strategy chosen based on the testing result cause? Does the strategy cause more harms than alternative disease management strategies?

Figure 4

Analytic framework for prognosis assessment. Key questions: Overarching question: Does use of the test lead to improved health outcomes compared to the standard-of-care prognosis assessment strategy or not doing the assessment? The test being evaluated (more...)

Figure 5 is a proposed evaluation framework for genetic testing for treatment monitoring. The framework includes a series of headers sequentially linked by arrows going from left to right, including patients with the disease, testing, treatment effectiveness assessment, treatment adjustment, intermediate outcomes, and health outcomes. There is an overarching arc line that links the header “testing” and the header “health outcomes.” Below the headers “intermediate outcomes” and “health outcomes,” there are two additional headers: “harms caused by the testing” and “harms caused by treatment adjustment.” An arrow from the header “testing” points to the header “harms caused by the testing.” An arrow from the header “treatment adjustment” points to the header “harms caused by treatment adjustment.” The four headers, “intermediate outcomes,” “health outcomes,” “harms caused by the testing,” and “harms caused by treatment adjustment” are also in one bigger box labeled as “balance of benefits and harms.” This figure demonstrates a clinical path for individuals who receive this type of testing. Under this evaluation framework, eight key research questions can be generated: 1. Overarching question: Does use of the test lead to improved health outcomes compared to the standard-of-care treatment monitoring strategy or no monitoring? The test being evaluated may be used to substitute an existing test for monitoring or as an add-on test (i.e., a test added to an existing treatment monitoring protocol). This overarching key question involves comparison of use of the test with no monitoring or the standard-of-care monitoring strategy that uses other tests. 2. Does the test have adequate analytic validity? 3. How accurate is the test for indicating the effectiveness of the treatment? Is the test more accurate than a standard-of-care test for evaluating the effectiveness of the treatment? Or when the test is used as part of a treatment monitoring strategy (e.g., being used as an add-on test), how accurate is the monitoring strategy as a whole for indicating the effectiveness of the treatment? Is the monitoring strategy using the test more accurate than a standard-of-care monitoring strategy for evaluating the effectiveness of the treatment? 4. Does use of the test have any impact on disease-management decisions (such as, adjustment of treatment plans)? 5. Do the disease management decisions lead to improved intermediate outcomes? 6. Do the disease management decisions lead to improved health outcomes? 7. What harms does the testing cause? Does the testing cause more harms than alternative testing strategies? 8. What harms does the disease management strategy chosen based on the testing result cause? Does the strategy cause more harms than alternative disease management strategies?

Figure 5

Analytic framework for treatment monitoring. Key questions: Overarching question: Does use of the test lead to improved health outcomes compared to the standard-of-care treatment monitoring strategy or no monitoring? The test being evaluated may be used (more...)

Figure 6 is a proposed evaluation framework for genetic testing for pharmacogenetics. The framework includes a series of headers sequentially linked by arrows going from left to right, including patients being considered for a medicine, testing, different drug response, personalized treatment, intermediate outcomes, and health outcomes. There is an overarching arc line that links the header “testing” and the header “health outcomes.” Below the headers “intermediate outcomes” and “health outcomes,” there are two additional headers: “harms caused by the testing” and “harms caused by the treatment decision.” An arrow from the header “testing” points to the header “harms caused by the testing.” An arrow from the header “personalized treatment” points to the header “harms caused by the treatment decision.” The four headers, “intermediate outcomes,” “health outcomes,” “harms caused by the testing,” and “harms caused by the treatment decision” are also in one bigger box labeled as “balance of benefits and harms.” This figure demonstrates a clinical path for individuals who receive this type of testing. Under this evaluation framework, ten key research questions can be generated: 1. Overarching question: Does use of the test lead to improved health outcomes compared to no testing or the standard-of-care test for predicting the response to the drug? 2. Does the test have adequate analytic validity? 3. Do testing results effectively predict patients' response to the drug? Is the test more accurate than other methods for predicting patients' response to the drug? 3a. How well do the testing results predict the drug's efficacy? 3b. How well do the testing results predict drug-related adverse reactions? 4. Do testing results have any impact on treatment decision making? 5. Do the personalized treatment decisions based on the testing results lead to improved intermediate outcomes? 6. Do the treatment decisions lead to improved health outcomes? 7. What harms does the testing cause? Does the testing cause more harms than alternative testing strategies? 8. What harms does the treatment strategy chosen based on the testing result cause? Does the strategy cause more harms than alternative treatment strategies?

Figure 6

Analytic framework for pharmacogenetics. Key questions: Overarching question: Does use of the test lead to improved health outcomes compared to no testing or the standard-of-care test for predicting the response to the drug? Does the test have adequate (more...)

Figure 7 is a proposed evaluation framework for genetic testing for risk/susceptibility assessment. The framework includes a series of headers sequentially linked by arrows going from left to right, including general or at-risk population, testing, likelihood to develop a condition, clinical or personal decisions, intermediate outcomes, and health outcomes. There is an overarching arc line that links the header “testing” and the header “health outcomes.” Below the headers “intermediate outcomes” and “health outcomes,” there are two additional headers: “harms caused by the testing” and “harms caused by the decisions.” An arrow from the header “testing” points to the header “harms caused by the testing.” An arrow from the header “clinical or personal decisions” points to the header “harms caused by the decisions.” The four headers, “intermediate outcomes,” “health outcomes,” “harms caused by the testing,” and “harms caused by the decisions” are also in one bigger box labeled as “balance of benefits and harms.” This figure demonstrates a clinical path for individuals who receive this type of testing. Under this evaluation framework, eight key research questions can be generated: 1. Overarching question: Does use of the test lead to improved health outcomes compared to the standard-of-care risk assessment strategy or no assessment? The test being evaluated may be used to substitute an existing test for monitoring or as an add-on test (i.e., a test added to an existing risk assessment strategy). This overarching key question involves comparison of use of the test with no risk assessment being performed or the standard-of-care assessment strategy that uses other tests. 2. Does the test have adequate analytic validity? 3. How accurate is the test for predicting the likelihood of a patient developing the target condition in the future? Is the test more accurate than a standard-of-care method for predicting the likelihood of a patient developing the target condition in the future? Or when the test is used as part of a risk assessment strategy (e.g., being used as an add-on test), how accurate is the assessment strategy as a whole for predicting the likelihood of a patient developing the target condition in the future? Is the risk assessment strategy using the test more accurate than a standard-of-care risk assessment strategy in predicting the likelihood of a patient developing the target condition in the future? 4. Does use of the test have any impact on clinical or personal decision making? 5. Do the clinical or personal decisions lead to improved intermediate outcomes? 6. Do the clinical or personal decisions lead to improved health outcomes? 7. What harms does the testing cause? Does the testing cause more harms than alternative testing strategies? 8. Do the clinical or personal decisions cause any harm? Does the action taken by the patient or clinician based on the testing result cause more harms than alternative actions?

Figure 7

Analytic framework for risk/susceptibility assessment. Key questions: Overarching question: Does use of the test lead to improved health outcomes compared to the standard-of-care risk assessment strategy or no assessment? The test being evaluated may (more...)

Figure 8 is a proposed evaluation framework for genetic testing for germline-mutation-related risk/susceptibility assessment. The figure consists of two sections. The section on the top of the figure demonstrates a clinical path for individuals who receive this type of testing. This section includes a series of headers sequentially linked by arrows going from left to right, including general or at-risk population, testing, likelihood to develop a condition, clinical or personal decisions, intermediate outcomes, and health outcomes. There is an overarching arc line that links the header “testing” and the header “health outcomes.” Below the headers “intermediate outcomes” and “health outcomes,” there are two additional headers: “harms caused by the testing” and “harms caused by the decisions.” An arrow from the header “testing” points to the header “harms caused by the testing.” An arrow from the header “clinical or personal decisions” points to the header “harms caused by the decisions.” The four headers, “intermediate outcomes,” “health outcomes,” “harms caused by the testing,” and “harms caused by the decisions” are also in one bigger box labeled as “balance of benefits and harms.” The section at the bottom of the figure demonstrates a clinical path for family members of test-positive individuals. The two sections of the figure are paralleled and almost identical in structure except that the section at the bottom starts from the header “family members of test-positive individuals” instead of from the header “general or at-risk population.” An arrow from the header “testing” in the section on the top points to the header “family members of test-positive individuals,” indicating that the positive testing finding of an individual will trigger the testing process for his/her family members. Under this evaluation framework, eight key research questions can be generated: 1. Overarching question: Does use of the test lead to improved health outcomes compared to the standard-of-care risk assessment strategy or no assessment? 2. Does the test have adequate analytic validity? 3. How accurate is the test for predicting the likelihood of a patient or family member to develop the target condition in the future? Is the test more accurate than the standard-of-care assessment method in making the prediction? Or when the test is used as part of a risk assessment strategy (e.g., when used as an add-on test), how accurate is the assessment strategy as a whole for predicting the likelihood of a patient or family member to develop the target condition in the future? Is the assessment strategy using the test more accurate than the standard-of-care assessment strategy in making the prediction? 4. Does use of the test have any impact on clinical or personal decision making? 5. Do the clinical or personal decisions lead to improved intermediate outcomes? 6. Do the clinical or personal decisions lead to improved health outcomes? 7. What harms does the testing cause? Does the testing cause more harms than alternative testing strategies? 8. Do the clinical or personal decisions cause any harm? Does the action taken by the patient or clinician based on the testing result cause more harms than alternative actions?

Figure 8

Analytic framework for germline-mutation-related risk/susceptibility assessment. Key questions: Overarching question: Does use of the test lead to improved health outcomes compared to the standard-of-care risk assessment strategy or no assessment? Does (more...)

  • Diagnosis in symptomatic patients
  • Disease screening in asymptomatic patients
  • Prognosis assessment
  • Treatment monitoring
  • Pharmacogenetics
  • Risk/susceptibility assessment
  • Germline-mutation-related testing scenarios

Each framework includes a graphical depiction of the relationship between the population, the test under consideration, subsequent interventions, and outcomes (including intermediate outcomes, patient outcomes, and potential harms). Each framework also includes a set of research questions that need to be addressed. The numbers shown in the diagram of the framework represent corresponding research questions.

While differences exist among the presented frameworks, the frameworks also share the following commonalities:

  1. Under each framework, an overarching question (Key Question 1) needs to be addressed about whether use of the test will lead to an incremental change in health outcomes among the patients being tested compared to using standard-of-care testing or no testing. In some instances, the new test may be evaluated as an “add-on” to testing currently in use, or as a “triage” step prior to use of a more invasive test.
  2. Under each framework, a research question (Key Question 2) is asked regarding the analytic validity of the test. This question addresses issues such as analytic accuracy, analytic sensitivity, analytic specificity, precision, reproducibility, and robustness of the test.
  3. Under each framework, potential harms that might be caused by the testing or the subsequent interventions based on the testing results are required to be evaluated. While these potential harms could be reflected by incremental health outcomes (e.g., mortality and quality of life), it is still important to ask the harm-related questions separately, particularly when evidence on incremental health outcomes is not available for evaluation.
  4. Under each framework, both health outcomes and intermediate outcomes are included for evaluation of the clinical utility of the test. Health outcomes (or patient outcomes) are symptoms and conditions that patients can feel or experience, such as visual impairment, pain, dyspnea, impaired functional status or quality of life, and death.36 Intermediate outcomes (or surrogate outcomes) are pathologic and physiologic measures that may precede or lead to health outcomes.36 For example, elevated blood cholesterol level is an intermediate outcome for coronary artery disease. While health outcomes are what ultimately matter to patients, it could still be important to evaluate the testing's impact on intermediate outcomes, particularly when direct evidence on health outcomes is not available.
  5. Under each framework, a question is asked regarding whether use of the test would have any impact on decision making by clinicians or patients. Addressing this question could help to address the clinical utility issue, particularly when evidence on health or intermediate outcomes is not available. Tests whose results have no impact on decision making by clinicians or patients will certainly not lead to any changes—positive or negative—in health outcomes.

This set of frameworks inherits the concept of “chain of evidence” from the EGAPP framework.3 Key Question 1 (i.e., the overarching question) determines whether a single body of evidence exists that directly establishes the connection between the use of the genetic test and health outcomes. However, for genetic tests, such direct evidence is rarely available.1,3 Even when direct evidence exists, it could be low in quality, quantity, or consistency.3 Therefore, constructing a chain of evidence by addressing a series of key questions (i.e., the other key questions specified in the frameworks) is commonly necessary for evaluating the clinical utility of the tests.

To connect the use of the test with health outcomes, the remaining key questions specified in the frameworks need to be addressed. These key questions evaluate analytic validity, clinical validity, medical or personal decisionmaking, and balance of benefits and harms associated with the tests. Determining whether this chain of evidence is adequate for answering the overarching question requires consideration of the adequacy of evidence for each link in the evidence chain, the certainty of findings based on the quantity (i.e., number and size) and quality (i.e., internal validity) of studies, the consistency and generalizability of results, and understanding of other factors or contextual issues that might influence the conclusions.3,59

Before entering the evaluation process using the presented frameworks, two issues need to be addressed. First, the patient population for whom the test is intended to apply should be clearly defined. For example, for screening tests, whether the test is intended for the general population or for a population at high risk should be explicitly stated. If a test is for an “at-risk” population, whether the “at-risk” population can be identified reliably should be assessed.

Second, the testing purpose should be defined explicitly (e.g., diagnosis, prognosis, screening, or even multiple purposes), as well as the testing techniques employed. In some cases, several different techniques can be used to analyze the status of the same gene. For example, immunohistochemistry (IHC) assays, extracellular domain assays, and in situ hybridization (ISH) techniques are all used for ERBB2/Neu testing for breast cancer and other solid tumors.7 In other cases, testing the status of the same gene can be used for multiple clinical purposes. For example, testing of cystic fibrosis mutations can be used for diagnosis in symptomatic patients, screening for asymptomatic patients, or prenatal screening via carrier testing. If different testing purposes or techniques are within the scope of work of an evaluation project, multiple “tests” are actually being evaluated. For this type of project, several analytic frameworks may be needed.

In the following section, we present a set of analytic frameworks for common clinical scenarios. Unless otherwise specified, these frameworks apply either to nonheritable conditions (i.e., those caused by somatic mutations), or to heritable conditions (i.e., those caused by germline mutations) when the evaluator is only concerned with the impact of tests on probands. The frameworks for tests for heritable conditions involving evaluation for both the probands and relatives are more complicated and are thus described in a separate subsection.

We investigated the usability of the frameworks that we presented for seven real-world sample testing scenarios. We generated research questions for the sample tests using the frameworks. The sample tests, as well as the hypothetical research questions generated, are described in Appendix B of this report.

We acknowledge that the frameworks that we presented in this report may not meet all needs that an assessor may have in evaluation of a particular test. However, we believe that the assessor should be able to readily adjust these frameworks to meet their needs. For example, some assessors may need to evaluate the effectiveness of a test in different subpopulations (e.g., by age, gender, or ethnicity); other assessors may need to evaluate potential interactions between comorbidities and the effectiveness of the test. In those cases, the frameworks presented in this report can still be used as the basis of the evaluation. The assessors only need to perform subgroup analysis or add additional research questions.

Scenario 1. Diagnosis in Symptomatic Patients
Scenario 2. Screening in Asymptomatic Patients
Scenario 3. Prognosis Assessment
Scenario 4. Treatment Monitoring
Scenario 5. Pharmacogenetics
Scenario 6. Risk/Susceptibility Assessment
Scenario 7. Germline-Mutation-Related Testing Scenarios

All the frameworks that have been presented so far in this section were intended for testing scenarios for a nonheritable condition (i.e., a condition caused by somatic mutations). The testing scenarios for germline-mutation-related heritable conditions can be more complex to evaluate when the potential benefits and harms that may be realized among the family members of test-positive individuals also need to be considered in the evaluation process. Figure 8 is a suggested analytic framework for germline-mutation-related risk/susceptibility assessment.

This framework was used in an EPC report published in 2009, DNA Testing for Factor V Leiden Mutations for the Assessment of Venous Thromboembolism Recurrence Risk.40

The framework consists of two almost parallel branches. The upper branch (see Figure 8) focuses on the utility of the test for the general or high-risk population, while the lower branch focuses on the utility of the test for the family members of the test-positive individuals. The two branches are put under one framework because the potential benefits and harms of the test in both those who are screened originally and family members of test-positive individuals are of interest to the assessor. However, if the assessor is primarily concerned with the effectiveness of the test either in those who are screened originally or in those family members of test-positive individuals, the single-branch analytic framework (Figure 7) presented previously could be used instead.

In addition to risk/susceptibility assessment, germline-mutation-related testing may also be used for other clinical purposes (e.g., diagnosis in symptomatic patients and disease screening in asymptomatic patients). For those testing scenarios, similar two-branch frameworks can be constructed based on relevant frameworks presented previously in this chapter (e.g., Figure 2 and Figure 3).

Analytic Frameworks for Genetic Tests: From Other Stakeholders' Perspectives

As discussed previously, the issues that providers, payers, regulators, and test developers need to address in evaluation of laboratory tests could be somewhat different from those for patients (refer to the section, Unique needs of different stakeholders for evaluation frameworks). As a result, evaluation frameworks that are appropriate from patients' perspectives may not meet the needs of those other stakeholders.

For providers and payers, most issues that are addressed under the frameworks for patients (e.g., analytic validity, clinical validity, and clinical utility) are still relevant. Therefore, evaluation frameworks preferred by providers and payers should be largely similar to the frameworks for patients. The frameworks for providers and payers incorporate some additional pieces that address the issues of concern to these stakeholders. As discussed previously, these may include operational, economic, legal and other societal implications of the test.

Figure 9 is a provider perspective analytic framework for evaluation of a diagnostic test. As the diagram depicts, the framework is similar to the framework for patients (Figure 2), except that the provider may wish to ask about the operational and financial impact of the test. The framework shows that whether the test would have any operational and financial impact largely depends on patients' preference for the test, the cost for providing the testing service and subsequent treatments, and the clinical utility (benefits and harms) of the test. Similar provider-oriented analytic frameworks for other testing scenarios (e.g., disease screening in asymptomatic patients, treatment monitoring, and drug selection) can also be constructed based on the patient-oriented frameworks (Figures 28).

Figure 9 is an evaluation framework for genetic testing for diagnosis in symptomatic patients proposed from providers' perspectives. The framework includes a series of headers sequentially linked by arrows going from left to right, including symptomatic patients, testing, diagnosis of disease, treatment, intermediate outcomes, and health outcomes. There is an overarching arc line that links the header “testing” and the header “health outcomes.” Below the headers “intermediate outcomes” and “health outcomes,” there are two other headers: “harms caused by the testing” and “harms caused by the treatment.” An arrow from the header “testing” points to the header “harms caused by the testing.” An arrow from the header “treatment” points to the header “harms caused by the treatment.” The four headers, “intermediate outcomes,” “health outcomes,” “harms caused by the testing,” and “harms caused by the treatment” are also in one bigger box labeled as “balance of benefits and harms.” At the bottom of the figure, there is another header, “operational and financial impacts of the test.” Three big arrows from the headers “symptomatic patients,” “testing,” and “treatment,” as well as a big arrow from the box “balance of benefits and harms,” all point to the header, “operational and financial impacts of the test.” Under this evaluation framework, nine key research questions can be generated: 1. Overarching question: Does use of the test lead to improved health outcomes compared to the standard-of-care diagnostic strategy that does not include the test? 2. Does the test have adequate analytic validity? 3. How accurate is the test for detecting the target disease or condition? Is the test more accurate than standard-of-care test for detecting the target disease or condition? Or when the test is used as part of a diagnostic strategy (e.g., being used as a triage or add-on test), how accurate is the diagnostic strategy as a whole for detecting the disease or condition? Is the diagnostic strategy including the test more accurate than a standard-of-care diagnostic strategy for detecting the disease or condition? 4. Does use of the test have any impact on treatment decision making by clinicians or patients? 5. Does the treatment lead to improved intermediate outcomes in comparison with no treatment? 6. Does the treatment lead to improved health outcomes in comparison with no treatment? 7. What harms does the testing cause? Does the testing cause more harms than alternative testing strategies? 8. What harms does the treatment cause? Does the treatment cause more harms than alternative treatments? 9. What operational and/or financial impact does the testing have?

Figure 9

A Sample analytic framework from providers' perspectives (for diagnostic tests). Key questions: Overarching question: Does use of the test lead to improved health outcomes compared to the standard-of-care diagnostic strategy that does not include the (more...)

Figure 10 is a sample analytic framework for payers for evaluation of a screening test for asymptomatic patients. As the diagram depicts, the framework is similar to the framework for patients (Figure 3), except that a component is added to address potential legal, ethical, operational, financial, and societal impact (including cost-effectiveness) of the test. Similar payer-oriented analytic frameworks for other testing scenarios (e.g., diagnosis in symptomatic patients, treatment monitoring, and drug selection) can also be built based on the patient-oriented frameworks (Figures 28).

Figure 10 is an evaluation framework for genetic testing for screening in asymptomatic patients proposed from payers' perspectives. The framework includes a series of headers sequentially linked by arrows going from left to right, including asymptomatic individuals at risk, testing, detection of target condition, early intervention, intermediate outcomes, and health outcomes. There is an overarching arc line that links the header “testing” and the header “health outcomes.” Below the headers “intermediate outcomes” and “health outcomes,” there are two additional headers: “harms caused by the testing” and “harms caused by the intervention.” An arrow from the header “testing” points to the header “harms caused by the testing.” An arrow from the header “early intervention” points to the header “harms caused by the intervention.” The four headers, “intermediate outcomes,” “health outcomes,” “harms caused by the testing,” and “harms caused by the intervention” are also in one bigger box labeled as “balance of benefits and harms.” At the bottom of the figure, there is another header, “any operational, financial, ethical, legal, societal impacts (including cost-effectiveness).” Three big arrows from the headers “asymptomatic individuals at risk,” “testing,” and “early intervention,” as well as a big arrow from the box “balance of benefits and harms,” all point to the header, “any operational, financial, ethical, legal, societal impacts (including cost-effectiveness).” Under this evaluation framework, nine key research questions can be generated: 1. Does use of the test lead to improved health outcomes compared to the standard-of-care screening strategy or no screening? 2. Does the test have adequate analytic validity? 3. How accurate is the test for detecting the target condition? Is the test more accurate than a standard-of-care screening test (if any) for detecting the condition? Or when the test is used as part of a screening strategy (e.g., being used as a triage or add-on test), how accurate is the screening strategy as a whole for detecting the target condition? Is the screening strategy using the test more accurate than a standard-of-care screening strategy for detecting the condition? 4. Does use of the test have any impact on the decision making by clinicians or patients regarding early intervention (if any)? 5. Does the early intervention (if any) lead to improved intermediate outcomes in comparison with no intervention? 6. Does the early intervention (if any) lead to improved health outcomes in comparison with no intervention? 7. What harms does the testing cause? Does the testing cause more harms than alternative testing strategies? 8. What harms does the early intervention cause? Does the intervention cause more harms than alternative interventions? 9. What operational, financial, legal, ethical, and societal implications (including cost-effectiveness) does the testing have?

Figure 10

A sample analytic framework from payers' perspectives (for screening tests). Key questions: Does use of the test lead to improved health outcomes compared to the standard-of-care screening strategy or no screening? Does the test have adequate analytic (more...)

In this report, we have not attempted to clarify any evaluation frameworks for regulators. As discussed previously, the evaluation issues that a regulator needs to address are largely defined by the laws which mandate their responsibilities.

We also have not presented any evaluation frameworks specific to test developers. As previously discussed, goals for test developers for evaluation could vary across different phases of the development cycle. A dynamic approach to evaluation (such as those models based on the drug development process that are reviewed in a previous section of this report)25-34 would provide some practical guidance to test developers on the types of evaluation that need to be performed at each phase of the development cycle. Meanwhile, the patient-oriented evaluation frameworks introduced previously in this chapter would provide test developers with some useful insights about how to develop tests that would meet the needs of patients, providers and payers.

Analytic Validity

Key Question 2. What are the Strengths and Limitations of Different Approaches to Literature Searching to Assess Evidence on Variability in Genetic Testing? Is There an Optimal Approach to Literature Search?

Findings of the Targeted Review

To address Key Question 2, we first conducted a targeted review of existing literature search strategies for analytic validity of genetic tests to facilitate the discussion among the Workgroup members. As mentioned previously, given the broad scope of the work and the limited timeframe for the study, AHRQ and the ECRI Institute EPC team agreed that it would be important to be efficient in the targeted search and review. Therefore, although we had searched the major medical databases as well as the Web sites of government agencies and technology assessment groups, our targeted review was primarily focused on relevant published systematic reviews, particularly the landmark evidence reports on genetic testing topics performed by the CDC and the AHRQ EPC program.

As observed by the authors of several evidence reports being reviewed, lack of published data remains a major challenge to evaluating analytic validity of genetic (or other laboratory) tests.4-8 During the course of preparing this report, the project team had the same observation (refer to the results section for Key Question 4). Often, technology assessment groups needed to search for gray literature for analytic validity data.

Table 3 is a summary of the gray literature sources searched for analytic validity studies in the examined evidence reports on genetic testing topics. In summary, the following were among the common gray literature sources searched by the reports' authors:

Table 3. Gray literature sources searched for analytic validity studies in evidence reports on genetic testing topics.

Table 3

Gray literature sources searched for analytic validity studies in evidence reports on genetic testing topics.

  • FDA's Web site, particularly FDA's PMA or 510(k) summaries and committee reports
  • Laboratories or manufacturers offering the tests being evaluated

    The information published on their Web sites

    Information released on the tests by laboratories or manufacturer, including press releases, lay magazine/newspaper articles, and package inserts for tests

    Direct contact with the laboratories or manufacturers

  • Conference publications from professional societies (e.g., the American Association for Clinical Chemistry, American Society of Clinical Oncology, College of American Pathologists [CAP], the American College of Medical Genetics [ACMG])
  • The ACMG/CAP external proficiency testing program
  • International external proficiency testing programs
  • The GeneTests Web site (available at: http://www.genetests.org)
  • Direct contact with individuals who were likely to have access to the relevant information.

From previous research we have done in the area of genetic testing and through our consultations with experts in the field, we identified the following additional resources as potentially useful sources of data for analytic validity:

  • The Clinical Laboratories Improvement Amendments (CLIA) program administered by CMS
  • State-based regulatory programs, such as CLEP of New York State
  • Laboratory accreditation organizations, such as CAP and the Joint Commission
  • The National Institutes of Health (NIH)
  • The Centers for Disease Control and Prevention (CDC)
  • The United States Patent and Trademark Office and the World Intellectual Property Organization
  • International agencies or collaborations.

Input From the Workgroup

The potential sources of data identified through the targeted review were presented to the Workgroup. The experts were then invited to comment on literature search strategies, particularly the utility of the potential gray literature sources identified previously. The following is a summary of the Workgroup's discussions:55,58

  • CMS regulates all laboratories (except research laboratories) performing tests on humans in the U.S. through CLIA and has responsibility for implementing the CLIA Program.60 Laboratories that perform tests of moderate and/or high complexity (most, if not all, of genetic tests) are required to be surveyed (inspected) by a CLIA-authorized State agency or an accrediting organization. However, most data at the individual laboratories that the CLIA program surveys are proprietary and not open to the public.
  • The CLEP program in New York State requires submission of laboratory validation data for laboratory-developed tests (LDTs). If the data are marked proprietary by the submitting lab, the CLEP would redact proprietary information before releasing any information in response to a New York State's Freedom of Information Law request. One exception to this law is if release of the information could have a potential adverse impact on a business interest.
  • Analytic validity information for some tests may be available from NIH by contacting the principal investigators involved in developing the test. Principal investigators of NIH-funded studies are required to share data and respond to inquiries if the annual costs of their research in any given year are $500,000 or greater. The Research Portfolio Online Reporting Tools Expenditures and Results (RePORTER) query tool (formerly known as the CRISP system) would be helpful to identify particular studies on a test or names of specific principal investigators. Several specific NIH programs were identified by the Workgroup as potentially useful sources of data for analytic validity. These programs include the Office of Rare Disorders, Collaboration, Education and Test Translation Program (which conducts mainly sequence-based tests) and the Early Detection Research Network at NCI, Pharmacogenetics Research Network (which is NIH-wide) and the Biomarkers Consortium (which would cover a broad spectrum of diseases).
  • CDC could be a valuable resource for analytic validity data for screening tests on newborns. CDC operates the Newborn Screening Quality Assurance Program (NSQAP) in partnership with the Association of Public Health Laboratories. NSQAP provides various services, including proficiency testing, to more than 73 domestic newborn screening laboratories, 29 manufacturers of diagnostic products, and laboratories in 58 countries.61 NSQAP has been the only comprehensive source of essential quality assurance services for dried-blood-spot testing for more than 29 years. NSQAP publishes quarterly reports on the performance of participating laboratories in proficiency testing. CDC's Genetic Testing Reference Materials Coordination Program is also a potential source of analytic validity data. The goal of the program is to improve the availability of appropriate and characterized reference materials for: quality control, proficiency testing (PT), test development and validation, and research.
  • Some Workgroup members suggested looking into international resources as a means to obtain data due to the limited amount of money available for funding studies in the United States. Two international resources, EuroGentest and Orphanet, were mentioned as being of particular interest in the panel discussion. EuroGentest is a European Union-funded Network of Excellence looking at all aspects of genetic testing–quality management, information databases, public health, new technologies and education (more information about the network is available at: http://www.eurogentest.org/). Orphanet is a public database of information on rare diseases and orphan drugs. Its aim is to contribute to the improvement of the diagnosis, care and treatment of patients with rare diseases (more information about the Orphanet is available at: http://www.orpha.net/consor/cgi-bin/index.php?lng=EN#x00023;). Orphanet has a Directory of Expert Services, which includes information on relevant clinics, clinical laboratories, research activities and patient organizations. A Workgroup member commented that some of the international laboratories may not have any required federal or regulatory bodies governing them; thus, the data from these laboratories should be used with extra caution.
  • Another possible source of analytic validity data could be a professional society (such as CAP) database to which laboratories submit data, with the data de-identified prior to release. Putting a posting on CHat AMP, the Association for Molecular Pathology (AMP) members-only listserv, may also be helpful in identifying such data. Many of the larger clinical laboratories are represented in AMP.
  • There are review summaries from test kit manufacturers available on the FDA Web site as well as summaries written by the FDA on those tests, which tend to be more detailed than the manufacturers' 510(k) summaries.
  • Proficiency testing programs could be a valuable source of analytic validity data. For example, the subscribers of the CAP proficiency testing programs may request the data from the program, and the data are sent to the requestor in summary. Other proficiency programs (e.g., the European Molecular Genetics Quality Network's External Quality Assessment program, the New York State CLEP's PT program) may also helpful to technology evaluators.
  • Several Workgroup members advocated directly contacting the laboratories or manufacturers that provide the test for the data needed. These members commented that testing validation data are generated on a regular basis at laboratories, but these data were rarely published in peer-reviewed journals. A laboratory that focuses more on public health (such as a newborn screening laboratory) rather than for-profit testing might be more willing to share their data. Search of the GeneTests and AMP Web sites may be helpful in identifying relevant laboratories providing such testing services.

A Comprehensive Approach to Search of Analytic Validity Data

Summarizing the comments of the Workgroup, the findings of the targeted review and ECRI Institute EPC's experience from previous work on genetic testing evaluation, we recommend a systematic approach to search for analytic validity data. At the outset, a comprehensive search of published analytic validity data should be performed. Major internal and external databases (e.g., PubMed and Embase) need to be searched using a list of controlled vocabulary terms (e.g., MeSH [Medical Subject Headings] and Emtree), publication types, and textword combinations. The development of the search strategy should be guided by the key research questions, needs of the stakeholder who commissioned the study, and input from technical experts. For this task, experienced search specialists who are familiar with online thesauri for controlled vocabularies (e.g., MeSH Browser, Emtree, and PsycINFO Thesaurus) and specialized syntaxes can be helpful. Refer to Appendix A of this report for a sample list of the databases that might need to be searched and the search strategy used to identify studies. In addition, hand searches of journals as well as the bibliographies of retrieved articles also need to be performed to obtain articles not retrieved by the database searches.

Unless published data identified provide a sufficient evidence base for analytic validity assessment, an extensive search for unpublished data sources would enhance the thoroughness, and decrease the uncertainty associated with the findings, of the assessment. An extensive, systematic search of unpublished data could be an extremely time- and resource-consuming endeavor. To improve the efficiency and effectiveness of the search, it is important to seek input from experts who are familiar with the testing area at an early stage.

Table 4 is a summary of common sources of unpublished data for analytic validity. The summary was developed based on the comments from the Workgroup and the findings of the targeted review of the project team that were previously discussed. The table is intended to provide a brief checklist of the potentially useful resources for identifying unpublished data. Brief comments on the strengths and limitations of the resources are also provided. Depending on the particular tests being assessed, some of the resources listed in the table may not be relevant, while other resources could be more valuable. For example, FDA's test kit review summaries could be a valuable source of data for commercial test kits but may not be relevant to laboratory-developed tests (also known as “in-house tests” or “homebrew tests”) at this time. CDC's NSQAP could be a valuable data source for dried-blood-spot testing for newborn screening, but may not be useful for other tests.

Table 4. Common sources of unpublished data for analytic validity assessment.

Table 4

Common sources of unpublished data for analytic validity assessment.

A systematic search of peer-reviewed literature and unpublished data sources such as those listed above and in Table 4 would increase the chance to identify data potentially helpful information for addressing analytic validity issues. Whether the data identified ultimately meet the inclusion criteria for the assessment will be determined based on a critical evaluation of the data, particularly the evaluation of data quality. In the following section, issues regarding quality rating criteria for analytic validity studies will be addressed.

Key Question 3. Is it Feasible to Apply Existing Quality Rating Criteria to Analytic Validity Studies on Genetic Tests? Is There an Optimal Quality Rating Instrument for These Studies?

Quality of individual studies has been defined differently by different authors. The Cochrane Collaboration defines study quality as “a vague notion of the methodological strength of a study, usually indicating the extent of bias prevention.”62 In this definition, bias refers to a systematic error or deviation in results or inferences from the truth. Some other authors use the term to refer to “the extent to which all aspects of a study's design and conduct can be shown to protect against systematic bias, nonsystematic bias, and inferential error.”63 The term “quality” has also been used in an even broader sense to measure the study's potential for bias (internal validity), applicability (or generalizability or external validity) of the findings, and reporting quality.64

How authors define quality of individual studies might depend on their views about what methodological issues are more likely to cause study results to potentially deviate from the truth, as well as their thinking about the appropriate ways to incorporate various “quality elements” (e.g., systematic bias, nonsystematic bias, inferential error, external validity, and reporting quality) into the assessment of the overall quality or strength of evidence (a concept discussed later in this section). In recent years, the AHRQ EPC program has focused on “risk of bias” when evaluating the quality of individual studies.65 However, when we reviewed published EPC reports that evaluated analytic validity of genetic tests (discussed later in this section), we found that most of these reviews used a broader definition of study quality (i.e., including systematic bias, generalizability, reporting adequacy, and validity of statistical analysis). As evidenced by the discussions among the Workgroup experts, how best to determine the quality of individual studies examining analytic validity of genetic tests is far from settled. The Workgroup participants favored using a more inclusive, multi-dimensional definition of quality including systematic bias, generalizability, reporting adequacy, and validity of statistical analysis.

It is worth noting that some authors or groups use the term “quality of evidence” to refer to the overall strength of the evidence base (consisting of one or multiple studies).66,67 Assessment of the overall strength of evidence is a complex matter, involving consideration of the limitations (or “risk of bias”) of individual studies, the quantity of data (or “precision” of summary estimates), the consistency of the evidence, and the directness of the evidence.66,67 The methodological issues regarding how to grade the overall strength of evidence are beyond the scope of this section on analytic validity. The goal of this portion of the project was to examine whether there was consensus about how to assess the quality (primarily in terms of risk of bias or study limitations) of individual studies of analytic validity for genetic tests.

The aim of analytic validity studies is to determine how good a particular test is at detecting the target analyte (e.g., a particular gene or biomarker in the specimen). Analytic validity studies evaluate a broad range of testing performance characteristics, such as analytic sensitivity or specificity (for qualitative studies), analytic accuracy (for quantitative studies), precision, reproducibility, and robustness (See Acronyms/Abbreviations and Glossary for definitions of the terms). Analytic validity studies of laboratory testing are unique in design compared to other types of studies, such as diagnostic accuracy studies and studies for evaluating therapeutic interventions. Unique design features mean that the criteria needed to assess the quality of analytic validity studies differ from those needed to assess evaluations of diagnostic accuracy or therapeutic interventions. To address Key Question 3, we first conducted a targeted review of quality criteria that have been developed specifically for assessing the quality of analytic validity studies.

Findings of the Targeted Review

To identify existing criteria for assessing the quality of analytic validity studies, we first searched multiple electronic databases of peer-reviewed publications (the search strategy is provided in Appendix A) and queried the Workgroup for other relevant resources. Our search of the electronic databases identified one set of criteria that was specifically designed to assess the quality of analytic validity studies. This list was first published in 2008 by the EGAPP Working Group for evaluation of the quality of analytic validity of genetic tests.3 While the ACCE framework did include ten questions regarding analytic validity, the primary purpose was for organizing analytic validity information rather than for assessing its quality.37

The EGAPP Approach to Assessing Quality of Analytic Validity Studies

Table 5 is a summary of the EGAPP approach to assessing the quality of analytic validity studies. This approach includes the method for judging the quality of individual studies and the method for reaching the conclusion about the overall quality of the evidence base. EGAPP judges the quality of individual studies using a hierarchy of data sources and study designs (column 1 of Table 5) and a set of additional criteria for assessing the internal validity of studies (column 2 of Table 5). EGAPP grades the overall quality of evidence as convincing, adequate and inadequate (column 3 of Table 5) based on the assessment of individual studies.

Table 5. The EGAPP approach to assessment of the quality of analytic validity studies.

Table 5

The EGAPP approach to assessment of the quality of analytic validity studies.

While the EGAPP approach provides a structure for assessing the quality of analytic validity studies, some technical issues with the approach restrict its applicability. Detailed guidance does not exist about how to judge some of the quality criteria (e.g., how to judge if an external proficiency testing scheme is “well-designed”) In addition, some of the criteria for judging studies' internal validity are only concerned with the reporting quality of the study (e.g., “adequate descriptions of index test”).

Sources of Quality Rating Criteria That are Potentially Helpful in Assessing Analytic Validity Studies

Our search of the electronic databases of peer-reviewed publications also identified a multitude of instruments that had been developed for assessing the quality of diagnostic accuracy (or clinical validity) studies or studies that evaluate therapeutic interventions. These instruments are not specifically developed for assessing the quality of analytic validity studies. Some of the instruments are only focused on reporting quality and do not address other quality elements (e.g., internal and external validity). However, some components of the instruments may be useful for proposing quality assessment criteria for analytic validity studies. These instruments include:

  • The Quality Assessment of Diagnostic Accuracy Studies (QUADAS) tool64,68
  • Standards for Reporting of Diagnostic Accuracy (STARD) checklist for the reporting of studies of diagnostic accuracy69
  • REporting recommendations for tumor MARKer prognostic studies (REMARK)70
  • Checklist for reporting and appraising studies of genotype prevalence and gene-disease associations proposed by CDC71
  • QUADOMICS tool (adapted from QUADAS) for the evaluation of the quality of studies on the diagnostic accuracy of ‘-omics’-based technologies35
  • The Newcastle-Ottawa Scale for assessing the quality of case control studies72
  • USPSTF criteria for assessing internal validity of individual studies73
  • USPSTF criteria for assessing external validity (generalizability) of individual studies74

In addition, via querying the Workgroup, we identified various guidance documents used by regulatory agencies for evaluating the quality of the materials submitted by test developers to support their applications for the approval of new tests. For example, FDA published guidance documents (or draft guidance documents) for industry and the agency's staff on subjects such as pharmacogenetic tests and genetic tests for heritable markers, nucleic acid based in vitro diagnostic devices for detection of microbial pathogens, and in vitro diagnostic multivariate index assays, respectively.75-77 The New York State CLEP has similar guidance (e.g., the Checklist for Genetic Testing Validation Packages).78 Although some of the criteria specified in the documents are relevant to the goal of this report, the purpose of the regulatory guidance is not to evaluate all aspects of quality of analytic validity studies (e.g., internal validity, external validity, nonsystematic errors, and reporting quality). Similarly, the guidelines and standards for laboratories published by professional societies (e.g., CAP and ACMG) or the Clinical Laboratory Standards Institute could provide useful input for this report, but do not evaluate all quality aspects of analytic validity studies.

Quality Assessment Criteria Used in Completed Evidence Reports

Table 6 is a summary of the information we identified in our targeted review. We summarized the quality-rating criteria for analytic validity studies used in completed evidence reports on genetic testing topics. As the summary reveals, there was no consensus among the authors of these evidence reports on what criteria should be used for judging the quality of analytic validity studies. Some authors used the EGAPP approach; some authors used criteria from the REMARK and STARD guidelines; other authors used criteria developed by CDC, and some developed their own criteria. In some reports, only reporting quality of the studies was assessed, while, in other reports, additional quality components (e.g., internal or external validity) were also assessed.

Table 6. Quality assessment criteria for analytic validity studies used in evidence reports on genetic testing topics.

Table 6

Quality assessment criteria for analytic validity studies used in evidence reports on genetic testing topics.

Input From the Workgroup

After examining the findings of the targeted review, the Workgroup reached a consensus that a comprehensive, easier-to-use list of quality assessment criteria would be beneficial to the practice of analytic validity assessment. Several experts suggested that an ideal set of quality rating criteria should not only include items that measure the internal validity of the studies, but also need to include those measuring external validity and reporting quality.

To propose a draft analytic validity quality criteria list, the project team first synthesized the EGAPP criteria3 and other criteria that had been used in completed evidence reports for assessing quality of analytic validity studies (refer to Table 6). Relevant items from other published quality assessment instruments such as QUADAS,79 REMARK,70 and STARD,69 as well as those from FDA or CLEP review guidance were also incorporated into the draft list.75-78 We provided the draft list to the Workgroup for comments and suggestions. After we received feedback from the experts, we further revised the list of criteria. Some new quality items were added, some items were removed, and other items were combined.

A Quality Criteria List for Individual Studies of Analytic Validity

Table 7 is the finalized list for assessing the quality of analytic validity studies. The list consists of 17 items that cover various quality aspects including internal validity, reporting quality, and other factors potentially causing bias. Some of the quality items may not be applicable to all tests being evaluated. For example, item 8 is only relevant to quantitative tests. Therefore, users should customize the list to meet their assessment needs, ideally prior to examining the studies. The answer to each item (except for item 1) would be “Yes,” “No,” or “Unclear.” “Unclear” is provided as an option for response primarily for addressing related reporting quality issues. If a quality item cannot be addressed due to lack of reported information, the response to the question would be “Unclear.”

Table 7. Quality assessment criteria for analytic validity studies.

Table 7

Quality assessment criteria for analytic validity studies.

The purpose of this list is to provide a method for systematically and consistently evaluating the key quality aspects of analytic validity studies. This checklist is intended to apply to the studies that evaluate the performance characteristics that are of primary concern to systematic reviewers, including sensitivity, specificity, and precision (including repeatability and reproducibility). These performance characteristics are commonly reported within the same study (e.g., test validation studies), although they reflect different aspects of analytic validity.

To ensure that the list is flexible and customizable, we have not provided detailed instructions for making the judgment about each quality item. Some quality items on the list include wording such as “appropriate” and “appropriately.” Our philosophy is that the users of the list should determine a priori what criteria should be used to answer “Yes,” “No,” or “Unclear” for the quality items. The criteria used need to be based on the topics being evaluated and the needs of the stakeholders of the evaluation.

We acknowledge that empirically validating a quality assessment instrument is a time-consuming matter. Given the time frame for this report, it was not feasible for us to empirically validate this list. However, all items on the list have been applied in previous evidence reports (refer to Table 6). We also tested this set of criteria on several sample analytic validity studies to ensure applicability of the quality items on the list.

Key Question 4. What are Existing Gaps in Evidence on Sources and Contributors of Variability Common to all Genetic Tests, or to Specific Categories of Genetic Tests? What Approaches Will Lead to Generating Data to Fill These Gaps?

In this section, we used three different case studies of tests to demonstrate the issues test evaluators may experience when attempting to evaluate the analytic validity of tests. We chose three tests of different types for this purpose. We searched for literature and information on analytic validity to investigate the gaps in evidence sources, and discussed the possible sources and contributors of variability to testing results.

Case Study 1. Biochemistry Test for Cancer Antigen-125

Cancer Antigen 125 (CA-125) is the term used to refer to the use of measuring serum levels of mucin 16 for clinical oncology indications. Mucin 16 is a glycoprotein expressed by many different types of cells. There are a variety of commercially available CA-125 tests on the market today, but all depend on the use of a monoclonal antibody that was first created in 1981, the OC 125 antibody.80 Practically all tests in use today are “second generation” tests that use a combination of OC 125 and another antibody that recognizes mucin 16 called M11.81 The various commercially available tests differ only in the methodology used to measure the amount of bound monoclonal antibody.

Measurements of levels of CA-125 in the serum are used for a variety of medical reasons. Normal levels of CA-125 are 35 U/ml or lower; a number of conditions, including cancers, pregnancy, and inflammation, can cause elevated serum levels of CA-125. The indication focused on in this Case Study is the monitoring of ovarian cancer to treatment. Serum CA-125 levels are elevated in approximately 80 percent of women with ovarian cancer. For women with CA-125 over-expressing ovarian tumors, the relative level of the antigen over time can be used to track response to treatment, since it tends to decrease or increase proportionally in response to tumor load.

The majority of current generation CA-125 tests (CA-125 II) work in the following manner. The monoclonal antibody M11 is affixed to a solid phase, such as a microtiter dish or microparticles. The sample, generally serum collected from patients, is washed over the solid phase and mucin 16 protein binds to the antibody M11. The solid phase is then washed to remove the parts of the patient sample that have not bound to the antibody. The antibody OC-125, usually attached to an enzyme for later detection, is then applied to the solid phase. The labeled OC-125 binds to the captured mucin 16. Labeled OC-125 that has not bound is then washed away. The amount of bound OC-125 is then measured. The measurement step is the point at which various assays deviate in methodology, but all are similar in principle.

For example, the VIDAS CA-125 II test (Fujirebio Diagnostics, Inc.) uses OC-125 that has been attached to the enzyme alkaline phosphatase. A substrate (4-methyl-umbelliferyl phosphate) is washed over the solid phase, and bound alkaline phosphatase cleaves the substrate into a fluorescent chemical (4-methyl-umbelliferone). The intensity of the fluorescence is proportional to the concentration of mucin 16 present in the original serum sample.82

Most commercially available CA-125 assays are almost completely automated and come with pre-packaged assay reagents. The use of pre-packaged assay reagents eliminates variation in assay components, assuming the reagents are prepared according to good quality control and good manufacturing practices as defined by FDA. Most assays come with “standards” that laboratories can use to calibrate the kits to their working conditions. Errors in or failure to calibrate the instruments and kits could contribute to variability in results. However, the most likely source of variability in results is variation in methods of collection of, storage of, and preparation of the serum samples.

Tso et al. attempted to measure the “real life” variability in CA-125 testing by collecting multiple samples from each patient and submitting them for analysis to a laboratory unaware of the experiment.83 The variability in results was found to dramatically increase as the amount of CA-125 in the samples increased; there was practically no variation from test to test for samples with less than 100 U/ml, but a high degree of variability from test to test for samples with more than 600 U/ml. The clinical implications of these findings are unclear, considering that 35 U/ml is generally considered the “high” normal level.

We identified a systematic review, “Genomic Tests for Ovarian Cancer Detection and Management,” a report produced by AHRQ's EPC program.47 This report was finalized in October 2006. One section of the systematic review was devoted to locating evidence on the analytic performance of CA-125 tests in the laboratory. The authors of the review searched Medline and FDA databases and located six articles about the analytic validity of CA-125 tests. The authors of the systematic review reported that all six compared the performance of the tests to either earlier-generation CA-125 tests or to other similar types of tests. Outcomes reported were reproducibility of the tests, precision of the tests, impact of analyte concentration on the sensitivity of the tests, and the correlation of results with earlier generation tests. The authors of the systematic review concluded that:

The published data on clinical laboratory performance suggests that currently available radioimmunoassays for single-gene products have acceptable reproducibility and reliability, although even this level of variability may have some impact on clinical interpretation of results, especially when comparing relatively small serial changes, or levels close to the discriminatory threshold.47

This conclusion was based on a narrative review and visual inspection of the included data.

The majority of the articles discussed in the 2006 EPC report compared the results of second-generation CA-125 tests to first-generation CA-125 tests (Kenemans et al. 1995)84 However, some of the articles selected by the 2006 EPC report do not appear to strictly meet the definition of “analytic validity.” For example, Tamakoshi et al. 1996 studied the sensitivity of five tumor markers (including CA-125) for diagnosing patients with various types of ovarian cancer in the clinic, a purpose most would refer to as “establishing clinical validity.”85

We searched Embase and MEDLINE 1980 through July 2009 for articles relevant to the analytic validity of CA-125. The search strategies are summarized in Appendix A. After review of the articles identified, the articles listed in Table 8 were selected as being relevant to the analytic validity of CA-125 tests.

Table 8. Published studies of the analytic validity of CA-125.

Table 8

Published studies of the analytic validity of CA-125.

In addition to the published articles, eleven 510(k) clearances for commercial CA-125 test kits were identified by searching the FDA database (available at: http://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/pmn.cfm) for CA-125. Each approval summary contained detailed information about the analytic validity of each test and how the validity was established. Also, a substantial number of manufacturers/vendors of commercial CA-125 kits were identified. The product labeling for each kit typically contained some information about analytic validity. However, all of these commercial available test kit products also have 510(k) summaries with additional details about analytic validity.

Searches of the gray literature did not identify additional relevant information. For example, a U.S. patent 4921790, issued on May 1, 1990, expired May 1, 2007, describes an ELISA test kit for CA-125. No analytic validity information is presented. References to the discovery of CA-125 and its possible clinical uses in the management of ovarian cancer are provided. See Appendix A for a link to the patent description.

Case Study 2. Establishing the Analytic Validity of Cytochrome p450 Polymorphism Testing

The CYP450 family of enzymes is found in the liver and is responsible for metabolizing a large number of molecules, including many commonly administered pharmacologic agents. Polymorphisms of some of the genes within this system are known to affect enzymatic activity of the cytochrome p450 complex, which affects the half-life and therapeutic dosage of pharmacologic agents. Genetic tests, such as the recently FDA-approved Roche AmpliChip CYP450 Test, are now available to test for CYP450 polymorphisms. The AmpliChip delivers the results of testing for polymorphisms in the form of “predicted phenotypes”—poor metabolizers, intermediate metabolizers, extensive metabolizers, and ultra-rapid metabolizers.

Warfarin is an oral anticoagulant prescribed to treat a variety of health conditions. Warfarin acts by interfering with the synthesis of clotting factors in the liver, and bleeding is a common adverse event associated with taking the drug. Establishing the safe and effective dose of warfarin for each patient can be difficult. Certain polymorphisms in the genes CYP2C9 (which encodes the protein cytochrome P450 2C9) and VKORC1 (which encodes vitamin K epoxide reductase complex subunit 1) affect the metabolism and action of warfarin. In August 2007, the FDA updated the product label for warfarin (Coumadin) to include genetic variations in CYP2C9 and VKORC1 as factors to consider for more precise initial dosing.91

Matchar et al. prepared a technology assessment for AHRQ (as part of the EGAPP program) on testing for cytochrome p450 polymorphisms in adults with depression in 2006.92 As part of the assessment the authors addressed the analytic validity of such tests. The authors defined the “gold standard” reference for these tests as bidirectional sequencing. They identified 12 published articles and 2 documents from the FDA Web site (on performance of the Roche AmpliChip) that described methods for genotyping various CYP450 enzymes. Only four of the studies used the “gold standard” reference of DNA sequencing; the others compared their results to other methods of genotyping, or to published allele frequencies in populations similar to the ones employed in the study. Sensitivity and specificity were generally high (in the range of 94 to 100) percent) for the various tests. Sample sizes used in the validation studies ranged from approximately 50 to approximately 400, of which most were negative for any of the target polymorphisms; the numbers of positive samples were generally very low, in the single digits for most of the tests and polymorphisms. Some of the validation studies also reported on the reproducibility and repeatability of the tests. Repeatability assays varied, and were performed on one to four samples anywhere from only twice to up to 12 times. Reproducibility assays also varied, and may have incorporated between-laboratory, between-operator, and day-to-day assays; however, few studies reported performing all three types of reproducibility assays.

We searched Embase and MEDLINE for relevant studies published since 2007 using the strategy described in Appendix A. Our searches identified 12 potentially relevant articles. However, review of the abstracts indicated that none of these articles studied the analytic validity or mechanisms of performing testing for cytochrome p450 polymorphisms.

Case Study 3. Establishment of the Analytic Validity of FISH Assays for ERBB-2 (Also Called HER2/neu)

The gene encoding for the epidermal growth factor receptor 2 (ERBB-2), commonly referred to as HER2/neu, is overexpressed in approximately 20 percent of breast tumors. Over-expression can be the result of gene amplification, enhanced RNA transcription, or enhanced protein synthesis. In approximately 90 percent of breast tumors, the overexpression is thought to be the result of amplification of the ERBB2 gene (there are more than the normal two copies of the gene per tumor cell).93,94 Cells that overexpress ERBB2 have an enhanced responsiveness to growth factors.95,96

A monoclonal antibody that binds to ERBB2, trastuzumab (Herceptin, Genentech, San Francisco, CA), is used clinically to treat women with breast cancer, but only if their tumors overexpress ERBB2. Because Herceptin is only active against breast tumors that overexpress ERBB2, testing tumors for expression levels of ERBB2 is important for treatment planning.

Fluorescent in situ hybridization (FISH) is a general testing method used to identify the number of copies of a genetic sequence in cells. The test is performed on fixed tissue that has been sectioned and mounted on a slide. The sections are then hybridized with a fluorescent-labeled DNA probe that recognizes the ERBB2 gene. Unbound probe is washed away and the slide is mounted. The slide is then viewed under a fluorescent microscope. The number of ERBB2 signals per cell is counted. A cell that has amplified the ERBB2 gene will have multiple ERBB2 signals per cell nucleus. The results of FISH tests for ERBB2 are commonly reported as negative (no amplification) or positive (amplification).

Immunohistochemistry (IHC) is a general testing method for identifying and quantifying protein in biopsy or surgery specimens fixed, sectioned, and mounted on microscope slides. The section is then incubated with an antibody that recognizes the ERBB2 protein. Excess antibody is washed away and the bound antibody is detected by a labeled secondary antibody. The secondary antibody is usually labeled with an enzyme (a peroxidase) that breaks down a chromogenic substrate (diaminobenzidine) into an insoluble brown stain. Sometimes the initial antibody is detected by a secondary antibody labeled with biotin that is then detected by avidin labeled with the peroxidase enzyme. After incubating the slide with the chromogenic substrate, the cells are usually stained with nonspecific dyes to allow visualization of the cellular structure. The slide is then mounted and examined under a microscope. The degree of staining is estimated by the technician by comparing the slide to control slides with known degrees of staining. IHC tests of ERBB2 expression are commonly reported on a scale of 0 to 3, with 3 indicating a high degree of overexpression and 0 indicating normal levels of expression of ERBB2, as compared to levels found in normal breast epithelium. With this simple qualitative method of estimating the relative amount of stained ERBB2, there is no way to systematically adjust the threshold for each of the 4 categories (0–3). However, each observer may have a unique conscious or subconscious threshold. Observers with a high threshold will minimize sensitivity while maximizing specificity; whereas, observers with a low threshold will maximize sensitivity while minimizing specificity. Further, the overall threshold can be adjusted by different choices for combining the categories to produce a dichotomous positive or negative test result. For example, the 0 category could be considered negative, and 1–3 considered positive (maximizing sensitivity and minimizing specificity), or 0–2 could be considered negative, and 3 considered positive (minimizing sensitivity and maximizing specificity).

In 2007 the American Society of Clinical Oncology (ASCO) and the College of American Pathologists (CAP) jointly systematically reviewed the literature and developed recommendations for ERBB2 testing.97 The panel concluded that as much as 20 percent of current ERBB2 testing may have been inaccurate, and the data did not clearly demonstrate the superiority of a particular method of testing. The panel went on to define criteria for specimen handling, assay interpretation, and reporting, in hopes that standardization of methods would reduce variability and inaccuracy of testing.97

Middleton et al. published an article in 2009 exploring the impact of the ASCO/CAP guidelines on ERBB2 testing.94 The authors reported that prior to implementation of the guidelines, concordance between FISH-based and IHC-based testing was 98 percent, and 10.8 percent of cases had inconclusive FISH results. After implementation of the guidelines, the authors reported that the concordance between FISH-based and IHC-based testing was 98.5 percent, and only 3.4 percent of cases had inconclusive FISH results.94

A 2008 evidence report prepared for the Agency for Healthcare Research and Quality (AHRQ) explored the analytic validity of assays for ERBB2.7 Seidenfeld et al. systematically searched the medical literature through April 2008. Key Question 1 of the review focused on a discussion of discrepancies between results provided by different types of assays for ERBB2, especially discrepancies between FISH-based and immunohistochemistry (IHC)-based assays. The authors of the review noted that “Notably, there is no recognized gold standard to determine the HER2 status of tumor tissue, which also precludes consensus on one ‘best’ HER2 assay.” The authors' conclusion for Key Question 1 is quoted below:

A narrative review was conducted on Key Question 1, which addressed concordance and discrepancy among HER2 assays in breast cancer. HER2 assay results are influenced by multiple biologic, technical, and performance factors. Since many aspects of HER2 assays were standardized only recently, we could not isolate effects of these disparate influences on assay results and patient classification. This challenged the validity of using systematic review methods to compare available assay technologies.7

We searched Embase and MEDLINE for articles published since April 2008 using the search strategy in Appendix A. The search strategy identified 36 articles of possible relevance. Review of the abstracts identified six articles studying alternative (non-IHC, non-FISH based) methods of testing for ERBB2,98-103 five articles comparing different IHC-based and FISH-based methods of testing for ERBB2,104-108 and two articles exploring methods to reduce variability of testing for ERBB2.109,110 In the latter category, Masmoudi et al. studied automation of interpreta­tion of IHC tests,109 and Theodosiou et al. studied automation of interpretation of FISH tests.110

In addition to the above articles, the searches identified an article by Xiap et al. proposing the use of two well-characterized cell lines as “gold standard” reference materials for the validation of and standardization of ERBB2 testing.111 None of the published articles can be characterized as studies of the analytic validity of ERBB2 testing.

Existing Gaps in Evidence

As discussed in this report and other studies, many preanalytic, analytic, and postanalytic factors may contribute to variability in genetic testing results.1,2,112 These factors include collection, preservation, and storage of samples prior to analysis, the type of assay used and its reliability, types of samples being tested, the type of analyte investigated (e.g., SNPs, alleles, genes, or biochemical analytes), genotyping methods, timing of sample analysis, interpretation of the test result, and variability among different labs or their staff members, and quality control processes. Currently, genetic tests are performed either as FDA-cleared or as LDTs. For many conditions (e.g., cystic fibrosis), testing can be performed with both FDA-cleared systems and various laboratory-developed methods. These different testing options are potentially associated with differences in test performance. Validating genetic tests is often challenging due to lack of appropriately validated samples for test validation, lack of “gold-standard” reference methods, and the constantly emerging new genetic techniques.1,2,9 As a result, data for analytic validity of genetic tests may be lacking or inconsistent.

In addition, there are two other barriers that systematic reviewers must overcome to conduct effective systematic review of genetic tests: how to obtain data about the analytic validity of tests that already exist but which are scattered in various locations, and how to analyze the data once it is obtained. One challenge in performing systematic reviews of analytic validity of laboratory tests is obtaining unbiased detailed information. Our case studies above illustrated the difficulty of obtaining information. The published medical literature was searched using standard methods to search electronic databases. In general, with few exceptions, little information was obtained.

We found that 510(k) summaries filed with the FDA were a fruitful source of information about the analytic validity of tests. However, LDTs are not required to file information with the FDA (although the agency is considering reviewing them for analytic validity data).113 We also attempted to obtain information from test manufacturers. Information available from these sources was variable. Some of these sources provided detailed information about the analytic validity of their tests and how it was established; others provided no information, and others provided limited information. Searches of the gray literature revealed that patents and thesis dissertations did not appear to be useful sources of information.

In our meetings with the Workgroup, we discussed what would need to change to make analytic validity information more accessible, particularly for LDTs. Most agreed that at present there is no incentive for laboratories to disclose the data, which are generally considered proprietary. Either a regulatory mandate for release of the data or other type of incentive would be necessary for the situation to change. NIH recently initiated the development of the Genetic Testing Registry (GTR), an online resource that will provide a centralized location for test developers and manufacturers to voluntarily submit test information including validity data.114 It remains unclear how effective the voluntary-data-submission mechanism will be and whether GTR will be a valuable data source for evaluating analytic validity. Alternatively, some of the professional societies such as AMP or CAP may also be able to create databases of analytic validity information that could be de-identified in terms of the laboratory submitting the data. This de-identified data could then, in theory, be analyzed to assess the range of variation for a particular test method.

Other members of the Workgroup argued that analytic validity data is only meaningful for a particular laboratory performing a particular test. The New York CLEP, for example, has observed variation in analytic validity data across laboratories performing the same test method and same types of validation studies. Experts from that program argued that it could be difficult to generalize analytic performance from one laboratory to another. Our own opinion is that while generalizability is a very important aspect of test evaluation, it would be useful to have a sense of the overall experience with analytic validation of a given test method, and that analysis of such data could help to provide clues to reasons for variation in performance. This would not obviate the need for knowing how well the test performs in a given laboratory.

If information about analytic validity can be obtained, the issue of what to do with the data remains. Some systematic reviewers have attempted to “pool” analytic validity data from different tests that purport to measure the same analyte and come to a global conclusion about the analytic validity of an entire class of similar tests. As mentioned above, Seidenfeld et al. systematically reviewed the literature on assays for ERBB2, and concluded that their results “challenged the validity of using systematic review methods to compare available assay technologies.”7 Indeed, establishing the “analytic validity” of an entire class of tests may be inappropriate. For example, one company's test for CA-125 may be highly reproducible from run to run, while another company's very similar test for CA-125 may exhibit significant variability from run to run. Therefore, reviews of analytic validity may need to treat each specific test as a unique technology.

In summary, numerous gaps in evidence exist for measuring the analytic validity of genetic tests. These gaps exist due to multiple factors, including the difficulty in generating data for test validation, barriers to accessing existing data that are not published in peer-reviewed sources, and use of inappropriate methods in synthesizing the existing data. There is no single solution to fill the gaps. To facilitate generation of scientifically sound data on analytic validity, a higher-level of collaboration among the research community, professional societies, and test developers is needed in efforts such as increasing the availability of appropriately validated samples that can be used for test validation, developing effective reference methods, and building sample-splitting or sharing programs. Meanwhile, as discussed previously, laboratories, research funders, test developers or manufacturers, regulatory agencies, and professional societies should play a more active role in developing infrastructures that make the data more accessible.

Views

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...