The utility of toxicogenomic technologies ultimately depends on how reliable, reproducible, and generalizable the results are from a particular study or individual method of analysis. Moving beyond laboratory assays to more widespread use requires some level of validation, which can be defined as the process of ensuring that a test reliably measures and reports the determined end point(s) and encompasses both technical and platform qualification in addition to biologic qualification. Distinct issues arise from the use of any novel technology in a regulatory context. As discussed in this chapter, validation is an integral part of the more general process of developing and applying toxicogenomic methodology.


Validation must be carried out at various levels as described in Box 9-1. First, technology platforms must be shown to provide consistent, reliable results, which includes assessment of device stability and determination of analytical sensitivity and assay limits of detection, interference, and precision (reproducibility and repeatability). Second, the software used to collect and analyze data for an application must provide valid results. Third, the application, consisting of both hardware and software, must be tested and validated in the context of the biologic system to which it will be applied. Fourth, the application, or a related application based on the original, must be shown to be generalizable to a broader population or to be highly specific for a smaller, target population. Finally, one must consider how these technologies and applications based on them can be validated for regulatory use. These five levels of validation are discussed in this chapter.

Box Icon

BOX 9-1

Validation of Toxicogenomic Applications. Platform validation: Does the particular technology provide reproducible and reliable measurements? Software/data analysis validation: Is the software used for analysis (more...)

Platform Validation

Any toxicogenomic study is predicated on the assumption that the technologies provide accurate and relevant measures of the biologic processes underlying what is being assayed. For transcriptomic profiles with microarrays, for which we have the most data, there have been many successful applications, often with high rates of validation using an alternative technology such as Northern analysis or quantitative reverse transcriptase polymerase chain reaction (qRT-PCR); however, it should be noted that each of these techniques has experimental biases. The issue of concordance between different microarray platforms was discussed in Chapter 2. However, recent reports suggest that adherence to good, standard laboratory practices and careful analysis of data can lead to high-quality, reproducible results in which the biology of the system under study drives the gene expression profiles that are observed (Bammler et al. 2005; Dobbin et al. 2005; Irizarry et al. 2005; Larkin et al. 2005). Similar efforts must accompany the adoption of various genomic, proteomic, and metabolomic technology platforms for toxicogenomics.

This process of technology platform assessment is an essential step in the overall validation process and indicates whether a system provides a reliable and reproducible measure of the biology under study. Two often-confused measures of system performance are repeatability and reproducibility. Repeatability describes the agreement of successive measurements when controllable sources of variation are held constant. If one or more of these sources of variation is allowed to have its typical effect, the agreement is called reproducibility.

Repeatability can be assessed by consecutive measurements of a single sample at one time under identical conditions. Repeated measurements with the same method but conducted on different days, with different batches of reagents, or with different operators provide a measure of reproducibility. Because the latter scenario best describes the routine application of toxicogenomic technology platforms, reproducibility is the most relevant measure of system performance.

Assays of reproducibility involve analyzing the same biologic sample multiple times to determine whether the platform provides consistent results with a small coefficient of variation. Although this may seem straightforward, toxicogenomic technologies do not measure single quantities but represent hundreds or thousands of measurements—one each for many genes, proteins, or metabolites. Assays optimized for one range of expression level or type of analyte may not perform as well with other samples. For example, a technology that performs well for genes expressed at high levels may not be sensitive to low levels of expression. Careful assessment of the reproducibility of measurements and the relative signal-to-noise ratio in the assay must be evaluated with an emphasis on relevant levels of gene, protein, or metabolite expression for a specific application. This is particularly true of proteomics and metabonomics, in which the range of analyte concentrations may vary by more than a millionfold (Figure 9-1).

FIGURE 9-1. Human plasma proteome.


Human plasma proteome. The large range of protein concentrations in the human proteome represents a significant experimental challenge as technologies must be sensitive across nearly 12 orders of magnitude (a 1 trillionfold range) for comprehensive analysis (more...)

Types of Calibration Standards

Another approach that can provide some level of quality assessment and quality control is the use of calibration standards that consist of complex mixtures of analytes spanning the dynamic range normally surveyed in a particular application. In the context of microarray gene expression analysis, the development of “universal” RNA reference samples (Cronin et al. 2004) is under way and the External RNA Control Consortium (ERCC), led by the National Institute of Standards and Technology, is moving toward defining such a standard. The ERCC is composed of representatives from the public, private, and academic sectors working together in a consensus fashion to develop tools for experiment control and performance evaluation for gene expression analysis, including “spike-in” controls, protocols, and informatic tools—all intended to be useful for one- and two-color microarray platforms and qRT-PCR.

Ideally, such an RNA “standard” would consist of multiple samples. The first would consist of one or more RNA mixtures that could be used for regular quality control assessment. This approach can be used to monitor the performance of a particular laboratory or platform to document that the results obtained remained consistent both in their ability to detect expression measures for any one sample and their ability to detect differential expression among samples. A second useful control RNA would consist of exogenous spike-in controls (van de Peppel et al. 2003), which correspond to probes on the microarray surface that are not from the species being analyzed. This control measures system performance independent of the quality of the RNA sample being analyzed. Objective measures of RNA quality may provide an additional means of assessing the performance and establishing the credibility of a particular microarray assay, as poor quality RNA samples provide unreliable results. Finally, as the primary measurement in microarray assays is the fluorescence intensity of individual hybridized probes, work is under way to establish quantitative standards to assess the performance of microarray scanning devices.

Efforts at Standards Development

Consortia approaches to standardization and validation have played an important role in working toward platform validation. As the field of toxicogenomics has matured, there has been a realization that groups working together can better understand and define the limitations of any technology and the potential solutions to any problems. Examples include the International Life Sciences Institute’s Health and Environmental Sciences Institute (ILSI-HESI 2006), a consortium of industry, government, and academic groups examining applications of toxicogenomics, and the Toxicogenomics Research Consortium sponsored by the National Institute of Environmental Health Sciences (NIEHS) (TRC 2005). The value of these consortium efforts is that they capture the state-of-the-art of multiple groups simultaneously and therefore have the potential to advance an adoptable standard much more quickly than can individual research groups.

In response to the growing need for objective standards to assess the quality of microarray assays, the Microarray Gene Expression Data Society (MGED) hosted a workshop on microarray quality standards in 2005. This workshop and its findings are described in Box 9-2.

Box Icon

BOX 9-2

Microarray Gene Expression Data Society Workshop, September 2005, Bergen, Norway. The purpose of this workshop was to examine quality metrics that cut across technologies and laboratories: defined standards that can be used to evaluate various aspects (more...)

While early, informal efforts such as those offered by MGED are important in helping to define the scope of the problem and to identify potential approaches, systematic development of objective standards for quality assessment would greatly facilitate the advancement and establishment of toxicogenomics as a discipline. Ideally, further efforts to establish objective and quantitative quality measures for microarray data and other toxicogenomic data will help to advance the field in much the same way that “phred quality scores” characterizing the quality of DNA sequences accelerated genome sequencing.

Software/Data Analysis Validation

The software used in analyzing data from toxicogenomic studies can play as significant a role in determining the final outcome of an experiment as the technology platform. Consequently, considerable attention must be paid to validating the computational approaches. The combination of technology platform data collection and processing algorithms must be appropriately selected and validated for application to the biologic system for each study.

Data Collection and Normalization

Most genomic technology platforms do not perform absolute quantitative measurements. For microarray data collected on an Affymetrix GeneChip, the data from the multiple probe pairs for each gene are combined in various ways to assess an expression level for each gene. However, these analyses do not measure quantities of particular molecules directly; instead, they measure surrogates such as fluorescence, which is subject to unknown sources of variation. These fluorescent signals are used to estimate expression levels. Similar processing of the raw data is an element of all genomic technologies. For gene expression-based assays, these initial measurements are often followed by a “normalization” process that adjusts the individual measurements for each gene in each sample to facilitate intersample comparison. Normalization attempts to remove systematic experimental variation in each measurement and to adjust the data to allow direct comparison of the levels of a single gene, protein, or metabolite across samples. Despite widespread use of image processing and data normalization in microarray analyses, the sources of background signals and how to best estimate their levels are not fully understood; thus, there is no universally accepted standard for this process. Any image processing and normalization approach changes the data and affects the results of further analysis.

The importance of appropriate methods is emphasized here. If the results of multiple studies are to be combined, every effort should be made to apply consistent methods to all data. As the number of microarray experiments grows, there is increasing interest in meta-analyses, which may provide more broadly based information than can be seen in a single experiment. It is important that repositories for toxicogenomic experiments make all “raw” data available, so they can be analyzed by consistent methodologies to extract maximum information.

Data Analysis

Once the data from a particular experiment have been collected and normalized, they are often further analyzed by methods described in Chapter 3. Class discovery experiments typically use approaches such as hierarchical clustering to determine whether relevant subgroups exist in the data.

Class prediction and classification studies, which link toxicogenomic profiles to specific phenotypic outcomes represent a somewhat different validation challenge. Ideally, to validate a classification method, it is most useful to have an initial collection of samples (the training set) that can be analyzed to arrive at a profile and an appropriate classification algorithm as well as an independent group of samples (the test set) that can be used to verify the approach. In practice, most toxicogenomic studies have a limited number of samples and all of them are generally necessary for identifying an appropriate classification algorithm (or classifier). An alternative to using an independent test set, albeit less powerful and less reliable, is to perform leave k out cross-validation (LKOCV) (Simon et al. 2003). This approach leaves out some subset k of the initial collection of N samples, develops a classifier using the (N − k) samples that remain, and then applies the classification algorithm to k samples in the test set that were initially left out. This process is then repeated with a new set of k samples to be left out and classified and so on. The simplest variant, which is often used, is leave one out cross-validation (LOOCV).

This cross-validation can be extremely useful when an independent test set is not available, but it is often applied inappropriately as a partial rather than a full cross-validation, the distinction being the stage in the process when one leaves k out. Many published studies with microarray data have used the entire dataset to select a set of classification genes and then divided the samples into k and (N − k) test and training sets. The (N − k) training samples are used to train the algorithm, which is tested on the k test samples. The problem is that using all the samples to select a classification set of genes has the potential to bias any classifier because the test and training sets are not independent. Such partial cross-validation should never be performed. The proper approach is to conduct full LKOCV in which the sample data are divided into training and test sets before each round of gene selection, algorithm training, and testing. When iterated over multiple rounds, LKOCV can be used to estimate the accuracy of the classification system by simply averaging the complete set of classifiers. However, even here, optimal validation of any classifier requires a truly independent test set.

The choice of samples for training is an important, but often neglected, element in developing a classifier. It is important to balance representation of sample classes and to ensure that other factors do not confound the analysis. Nearly all algorithms work by a majority consensus rule. For example, if the data represent two classes, A and B, with eight in class A and two in class B, the simplest classifier would just assign everything to class A with 80% accuracy, a result that clearly is not acceptable. Samples should also be selected so that there are no confounding factors. For example, an experiment may be conducted to develop a classifier for hepatotoxic compounds. If all toxicant-treated animals were treated with one vehicle, whereas the control animals were treated with another vehicle, then differences between the treated and control groups may be confounded by the differences in vehicle response. The solution to this problem is a more careful experimental design focused on limiting confounding factors.

Selecting a sample of sufficient size to resolve classes is also an important consideration (Churchill 2002, Simon et al. 2002, Mukherjee et al. 2003). Radich and colleagues recently illustrated one reason why this is so important; they analyzed gene expression levels in peripheral blood and demonstrated significant, but reproducible, interindividual variation in expression for a relatively large number of genes (Radich et al. 2004). Their study suggests that a small sample size may lead to biases in the gene, protein, or metabolite selection set because of random effects in assigning samples to classes.

Biologic Validation and Generalizability

Regardless of the goal of a particular experiment, its utility depends on whether its results can be biologically validated. Biologic validation is the process of confirming that a biologic change underlies whatever is detected with the technology platform and assigning a biologic context or explanation to an observed characteristic of a system. The results from any analysis of toxicogenomic data are generally considered a hypothesis that must be validated by more well-established and lower-throughput “standard” laboratory methods. A first step is often to verify the expression of a select set of genes in the original samples by an independent technique. If the results can be shown to be consistent with results from an independent technique, further detailed study is often warranted. For example, upregulation of a gene transcript detected with a microarray may suggest activation of a specific signaling pathway—activation that, in addition to being confirmed by a change in the level of a corresponding protein, can be confirmed by measuring a change in another output regulated by the pathway. In a toxicogenomic experiment, in which thousands of genes, proteins, or metabolites are examined in a single assay, biologic validation is important because there is a significant likelihood that some changes in genes, proteins, or metabolites are associated with a particular outcome by chance.

For mechanistic studies, biologic validation also requires clearly demonstrating a causative role for any proposed mechanism. For class discovery studies in which new subgroups of compounds are identified, biologic validation also typically requires demonstrating some tangible difference, however subtle, among the newly discovered subgroups of compounds. For example, new paths to neurotoxicity may be inferred through transcriptome profiling when neurotoxic compounds separate into groups based on the severity of the phenotype they cause, the time of onset of the phenotype, or the mechanism that produces the phenotype. Finding such differences is important to establish the existence of any new classes.

Generalizability addresses whether an observation from a single study can be extended to a broader population or, in the case of studies with animal models, whether the results are similar across species. For mechanistic studies, generalization requires verifying that the mechanism exists in a broader population. In class discovery, if results are generalizable, newly discovered classes should also be found when the sample studies are extended to include a broader, more heterogeneous population of humans or other species than those used in the initial study. For classification studies, generalization requires a demonstration that, at the least, the set of classification genes and associated algorithms retain their predictive power in a larger, independent population and that, within certain specific populations, the classification approach retains its accuracy (for example, see Box 9-3).

Box Icon

BOX 9-3

Clinical Validation of Transcriptome Profiling in Breast Cancer. Although toxicogenomic technology applications are still in their infancy, they are being explored in clinical medicine. A notable example that illustrates the path from genomic discovery (more...)

Validation in a Regulatory Setting

Toxicogenomic technologies will likely play key roles in safety evaluation and risk assessment of new compounds. Toxicogenomic technologies for this application must be validated by regulatory agencies, such as the Environmental Protection Agency (EPA), the Food and Drug Administration (FDA), and the Occupational Safety and Health Administration (OSHA), before they can be used in the decision-making process. The procedures by which regulatory validation occurs have traditionally been informal and ad hoc, varying by agency, program, and purpose. This flexible validation process has been guided by general scientific principles and, when applied to a specific test, usually involves a review of the available experience with the test and an official or unofficial interlaboratory collaboration or “round robin” to evaluate the performance and reproducibility of the test (Zeiger 2003).

Deciding whether to accept a particular type of data for regulatory purposes depends on more than such technical validation, however, and is also affected by a regulatory agency’s statutory mandate, regulatory precedents and procedures, and the agency’s priorities and resources. Because these factors are agency specific, validation is necessarily agency specific as well. The scientific development and regulatory use of toxicogenomic data will be facilitated by harmonization, to the extent possible, of data and method validation both among U.S. regulatory agencies and at the international level. However, harmonization should be a long-term goal and should not prevent individual agencies from exploring their own validation procedures and criteria in the shorter term.

Toxicogenomic data present unique regulatory validation challenges both because such data have not previously been used in a regulatory setting and because of the rapid pace at which toxicogenomic technologies and data are developing. Therefore, regulatory agencies must balance the need to provide criteria and standardization for the submission of toxicogenomic data with the need to avoid prematurely “locking-in” transitory technologies that may soon be replaced with the next generation of products or methods. Regulatory agencies have been criticized for being too conservative in adopting new toxicologic methods and data (NRC 1994). Consistent with this pattern, such regulatory agencies as EPA and FDA to date have been relatively conservative in using toxicogenomic data, partly due to the lack of validation and standardization (Schechtman 2005).

Although some caution and prudence against premature reliance on unvalidated methods and data is appropriate, agencies can play a critical role and must actively encourage the deployment of toxicogenomic data and methods if toxicogenomic approaches are to be used to their fullest advantage. For example, FDA has issued a white paper describing a “critical path to new medical products” (FDA 2005b) that acknowledges the role of toxicogenomic technologies in providing a more sensitive assessment of new compounds and suggests that new methods be developed to improve the process of evaluation and approval.

EPA and FDA have adopted initial regulatory guidances that seek to encourage toxicogenomic data2 submissions (see Table 9-1 and Chapter 11). In March 2005, FDA issued a guidance for industry on submission of pharmacogenomic data (FDA 2005a). In that guidance, FDA states that “[b]ecause the field of pharmacogenomics is rapidly evolving, in many circumstances, the experimental results may not be well enough established scientifically to be suitable for regulatory decision making. For example: Laboratory techniques and test procedures may not be well validated” (FDA 2005a. 2).

TABLE 9-1. Worldwide Regulatory Policies and Guidelines Related to Toxicogenomics and Pharmacogenomics.


Worldwide Regulatory Policies and Guidelines Related to Toxicogenomics and Pharmacogenomics.

The FDA Guidance describes reporting requirements for “known valid” and “probable valid” biomarkers. A known valid biomarker is defined as “[a] biomarker that is measured in an analytical test system with well-established performance characteristics and for which there is widespread agreement in the medical or scientific community about the physiologic, toxicologic, pharmacologic, or clinical significance of the results.” A probable valid biomarker is defined as “[a] biomarker that is measured in an analytical test system with well-established performance characteristics and for which there is a scientific framework or body of evidence that appears to elucidate the physiologic, toxicologic, pharmacologic, or clinical significance of the test results” (FDA 2005a. 17). The Guidance provides that “validation of a biomarker is context-specific and the criteria for validation will vary with the intended use of the biomarker. The clinical utility (for example, ability to predict toxicity, effectiveness or dosing) and use of epidemiology/population data (for example, strength of genotype-phenotype associations) are examples of approaches that can be used to determine the specific context and the necessary criteria for validation” (FDA 2005a. 17).

FDA also lists possible reasons why a probable valid biomarker may not have reached the status of a known valid marker including the following: “(i) the data elucidating its significance may have been generated within a single company and may not be available for public scientific scrutiny; (ii) the data elucidating its significance, although highly suggestive, may not be conclusive; and (iii) independent verification of the results may not have occurred” (FDA 2005a. 17, 18). Although FDA outlines clear steps for sponsors to follow with regard to regulatory expectations for each type of biomarker, these classifications are not officially recognized outside FDA.

In addition to this FDA guidance, a number of other regulatory policies and guidelines have been issued worldwide that cover topics related to pharmacogenomics and toxicogenomics (see Table 9-1). Future efforts to harmonize the use and expectations for genomic data will provide value in reducing the current challenge pharmaceutical companies face in addressing guidances for different countries.

EPA issued an Interim Policy on Genomics in 2002 to allow consideration of genomic data in regulatory decision making but stated that these data alone would be “insufficient as a basis for decisions” (EPA 2002, p. 2). The Interim Policy states that EPA “will consider genomics information on a case-by-case basis” and that “[b]efore such information can be accepted and used, agency review will be needed to determine adequacy regarding the quality, representativeness, and reproducibility of the data” (EPA 2002, Pp. 2-3). The EPA is also in the process of standardizing data-reporting elements for new in vitro and in silico test methods including microarrays, using the Minimum Information About a Microarray Experiment criteria as a starting point.

At the interagency level, Congress established a permanent Interagency Coordinating Committee on the Validation of Alternative Methods (ICCVAM) in 2000 to require that new and revised test methods be validated to meet the needs of federal agencies. The NIEHS, EPA, FDA, and OSHA are 4 of the 15 federal regulatory and research agencies participating in ICCVAM. Europe has created a similar validation organization called the European Centre for the Validation of Alternative Methods (ECVAM). At the international level, the Organization for Economic Co-operation and Development has also adopted formal guidelines to validate test methods for use in regulatory decision making (OECD 2001).

The ICCVAM criteria are useful guides for regulatory validation of toxicogenomic data and methods, but toxicogenomic technologies will require unique and more flexible approaches to validation given their rapid pace of change and other unique characteristics (Corvi et al. 2006). To that end, ICCVAM and ECVAM are developing a unique and more flexible approach to validating toxicogenomic test methods for regulatory use, and they have convened a series of workshops on the topic (Corvi et al. 2006). One approach being investigated is a “modular” model, in which different steps in the validation process are independently undertaken, in contrast to the traditional stepwise “linear” model that may unduly delay the validation of rapidly evolving toxicogenomic technologies (Corvi et al. 2006). Agencies such as the EPA are carefully tracking and participating in this initiative and anticipate applying the output of the ICCVAM process in their own regulatory programs (EPA 2004).


Toxicogenomics has reached the stage where many of the initial technical questions have been resolved, at least for the more mature approaches such as gene expression analysis with microarrays. The community has learned that careful experiments using genomic approaches can provide results that are both comparable among laboratories and reveal insight into the biology of the system under study (Bammler et al. 2005; Irizarry et al. 2005; Larkin et al. 2005). However, the need for standards for assessing the quality of particular experiments remains, and it will affect the utility of datasets that are and will be generated. The work of the ERCC (and other groups) to develop RNA standards (Cronin et al. 2004) is a potentially important component of this effort and should be encouraged and continued, but additional work and development are necessary if truly useful quality assessment standards are to be created. Standard development efforts should not be limited to gene expression microarray analysis, as similar standards will be necessary if other toxicogenomic technologies are to be widely used and trusted to give reliable results.

Beyond quality control of individual experiments, more extensive validation is needed to use toxicogenomic data for the applications discussed in this report. Most toxicogenomic projects have focused on limited numbers of samples with technologies, such as DNA microarrays, that may not be practical for large-scale applications that go beyond the laboratory. Consequently, validation of toxicogenomic signatures should focus not only on the primary toxicogenomic technology (such as DNA microarrays) but also on assays that can be widely deployed (such as qRT-PCR) at relatively low cost. Many issues associated with validation of toxicogenomic signatures will rely on the availability of large, accessible, high-quality datasets to evaluate the specificity and sensitivity of the assays. Those datasets must include not only the primary data from the assays but also the ancillary data about treatments and other factors necessary for analysis. This argues for the creation and population of a public data repository for toxicogenomic data.

A means of regulatory validation of toxicogenomic applications is needed—for example, for toxicogenomic data accompanying new submissions of drug candidates for approval. Specifically, the development of new standards and guidelines that will provide clear, dynamic, and flexible criteria for the approval and use of toxicogenomic technologies is needed at this point. Development of these standards and guidelines requires suitable datasets. For example, the use of toxicogenomics to classify new compounds for their potential to produce a specific deleterious phenotype requires a useful body of high-quality, well annotated data. The existing ICCVAM approaches do not provide guidance for the large-scale toxicogenomic approaches being developed, as evidenced by the different guidelines the FDA and the EPA are developing, and toxicogenomic tools need not be subject to ICCVAM protocols before they are considered replacement technologies. Although multiagency initiatives, such as the one ICCVAM is spearheading, may serve as a basis for establishing the needed standards and criteria in the longer term, the overall ICCVAM approach does not seem well suited for validating new technologies and as such will need to be significantly revised to accommodate the tools of toxicogenomics. Consequently, regulatory agencies such as the EPA and the FDA should move forward expeditiously in continuing to develop and expand their validation criteria to encourage submission and use of toxicogenomic data in regulatory contexts.

In summary, the following are needed to move forward in validation:

  • Objective standards for assessing quality and implementing quality control measures for the various toxicogenomic technologies;
  • Guidelines for extending technologies from the laboratory to broader applications, including guidance for implementing related but more easily deployable technologies such as qRT-PCR and ELISAs;
  • A clear and unified approach to regulatory validation of “-omic” technologies that aligns the potentially diverse standards being developed by various federal agencies, including the EPA, FDA, NIEHS, and OSHA, as well as at tempting to coordinate standards with the relevant European and Asian regulatory agencies.
  • A well-annotated, freely accessible, database providing access to high-quality “-omic” data.


The following specific actions are recommended to facilitate technical validation of toxicogenomic technologies:

  1. Develop objective standards for assessing sample and data quality from different technology platforms, including standardized materials such as those developed by the ERCC.
  2. Develop appropriate criteria for using toxicogenomic technologies for different applications, such as hazard screening and exposure assessment.
  3. Regulatory agencies should establish clear, transparent, and flexible criteria for the regulatory validation of toxicogenomic technologies. Whereas the use of toxicogenomic data will be facilitated by harmonization of data and method validation criteria among U.S. regulatory agencies and at the international level, harmonization should be a long-term goal and should not prevent individual agencies from developing their own validation procedures and criteria in the shorter term.



“Pharmacogenomic” data and guidances are included in this table and discussion because the term pharmacogenomic is often used to include data about the toxicity and safety of pharmaceutical compounds (referred to in this report as toxicogenomics) because regulatory use and validation of other types of genomic data are relevant to toxicogenomics.