The hen’s egg test for micronucleus induction (HET-MN): validation data set

Abstract The classical in vitro genotoxicity test battery is known to be sensitive for indicating genotoxicity. However, a high rate of ‘misleading positives’ was reported when three assays were combined as required by several legislations. Despite the recent optimisations of the standard in vitro tests, two gaps could hardly be addressed with assays based on 2D monolayer cell cultures: the route of exposure and a relevant intrinsic metabolic capacity to transform pro-mutagens into reactive metabolites. Following these considerations, fertilised chicken eggs have been introduced into genotoxicity testing and were combined with a classical read-out parameter, the micronucleus frequency in circulating erythrocytes, to develop the hen’s egg test for micronucleus induction (HET-MN). As a major advantage, the test mirrors the systemic availability of compounds after oral exposure by reflecting certain steps of Absorption, Distribution, Metabolism, Excretion (ADME) without being considered as an animal experiment. The assay is supposed to add to a toolbox of assays to follow up on positive findings from initial testing with classical in vitro assays. We here report on a validation exercise, in which >30 chemicals were tested double-blinded in three laboratories. The specificity and sensitivity of the HET-MN were calculated to be 98 and 84%, respectively, corresponding to an overall accuracy of 91%. A detailed protocol, which includes a picture atlas detailing the cell and micronuclei analysis, is published in parallel (Maul et al. Validation of the hen’s egg test for micronucleus induction (HET-MN): detailed protocol including scoring atlas, historical control data and statistical analysis).


Introduction
The in vitro micronucleus assay (MNvit) (2) is an essential part of genotoxicity test batteries recommended by regulatory agencies in the field of, e.g., cosmetics (3), industrial chemicals (4) or plant protection products (5). It allows the detection of both chromosomal breakage and interference with chromosomal segregation during interphase by easily scoring micronuclei (MNs) in different cell types, making it a scientifically valid alternative to the in vitro chromosomal aberration test (6). A retrospective validation confirmed its good sensitivity (6). However, when the assay is combined with other genotoxicity assays in an in vitro test battery as requested by different legislations e.g. (4), the overall outcome had a rather low specificity (7). Subsequent improvements in the experimental protocol were recently implemented in a revised OECD Testing Guideline (2). However, two aspects can hardly be addressed with assays that are based on two-dimensional cell cultures: an efficient metabolic capacity to identify pro-mutagens, as acknowledged in the OECD Test Guideline (TG) for the MNvit (OECD TG 487) (2), and the route of exposure, an aspect referred to in current OECD TGs on in vivo genotoxicity testing (8,9).
To overcome these limitations, complex three-dimensional test systems were introduced into genotoxicity testing and combined with established read-out parameters (10). These assays are intended to complement the existing in vitro genotoxicity toolbox by broadening the spectrum of assays for following up on positive results from initial testing with classical methods.
The Hen"s Egg Test for Micronucleus-Induction (HET-MN) represents one of those examples as it combines the analysis of MN frequencies in circulating erythrocytes with standardized and fertilized chicken eggs, which are routinely used for vaccine production (11,12). As a unique characteristic, the HET-MN is able to mirror certain steps of ADME: At day 8 of egg development, the test compound is applied through a little hole in the eggshell at the blunt end (where the air cell is located) onto the inner shell membrane. During the following three days, the compound passes this membrane and is taken up by the highly vascularized chorioallantoic membrane (CAM) prior to the distribution via the blood vessel system. The metabolism of the compound is ensured by respective enzymes in the yolk A c c e p t e d M a n u s c r i p t 5 sac membrane and the developing liver. Finally, the compound and/or its metabolites are actively excreted into the allantois, a bladder equivalent accessible to sampling. In summary, the HET-MN allows for toxicokinetic and toxicodynamic investigations, thereby closing a major gap in in vitro genotoxicity testing.
There is ample evidence that the xenobiotic metabolism is well established in the developing chicken egg, see e.g. (11). In consequence, liver S9 mix, which needs to be added as an external source of metabolizing enzymes to two-dimensional cell cultures, is not required to correctly identify promutagens with the HET-MN (11)(12)(13)(14)(15). Recent studies (K. Reisinger, in preparation) provided evidence that the intrinsic metabolic capacity between day 8-11 of egg development is located in the developing liver and the yolk sac membrane. During this time, the yolk sac membranes also serve as focal point for erythropoiesis. Thus, test compounds are metabolized in close vicinity to the repository of cells, which are used to analyze the chemical"s genotoxic potential. Therefore, a pre-systemic metabolic elimination of a test compound, which is described for some orally administered drugs by the intestinal and hepatic first-past effect, is not expected.
The period between day 8-11 of egg development, when the HET-MN is performed, is a highly proliferative state, during which both the blood volume and the number of erythrocytes per blood volume increase exponentially (16,17). Erythrocytes bearing MNs accumulate in the blood as the spleen is yet not functional to eliminate damaged cells (18), while the background MN frequency is low in the standardized chicken eggs used (1), which are genetically defined by their local suppliers.
As mentioned above, eggs are only used in an early developmental stage, in which no brain activities could be detected (19)(20)(21). This premature state is reflected by legislations around the globe which do not consider the assay as animal experiment e.g. (22)(23)(24)(25)(26). Thus, the assay can be used to meet legislations which demand or support in vitro methods for regulatory decision making. A c c e p t e d M a n u s c r i p t 6 Taken together, the HET-MN provides a complex study type exhibiting a liver-like xenobiotic metabolism. Together with the intrinsic characteristics of chicken eggs as summarized above, the HET-MN combines the advantages of an in vitro approach with the ability to mirror the systemic availability of chemicals, which is otherwise associated with in vivo experiments, while the assay is in line with animal protection regulations and ethical aspects.
The HET-MN protocol as used in the present study is the result of a thorough method development (12)(13)(14)(15) after which the assay was transferred to and further optimized together with a second laboratory (11). Until 2012, up to 21 compounds were tested in two laboratories and were all predicted correctly (see discussion for details, Table 4). Subsequently, three laboratories entered into a cooperation to further investigate the performance of the HET-MN in a validation exercise (after a transfer phase), the results of which are reported here. The validation study included the investigation of more than 30 chemicals being tested double-blinded as well as the evaluation of two prediction models. Finally, the validation data were used to calculate the predictivity of the assay.

Selection and allocation of coded chemicals
The test chemicals were selected (independent from the study authors) by experts of the genotoxicity group of Cosmetics Europe. The substances were grouped into three categories based on literature data: true negative (TN) and true positive (TP) chemicals, with concordant in vitro and in vivo genotoxicity and/or carcinogenicity data (Table S1) information. In addition, sealed envelopes with the codes and the entire hazard profile were available to safety officers of the three facilities for emergency cases. The envelopes remained sealed and were sent back to the BfR after the experimental phase to prove that the substance identities were not disclosed before unblinding.

Chemicals
In order to keep a high level of standardization, the same batches of each of the following chemicals

Chicken eggs
White Leghorn chicken eggs (Gallus gallus domesticus) of a defined health status, i.e., specificpathogen-free (SPF) eggs, were obtained from Valo Biomedia GmbH (www.valobiomedia.com) within one day after egg deposition. Care was taken during transport to avoid major temperature variations. After storage at 4 -8 °C for a maximum of four days, eggs were cultivated in the incubator at 37.5 ± 0.5 °C and a humidity of approximately 70% (40-80%) in horizontal position and automatically rotated to simulate natural incubation conditions.

HET-MN protocol
The validation followed the HET-MN protocol that has recently been published (27) as well as submitted for publication (1), including study design and criteria used for the evaluation of results.
The protocol is therefore only briefly summarized here whereas the study design and the evaluation criteria are described in more detail to support the understanding of the validation results.
A c c e p t e d M a n u s c r i p t 8 After checking for viability and egg weight, intact and appropriately developed eggs were exposed on day 8 of egg development to the test chemicals. In rare cases, chemicals were applied on day 9 of egg development (see section 3.4.1). In general, more than six eggs were allocated to dose or control groups at the beginning of experiments to ensure that a sufficiently high number of viable eggs was available at the end of experiments for micronucleus analysis. In case of unknown or high toxicity up to 18 eggs were allocated to respective dose groups, in case of known and low toxicity 8-10 eggs were used for those control or dose groups (for details please refer to (1)). Chemicals were freshly prepared and applied via a small hole in the eggshell at the blunt end (where the air cell is located) onto the inner shell membrane. Blood samples were always taken on day 11 of egg development.
Immediately prior to blood sampling, the viability of eggs was checked by candling them under a cold light lamp and only viable eggs were subjected to sampling. Further, the viability within treatment and control groups was determined, i.e., the number of viable eggs of a treatment/control group at the end of an experiment were compared to the number of viable eggs at the beginning of experiments and given as percentage. For sampling, eggs were opened widely around the small hole used for application. Subsequently, the only appearing big blood vessel was identified, and a loop was pulled out and positioned across a plastic strip, which laid on the rim of the opened eggshell. A sample of 3-5 µL blood was taken and spread onto a glass slide. Three slides were prepared per egg (one for analysis, two as back-up) and air-dried. Afterwards, slides were stained with a modified Pappenheim staining. Before analysis under a bright field microscope using a 100× magnification, slides were randomized and coded to prevent operator bias during evaluation.
For analysis, 1000 polychromatic erythrocytes (PCE) and normochromatic erythrocytes (NCE) per egg in total were investigated for the presence of MNs. Other cellular effects such as binucleated cells were only recorded.

Study design
The HET-MN followed the standard design of in vitro genotoxicity studies comprising a solubility study, a recommended pre-test, a dose range-finding experiment, and for validation purposes at least two valid main experiments, while for regulatory testing laboratories may finalize testing after one valid and positive experiment (2).
A c c e p t e d M a n u s c r i p t 9

Solubility study
Based on results of development and optimisation phases of the HET-MN protocol, four solvents have been recommended. With first priority deionised water (aqua DI, 300 µL standard volume to be applied on egg membranes, maximum 1500 µL) and isopropyl myristate (IPM, 50 µL) were used. In case of low solubility, ethanol (10%, 100 µL) as well as 1% and 10% DMSO (300 µL and 100 µL, respectively) were used to identify the solvent in which the maximum concentration of the test chemical could be applied. The maximum dose was limited to 100 mg per egg (acceptable weight range: 65 ± 4 g), which corresponds to the top dose in the mammalian in vivo MN test, i.e., 2000 mg/kg body weight/day (8).

Pre-test
This short-time test was used to narrow down the dose range for the subsequent dose range-finding experiment, especially for well soluble compounds. For this purpose, a limited number of eggs, e.g. two per dose group, was exposed to a limited number of concentrations, e.g. the highest soluble dose and several dilutions, for 0.5 h up to 48 h. The viability of dose groups was recorded and used to design the subsequent experiment.

Dose range-finding experiment
The dose range-finding experiment was designed to define the maximum dose for main experiments, which could be limited by the solubility, if it is less than 100 mg/egg, or by the chemical"s general toxicity (for details on toxicity please refer to 2.6). In case the dose range-finding experiment met all validity criteria (see section 2.6), it was accepted as main experiment. Eggs were exposed in line with the schedule of main experiments. Egg viability was the read-out of first priority; most of the laboratories also prepared slides to investigate the MN frequency. A c c e p t e d M a n u s c r i p t

Main experiment
Main experiments comprised a solvent control (SC), a positive control (PC), and at least three doses of the test chemical. As the SC groups showed the same low background in DNA damage compared to untreated eggs, a negative control group was omitted. Cyclophosphamide (CP; 0.05 mg CP/egg in aqua DI) was used as PC, in a concentration to induce a moderate increase in MN rate without causing remarkable general toxicity. In phase I, 7,12-dimethyl-benz[a]anthracene was used instead of CP as PC in few experiments, which all fulfilled the respective validity criteria. Each control or dose group comprised six viable eggs at the end of experiments to be subjected to the analysis of MN frequency.
For validation purposes, at least two main experiments were performed to obtain information on the intra-laboratory reproducibility. For routine testing, a study can already be terminated after the first experiment in case a clear positive call is obtained, i.e, all criteria for a positive call would have been fulfilled as delineated in section 2.6. Generally, when a second main experiment is performed, the dose spacing is modified, usually by using a tighter spacing, depending on the outcome of the first main experiment.  frequency. The appearance of alert parameters (e.g. binucleated cells) could also serve as indication of chemical exposure but was not sufficient to fulfil the validity criteria. In case none of these parameters would prove the bioavailability of a test compound, its distribution within the egg has to be shown with analytical measurements of samples taken from blood, allantois or other compartments of the egg (proof of exposure). During the validation with more than 30 coded test compounds, these additional analyses were outside the scope of the exercise.

Statistical evaluation
Data of valid experiments were analysed by two prediction models (PM). The first one (PM1) checked for the exceedance of a pre-defined threshold, i.e., the mean of the historical SC (m hSC ) plus four times the standard deviation (sd hSC ). The Jonckheere-Terpstra (JT) test was used in addition to check for a dose-dependent monotonic increase below the strict threshold using a significance level (p) of 0.025. The outcome of PM1 was positive if the threshold was exceeded and/or if the JT test indicated a statistically significant increase. PM2 used the one-sided Umbrella-Williams (UW) test (29), which detects additional shapes of dose-response curves as it compares single as well as pooled dose groups against the SC (p < 0.05) (1).
A c c e p t e d M a n u s c r i p t

Consideration of biological relevance
In addition to statistical significance, the biological relevance of effects was analyzed in line with OECD TG 487 (2). By expert judgement, it was checked (a) whether the observed MN frequency exceeded the historical control range (mean of historical SC plus two times the standard deviation) in case data were below the PM1 threshold. Further, (b) the reproducibility of positive findings was evaluated.
If one experiment showed a statistically significant, dose-dependent increase in MN frequency (thus demonstrating a reproducible effect across the treatment groups) which exceeded the PM1 threshold, this experiment would be sufficient to call the entire study as positive (even if the second experiment was negative). The positive call for the entire study would also apply in case of a statistically significant increase in one dose only (with exceedance of the PM1 threshold) if reproduced in a second experiment. In case none of the criteria applied, and the bioavailability of the test compound was proven, the study was considered negative. If only one (but not both) of the criteria (a) and (b) were fulfilled, the study was considered equivocal, i.e., further investigation would have been needed to conclude in a positive or negative call.
Please note that after the validation exercise, the performance of both PMs was analysed and the threshold of PM1 and the UW test of PM2 were combined to the final PM (1). None of the calls presented in this publication would change when applying this final PM. For transparency reason, in the graphs presented in the Results and Discussion Section, the outcomes of PM1 and PM2 are delineated.

Results and discussion
Three laboratories (Labs A, B, C) participated in the validation of the HET-MN. The validation exercise was preceded by a transfer phase, in which the HET-MN protocol was implemented in Labs A and B by investigating cyclophosphamide and 7,12-dimethylbenz(a)antracene; Lab C was not involved in this phase as it already participated in the preceding optimization phase of the method (11). Subsequently, three chemicals, already tested before with the HET-MN, were shared blinded to all three laboratories to expand the historical control databases in Labs A and B (data not shown). In A c c e p t e d M a n u s c r i p t 13 addition, the transfer phase was used to verify the implementation of standards linked to validation exercises (30,31) such as the shipping of coded chemicals as well as proper dose-range findings and to conclude the studies with coded chemicals.

Coded testing
The subsequent validation exercise was structured into four phases following a lean design (31). In phase I each chemical was investigated by all three laboratories to obtain information on withinlaboratory and between-laboratory reproducibility. In phases II and III each chemical was tested in two laboratories, whereas in phase IV each chemical was analyzed in one laboratory only to expand the number of chemicals investigated with the HET-MN. In total, 34 chemicals were tested double- studies are portrayed in more detail, whose results deviated from in vivo genotoxicity or carcinogenicity data (see Supplementary Table S1). The description of results starts with 2aminoanthracene to delineate both the study design and the evaluation criteria. Validity criteria (Section 2.6) were all met: (1) the pre-defined experimental design was used, (2) control groups and a minimum of three dose groups showed a sufficiently high viability of ≥ 40 %, (3) acceptance criteria for SC and PC (see short dotted lines in Figure 1) were met, and (4) the bioavailability of the chemical was demonstrated by the decrease in viability and the increase in MN frequency (one of these signs would have been sufficient). The evaluation of data with prediction model 1 (PM1) showed an increase in MN frequency exceeding the pre-defined threshold (see long dotted lines in Figure 1). This threshold was calculated as the mean of the historical SC plus four times the standard deviation. Please note that the criterium which is often used as upper bound of the historical control range, i.e., the mean of the historical SC plus two times the standard deviation, is A c c e p t e d M a n u s c r i p t 14 used here as validity criterium for the concurrent SC. Therefore, the exceedance of the PM1 threshold is considered a clear indication for a genotoxic effect. In addition, the trend test for a monotonic increase, i.e., Jonckheere-Terpstra test (JT), was positive as well. A statistically significant increase was also signalized by the Umbrella-Williams test (UW) of PM2. In addition to the statistical evaluation, the laboratory evaluated the biological relevance of the observed effects in an expert judgement (EJ, section 2.6, in accordance to OECD TG 487 (2)); all relevance criteria were met so that the statistically based test outcome could be confirmed. A second main experiment was performed to obtain information on the within-laboratory reproducibility (WLR) during the validation exercise, which resulted in the same positive call.  Table S1). Lab C considered IPM instead of DMSO (which was used by Lab A) as the most suitable solvent but applied lower doses. To maximize the applied dose, eggs were not only treated on day 8 but received the same dose also on days 9 and 10, followed by the usual sampling on day 11. This so called "repeated-dose regimen" (14) induced a dose-dependent increase in MN frequency above the PM1 threshold in experiment 1. This outcome was reproduced in experiment 3 being performed to follow-up the disconcordant result in the second experiment. In summary, 2-AAF was correctly classified as positive by Lab C. Details on the "repeated-dose-regime" are given in Section 3.4.1.  Table S1). Figure S3) was tested by Lab C up to doses producing signs of strong toxicity. In the second experiment the MN rate at the second mid-dose was flagged by both PMs. As this effect was not reproducible, neither in the third main experiment using a tight dose range nor in the dose-range-finding experiment, the study was considered negative in line with historical in vivo data (Supplementary Table S1).

2-Ethyl-1,3-hexandiol (MP; Supplementary
Already in the dose-range finding experiment (not shown), Lab A observed strong toxicity when   Table S1).  Table S1).  Table S1).  Table S1). The studies in both laboratories were in line with in vitro genotoxicity data for 8HQ (32,33). In vivo genotoxicity and carcinogenicity studies with oral administration showed disconcordant results (Supplementary Table S1). However, when using a single intraperitoneal (i.p.) injection and analyzing PCE/NCEs in the bone marrow of CD1 mice, a clear increase in MN frequency was seen (34). In addition, several rodent lifetime studies have been published in which 8HQ was applied i.p., via the vagina, or as bladder implant (35). In all these studies, the treated animals developed tumors at the site of application or in other organs at a rate exceeding that in the solvent-control group. In line with these application regimens, the HET-MN requires an application of chemicals onto the inner shell membrane, which can easily be permeated, allowing the chemical to penetrate the CAM which is pervaded by fenestrated blood vessels, facilitating the systemic up-take. In consequence, we consider the administration procedure in the HET-MN studies to be more closely related to an i.v.

5-Fluorouracil
administration rather than to application via the oral route. Thus, the two positive HET-MN studies for 8HQ were considered consistent with published in vivo data.  Table S1). The two other laboratories reproduced the dose-dependent effect on egg viability starting at 0.04 mg/egg. In addition, a dose-dependent sub-threshold increase in MN frequency could be observed in both laboratories in the first experiments, which was flagged by the JT trend test of PM1 while one dose group of each experiment was outside the HC but below the PM1 threshold. As these effects were It should be noted that Lab C tested cadmium chloride in a repeated dose regimen to maximize the overall dose by three applications. After the validation exercise, the laboratory re-tested the chemical in a single dose regimen, which revealed a clear increase in MN frequency (for details, see Section 3.4.1). Figure S12) showed a limited solubility in all recommended solvents. IPM was eventually selected by Lab C as the most suitable one to produce a homogenous suspension at ≥ 0.075 mg/egg. According to standards established for determining the maximum concentration for poorly soluble test chemicals (OECD TG 487, MNvit), curcumin was tested up to the first precipitating dose, i.e., 0.1 mg/egg without any impact on MN frequency or egg viability. As the precipitations on the egg membrane did not interfere with the test system"s integrity and therefore not with the experimental outcome, Lab B applied suspensions to the highest manageable dose which could be applied on eggs (20 mg/egg). Again, MN frequency was equal to or below the SC values while viability remained high. Consequently, both studies could not be regarded as valid since the bioavailability of the chemical was not proven. As analytical methods to prove the test chemical"s distribution within the biological test system were not foreseen for the validation exercise, is was decided to present and discuss the study results without including them in the calculation of predictivity.   Figure S14) was tested by Lab C with doses spanning from low to strong toxicity. As none of the experiments showed genotoxic effects, the study was considered negative in concordance with historical in vivo data (Supplementary Table S1).  Table S1). Figure S16) was investigated by Lab B up to strong toxicity in the first main experiment. As this experiment and the second one involving a modified dose range did not show a significant increase in MN frequency, the study was considered negative in line with historical in vivo data (Supplementary Table S1).  Table S1). Figure S18  In Lab A several slight effects were detected. In experiment 1 the viability decreased to 86% which is within the normal range of solvent and positive controls (Supplemental Figure S34). In both experiments the MN rate of one dose group was slightly outside the HC, but clearly below the PM1 threshold. These slight effects were not considered sufficient by the laboratory to prove the chemical's bioavailability. In consequence, the study was considered not valid, and in line with the process used for curcumin and phenanthrene, the griseofulvin study of Lab A was not included in the predictivity calculation. Figure S21) was investigated in aqua DI in all three laboratories.

Labs A and C tested up to the maximum dose of 100 mg/egg and observed a decrease in viability
which was sufficient to prove the chemical`s bioavailability. In contrast, Lab B observed already at 15 mg/egg (first experiment) a general toxicity reaching the threshold defining strong toxicity. Table S1). Figure S22) Figure S23) was tested in Lab C at doses groups causing responses from low to strong toxicity (viability below 40%). As the viability declined steeply (without any changes in MN rate) in the first experiment at the highest dose, the dose range was modified in the second experiment, which proved the absence of genotoxic effects. The negative call was in line with historical in vivo data (Table S1). Figure S24) was tested in all laboratories up to the maximum solubility of 7 mg/egg, while 11 mg/egg was identified as the maximum applicable suspension. Labs A c c e p t e d M a n u s c r i p t 23 A and C did not observe any relevant impact on viability, even when Lab C used the repeated dose regimen to facilitate the application of 30 mg/egg in total, i.e., three times 10 mg/egg on days 8, 9 and 10. The reduction of viability in Lab B was considered less relevant because the viability seemed to be generally impacted in these studies as also the viability in the PC of both experiments was close to 60%, i.e., a treatment condition which normally does not affect viability. No indications for genotoxic effects were observed. Similar to curcumin, phenanthrene could not be appropriately investigated as the bioavailability of the test compound could not be demonstrated, neither by an increase in MN frequency nor by a decrease in viability. As analytical methods to prove its distribution in the test system were not planned to be used in this validation exercise, is was decided to show and discuss the studies but to not include the results in the calculation of the predictivity. Figure S25) was tested up to strong toxicity in Lab C without evidence for genotoxicity in the first main experiment. In the second experiment, the MN frequency increased close to the PM1 threshold at 7 mg/egg, an effect accompanied by strong toxicity (MN data not shown in the graph due to the viability of < 40%). The following experiment conducted with a narrowed dose range confirmed the absence of genotoxic effects also in the two highest doses which were accompanied by strong toxicity. The study was concluded negative, concordant to published in vivo data (Supplementary Table S1). Figure S26) was investigated by Labs A and C up to strong toxicity, proving the bioavailability of the test chemical. Whereas Lab C did not observe indications for genotoxicity, Lab A detected a slight increase in MN frequency in a mid-dose (without dosedependency) in the first main experiment, which was flagged by PM2. As this effect was not reproduced in any of the dose groups tested in the second main experiment, also this study was considered negative in line with historical in vivo data (Table Supplementary S1). Figure S27 Table S1).
A c c e p t e d M a n u s c r i p t 25 Tertiary-butylhydroquinone (TP; Figure Supplementary S31) was investigated up to strong toxicity. In the first experiment the highest dose induced an increase in MN frequency above the PM1 threshold, which was flagged by both PMs. However, as none of the dose groups in the following two experiments were flagged by the PMs the study was considered negative.

Assessment of intra-and inter-laboratory reproducibility
In order to assess the intra-and inter-laboratory reproducibility, all data generated within the validation effort under blinded conditions (Supplementary Figures S1-S31) were tabulated (Table 1, Supplementary Tables S2).
The reproducibility of the HET-MN assay within a laboratory over time was assessed by comparing the concordance of experiments performed in duplicate or triplicate in the same laboratory. Among the 48 studies performed across all three laboratories, 101 experiments could be identified and counted towards assessing the concordance of classification (Table 2A, Supplementary Table S2).
The overall within-laboratory reproducibility for the validation exercise was 92% (Table 2A), with values between 88% and 94% for the individual laboratories that participated in the validation.
Reproducibility between laboratories was calculated based on the final overall call within laboratories for each chemical obtained when tested in three or two laboratories during phases I -III. Of these 15 chemicals, 87% obtained concordant calls (see Table 2B, Supplementary Table S2).
Both the intra-and inter-laboratory reproducibility was found to be in a similar range to other in vitro genotoxicity assays when testing was done in a coded fashion and was therefore considered acceptable, i.e., the intra-laboratory reproducibility of the in vitro MN was reported to vary between 83% to 100% (6).

Predictive capacity of the HET-MN
The predictive capacity of the HET-MN was calculated using the data from 29 chemicals from all phases of the validation exercise (Table 1). Where the call for a chemical unequivocally agreed with the expected classification, it was assigned a value of 1.0 when applied to the calculation. If it unequivocally disagreed with the expected classification, that chemical was assigned a value of 0, while equivocal calls counted as 0.5. Discordant calls for one chemical among laboratories went in according to their weight, e.g., if a chemical was tested in three labs and two found the expected results and one gave an unexpected result it would be assigned a value of 0.66. Applying these principles revealed an overall sensitivity of the HET-MN of 84% (Table 2C). The overall specificity was 98%. Only eugenol produced one equivocal experiment and, in consequence, an equivocal study while the remaining studies with TN and MP concluded in correct negative predictions (Table 2C).  (46). The sensitivity was calculated to be 84%. While Lab B observed a sensitivity of 89% it was 67% in Labs A and C. The incorrect calls in the latter two laboratories originate from four chemicals. Three of them were further investigated after the validation revealing a dose-dependent increase in MN frequency, i.e., BaP and potassium dichromate with the day 9 protocol after strong toxicity was observed at dose groups below 0.03 mg/egg and 0.18 mg/egg respectively. In addition, CdSO 4 was retested in the standard protocol after it has been evaluated with the "repeated-dose" regimen during coded-testing, a dosing-regime which was deprioritized after the validation (for details please refer also to section 3.4.1. Dosing regimen). The fourth chemical was griseofulvin, which limited solubility has been highlighted above. Lab B did not test all of the four chemicals. The overall accuracy of the HET-MN was calculated to be 91%.
In order to put the predictivity of the HET-MN into reference, the validation outcome was compared to the predictivity of the MNvit for which two data sets were available. First, a retrospective analysis of MNvit data published in 2008 (6) in order to support establishing the OECD TG 487. This data set is however not discussed in further detail here, because the predictivity of the MNvit data set was calculated with reference to data of the in vitro chromosomal aberration test to which the MNvit was supposed to function as an alternative. Generally, validation data are rather set in reference to in vivo data, which are considered of higher biological relevance compared to in vitro results. Therefore, another study was used to evaluate the HET-MN data set. In specific, a respective analysis of MNvit data referenced to in vivo data (7) revealed a sensitivity of the classical MNvit of 78.7% while specificity was 30.8% (or 53.8% when the chemicals classified as equivocal in vivo were considered negative). It should be noted that the MNvit results used for the calculation were obtained with different cell lines and not with one test system as used for the current validation.

Protocol improvements
Apart from providing key information of the predictive capacity of the HET-MN, the comprehensive validation data set was additionally used to investigate specific protocol aspects, which are addressed in the following.

Dosing regimen
In the development and optimization phases of the assay three different dosing regimens were used (11): the standard protocol involving the single application on day 8, the "repeated-dose" regimen with repeated dosing on day 8, 9 and 10, and a single-dose regimen with application on day 9. All regimens foresee a sampling on day 11. The usefulness of the two non-standard regimens is discussed in the following. A c c e p t e d M a n u s c r i p t

Repeated dose regimen
During the validation phase, Lab A and B employed exclusively the standard protocol. Lab C, which already participated in the optimization phase, additionally used the "repeated-dose" regimen, which foresees a repeated administration of the same dose on three consecutive days. This treatment procedure was developed to maximize the applicable dose in comparison to a single exposure in case of a low solubility of test chemicals while in parallel an increase in viability could often be observed (11). In the validation study, the "repeated-dose" regimen was applied for cadmium sulfate and phenanthrene in phase I, and for 2-AAF in phase III.
In case of cadmium sulfate, Lab C was able to double the dose when using the "repeated-dose" vivo data. After coded testing, the laboratory re-tested the chemical and correctly predicted the chemical using the standard design ( Figure 2A). Thus, as the "repeated dose regimen" was shown to be of limited value in supporting correct calls, it is no longer be described in the HET-MN protocol (1).

Day 9 protocol
The third dosing regimen was conceived during the development and optimization phases of the assay (11) and came into play in response to the effects observed with BaP, i.e., strong toxicity in the absence of genotoxic effects already at very low doses (0.03 mg/egg; Supplementary Figure S10 (2)). BaP was therefore re-tested after the validation phase with a slightly modified protocol in which single doses were applied on day 9 (instead of day 8 according to the standard protocol) whereas sampling remained on day 11. With this modification 10-fold higher BaP doses could be applied without inducing strong toxicity while a clear increase in MN frequency was noticed ( Figure 3, Table 3).
To investigate whether this approach is of broader relevance, further chemicals were tested with the day-9 protocol after the validation. Similar to BaP, potassium dichromate had also produced strong  Table 3). Therefore, the HET-MN protocol was amended with the recommendation to further investigate compounds, which induce strong toxicity already at low doses of < 1 mg/eggwithout having an impact on MN frequencywith the "day 9 protocol".

Proof of test chemical`s bioavailability
The PCE/NCE ratio had been introduced at an early stage of the HET-MN development as an additional indicator for the test chemical"s bioavailability (12). A systematic analysis of the validation data set showed this parameter to be quite stable across all three laboratories even if accompanied by clear indications of genotoxicity or general toxicity (1). In consequence, the PCE/NCE ratio was not considered sufficiently sensitive to proof the bioavailability of a test chemical and is therefore no longer included in the HET-MN protocol.  Supplementary Figures S1-S33). None of the calls presented in this publication would change when applying the final PM.

5 Strategic use of the HET-MN assay
The presence of MN in cultured cells has been reported as early as the 1960s (58) as an indicator for clastogenic and aneugenic effects (59). Meanwhile, the mechanistic relevance of micronuclei formation for toxicological assessment is widely accepted as documented in respective OECD TGs (2, A c c e p t e d M a n u s c r i p t 31 8), supporting the assessment of chemicals in different regulatory sectors such as industrial chemicals (4), plant protection products (5), pharmaceuticals (60) and cosmetics (3).
The MNvit holds a central position in vitro test batteries e.g. (3,4). Its position is supported by the assay`s good sensitivity (6). However, when the MNvit is combined in a battery approach, positive findings were observed, which disagreed with negative in vivo findings obtained with the same chemical (7). Despite their optimization (see revised OECD testing Guidelines (2)), classical in vitro genotoxicity assays based on 2D cell cultures remain limited in mirroring the route of exposure and in showing an intrinsic xenobiotic metabolism, necessitating the use of an external metabolizing system, two crucial aspects specified by current OECD TGs (8,9). In consequence, follow-up testing is often performed using animal experiments, which are prohibited or restricted by a growing number of legislations across the globe e.g. (4,61,62,63). Therefore, three-dimensional test systems have been introduced into genotoxicity testing (10), including the HET-MN (11), to fill a toolbox to further investigate positive findings from initial testing without animal experiments.
With the new assays, which utilize test systems with clear intrinsic metabolic capacity, the three routes of exposure can be addressed. For the dermal route, reconstructed skin (RS) tissues have been employed to develop the RS Comet assay (64) and the RS Micronucleus assay (65), which both successfully passed validation exercises recently (66,67). In addition, proof-of-concept studies have been presented to address the inhalative route by combining EpiAirway™ tissues (MatTek) with the comet assay (68), while spheroids from a human liver cancer cell line, HepG2 cells, were used for the evaluation of micronuclei to reflect genotoxic effects following exposures via the oral route (69).
The HET-MN is considered a good candidate to complement the in vitro genotoxicity toolbox. In contrast to 2D cell cultures, chicken eggs are characterized by a clear metabolic capacity, which is mediated by functional cell units in the yolk sac membrane, which in turn are in close vicinity to focal points of erythrocytes maturation, and in the developing liver. The metabolic capacity of the chicken eggs has been proven by the correct prediction of 12 pro-mutagens during development and validation phases. Further, the developing chicken egg is a fast-cycling test system during the developmental stage at which the HET-MN is being performed, i.e., the number of erythrocytes per blood volume increases exponentially while the same holds true for the blood volume. Moreover, erythrocytes A c c e p t e d M a n u s c r i p t 32 bearing micronuclei are not eliminated as the spleen is not yet functioning at this early developmental stage, and erythrocytes are almost the only cell type circulating in the blood at this stage. These aspects are supposed to establish the basis for the very good predictivity of the HET-MN in the validation exercise (specificity 98%, sensitivity 84%, overall accuracy 91%). In addition, during the development and optimization phases of the assay (11)(12)(13)(14)(15), 21 chemicals had been tested and predicted correctly (Table 4).
Since 2018 the HET-MN is mentioned in the Notes of Guidance of the EU Scientific Committee on Consumer Safety (3). The independent expert panel of the European Commission, mandated to ensure the safe use of consumer products, suggested the HET-MN as one assay within a toolbox for a further evaluation of positive outcomes from initial testing with the MNvit (2) in a weight-of-evidence approach. The validation data set is supposed to build the basis for further regulatory acceptance.

Conclusion
1. The performance of the assay to correctly predict the expected genotoxic effects of a difficult set of coded chemicals was very good, providing a sensitivity of 84% and a specificity of 98%. The overall accuracy was 91%.
2. The within-laboratory reproducibility was very good with 92%, as was the between-laboratory reproducibility with 87%, which was based on the final calls.
3. The validation proved the suitability of fertilized chicken eggs for genotoxicity assessment as shown by the reproducibly low background DNA damage and the intrinsic metabolic capacity being sufficient to toxify pro-mutagens. 4. The HET-MN has gained regulatory acceptance from the EU Scientific Committee on Consumer Safety, which now suggests the assay as a follow-up to help address positive findings from the initial testing with the classical in vitro test battery.   Table 1. Overview of the validation outcome. Chemicals were each tested in a blind-coded manner in two to three laboratories in phases I-III, and in one laboratory only in Phase IV. Study outcome: equiv = equivocal; neg = negative study (i.e., no increase in MN frequency); nv = not valid; pos = positive study; i.p.= intraperitoneal. Classification of chemicals into MP (misleading positive), TN (true negative) and TP (true positive) is based on historical in vitro and in vivo genotoxicity or carcinogenicity data as provided in Supplementary Table S1.

No. Figure
Chemical