Use of historical controls for animal experiments.

Statistical methods for the use of historical control data in testing for a trend in proportions in carcinogenicity rodent bioassays are reviewed. Asymptotic properties of the Hoel-Yanagawa exact conditional tests are developed and compared with the Tarone test. It is indicated that the Hoel-Yanagawa test is more powerful than the Tarone test. These tests depend on the beta-binomial parameters which are estimated from historical data. The goodness of fit of beta-binomial distributions to historical data is illustrated by application to the historical control database in the National Toxicology Program. Finally, sensitivities of the exact conditional test to the historical information is discussed and a conservative use of the test is considered.


Introduction
To begin, we consider Table 1, which summarizes the data from an experiment involving r + 1 groups of animals. One group serves as a control group and the remaining r groups are administered a test compound at increasing dose levels, d1 < d2 < ... < dr. The control group is associated with i = 0 so that do = 0. Let ni denote'the number of animals in the i-th group. We assume for i = 0, 1, . . ., r that at experimental dose di there are xi animals with tumors observed which are binomially distributed with parameters pi and ni. We define p = po.
To test an increase in the proportions pi = xilni with increasing dose level, Cochran (1) and Armitage (2) Cox (3) showed that this statistic gives the uniformly most powerful unbiased test against logistic alternatives and Tarone and Gart (4) showed that this statistic is asymptotically locally optimum against any alternative which can be expressed as a smooth increasing function of dose. In most carcinogenicity rodent bioassays, we are usually dealing with three experimental groups of animals which consist of a control group, a low dose and a high dose group each with 50 animals. The probability of an animal with a specific type of tumor in the control group ranges from less than 1% to 20% depending upon the type of tumor.
When the Cochran-Armitage test is applied to these bioassays, two problems arise: the problem of false positives (Type I error) and that of false negatives (Type II error). For the first problem, Portier and Hoel (5) showed that when the Cochran-Armitage test is used the false positives can be considerable, depending mark- Yanagawa (7). The tests developed are exact tests rather than the asymptotic procedures of the previous authors. They assume, as did Tarone (8), that the historical rates are distributed as beta-binomial and construct tests conditional on the number of outcomes in the control group.
In all of this work, the parameters of the beta-binomial distribution which must be estimated from the historical control data are assumed to be known. Considering the conditional test for logistic response by Hoel and Yanagawa (7), Yanagawa, Hoel, and Brooks (6) discuss the sensitivity of the test to this source of variability and develop a conservative use of the test. Haseman, Huff, and Boorman (11) have reviewed the data stored in the Carcinogenesis Bioassay Data System and discuss those issues which must be adequately addressed before historical control data can be used in a formal testing framework.

Formulation
We assume for i = 0, 1, . . ., r that at experimental dose di there are Xi animals with observed tumors which are assumed to be binomially distributed with parameters pi and ni. We formulate the problem of testing for a trend in proportions following Tarone and Gart (4) by assuming that: where H is a twice differentiable and monotone increasing function over [O,x]. The statistical test of hypothesis of an increasing trend in proportions is given by Ho: k=O vs. H1: t>O Following the development of Tarone (8) and Hoel (10), we assume that po (denoted by p) is a random variable following a beta distribution with q= 1 -p with a and a known. We defined p = H(a) so that a is distributed as (1) Logistic response: Set H(a) = e"/( + e"), then from Eq. (1) we have r ,.
T,(ot, 4) (2) Exponential response: Set H(a) = 1 -e-", then Since XO, X1, .. ., X,. are independent conditioned on p, the joint distribution of a and X = (XO, Xl, *.., Xr) is given by Thus the marginal distribution of X is In particular, the marginal distribution of XO is

Unconditional Tests
The locally most powerful test (12) for Ho: = 0 vs. H1: e > 0 is given by After some simple calculation we find that r(n + a + r) (1) which depends on the response function H. Two cases are of particular interest: Suppose that the response function is a distribution function which is third-order differentiable, then applying the formula by Hald (13), we may show that and Tt is the statistic given in Eq. (2). Thus when T and Tt are appropriately normalized, they have the same asymptotic distribution as n -* oc; that is, T is asymptotically free of the shape of the response function and is equivalent to Tt.
In order to obtain an asymptotic test based on Tt we observe the following results which are straightforward calculations. Under Ho: e = 0 E(X1) = niaI(a + 13) = niO V(Xi) = njO(1 -0)/(a + 13 + 1) Cov(Xi, Xj) = ninjO(1 -0)/(a + ,B + 1) and it thus follows that Of these standardizations, the first statistic S simply uses the unconditional variance of the statistic T while the second uses the estimated conditional variance. The Tarone standardization results from treating the random variable a as a parameter in the likelihood function and using the score test of e = 0.
In considering the asymptotic distribution of the test statistics S, Sc, St as n c we assume that Xi = ni/n is kept constant (O < Xi < 1) for each i. The asymptotic distributions are summarized as follows and the proofs can be found or obtained following the arguments given in Hoel and Yanagawa (7): TARONE'S STANDARDIZATION. For either a + or 0 = a/(a + 1) fixed and '(t) the normal distribution, then under Ho lim pr{S, < x} = ¢D(x) UNCONDITIONAL STANDARDIZATION. For at + fixed then where pn, = nI(n + a + 13) -+ p (O < p < 1) and Xi = nijn is fixed.
The above results show that Tarone's standardization is the best among the three, although the standardization is not easy to justify. It is shown in Hoel and Yanagawa (7) that when 0 is small, n must be quite large for the normality of the asymptotic tests to be a reasonable approximation.

Exact Conditional Test
Since fo(x) is independent of i, we have that XO is an ancillary statistic. Fisher (14) suggested that for purposes of inference on should consider the family of conditional distributions given the observed value of the ancillary statistic in the sample. Denote byft(xlxO) the conditional probability density function of X given XO = XO. The conditional locally most powerful test for Ho: (=0 vs. H1: t>0 is given by and it is easy to show that T is given by Eq. (1).
In general, let to be the observed value of T; then the exact p-value of the conditional test is given by p-value = E, (niA Rx + a)F(n + -x)F(no + a +) ik= i X (xo + ot)F(no + 1o -x)F(n + a + 1 where the summation ' extends over all (x1,x2, xr), which satisfy T < to for given XO = xo. For the NTP data with r = 2, no = n1 = n2 = 50 and p ranging from 1% to 20%, computations of the p-value by computer is very quick.

Asymptotic Properties of the Exact Conditional Tests
Hoel and Yanagawa (7)  Following the development in Hoel and Yanagawa (7) one may show that: (1) under the null hypothesis Ho, S has limiting normal distribution with mean zero and variance one as n --).xz. Thus the asymptotic conditional size a test for increasing trend in proportion is given by rejecting the hypothesis Ho if Sz,a,, where z,x is the upper oa% point of the standard normal distribution.
(2) For the sequence of alternative hypotheses H1l,: t,, = &/Inthe unconditional asymptotic power of the conditional test S is the same as for S, which is given in This efficiency formula can be used for assessing the saving in sample size by incorporating historical controls. Suppose that historical control data are incorporated with the current experiment which uses n total animals. Then the formula implies that approximately n' = n[B(p)/B(1)]2 animals are needed by the analysis of the current experiment alone to achieve the same statistical power as the incorporated analysis. For example, the analysis of incorporating historical controls of at + = 400 with the current experiment using d, = 1, d2=2, and no=nj=n2=50 animals corresponds to the analysis of no= n1 = n2 = 105 animals of current experiment. In terms of the false negatives of statistical tests, the use of this historical information decreases the false negative rate from 0.51 to 0.24 when p = 0.01, Pi = 0.049, and P2 = 0.360 (see Table 2). This of course assumes that + ,B as n c and that p = 0 in the limit. If this is not the case, the above finding is not true. For example, in comparing the exact p-values of the exact tests, Yanagawa, Hoel, and Brooks (6) show that when 0 is large (0 = 0.2), at + is small (a + = 15) and xJno is much smaller than 0; then the p-value of the Hoel-Yanagawa conditional test for (xO,x1,x2) = (2,2,9) is 0.048; whereas the corresponding p-value of the exact trend test which does not incorporate historical control data is 0.0096. Generally, Hoel and Yanagawa (7) find that the Cochran-Armitage test gives much higher p-values especially when xo is larger than expected and smaller p-values when xo is smaller than expected.

Comparisons with the Tarone Test
Suppose that the sample size n and p are moderately large and that the distributions of both test statistics S and St are approximated well by their asymptotic distributions. It would be reasonable to expect under the alternative hypothesis of a positive dose-response that the observed sample point (xo, x1, ... xr) falls in the region R defined by R = {XO0Xi,. ... ,Xr) xcx/no) (xJ/ni), i=1,2,.. . ,r} Furthermore assume that xi/ni < 1/2, i = 0, 1, . .., , which is the case in many animal carcinogenicity experiments. Then it may be shown for (xo,x1, ., Xr) E R that S is larger than the square root of Tarone's test statistic. This indicates for a moderate sample size that the asymptotic conditional test would have higher power than the Tarone test.
Generally, as stated above for animal experiments where n is small, the asymptotic approximation of the trend tests is not good, especially when 0 is very small and a + 13 is large. This is the situation where a good gain in power by incorporating historical control data is anticipated. Therefore, the exact conditional test is suggested rather than asymptotic test.
Finally, we note one weak point of the Tarone test, as well as the test by the other authors. Suppose that no = ni = n2 = 50, 0 = 0.01 and (ao, 13) = (3.95, 3.91), and that (xO,x1,x2) = (3,3,3) is observed, then the pvalue of the Tarone test is 0.007. Thus a strong evidence of positive dose-response is shown. This is because we have pr[xo03] = 0.02. This illustrates the necessity of dealing with exceptional values of xo which happen sometimes by the reasons discussed in the next section. Since the existence of sound historical control database has been presumed for our statistical procedures, one should not attempt to incorporate the historical data when the exceptional value of xo is observed. We encourage the use of the ancillary information, i.e. xo, in the conditional procedure to check the quality of current experiment.

Historical Control Database
Problems encountered in the historical control data are discussed by Haseman, Huff, and Boorman (11). Examining the NCI/NTP historical data carefully, these authors find that different terminologies are often used to describe the same tumor even for studies at the same laboratory carried out at approximately the same time. Also the use of different sets of criteria for diagnosing a lesion is revealed. Discussing the criteria that will aid in determining whether a particular study should be included in the database, Haseman, Huff, and Boorman (11) state "Certainly species, strain, sex, study duration, pathology protocols, nomenclature conventions, quality assurance and review procedures should be the same for each study in a particular control database. Ideally, diets, changing regiments, and various environmental parameters should also be comparable. Different types of control groups (e.g., untreated, corn oil gavage) should be dealt with separately. Other potential sources of variability (calendar year, laboratory, pathologist, supplier) should also be investigated, identified and controlled." The current database thus established in the NTP contains information beginning with those studies reported in Technical Report 193, 1981 through those studies whose pathology diagnoses were finalized in Carcinogenesis Bioassay Data System as of March, 1983. Most control groups have 50 animals/species/sex and all are from studies of two years duration. About 50 control groups/species/sex are contained in the database.
We fitted beta-binomial distribution to the data for each tumor type in the database. Table 3 shows for selected tumor sites in the Fisher 344 rat the estimates of the beta-binomial parameters a and 13, a + 1, and o = a/(a + 1), and their standard deviations. These es-  timates are obtained by the method of maximum likeseen in Table 3, the standard deviations of the estimated lihood. The asterisk (*) in the table represents tumor a and , are fairly large. There is the need to consider sites whose data did not fit well to a beta-binomial dis-the effect of this source of variability. tribution. It is estimated that there are a little more Yanagawa, Hoel, and Brooks (6) studied its effect on than 1/3 of such in the database. In most of these, data the p-value of the conditional test for a logistic response. variations between experiments are rather smaller than They show that it changes only slightly with a small that of a binomial distribution. Methods for incorporat-change in a + ,B; that when a + ,B is small the p-value ing these historical control data are not yet developed.
is not sensitive to a change in 0; whereas when a + 13 Figures are given in Yanagawa, Hoel, and Brooks (6) is large, slight changes in 0 produce substantial changes to show visually the goodness-of-fit of beta-binomial dis-in the p-value, and in particular, when the difference of tributions to several selected tumor sites. 0 and xolno is large.
Developing methods for constructing the 95% confidence intervals of 0 and a + 1, Yanagawa, Hoel, and Sensitivity Brooks (6) considered the maximum and minimum of ptyvalues over the space made by the cartesian product of The exact conditional tests developed in the preceding these confidence intervals (see the shaded area in Figure  sections depend on beta-binomial parameters, a and ,B, 1). They found numerically that these maxima and minwhich are estimated from historical control data. As ima seem to be attained always at the four corner points,  i.e., A, B, C, and D in Figure 1. Table 4 shows p-values at A, B, C, and D, and at the point 0 of estimated a and S3 for several configurations of (x0,x1,x2) for tumors of the thyroid and tumors of the hematopoietic system using as usual n0 = =1-n = 50.

Conservative Use of the Conditional Test
The inspection of Table 4 leads to the following conservative rule for incorporating historical control data by the exact conditional test for testing positive doseresponse: (Ri) Compute p-values at the five points A, B, C, D, and 0.
(R2) Do not attempt to draw any inference when the maximum p-value of these five points exceeds the nominal level, e.g., 0.01 or 0.05. This rule is very conservative, but it still works well in practice, especially for tumors with small spontaneous background rates. This is shown by comparing the maximum p-value with the p-value of the exact trend test, i.e., extended version of Fisher's exact test which does not incorporate historical data. For example, when (a', = (3.95,391) and (xO,xl,x2) = (1,2,3), then the maximum p-value is 0.020; whereas the p-value of the exact test is 0.226. The rule also works for many configurations of (xo,x1,x) even when 0 is large (0 = 0.2); for example, when (a,O = (3,12) and (x0,x1,x2) = (21,25,29), then the maximum p-value is 0.018 and the p-value of the exact test is 0.067. Note that the computing time required to obtain the p-values at the five points is rather short: for example, when (&,,B) = (3.95,391) a VAX 780 took less than 40 sec to compute the p-values for (xO,xl,x2) = (1,2,3).