Testing for goodness rather than lack of fit of continuous probability distributions

The vast majority of testing procedures presented in the literature as goodness-of-fit tests fail to accomplish what the term is promising. Actually, a significant result of such a test indicates that the true distribution underlying the data differs substantially from the assumed model, whereas the true objective is usually to establish that the model fits the data sufficiently well. Meeting that objective requires to carry out a testing procedure for a problem in which the statement that the deviations between model and true distribution are small, plays the role of the alternative hypothesis. Testing procedures of this kind, for which the term tests for equivalence has been coined in statistical usage, are available for establishing goodness-of-fit of discrete distributions. We show how this methodology can be extended to settings where interest is in establishing goodness-of-fit of distributions of the continuous type.


Introduction
Goodness-of-fit tests belong to the oldest and most frequently used methods of statistical inference. A chapter devoted to them can be found in almost every textbook for statisticians working in whatever area of application. Likewise, there are numerous authoritative expositions of the mathematical theory of these methods, beginning with Cramér's classical text [1], through all three editions of Lehmann's "Testing Statistical Hypotheses" [2][3][4], to Volume 2A of "Kendall's Advanced Theory of Statistics" [5], to mention just a few highly influential references of this category. Virtually all inferential procedures presented in the existing literature as tests of goodness-of-fit, share one crucial feature: The statement that the model to be fitted coincides with the true distribution from which the data are taken, plays the role of the null hypothesis, implying that a significant result actually indicates lack rather than goodness of fit of the model. This is clearly at variance with the fact that in the vast majority of applications, interest will be in proving rather than falsifying the model so that such a test typically fails to serve the purpose of its user. For instance, many of the most widely used methods of statistical analysis rely on the assumption that the data follow a specific distributional law (namely the Gaussian), and it is widespread practice to make sure of the adequacy of this assumption in a preliminary test. In the latter, the hypothesis one aims to establish states that the corresponding model holds at least approximately true. In the existing literature, the term lack-of-fit test occurs quite infrequently. The main exception is research on methods to be used for detecting misspecifications of linear or generalized linear regression models (see. e.g., [6][7][8][9]). Not surprisingly, the discrepancy between a test of goodness-of-fit and a procedure enabling one to establish the respective model, is addressed at least implicitly in some of the classical expositions of the topic (see, e.g., [5], § §25. [6][7]. The usual recommendation for finding a way around that difficulty is to take steps for increasing the power of the test, preferably, since increasing the order of magnitude of the sample size will rarely be feasible, by checking the p-value against an increased threshold, e.g. 10 instead of 5 percent and deciding in favor of goodness-of-fit if the test does not reject the null hypothesis even at that increased level of significance. However, it is a basic fact that "inverting" a test of a given null hypothesis H 0 against some alternative H 1 by declaring H 0 to be statistically proven if it cannot be rejected, fails to produce a test controlling the risk of taking an erroneous decision in favor of H 0 . A procedure which is tailored for serving the latter purpose ("proof of the null hypothesis"), is what has become quite popular in biostatistics since the last few decades under the name equivalence test. Construction of a test of that kind requires to enlarge H 0 , before defining it as the new alternative hypothesis, through introducing some indifference zone around the respective point in the parameter space consisting of models deviating from the model of interest by an amount considered still acceptable. The basic requisites for reformulating the testing problem in that way are a suitable measure of distance between true and hypothesized model (often called a metric in a mathematically not fully precise terminology) and a numerical specification of the maximum tolerable value of that distance (called equivalence margin in biostatistical contexts).
Up to now, equivalence tests for goodness-of-fit problems have been made available for problems of establishing models for discrete distributions (see [10], Ch. 9). In Section 2, we develop a framework for equivalence testing in settings where the objective is to establish goodness-of-fit of some hypothesized continuous distribution (like the standard normal law) to the true distribution underlying a dataset under analysis. In the proposed hypotheses formulation, the indifference zone around the model to be established, consists of all Lehmann alternatives [11] to the corresponding cumulative distribution function (cdf) for which the ratio θ, say, between the value of the true and the hypothetical cdf at any point in the sample space, falls in a sufficiently narrow interval around unity. Except for considering cdf's rather than survivor functions which give the probabilities in the right-hand tail of a distribution, the Lehmann parameter θ coincides with what plays, under the term 'hazard ratio', a prominent role in statistical survival analysis. In Section 3, a uniformly (in θ) most powerful, exact test for a hypothesis of this form is derived and shown that its power does only depend on the Lehmann parameter θ, not on the cdf one aims to fit to the data. In Section 4. results of studying the UMP test by means of exact numerical methods are presented, focusing on comparisons with tests being available for grouped data taken from the distribution to be fitted. The question of how to extend the approach to settings where the model to be established involves unknown parameters (like location and scale), will be addressed in Section 5. An illustrating example is presented in Section 6.

Assumptions and hypotheses formulation
Throughout we assume that the assessment of goodness-of-fit of the distribution of interest can be based on a random sample X 1 , . . ., X n of size n 2 N of mutually independent observations. The common distribution of these observations is assumed to be of the continuous type, with F as the cdf. The cdf of the distribution to be fitted will be denoted by F 0 and assumed to have support on some (maybe unbounded) interval on the real line. A reasonable basis for constructing a region in the space of all continuous cdf's on the real line which can be considered equivalent (i.e., sufficiently similar) to the cdf F 0 specified under the model of interest, consists of including all Lehmann alternatives F(�) = [F 0 (�)] θ for which the maximum vertical distance between F and F 0 does not exceed some suitably chosen margin δ > 0. Making this idea precise, we start with defining equivalence of the hypothesized to the true distribution by the condition Modifying the proof given in [12] for survival rather than distribution functions in the straightforward way, it can be shown that for Lehmann alternatives, the condition (1) is equivalent to jy 1=ð1À yÞ À y y=ð1À yÞ j < d : From basic properties of the expression on the left-hand side of (2) as a function of θ, we can eventually conclude that the goodness-of-fit criterion (1) is satisfied if and only if there holds Given the equivalence margin δ to ||F − F 0 ||, the corresponding value of ε is obtained by solving the equation: Table 1 shows the values of ε determined from Eq (4) for a selection of customary specifications of δ, together with the right-hand limit of the equivalence range for log θ which one is used to consider as the parameter of interest in the proportional hazards model for survival distributions [13].

A uniformly most powerful test for establishing goodness-of-fit in the continuous case
Adopting the conceptualization of the notion of goodness-of-fit of a specified distribution with cdf F 0 to a sample from an unknown continuous univariate distribution proposed in the Introduction and letting the indifference zone around the model consist of Lehmann alternatives to F 0 , we need a test for the problem In deriving an optimum solution to this problem, we will make use of the following Then, the distribution ofŷ n ðF 0 Þ does not depend on F 0 , and its cdf is, for any θ, exactly given by ð � Þ denotes the cdf of a central chi-square distribution with ν = 2n degrees of freedom.
Proof. Due to the well-known basic property of the probability-integral transform, we can write for any i 2 {1, . . ., n}: with G 1,1 (�) denoting the cdf of a gamma distribution with parameters (1, 1).
ð2yn=tÞ ; t > 0 : The following proposition states that there is an optimum test for (5) and describes the computational steps to be taken in order to carry out that procedure. Proposition 1. Let the common distribution of the X i under θ = 1 be absolutely continuous with density f 0 (�). Then, there exists a uniformly most powerful test for (5) which rejects the null hypothesis H of relevant deviations of the true from the hpothesized distribution if and only if it turns out that The critical constants C 1 n ða; εÞ, C 2 n ða; εÞ have to be determined through solving the equations Proof. For absolutely continuous F 0 , X i � F y 0 8i implies that the joined density g ðnÞ y , say, of the sample (X 1 , . . ., X n ) is given by Thus, g ðnÞ y is an element of a 1-parameter exponential family, with P n i¼1 log F 0 (X i ) as a sufficient statistic for θ. Hence, a well-known theorem on the existence of UMP tests for equivalence hypotheses about parameters of families of that structure (cf. [10], Appendix A.1) applies, according to which a UMP level-α test for (H, K) has rejection region where ðc 1 ;c 2 Þ solves the equations with θ 1 = 1/(1 + ε), θ 2 = 1 + ε. In view ofŷ n � À n= P n i¼1 log F 0 ðX i Þ, Lemma 1 implies that a unique solution of (10) is obtained by settingc n ¼ À n=C 3À n n ða; εÞ, ν = 1, 2, with C k n ða; εÞ, k = 1, 2, being defined as in (8), (9). Remark 1. The optimal critical constants C 1 n ða; εÞ, C 2 n ða; εÞ are not given explicitly but have to be calculated from (9) by means of a suitable numerical algorithm. All results to be presented in the subsequent sections were obtained by means of the program provided as Supplementary Material under the name UMPTestforGoF both as a SAS/IML-and R-script.
Remark 2. In terms of both the algorithm for determining the critical constants and its power, the UMP test given by (8) is completely distribution-free. The cdf F 0 defining the model whose goodness-of-fit one wants to establish, is only used for computing the test statistic.
Remark 3. Transforming the sufficient statistic P n i¼1 log F 0 (X i ) to the equivalent statistiĉ y n ðF 0 Þ ¼ À n= P n i¼1 log F 0 ðX i Þ in writing down the decision rule of the UMP test, is simply a matter of conceptual convenience:ŷ n ðF 0 Þ is easily shown to be the ordinary ML estimator of the Lehmann parameter θ and thus a quantity straightforward to interpret. Table 2 gives a tabulation of the critical values and the power against the null alternative θ = 1 of the exact UMP test derived in Section 3 for three different choices of the equivalence margin ε and sample sizes n ranging from 10 to 200. As usual, the significance level α is chosen to be 0.05 throughout. Comparing the entries in different lines of the same block of the table reveals the effect of increasing the sample size on basic characteristics of the test: The left-and righthand limit of the critical interval which has to be checked for inclusion of the observed value of the ML estimatorŷ n ðF 0 Þ, is monotonically de-and increasing, respectively, in n, in a way making the corresponding intervals a nested sequence of sets. Furthermore, as is mandatory for any reasonable test for the problem put forward in (5), the power increases likewise monotonically in n to unity. The effect of increasing the equivalence margin ε becomes obvious from comparisons between homologous entries in the different blocks of the table. As has clearly to be expected, the critical interval is shrinking in length as ε decreases, and the power attainable with a given sample size declines fairly rapidly as the margin is tightened.

Numerical results on the UMP test for goodness-of-fit
Recalling the standard approaches to testing for lack-of-fit of a fully specified distributiion to that underlying a given dataset, it seems natural to compare the new test with a goodnessof-fit testing procedure which uses grouped data. In the context of lack-of-fit testing, it is often recommended (see, e.g., [5], §25.22) to choose for grouping classes of equal probability in terms of the distribution F 0 under assessment. Focusing on the best known and perhaps most interesting special case that the distribution one aims to fit is the standard normal N ð0; 1Þ so that there holds F 0 = F, and that the number k, say, of classes to be formed equals 5, a partition of the range of X which is in line with that recommendation, is given by the intervals (−1,  The goodness-of-fit test for multinomial distributions established in Ch. 9.1 of [10] defines equivalence between multinomial distributions of common dimension k in terms of the Euclidean distance between the corresponding parameter vectors π and π˚, where π refers to the unknown distribution underlying the data and π˚to the model to be fitted. With the grouped-data probem under consideration, we have p � j ¼ 1=5 8 j = 1, . . ., 5, and the Euclidean distance of the vector π (1+ ε) with the above listed components from π˚is computed to be d(π, π˚) = ε� = 0.1548. On the other hand, π (1+ε) is the parameter vector of the multinomial distribution into which the uniform distribution on {1, . . ., 5} generated from N ð0; 1Þ by means of the chosen partition, is mapped through moving θ to the (right-hand) boundary of its equivalence range. Thus, it seems reasonable to consider the problem concerning the parameter π of a multinomial distribution over {1, . . ., 5}, as the grouped-data analogue of the testing problem to which the results shown in the middle block of Table 2 relate. As shown in Wellek, loc. cit., an asymptotically valid test of (11) is given by the following decision rule: where andp denotes the vector of relative frequencies of observations falling in the different classes.
For small to moderate sample sizes, the power of the grouped-data goodness-of-fit test given by (12) and (13)  Thus, the fact being well known (see. e.g., [14], Ch.27) for the lack-of-fit case that grouping entails substantial losses in efficacy, has also to be stated for problems of testing for goodnessof-fit.

A glance at options for testing for goodness-of-fit of distributions involving nuisance parameters
Upon noticing the results presented in Sections 3 and 4, it seems natural to ask the question whether the approach admits generalization to settings where one aims to fit a distribution involving unknown nuisance parameters rather than being fully specified. The best known special case of such a problem is testing for normality, which is to say that the hypothesis of interest states that, except for differences one is willing to accept, the distribution underlying the data has cdf F ¼ F 0 � À m s À � , with F 0 = F and arbitrary (μ, σ) 2 R � R þ . Adapting the hypothesis formulation proposed in the case of a fully specified distributional model to this setting is straightforward, leading to consider the testing problem H : ðy; m; sÞ 2 ððÀ 1; 1=ð1 þ εÞ� [ ½1 þ ε; 1ÞÞ � R � R þ versus K : ðy; m; sÞ 2 ðð1=ð1 þ εÞ; 1 þ εÞÞ � R � R þ : ð14Þ As before, θ denotes the Lehmann parameter inducing potential deviations of the true distribution from the distribution of the form assumed under the model to be fitted. A promising and often successful approach to the construction of equivalence tests in multi-parameter families of distributions uses the maximum likelihood estimator of the parameter of interest as pivotal quantity (for the theoretical basis of that approach see [10], Ch. 3.4). The log-likelihood function and its first-order derivatives associated with a sample (X 1 , � y are readily obtained to be lðx 1 ; . . . ; x n ; y; m; sÞ ¼ n log y þ ðy À 1Þ Solving the corresponding system of score equations by means of the Newton-Raphson algorithm or an alternative numerical technique is an easy exercise, and the roots almost surely exist. The same cannot be said of the maximum likelihood estimator: examining the function (θ, μ, σ) 7 ! l(x 1 , . . ., x n ; θ, μ, σ) for various datasets revealed that it fails to attain almost surely a global maximum in the interior of the parameter space. Hence, carrying out a construction requiring that the MLE of θ exists and is asymptotically normal, is not practicable here. In contrast to the MLE, the score statistic UðyÞ � lðx 1 ; . . . ; x n ; y;mðyÞ;sðyÞÞ with ðmðyÞ;sðyÞÞ as the solution to @l @m ¼ 0, @l @s ¼ 0 for fixed θ, is almost surely well defined. However, no approach to basing tests for interval hypotheses on a statistic of this form is at hand. An option for making use of U(θ) anyway for the purpose in mind, is to split up the equivalence testing problem (14) into the two one-sided testing problems H 1l : ðy; m; sÞ 2 À 1; and H 1r : ðy; m; sÞ 2 ½1 þ ε; 1Þ � R � R þ vs: K 1r : ðy; m; sÞ 2 ðÀ 1; 1 þ εÞ � R � R þ : ð17:2Þ Rejecting H 1l and H 1r when it turns out that there holds Uð1=ð1 þ εÞÞ=ṽ l > z 1À a and Uð1 þ εÞ=ṽ r < z a , respectively, yields asymptotically valid tests for (17.1) and (17.2), provided v 2 l consistently estimates the asymptotic variance of U (1/(1 + ε)) under θ = 1/(1 + ε) andṽ 2 r that of U(1 + ε) under θ = 1 + ε. As usual, z γ stands for the γ-quantile of N ð0; 1Þ for arbitrary γ 2 (0, 1).
The last step of the construction consists in combining the two score tests for the one-sided problems (17.1) and (17.2) into a test for the two-sided equivalence problem (14) to be solved when interest is in establishing goodness-of-fit. The combined test rejects the null hypothesis H of (14) if and only if both of the critical inequalities Uð1=ð1 þ εÞÞ=ṽ l > z 1À a and Uð1 þ εÞ=ṽ r < z a are found to hold true. According to a well-known and frequently exploited principle from the theory of equivalence testing (cf. [10,16]; Ch. 7.1), the asymptotic validity of the one-sided tests with rejection regions fUð1=ð1 þ εÞÞ=ṽ l > z 1À a g and fUð1 þ εÞ=ṽ r < z a g implies that the test for goodness-of-fit of a distribution from the family F 0 carried-out in this way, is likewise asymptotically valid in terms of the significance level. The facts stated so far admit the conclusion that in theory, testing for goodness-of-fit of distributional models involving unknown parameters is not an insurmountable challenge for statistical inference. However, upon studying the power of such a test, this judgement can hardly be maintained: Determining the rejection probability of the double one-sided score-test procedure described above by means of Monte Carlo simulation of normally distributed data reveals that even against the null alternative of perfect fit of the model [, θ = 1] and for a choice of the equivalence margin of moderate strictness [ε = 0.5077, corresponding to a maximum acceptable vertical distance of the cdf's of δ = 0.15-recall Table 1], several thousands (!) of observations are required in order to attain a power of 80%. Thus, establishing goodness-offit of some distributional shape rather than a specific element of the corresponding family of distributions (like N ðm; s 2 ÞÞ by means of a testing procedure providing satisfactory control over both kinds of error-risk, is hardly an option for practice.
6 Illustrating example Fig 2 shows the plot of the empirical cdf of a simulated random sample of size n = 100 from N ð0; 1Þ, together with the theoretical cdf F to be assessed for goodness-of-fit to these data.
In order to apply the UMP test of Section 3 with F 0 = F, the maximum likelihood estimate of the Lehmann parameter θ has to be determined. With the data behind the empirical cdf plotted above and F 0 = F, evaluating Eq (6) yieldsŷ n ðF 0 Þ = 1.0793. If the equivalence margin for θ is chosen to be ε = 0.5077, it is seen from Table 2 that the observed value ofŷ n ðF 0 Þ has to be checked for inclusion in the interval (0.7883, 1.2887). Since 1.0793 is an inner point of this interval, the conclusion is that the goodness-of-fit test for standard normality at level α = 0.05 leads under the chosen specifications to rejecting the null hypothesis of lack-of-fit.

Discussion
As major strengths of the procedure obtained in the core of this paper for testing for goodness-of-fit of a fully specified distribution of the continuous type to the distribution underlying a given random sample, the following facts can be adduced: (i). The alternative hypothesis which can be declared established upon a positive result, states that the model fits the data sufficiently well rather than meriting rejection because of marked discrepancies from the true distribution.
(ii). The method is fully exact and satisfies the strongest of the optimality criteria having been in use for hypothesis tests since the beginnings of classical frequentist inference.
(iii). Due to the availability of a fairly compact source code both in SAS/IML and R, the practical implementation of the test, as well as the algorithm for exact power and sample size computation, is fast and easy. (iv). The primary metric, in terms of which the region of distribution functions defined equivalent to the distribution function specified by the model, is fully intuitive also for applied research workers without advanced statistical training. It is the same in terms of which alternatives to the null hypothesis of perfect fit of the model are expressed in the Kolmogorov test.
Admittedly, the last of these advantages comes into play only as long as one is willing to rely on the semiparametric model which assumes that the true distribution function underlying the data differs from that specified by the model through taking all values of the latter to the θth power. Except for ignoring right-censoring and applying the transformation u 7 ! u θ to cumulative distribution rather than survivor functions, this model is the same as Cox's [13] well-known proportional hazards model. If the proposed test does not lead to a decision in favor of equivalence between actual and hypothesized distribution, one cannot rule out the possibility that the true distribution differs from F 0 nowhere by more than a given margin δ without satisfying the modified Cox model.