• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptNIH Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Psychol Sci. Author manuscript; available in PMC Jun 1, 2006.
Published in final edited form as:
PMCID: PMC1473027

An Alternative to Null-Hypothesis Significance Tests


The statistic prep estimates the probability of replicating an effect. It captures traditional publication criteria for signal-to-noise ratio, while avoiding parametric inference and the resulting Bayesian dilemma. In concert with effect size and replication intervals, prep provides all of the information now used in evaluating research, while avoiding many of the pitfalls of traditional statistical inference.

Psychologists, who rightly pride themselves on their methodological expertise, have become increasingly embarrassed by “the survival of a flawed method” (Krueger, 2001) at the heart of their inferential procedures. Null-hypothesis significance tests (NHSTs) provide criteria for separating signal from noise in the majority of published research. They are based on inferred sampling distributions, given a hypothetical value for a parameter such as a population mean (μ) or difference of means between an experimental group (μE) and a control group (μC; e.g., H0: μE − μC = 0). Analysis starts with a statistic on the obtained data, such as the difference in the sample means, D. D is a point on the line with probability mass of zero. It is necessary to relate that point to some interval in order to engage probability theory. Neyman and Pearson (1933) introduced critical intervals over which the probability of observing a statistic is less than a stipulated significance level, α (e.g., z scores between [−∞, −2] and between [+2, +∞] over which α < .05). If a statistic falls within those intervals, it is deemed significantly different from that expected under the null hypothesis. Fisher (1959) preferred to calculate the probability of obtaining a statistic larger than |D| over the interval [|D|, ∞]. This probability, p(xD|H0), is called the p value of the statistic. Researchers typically hope to obtain a p value sufficiently small (viz. less than α) so that they can reject the null hypothesis.

This is where problems arise. Fisher (1959), who introduced NHST, knew that “such a test of significance does not authorize us to make any statement about the hypothesis in question in terms of mathematical probability” (p. 35). This is because such statements concern p(H0|xD), which does not generally equal p(xD|H0). The confusion of one conditional for the other is analogous to the conversion fallacy in propositional logic. Bayes showed that p(H|xD) = p(xD|H)p(H)/p(xD). The unconditional probabilities are the priors, and are largely unknowable. Fisher (1959) allowed that p(xD|H0) may “influence [the null’s] acceptability” (p. 43). Unfortunately, absent priors, “P values can be highly misleading measures of the evidence provided by the data against the null hypothesis” (Berger & Selke, 1987, p. 112; also see Nickerson, 2000, p. 248). This constitutes a dilemma: On the one hand, “a test of significance contains no criterion for ‘accepting’ a hypothesis” (Fisher, 1959, p. 42), and on the other, we cannot safely reject a hypothesis without knowing the priors. Significance tests without priors are the “flaw in our method.”

There have been numerous thoughtful reviews of this foundational issue (e.g., Nickerson, 2000), attempts to make the best of the situation (e.g., Trafimow, 2003), proposals for alternative statistics (e.g., Loftus, 1996), and defenses of significance tests and calls for their abolition alike (e.g., Harlow, Mulaik, & Steiger, 1997). When so many experts disagree on the solution, perhaps the problem itself is to blame. It was Fisher (1925) who focused the research community on parameter estimation “so convincingly that for the next 50 years or so almost all theoretical statisticians were completely parameter bound, paying little or no heed to inference about observables” (Geisser, 1992, p. 1). But it is rare for psychologists to need estimates of parameters; we are more typically interested in whether a causal relation exists between independent and dependent variables (but see Krantz, 1999; Steiger & Fouladi, 1997). Are women attracted more to men with symmetric faces than to men with asymmetric faces? Does variation in irrelevant dimensions of stimuli affect judgments on relevant dimensions? Does review of traumatic events facilitate recovery? Our unfortunate historical commitment to significance tests forces us to rephrase these good questions in the negative, attempt to reject those nullities, and be left with nothing we can logically say about the questions—whether p = .100 or p = .001. This article provides an alternative, one that shifts the argument by offering “a solution to the question of replicability” (Krueger, 2001, p. 16).


Consider an experiment in which the null hypothesis—no difference between experimental and control groups—can be rejected with a p value of .049. What is the probability that we can replicate this significance level? That depends on the state of nature. In this issue, as in most others, NHST requires us to take a stand on things that we cannot know. If the null is true, ceteris paribus we shall succeed—get a significant effect—5% of the time. If the null is false, replicability depends on the population effect size, δ. Power analysis varies the hypothetical discrepancy between the means of control and experimental populations, giving the probability of appropriately rejecting the null under those various assumptive states of nature. This awkward machinery is seldom invoked outside of grant proposals, whose review panels demand an n large enough to provide significant returns on funding.

Greenwald, Gonzalez, Guthrie, and Harris (1996) reviewed the NHST controversy and took the first clear steps toward a useful measure of replicability. They showed that p values predict the probability of getting significance in a replication attempt when the measured effect size, d′, equals the population effect size, δ. This postulate, δ = d′, complements NHST’s δ = 0, while making better use of the available data (i.e., the observed d′ > 0). But replicating “significance” replicates the dilemma of significance tests: Data can speak to the probability of H0 and the alternative, HA, only after we have made a commitment to values of the priors. Abandoning the vain and unnecessary quest for definitive statements about parameters frees us to consider statistics that predict replicability in its broadest sense, while avoiding the Bayesian dilemma.

The Framework

Consider an experimental group and an independent control group whose sample means, ME and MC, differ by a score of D. The corresponding dimensionless measure of effect size d′ (called d by Cohen, 1969; g by Hedges & Olkin, 1985; and d′ in signal detectability theory) is


where sp is the pooled within-group standard deviation. If the experimental and control populations are normal and the total sample size is greater than 20 (nE + nC = n > 20), the sampling distribution of d′ is approximately normal (Hedges & Olkin, 1985; see the top panel of Fig. 1 and the appendix):

Fig. 1
Sampling distributions of effect size (d). The top panel shows a distribution for a population effect size of δ = 0.1; the experiment yielded an effect size of 0.3, and thus had a sampling error Δ = d1 ′ − δ = 0.2. ...

σd is the standard error of the estimate of effect size, the square root of


for n > 4. When nE = nC, Equation 3 reduces to σd2 ≈ 4/(n − 4).

Define replication as an effect of the same sign as that found in the original experiment. The probability of a replication attempt having an effect d2′ greater than zero, given a population effect size of δ, is the area to the right of 0 in the sampling distribution centered at δ (middle panel of Fig. 1). Unfortunately, we do not know the value of the parameter δ and must therefore eliminate it.

Eliminating δ

Define the sampling error, Δ, as Δ = d′ − δ (Fig. 1, top panel). For the original experiment, this equation may be rewritten as δ = d1′ − Δ1. Replication requires that if d1′ is greater than 0, then d2′ is also greater than 0, that is, that d2′ = δ + Δ2 > Substitute d1′ − Δ1 in place of δ in this equation. Replication thus requires that d2′ = d1′ − Δ1 + Δ2 > 0. The expectation of each sampling error is 0 with variance σd2. For independent replications, the variances add, so that d2′ ~ N(d1′, σdR), with σdR=2σd. The probability of replication, prep, is the area of the distribution for which d′ is greater than 0, shaded in the bottom panel of Figure 1 :


Slide the distribution to the left by the distance d1′ to see that Equation 4 describes the same area as


It is easiest to calculate prep from the right integral in Equations 5 , by consulting a normal probability table for the cumulative probability up to



Suppose an experiment with nE = nC = 12 yields a difference between experimental and control groups of 5.0 with sp = 10.0. This gives an effect of d1′ = 0.5 (Equation 1) with a variance of σd12 ≈ 4/(24 − 4) = 0.20 (Equation 3), and a replication variance of σdR2 = 2 · σd12 ≈ 0.40. From this, it follows that z=0.5/0.40=0.79 (Equation 6). A table of the normal distribution assigns a prep of .785.1

As the hypothetical number of observations in the replicate approaches infinity, the sampling variance of the replication goes to zero, and prep is the positive area of N(d1′, σd1). This is the sampling distribution of a standard power analysis at the maximum likelihood value for δ, and establishes an upper bound for replicability. It is unlikely, however, that the next investigator will have sufficient resources or interest to approach that upper bound. By default, then, prep is defined for equipotent replications, ones that employ the same number of subjects as the original experiment and experience similar levels of sampling error. The probability of replication may be calculated under other scenarios (as shown later), but for purposes of qualifying the data in hand, equipotency, which doubles the sampling variance, is assumed.

The left panel of Figure 2 shows the probability of replicating the results of an experiment whose measured effect size is d1′ = 0.1 (bottom curve), 0.2, . . . , 1.0, as a function of the number of observations in the original study. These results permit a comparison with traditional measures of significance. The dashed line connects the effect sizes necessary to reject the null under a two-tailed t test, with probability of a Type I error, α, less than .05. Satisfying this criterion is tantamount to establishing a prep of approximately .917.

Fig. 2
Probability of replication (prep) as a function of the number of observations and measured effect size, d1′. The functions in each panel show prep for values of d1′ increasing in steps of 0.1, from 0.10 (lowest curve) to 1.0 (highest curve). ...

Parametric Variance

The calculations presented thus far assume that the variance contributed by contextual variables in the replicate is negligible compared with the sampling error of d. This is the classic fixed-effects model of science. But every experiment is a sample from a population of possible experiments on the topic, and each of those, with its own differences in detail, has its own subspecies of effect size, δi. This is true a fortiori for correlational studies involving different instruments or moderators (Mosteller & Colditz, 1996). The population of effect sizes adds a realization variance, σδ2, to the sampling distributions of the original and the replicate (Raudenbush, 1994; Rubin, 1981; van den Noortgate & Onghena, 2003), so that the standard error of effect size in replication becomes


In a recent meta-meta-analysis of more than 25,000 social science studies, Richard, Bond, and Stokes-Zoota (2003) reported a mean within-literature variance of σδ2 = 0.092 (median = 0.08), corrected for sampling variance (Hedges & Vevea, 1998). The statistic σδ2 places an upper limit on the probability of replication, one felt most severely by studies with small effect sizes. This is shown graphically in the right panel of Figure 2. The probability of replication no longer asymptotes at 1.0, but rather at prep(max)=-d1n(0,2σδ). At n = 100, the functions shown in the right panel of Figure 2 are no more than 5 points below their asymptotes. Given a representative σδ2 of 0.08, for no value of n will a measured effect size of d′ less than 0.52 attain a prep greater than .90; but this standard comes within reach of a sample size of 40 for a d′ of 0.8.

Reliance on standard hypothesis-testing techniques that ignore realization variance may be one of the causes for the dismayingly common failures of replication. The standard t test will judge an effect of any size significant at a sufficiently large n, even though the odds for replication may be very close to chance. Figure 2 provides understanding, if no consolation, to investigators who have failed to replicate published findings of high significance but low effect size. The odds were never very much in their favor. Setting a replicability criterion for publication that includes an estimate of realization variance would filter the correlational background noise noted by Meehl (1997) and others.

Claiming replicability for an effect that would merely be of the same sign may seem too liberal, when the prior probability of that is 1/2, but traditional null-hypothesis tests are themselves at best merely directional. The proper metric of effect size is d or r, not p or prep. In the present analysis, replicability qualifies effect, not effect size: A d2′ of 2.0 constitutes a failure to replicate an effect size (d1′) of 0.3, but is a strong replication of the effect. Requiring a result to have a prep of .9 exacts a standard comparable to (Fig. 2, left panel) or exceeding (right panel) the standard of traditional significance tests.

Does prep really predict the probability of replication? In a meta-analysis of 37 studies of the psychophysiology of aggression, including unpublished nonsignificant data sets, Lorber (2004) found that 70% showed a negative relation between heart rate and aggressive behavior patterns. The median value of prep over those studies was .71 (.69 assuming σδ2 = 0.08). In a meta-analysis of 37 studies of the effectiveness of massage therapy, Moyer, Rounds, and Hannum (2004) found that 83% reported positive effects on various dependent variables; including an estimate of publication bias against negative results reduced this value to 74%. The median value of prep over those studies was .75 (.73 assuming σδ2 = 0.08). In a meta-analysis of 45 studies of transformational leadership, Eagly, Johannesen-Schmidt, and van Engen (2003) found that 82% showed an advantage for women, and argued against attenuation by publication bias. The median value of prep over these studies was .79 (dropping to .68 for σδ2 = 0.08 because of the generally small effect sizes). Averaging values of prep and counting the proportion of positive results are both inefficient ways of aggregating and evaluating data (Cooper & Hedges, 1994), but such analyses provide face validity for prep, which is intended primarily as a measure of the robustness of studies taken singly.


Whenever an effect size can be calculated (see Rosenthal, 1994, for conversions among indices; Cortina & Nouri, 2000, for analysis of variance designs; Grissom & Kim, 2001, for caveats), so also can prep. Randomization tests, described in the appendix, facilitate computation of prep for complex designs or situations in which assumptions of normality are untenable. Calculation of the n required for a desired prep is straightforward. For a presumptive effect size of δ and realization variance of σδ2, calculate the z score corresponding to prep, and employ an n = nE + nC no fewer than


Negative results indicate that the desired prep is unobtainable for that σδ2. For example, for δ = 0.8, σδ2 = 0.08, and a desired prep = .9, z(.9)2 = 1.64, and the minimum n is 40.

Stronger claims than replication of a positive effect are sometimes warranted. An investigator may wish to claim that a new drug is more effective than a standard. The replicability of the data supporting that claim may be calculated by integrating Equation 4 not from 0, but from ds, the effect size of the standard bearer. Editors may prefer to call a result replicable only if it accounts for, say, at least 1% of the variance in the data, for which d′ must be greater than 0.04. They may also require that it pass the Aikaike criterion for adding a parameter (distinct means for experimental and control groups; Burnham & Anderson, 2002), for which r2 must be greater than 1 − e−2/n. Together, these constraints define a lower limit for “replicable” at prep ≈ 55. However these minima are set, a fair assessment of σδ is necessary for prep to give investigators a fair assessment of replicability.

The replicability of differences among experimental conditions is calculated the same way as that between experimental and control conditions. Multiple comparisons are made by the conjunction or disjunction of prep: If treatments A and B are independent, each with prep of .80, the probability of replicating both effects is .64, and the probability of replicating at least one is .87. The probability of n independent attempts to replicate an experiment all succeeding is prepn.

As is the case for all statistics, there is sampling variability associated with d′, so that any particular value of prep may be more or less representative of the values found by other studies executed under similar conditions. It is an estimate. Replication intervals (RIs) aid interpretation by reflecting prep onto the measurement axis. Their calculation is the same as for confidence intervals (CIs), but with variance doubled. RIs can be used as equivalence tests for evaluating point predictions. The standard error of estimate conveniently captures 52% of future replications (Cumming, Williams, & Fidler, 2004). This familiar error bar can therefore be interpreted as an approximate 50% RI. In the example given earlier, for σδ = 0, the 50% RI for D is approximately 5±2(102/24)[2.1,7.9].


Sampling distributions for replicates involve two sources of variance, leading to a root-2 increase in the standard error over that used to calculate significance. Why incur that cost? Both p and prep are functions of effect size and n, and so convey similar information: The top panel in Figure 3 shows p as the area in the right tail of the sampling distribution of d1′, given the null, and prep as the area in the right tail of the prospective sampling distribution of d2′, given d1′. As d1′ or n varies, prep and p change in complement.

Fig. 3
Complementarity of prep and p. The top panel shows sampling distributions for d1′ given the null (left) and for d2′ given d1 (right). The small black area gives the probability of finding a statistic more extreme than d1 if the null were ...

Recapturing a familiar index of merit is reassuring, as are the familiar calculations involved; but these analyses are not equivalent. Consider the following contrasts:

Intuitive Sense

What is the difference between p values of .05 and .01, or between p values of .01 and .001? If you follow Neyman-Pearson and have set α to be .05, you must answer, “Nothing” (Meehl, 1978). If you follow Fisher, you can say, “The probability of finding a statistic more extreme than this under the null is p.” Now compare those p values, and the oblique responses they support, with their corresponding values of prep shown in the bottom panel of Figure 3. These steps in p values take us from prep of .88 to .95 to .99—increments that are clear, interpretable, and manifestly important to a practicing scientist.

Logical Authority

Under NHST, one can never accept a hypothesis, and is often left in the triple-negative no-man’s land of failure to reject the null. The prep statistic provides a graded measure of replicability that authorizes positive statements about results: “This effect will replicate 100( prep)% of the time” conveys useful information, whatever the value of prep.

Real Power

Traditionally, replication has been viewed as a second successful attainment of a significant effect. The probability of getting a significant effect in a replicate is found by integrating Equation 4 from a lower limit given by the critical value d* = σdRtα, ν 2 − 2. This calculation does not require that the original study achieved significance. Such analyses may help bridge to the new perspective; but once prep is determined, calculation of traditional significance is a step backward. The curves in Figure 2 predict the replicability of an effect given known results, not the probability of a statistic given the value of a parameter whose value is not given.

Elimination of Errors

Significance level is defined as the probability of rejecting the null when it is true (a Type I error of probability α); power is defined as the probability of rejecting the null when it is false, and not doing so is a Type II error. False premises lead to conclusions that may be logically consistent but empirically invalid, a Type III error. Calculations of p are contingent on the null being true. Because the null is almost always false (Cohen, 1994), investigators who imply that manipulations were effective on the basis of a p less than α are prone to Type III errors. Because prep is not conditional on the truth value of the null, it avoids all three types of error.

One might, of course, be misled by a value of prep that itself cannot be replicated. This can be caused by

  • sampling error: d1 may deviate substantially from δ (RIs help interpret this risk.)
  • failure to include an estimate of σδ2 in the replication variance
  • publication bias against small or negative effects
  • the presence of confounds, biased data selection, and other missteps that plague all mapping of particular results to general claims

Because of these uncertainties, prep is only an estimate of the proportion of replication attempts that will be successful. It measures the robustness of a demonstration; its accuracy in predicting the proportion of positive replications depends on the factors just listed.

Greater Confidence

The American Psychological Association (Wilkinson & the Task Force on Statistical Inference, 1999) has called for the increased use of CIs. Unfortunately, few researchers know how to interpret them, and fewer still know where to put them (Cumming & Finch, 2001; Cumming et al., 2004; Estes, 1997; Smithson, 2003; Thompson, 2002). CIs are often drawn centered over the sample statistic, as though it were the parameter; when a CI does not subsume 0, it is often concluded that the null may be rejected. The first practice is misleading, and the second wrong. CIs are derived from sampling distributions of M around a hypostatized μ: |μ − M| will be less than the CI 100p% of the time. But as difference scores, CIs have lost their location. Situating them requires an implicit commitment to parameters—either to μ = 0 for NHST or to μ = M for the typical position of CIs flanking the statistic. Such a commitment, absent priors, runs afoul of the Bayesian dilemma. In contrast, RIs can be validly centered on the statistic to which they refer, and the replication level may be correctly interpreted as the probability that the statistics of future equipotent replications will fall within the interval.

Decision Readiness

Significance tests are said to provide decision criteria essential to science. But it is a poor decision theory that takes no account of prior information and no account of expected values, and in the end lets us decide only whether or not to reject a statistic as improbable under the null. As a graduated measure, prep provides a basis for a richer approach to decision making than the Neyman-Pearson strategy, currently the mode in psychology. Decision makers may compute expected value, E(v), by multiplying prep or its complement by the values they assign outcomes. Let v+(d′) be the value of positive action for an effect size d′, including potential costs for small or contrary effects. Then E(ν+)=-+ν+(x)n(x;d1,σR). Comparison with an analogous calculation for E(v) will inform the decision.

Congeniality With Bayes

Probability theory provides a unique basis for the logic of science (Cox, 1961), and Bayes’ theorem provides the machinery to make science cumulative (Jaynes & Bretthorst, 2003; see the appendix). Falsification of the null cannot contribute to the cumulation of knowledge (Stove, 1982); the use of Bayes to reduce σdR2 can. NHST stipulates an arbitrary mean for the test statistic a priori (0) and a variance a posteriori (sp2/n). The statistic prep uses both moments of the observed data in a coherent fashion to predict the most likely posterior distribution of the replicate statistic. Information from replicates may be pooled to reduce σd2 (Louis & Zelterman, 1994; Miller & Pollack, 1994). Systematic explorations of phenomena identify predictors or moderators that reduce σδ2. The information contributed by an experiment, and thus its contribution to knowledge, is a direct function of this reduction in σdR2.

Improved Communication

The classic definition of replicability can cause harmful confusion when weak but supportive results must be categorized as a “failure to replicate [at p < .05]” (Rossi, 1997). Consider an experiment involving memory for deep versus superficial encoding of target words. This experiment, conducted in an undergraduate methods class, yielded a highly significant effect for the pooled data of 124 students, t(122) = 5.46 (Parkinson, 2004). We can “power down” the effect estimated from the pooled data to predict the probability that each of the seven sections in which these data were collected would replicate this classic effect. All of the test materials and instructions were identical, so σδ2 was approximately 0. The effect size from the pooled data, d′, was 0.49. Individual class sections, averaging ns of 18, contributed the majority of variability to the replicate sampling distribution, whose variance is the sum of sampling variances for n = 124 (“original”) and again for n = 18 (replicates). Replacing σdR in Equation 4 with the root of this sum predicts a replicability of .81: Approximately six of the seven sections should get a positive effect. It happens that all seven did, although for one the effect size was a mere 0.06. Unfortunately, the instructor had to tell four of the seven sections that they had, by contemporary standards, failed to replicate a very reliable result, as their ps were greater than .05. It was a good opportunity to discuss sampling error. It was not a good opportunity to discuss careers in psychology.

“How odd it is that anyone should not see that all observation must be for or against some view if it is to be of any service!” (Darwin, 1994, p. 269). Significance tests can never be for: “Never use the unfortunate expression ‘accept the null hypothesis”’ (Wilkinson & the Task Force on Statistical Inference, 1999, p. 599). And without priors, there are no secure grounds for being against—rejecting— the null. It follows that if our observations are to be of any service, it will not be because we have used significance tests. All this may be hard news for small-effects research, in which significance attends any hypothesis given enough n, whether or not the results are replicable. But editors may lower the hurdle for potentially important research that comes with so precise a warning label as prep. When replicability becomes the criterion, researchers can gauge the risks they face in pursuing a line of study: An assistant professor may choose paradigms in which prep is typically greater than .8, whereas a tenured risk taker may hope to reduce σδ2 in a line of research having preps around .6. When replicability becomes the criterion, significance, shorn of its statistical duty, can once again become a synonym for the importance of a result, not for its improbability.


Colleagues whose comments have improved this article include Sandy Braver, Darlene Crone-Todd, James Cutting, Randy Grace, Tony Greenwald, Geoff Loftus, Armando Machado, Roger Milsap, Ray Nickerson, Morris Okun, Clark Presson, Anon Reviewer, Matt Sitomer, and François Tonneau. In particular, I thank Geoff Cumming, whose careful readings saved me from more than one error. The concept was presented at a meeting of the Society of Experimental Psychologists, March 2004, Cornell University. The research was supported by National Science Foundation Grant IBN 0236821 and National Institute of Mental Health Grant 1R01MH066860.


This back room contains equations, details, and generalizations.

Effect Size

The denominator of effect size given by Equation 1 is the pooled variance, calculated as


Hedges (1981) showed that an unbiased estimate of δ is


The adjustment is small, however, and with suitable adjustments in σd, d′ suffices.

Negative effects generate preps less than .5, indicating the unlikelihood of positive effects in replication. For consistency, if d′ is less than 0, use |d′| and report the result as the replicability of a negative effect. Useful conversions are d′ = 2r(1 − r2)−1/2 (Rosenthal, 1994) and d′ = t[1/nE + 1/nC]1/2 for the simple two-independent-group case and d′ = tr[(1 − r)/nE + (1 − r)/nC]1/2 for a repeated measures t, where r is the correlation between the measures (Cortina & Nouri, 2000).

The asymptotic variance of effect size (Hedges, 1981) is


Equation 3 in the text is optimized for the use of d′, however, and delivers accurate values of prep for −1 ≤ d′ ≤ 1.

Variance of Replicates

The desired variance of replicates, σdR2, equals the expectation E [(d2d1)]2. This may be expanded (Estes, 1997) as


The quantities E[(d2 − δ)2] and E[(d1 − δ)2] are the variances of d2 and d1, each equal to σd2. For independent replications, the expectation of the cross product E[(d2 − δ) (d1 − δ)] is 0.

Therefore, σdR2 = E[(d2d1)2] = σd2 + σd2. It follows that the standard error of effect size of equipotent replications is σdR=2σd.

When nE = nC > 2,


When the sizes of the original and replicate samples vary, replication variance should be based on


prep as a Function of p

We may approximate the normal distribution by the logistic and solve for prep as a function of p. This suggests the following equation:


The parenthetical converts a p value into a probability ratio appropriate for the logistic inverse. For two-tailed comparisons, halve p. Users of Excel can simply evaluate prep = norms-dist(normsinv(1 − p)/sqrt(2)) (G. Cumming, personal communication, October 24, 2004). This estimate is complementary to Rosenthal and Rubin’s (2003) estimate of effect size directly from p and n.

Randomization Method

Randomization methods avoid assumptions of normality, are useful for small-n experiments, and are robust against heteroscedasticity. To employ them:

  • Bootstrap populations for the experimental and control samples independently, generating subsamples of half the size of the original samples, using software such as Resampling Stats©(Bruce, 2003). This half-sizing provides the 2 increase in the standard deviation intrinsic to calculation of prep.
  • Generate an empirical sampling distribution of the difference of the means of the subsamples, or of the mean of the differences for a matched-sample design.
  • The proportion of the means that are positive gives prep.

This robust approach does not take into account σδ2, and so is accurate only for exact replications.

A Cumulative Science

Falsification of the null, even when possible, provides no machinery for the cumulation of knowledge. Reduction of σdR does. Information is the reduction of entropy, which can be measured as the Fisher information content of the distribution of effect sizes. The difference of the entropies before and after an experiment, I = log2beforeafter), measures its incremental contribution of information. The discovery of better theoretical structures, predictors, or moderators that convert within-group variance to between-group variance permits large reductions in σδ2, and thus σdR; smaller reductions are effected by cumulative increases in n.


1Excel® spreadsheets with relevant calculations are available from http://www.asu.edu/clas/psych/research/sqab and from http://www.latrobe.edu.au/psy/esci/.


  • Berger JO, Selke T. Testing a point null hypothesis: The irreconcilability of P values and evidence. Journal of the American Statistical Association. 1987;82:112–122.
  • Bruce, P. (2003). Resampling stats in Excel [Computer software]. Retrieved February 1, 2005, from http://www.resample.com.
  • Burnham, K.P., & Anderson, D.R. (2002). Model selection and multimodel inference: A practical information-theoretic approach(2nd ed.). New York: Springer-Verlag.
  • Cohen, J. (1969). Statistical power analysis for the behavioral sciencesNew York: Academic Press.
  • Cohen J. The earth is round (p < .05) American Psychologist. 1994;49:997–1003.
  • Cooper, H., & Hedges, L.V. (Eds.). (1994). The handbook of research synthesisNew York: Russell Sage Foundation.
  • Cortina, J.M., & Nouri, H. (2000). Effect size for ANOVA designsThousand Oaks, CA: Sage.
  • Cox, R.T. (1961). The algebra of probable inferenceBaltimore: Johns Hopkins University Press.
  • Cumming G, Finch S. A primer on the understanding, use and calculation of confidence intervals based on central and noncentral distributions. Educational and Psychological Measurement. 2001;61:532–575.
  • Cumming G, Williams J, Fidler F. Replication, and researchers’ understanding of confidence intervals and standard error bars. Understanding Statistics. 2004;3:299–311.
  • Darwin, C. (1994). The correspondence of Charles Darwin(Vol. 9; F. Burkhardt, J. Browne, D.M. Porter, & M. Richmond, Eds.). Cambridge, England: Cambridge University Press.
  • Eagly AH, Johannesen-Schmidt MC, van Engen ML. Transformational, transactional, and laissez-faire leadership styles: A meta-analysis comparing men and women. Psychological Bulletin. 2003;129:569–591. [PubMed]
  • Estes WK. On the communication of information by displays of standard errors and confidence intervals. Psychonomic Bulletin & Review. 1997;4:330–341.
  • Fisher RA. Theory of statistical estimation. Proceedings of the Cambridge Philosophical Society. 1925;22:700–725.
  • Fisher, R.A. (1959). Statistical methods and scientific inference(2nd ed.). New York: Hafner Publishing.
  • Geisser, S. (1992). Introduction to Fisher (1922): On the mathematical foundations of theoretical statistics. In S. Kotz & N.L. Johnson (Eds.), Breakthroughs in statistics (Vol. 1,pp. 1–10). New York: Springer-Verlag.
  • Greenwald AG, Gonzalez R, Guthrie DG, Harris RJ. Effect sizes and p values: What should be reported and what should be replicated? Psychophysiology. 1996;33:175–183. [PubMed]
  • Grissom RJ, Kim JJ. Review of assumptions and problems in the appropriate conceptualization of effect size. Psychological Methods. 2001;6:135–146. [PubMed]
  • Harlow, L.L., Mulaik, S.A., & Steiger, J.H. (Eds.). (1997). What if there were no significance tests?Mahwah, NJ: Erlbaum.
  • Hedges LV. Distribution theory for Glass’s estimator of effect sizes and related estimators. Journal of Educational Statistics. 1981;6:107–128.
  • Hedges, L.V., & Olkin, I. (1985). Statistical methods for meta-analysisNew York: Academic Press.
  • Hedges LV, Vevea JL. Fixed- and random-effects models in meta-analysis. Psychological Methods. 1998;3:486–504.
  • Jaynes, E.T., & Bretthorst, G.L. (2003). Probability theory: The logic of scienceCambridge, England: Cambridge University Press.
  • Krantz DH. The null hypothesis testing controversy in psychology. Journal of the American Statistical Association. 1999;44:1372–1381.
  • Krueger J. Null hypothesis significance testing: On the survival of a flawed method. American Psychologist. 2001;56:16–26. [PubMed]
  • Loftus GR. Psychology will be a much better science when we change the way we analyze data. Current Directions in Psychological Science. 1996;5:161–171.
  • Lorber MF. Psychophysiology of aggression, psychopathy, and conduct problems: A meta-analysis. Psychological Bulletin. 2004;130:531–552. [PubMed]
  • Louis, T.A., & Zelterman, D. (1994). Bayesian approaches to research synthesis. In H. Cooper & L.V. Hedges (Eds.), The handbook of research synthesis(pp. 411–422). New York: Russell Sage Foundation.
  • Meehl PE. Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology. 1978;46:806–834.
  • Meehl, P.E. (1997). The problem is epistemology, not statistics: Replace significance tests by confidence intervals and quantify accuracy of risky numerical predictions. In L.L. Harlow, S.A. Mulaik, & J.H. Steiger (Eds.), What if there were no significance tests?(pp. 393–425). Mahwah, NJ: Erlbaum.
  • Miller, N., & Pollock, V.E. (1994). Meta-analytic synthesis for theory development. In H. Cooper & L.V. Hedges (Eds.), The handbook of research synthesis(pp. 457–484). New York: Russell Sage Foundation.
  • Mosteller F, Colditz GA. Understanding research synthesis (meta-analysis) Annual Review of Public Health. 1996;17:1–23. [PubMed]
  • Moyer CA, Rounds J, Hannum JW. A meta-analysis of massage therapy research. Psychological Bulletin. 2004;130:3–18. [PubMed]
  • Neyman J, Pearson ES. On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London, Series A. 1933;231:289–337.
  • Nickerson RS. Null hypothesis significance testing: A review of an old and continuing controversy. Psychological Methods. 2000;5:241–301. [PubMed]
  • Parkinson, S.R. (2004). [Levels of processing experiments in a methods class]. Unpublished raw data.
  • Raudenbush, S.W. (1994). Random effects models. In H. Cooper & L.V. Hedges (Eds.), The handbook of research synthesis(pp. 301–321). New York: Russell Sage Foundation.
  • Richard FD, Bond CF, Jr, Stokes-Zoota JJ. One hundred years of social psychology quantitatively described. Review of General Psychology. 2003;7:331–363.
  • Rosenthal, R. (1994). Parametric measures of effect size. In H. Cooper & L.V. Hedges (Eds.), The handbook of research synthesis(pp. 231–244). New York: Russell Sage Foundation.
  • Rosenthal R, Rubin DB. requivalent: A simple effect size indicator. Psychological Methods. 2003;8:492–496. [PubMed]
  • Rossi, J.S. (1997). A case study in the failure of psychology as a cumulative science: The spontaneous recovery of verbal learning. In L.L. Harlow, S.A. Mulaik, & J.H. Steiger (Eds.), What if there were no significance tests?(pp. 175–197). Mahwah, NJ: Erlbaum.
  • Rubin DB. Estimation in parallel randomized experiments. Journal of Educational Statistics. 1981;6:377–400.
  • Smithson, M. (2003). Confidence intervalsThousand Oaks, CA: Sage.
  • Steiger, J.H., & Fouladi, R.T. (1997). Noncentrality interval estimation and the evaluation of statistical models. In L.L. Harlow, S.A. Mulaik, & J.H. Steiger (Eds.), What if there were no significance tests?(pp. 221–257). Mahwah, NJ: Erlbaum.
  • Stove, D.C. (1982). Popper and after: Four modern irrationalists New York: Pergamon Press (Available from Krishna Kunchithapadam, http://www.geocities.com/ResearchTriangle/Facility/4118/dcs/popper)
  • Thompson B. What future quantitative social science research could look like: Confidence intervals for effect sizes. Educational Researcher. 2002;31(3):25–32.
  • Trafimow D. Hypothesis testing and theory evaluation at the boundaries: Surprising insights from Bayes’s theorem. Psychological Review. 2003;110:526–535. [PubMed]
  • van den Noortgate W, Onghena P. Estimating the mean effect size in meta-analysis: Bias, precision, and mean squared error of different weighting methods. Behavior Research Methods, Instruments, & Computers. 2003;35:504–511. [PubMed]
  • Wilkinson L, Task Force on Statistical Inference Statistical methods in psychology: Guidelines and explanations. American Psychologist. 1999;54:594–604.


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...