This report analytically evaluates two controversial topics in meta-analytic methods using Monte Carlo simulation techniques. The first is to determine what effect size metric should be used when trials assess an outcome on the same continuous measure. The second is to determine the best estimates of the standardized mean difference effect size and its variance when the comparisons are derived from a repeated-measures or a between-groups design.

Choice of Metric When Meta-Analyzing Continuous Measures

Although several statistical methods exist to estimate comparisons of groups at one or more points (Tables 1 to 5), none provide unbiased estimations and, before the current report, the circumstances under which they produce the most optimal statistical inferences has been unknown. In the current simulations, the standardized mean difference outperformed the unstandardized version under a broad set of conditions in terms both of bias (Figures 1–5) and of efficiency (Figures 6–10) under the conditions we have described on our methods section. The standardized mean difference performed better when differences in within-study variability are large, when parametric assumptions are poorly met, and when study sample sizes are small. When the underlying assumptions are better met, choice of standardized vs. unstandardized mean difference mattered little in estimates of the weighted mean effect size (Table 7). Table 8 summarizes which equations performed best in the current research in terms of operationalizing effect sizes and their variances for particular types of designs and inferential circumstances.

Table 8. Findings relevant to meta-analytic practice (effect size and variance choice).

Table 8

Findings relevant to meta-analytic practice (effect size and variance choice).

The fact that the current results support the use of the standardized mean difference even when it is possible to use the unstandardized version might on the surface imply that that clinical interpretations will grow more difficult even while statistical inferences grow clearer and cleaner. Of course, most stakeholders can more easily interpret a 10 mmHg drop in blood pressure or a $100 reduction in the cost of care than the equivalent result on a standardized effect size metric. There are at least two solutions to this problem. The first solution is quantitative and entails converting final results from in the standardized mean difference metric to their equivalent unstandardized mean differences. One simply multiplies the standardized mean difference by the standard deviation. Naturally, standard deviations can and do vary widely between studies, which implies that is valuable to meta-analyze the relevant standard deviations in order to determine which value or values are best used in such conversions. Many factors might affect which standard deviation is presumed to describe a particular inferential situation. Investigators may have selected participants within a narrow range on the dependent measure, which artificially restricts the standard deviation. Presumably such standard deviations are of little use in setting a standard. Scaling issues are also a consideration: Other factors being equal, standard deviations will grow smaller as values near the low or high extremes of a particular measure (e.g., rating scales); standard deviations grow larger across levels of a measure that has infinity at one end (e.g., mmHg in blood pressure studies).6 Understanding when the standard deviation is larger or smaller thus facilitates making accurate clinical inferences.

The second solution for clinical interpretation hinges on effect size standards, which are made possible by using the standardized mean difference effect size. Specifically, Cohen30,35 tentatively proposed some guidelines for judging effect magnitude, suggesting “that medium represents an effect of a size likely to be visible to the naked eye of a careful observer” (Cohen, p. 156). Thus, if a standardized mean difference exceeds 0.50, then it is likely to be readily noticeable to the careful practitioner. If it is smaller, it is unlikely to be noticeable without the aid of statistics. In other words, if at least a medium amount of improvement has occurred between two observations, it should be noticeable in practice. Similarly, if a trial yielded a medium effect size and one encountered individuals who had been in either the treatment group or the control group, one could notice differences between them. It is worth noting that these clinical interpretation suggestions also apply to meta-analyses in which individual studies take observations on different measures, when the only conventional recourse is to use a standardized effect size. Finally, note that interventions with an average small effect can have very large public health effects if they apply to large part of the population, even if they are not noticeable by clinicians.

Optimal Estimations of the Standardized Mean Difference Effect Size (and its Sampling Variance)

Tables 1 to 5 show current methods to obtain an effect size and a sampling variance estimate for repeated-measures and two-groups designs. These solutions either include the correlation between pre- and post-test11,21,26 or exclude it.3,19,23,26 Despite the disagreement about use of the correlation in calculating the ES, all solutions except Gibbons et al.,23 use the correlation in estimating the variance of ES for subsequent weighted analyses. Finally, these solutions rarely if ever distinguish between change- and raw-score metrics; the latter always assumes a 0.5 correlation between measures in estimating the ES and its variance. The effect size in change-score metric can be defined as the mean change due to treatment compared with the variability of change scores and the effect size in raw-score metric as the mean difference between conditions compared with a pooled variability of scores within each condition or to the variance of the original scores without having any intervention.

The second takes into account only the change, without considering the variability of this change, and the first considers the change and variability. If the variability of this change is high, the ES in change-score formulation will be smaller than it will be in the raw-score formulation that considers just the between groups variability implying that the correlation between the two conditions is 0.5. Thus, the raw-score ES can be misleading. However, if the variability of the change is small, the ES estimation will be higher than if just the ES in raw-metric is considered because of the consistency. Consistency implies that for all the subjects, a similar change has been produced. Therefore, those metrics will report different definitions of the ES because of the different standard deviations that they use. There are different estimates of the sampling variance depending on study design (Tables 3 and 4); all present a good adjustment to the theoretical variance under most circumstances. Yet, for two-groups designs with repeated measures, there is an advantage to use the equations with the total effect size as a component (i.e., Table 4, equations 15 and 17). These performed superior to versions that used separate variance estimates for the two compared groups to create the total sampling variance (i.e., Table 4, equations 16 and 18). In general, for one-group repeated-measures designs all ES equations behaved well, but Table 8 lists those that performed the best under certain conditions.

Based on our results, selections of a formula for repeated measures can have considerable effects on statistical inferences. The parametric repeated-measures ES is defined as the difference between the means of the post- and pre-test divided by a standard deviation. The particular standard deviation chosen in calculating the ES index will also create some differences. Those differences can be corrected using the appropriate weights in each case, using the sampling variance estimate for change- or raw-score metric, then effect sizes from different designs can be integrated. It is worth mentioning that solutions for repeated measures effect sizes were most optimal when the correlation between repeated observations was 0.50; to the extent that actual observed correlations differ from this value, statistical inferences are likely to be sub-optimal, especially with some of the competing equations (Tables 1 and 2).

Limitations and Future Directions

The present study examined the performance of numerous estimators of effect size across widely diverse circumstances but it cannot evaluate all possible circumstances. Although the methods were intended to describe the conditions that most often appear in meta-analyses of health-related research, it is possible that important conditions have been omitted from the current simulations. For example, trials sometimes have far larger samples than the current simulations examined. Yet, because sample size had little role in results, this concern would seem to be abated. Moreover, in examining circumstances with heterogeneity and with unequal variances, the current findings would seem highly germane to many meta-analyses related to health.

It is also possible that our results favoring standardized mean differences over their unstandardized counterparts were in part determined by the design of our simulations that are more conditioned to the first one than to the latter. A future simulation assuming an unstandardized parametric effect size would be a useful replication and check of this possibility. Our simulation also does not provide direct evidence about the advisability of mixing ESs from between- and within-group designs in the same meta-analysis. Some sources argue against the practice (see Lipsey and Wilson, 2001)18 and others suggest that it is acceptable (for example, see Morris and DeShon, 2002 and Johnson and Eagly, 2000).22,34 Future research should directly address these issues.

The current investigation also leaves some questions without complete answers. Future investigations could examine alternative solutions beyond those in Table 5 for gauging the magnitude of effect sizes in the original metric. For example, as implied in the preceding sub-section, it may be fruitful to model the standard deviations in trials. Once the population values are estimated they could be used in place of the observed standard deviations in individual studies to weight results. This solution might correct many of the deficiencies the current study identified. (Or, the population standard deviations could replace the observed standard deviations in calculating the standardized mean difference.) Another solution could be taking previous transformations of the unstandardized metric and evaluating which ones are the most unbiased and efficient depending on different simulated conditions. Similarly, in comparing the unstandardized effect size to the standardized one, the current work examined only one version (see Table 5). One popular version that was not examined in the current analysis is the unstandardized mean gain score. The unstandardized difference’s relatively poor performance in the current analysis leaves little faith that it will fare any better in the gain score arena, but only by doing the requisite work can this possibility be confirmed. Similarly, the current finding that the standardized mean difference performs better than the unstandardized one under unequal variances implies but does not directly show that differing variances of the measures across studies will make the unstandardized mean difference perform more poorly. Moreover, the current results showed that the standardized mean difference performs better under heterogeneity than its unstandardized counterpart; the implication is that moderator testing (viz. sub-group analysis or meta-regression) will also exhibit less bias and greater efficiency when the effect size is standardized rather than unstandardized. This possibility should be evaluated in a future simulation. Other important aspects to evaluate in a future study are the different ratios of the mean difference versus pooled standard deviation; the conditions manipulated in the current study may statistically benefit the standardized version more than the unstandardized counterpart.