- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

# Design and Analysis of Group-Randomized Trials: A Review of Recent Methodological Developments

*Requests for reprints should be sent to David M. Murray, PhD, 3693 Norriswood, 202 Psychology Bldg, Memphis, TN 38134 (e-mail: ude.sihpmem.cysp.liam@yarrum.d).*

## Abstract

We review recent developments in the design and analysis of group-randomized trials (GRTs). Regarding design, we summarize developments in estimates of intraclass correlation, power analysis, matched designs, designs involving one group per condition, and designs in which individuals are randomized to receive treatments in groups. Regarding analysis, we summarize developments in marginal and conditional models, the sandwich estimator, model-based estimators, binary data, survival analysis, randomization tests, survey methods, latent variable methods and nonlinear mixed models, time series methods, global tests for multiple endpoints, mediation effects, missing data, trial reporting, and software.

We encourage investigators who conduct GRTs to become familiar with these developments and to collaborate with methodologists who can strengthen the design and analysis of their trials.

Group-randomized trials (GRTs) are comparative studies designed to evaluate interventions that operate at a group level, manipulate the physical or social environment, or cannot be delivered to individuals.1 Examples include school-, worksite-, and community-based studies designed to improve the health of students, employees, and residents, respectively. Just as the randomized clinical trial (RCT) is the gold standard in public health and medicine when allocation of individual participants is possible, the GRT is the gold standard when allocation of identifiable groups is necessary.

There are 4 characteristics that distinguish the GRT from the more familiar RCT. First, the unit of assignment is an identifiable group; such groups are formed not at random but rather through some physical, social, geographic, or other connection among their members. Second, different groups are assigned to each condition, creating a nested or hierarchical structure for the design and the data. Third, the units of observation are members of those groups nested within both their condition and their group. Fourth, usually only a limited number of groups are assigned to each condition.

These characteristics create several problems in the design and analysis of GRTs.1 The major design problem is that a limited number of often heterogeneous groups makes it difficult for randomization to distribute potential sources of confounding evenly in any single realization of the experiment. This increases the need to use design strategies that will limit confounding and analytic strategies to deal with confounding when it is detected. The major analytic problem is that there is an expectation for a positive intraclass correlation (ICC) among observations of members of the same group.2 That ICC reflects an extra component of variance attributable to the group above and beyond the variance attributable to its members. This extra variation will increase the variance of any group-level statistic beyond what would be expected with random assignment of members to conditions. Moreover, with a limited number of groups, the degrees of freedom available to estimate group-level statistics are limited. Any test that ignores either the extra variation or the limited degrees of freedom will have a type I error rate that is inflated, and this effect will only worsen as the ICC increases.3

Cornfield4(p101–102) warned of this danger 25 years ago when he noted that ignoring these problems was “an exercise in selfdeception . . . and should be discouraged.” That warning was followed by a gradual increase in the number of methods papers in this area. The first comprehensive text on the design and analysis of GRTs appeared in 1998.1 It detailed the design considerations for the development of GRTs, described the major approaches to their analysis both for Gaussian and binary data, and presented methods for power analysis applicable to most GRTs. We use that text as a point of departure for this review and assume that readers are familiar with its basic material.

Over the past 5 years, many articles have discussed the methodological issues involved in GRTs generally or in design papers describing new trials.5–28 The second textbook on design and analysis of GRTs appeared in 2000.29 That text provided a good history of GRTs and examined the role of informed consent and other ethical issues. It focused on extensions of classical methods, although it also included material on regression models for Gaussian, binary, count, and time-to-event data. Other textbooks on analysis methods germane to GRTs appeared during the same period,30–33 as well as a large number of articles on new methods relevant to the design and analysis of GRTs. In the sections that follow, we bring the reader up to date on many of these developments.

## DESIGN ISSUES

In 1998, Murray1 detailed the design considerations for a GRT, whether the study was to use a nested cohort or nested crosssectional design; whether the study was to have a posttest-only design, a pretest–posttest design, or an extended design with multiple pretest/posttest measures; and whether the design was to be completely randomized or to include matching/stratification. At that time, investigators were limited by the paucity of ICC and other parameter estimates needed to select an efficient design and to ensure that the study would have adequate power (the probability of rejecting the null hypothesis when it is false). One of the important recent developments has been the publication of papers providing estimates for those parameters. Another has been the publication of important refinements in the methods used for power analysis. There have also been important developments in several specific designs, including matched designs, designs involving 1 group per condition, and designs in which individuals are randomized to receive treatments in groups.

### New Estimates of ICCs

Investigators planning a GRT should not proceed absent a good estimate of the extra variation likely to be present in their primary analysis. To do so is to risk a substantially underpowered or overpowered study. Table 1 lists articles published in the past 5 years that have reported ICC and related parameter estimates. Donner and Klar reported ICCs from a number of other studies,29 as did Murray and Blitstein.34 Collectively, these sources provide estimates for a wide variety of groups, members, and endpoints so that investigators now have a better opportunity of finding estimates that are well matched to the circumstances of the trial they are planning.

Murray and Blitstein34 also reported a pooled analysis of ICCs from worksite, school, and community studies. They confirmed that the adverse impact of a positive ICC can be reduced by regression adjustment for covariates1,35–38 or by taking advantage of over-time correlation in a repeated measures analysis.1,35,39 Janega et al. (unpublished data, 2003) have shown that standard errors for intervention effects from end-of-study analyses that reflect these strategies are often different from the standard errors estimated from baseline analyses. Because the ICC of concern in any GRT is the ICC as it operates in the primary analysis,1 these findings reinforce the need for investigators to use estimates in their power analyses that closely reflect the endpoints, target population, and primary analysis planned for the trial. And while the sources just cited will help considerably in this regard, we join others who have urged publication of such estimates as a routine part of reporting the results of GRTs.40

### Power Analysis

Most of the sources that reported ICCs also showed how they could be used to size a new GRT, as did many of the papers cited earlier as general reviews. We do not repeat the standard presentation here and instead refer readers to those sources, and especially to chapter 9 in the Murray text,1 including the examples offered at the end of that chapter. Even so, a few points bear repeating here. First, the increase in between-group variance due to the ICC in the simplest analysis is calculated as 1 + (*m* − 1)ICC, where *m* is the number of members per group; as such, ignoring even a small ICC can underestimate standard errors if *m* is large. Second, while the magnitude of the ICC is inversely related to the level of aggregation, it is independent of the number of group members who provide data. For both of these reasons, more power is available given more groups per condition with fewer members measured per group than given just a few groups per condition with many members measured per group, no matter the size of the ICC.

Third, the 2 factors that largely determine power in any GRT are the ICC and the number of groups per condition. For these reasons, there is no substitute for a good estimate of the ICC for the primary endpoint, the target population, and the primary analysis planned for the trial, and it is unusual for a GRT to have adequate power with fewer than 8 to 10 groups per condition. Finally, the formula for the standard error for the intervention effect depends on the primary analysis planned for the trial, and investigators should take care to calculate that standard error, and power, based on that analysis. Chapter 9 in the Murray text1 provides formulas for many of the common analyses, and generic formulas and examples are provided in recent work conducted by Janega et al. (unpublished data, 2003).

Several variations on the standard power analysis have appeared during the past 5 years. Slymen and Hovell presented a method that allows the investigator to compare sample size requirements for a GRT and an RCT based on the anticipated magnitude both of the ICC and of any contamination.41 They showed that for small groups, where contamination was likely to be substantial, GRTs were a natural choice, while for large groups, where contamination was likely to be modest, RCTs were a natural choice. Hayes and Bennett presented sample size formulas for pair-matched and pair-unmatched GRTs in terms of coefficients of variation rather than ICCs for investigators more familiar with the former than the latter.21 Murray et al. defined the design effect as it operates in a random coefficient model and presented methods for power analyses of such models.42

Kerry and Bland compared 3 methods for weighting group means in sample size calculations when those means are based on a variable number of observations; they reported that minimum variance weights were superior to uniform weights, particularly when clusters were small, and superior to cluster-size weights, particularly when the clusters were large.43 Lake et al. showed how power could be improved without increasing the type I error rate using a strategy in which sample size is reestimated after the start of recruitment using the initial data.44 This strategy has application in situations in which many groups are to be randomized and recruitment of those groups is to take place over a long period of time (e.g., some family studies). Liu et al. provided a technical discussion of sample size and power for analytic models involving differences between means, slopes, or proportions for GRTs involving repeated observations of the same groups and members45; less technical presentations are also available.1,42 Raudenbush discussed sample size in GRTs accounting for the cost of recruiting members and groups and provided formulas for optimal size with and without covariate adjustment.46

### Matched Designs

Almost half of the GRTs published in the *American Journal of Public Health* and *Preventive Medicine* during the period 1998 through 2002 involved matched designs.47 Even so, Klar and Donner suggested that stratification may be a better design choice to ensure balance on potential confounders.10,29,48 They argued that stratification exacted a lower price in terms of degrees of freedom, and certainly that is true. Klar and Donner also pointed out that estimation of the ICC in a matched design assumes homogeneity of effects across pairs, and they gave that as another reason to avoid a matched design. Others have argued that this assumption is often reasonable.49

Raab and Butcher proposed an alternative to matching50 based on a balancing criterion calculated as a weighted sum of squared differences between the condition means on any proposed covariates. Groups would be divided into 2 sets providing a small enough value on their criterion, followed by random assignment of sets to conditions. Raab and Butcher argued that this scheme would support model-based methods because it would fulfill the conditional independence criterion. To support a randomization test, they proposed that the criterion be calculated for all possible allocations of groups to conditions, that some subset of those allocations be identified as having a small enough value on the criterion to be acceptable, and that one such allocation be chosen at random, followed by random assignment of sets to conditions.

### One Group per Condition

GRTs with 1 group assigned to each condition have been criticized as unable to support a valid analysis for an intervention effect, absent strong and untestable assumptions.1,29 Even so, these designs continue to appear, both in applications submitted to National Institutes of Health study sections and in articles in the peer-reviewed literature.47 Varnell et al. recently provided additional documentation of the dangers of this design and urged investigators to avoid it except in the case of pilot studies.51

### Individuals Randomized to Receive Treatments in Groups

A design intermediate between a GRT and an RCT exists in which individuals are randomized to study conditions but receive their treatment in small groups or from the same intervention team seen by other participants. Those shared experiences may result in correlated errors, just as they do in GRTs. While some may regard this as a type of “intervention effect,” it is instead a threat to the internal validity of the trial. This concern was raised nearly 20 years ago in the context of designs in which endpoints were determined jointly by patients and their providers.52 Several recent articles have echoed that concern.46,53–55

Most recently, Varnell et al. compared analyses for these studies in simulations, varying the number of groups per condition, the magnitude of the ICC, and the number of conditions that received an intervention in small groups while fixing the intervention effect at zero.56 Analyses that ignored the ICC had an inflated type I error rate, with the magnitude of the problem dependent on the size of the ICC, the number of members per group, and the number of conditions in which participants received treatment in groups. A mixed-model regression approach with the group included as a nested random effect and degrees of freedom based on the number of groups carried the nominal type I error rate. This finding confirms that allowing participants to interact with each other in small groups does not maintain the independence of observations required for the usual RCT analyses.

## ANALYSIS ISSUES

Murray1 identified several analytic approaches that can provide a valid analysis for GRTs. In each, the intervention effect is defined as a function of a condition-level statistic (e.g., difference in means, rates, or slopes) and assessed against the variation in the corresponding group-level statistic. These approaches included mixed-model analysis of variance (ANOVA)/analysis of covariance (ANCOVA) for designs having only 1 or 2 time intervals, random coefficient models for designs having 3 or more time intervals, and randomization tests as an alternative to the model-based methods. Murray1 identified other approaches as invalid for GRTs because they ignored or misrepresented a source of random variation. These included (1) analyses that assessed condition variation against individual variation and ignored the group, (2) analyses that assessed condition variation against individual variation and included the group as a fixed effect, (3) analyses that assessed the condition variation against subgroup variation, and (4) analyses that assessed condition variation against the wrong type of group variation.

Murray1 identified still other strategies as having limited application for GRTs. Application of fixed-effect models with post hoc correction for extra variation and limited degrees of freedom assumes that the correction is based on an appropriate ICC estimate, and in 1998 few estimates were available. Application of survey-based methods or generalized estimating equations (GEE) and the sandwich method for standard errors requires that a total of 40 or more groups be included in the study, and in 1998 most GRTs did not include 40 groups.

During the past 5 years, considerable attention has been focused on analytic issues germane to GRTs, including refinements for existing methods and development of new methods. Much of this work has occurred outside the context of GRTs but has application to GRTs, and so we include it in this review.

### Conditional versus Marginal Models

Conditional or subject-specific models are typified by mixed-model regression57 and incorporate random effects to reflect the correlation among observations made of members of the same group; the observations are considered independent conditional on those random effects. Marginal or population-averaged models are typified by GEE58,59 and define the marginal expectation of the dependent variable as a function of the predictor variables and assume that the variance is a known function of the mean; they separately specify a correlation structure for observations made of members of the same group. In the case of Gaussian data, interpretation of the condition coefficient is the same in conditional and marginal models; however, in the case of binary data, the condition coefficient from a marginal model is smaller than that from a conditional model and has a different interpretation.

In the marginal model, the condition coefficient is the between-person difference in the log odds of the outcome comparing the effects of the intervention and control conditions as if they had been delivered to 2 different individuals. In the conditional model, the condition coefficient is the within-person change in the log odds of the outcome comparing the effect of the intervention and control conditions as if they had been delivered to the same individual. Several recent papers have recommended conditional models for GRTs focused on change within participants (e.g., preintervention vs postintervention) and marginal models for GRTs focused on differences between participants (e.g., intervention condition vs control condition). Unfortunately, both approaches have problems in certain binary data situations; because these issues affect the remainder of our presentation, we consider them first.

### Limitations of the Sandwich Estimator Used in Marginal Models

One of the advantages of GEE is that it uses an estimator for variances of fixed effects that is asymptotically robust to misspecification of the correlation structure; the sandwich estimator is so named because the expression of this estimator “sandwiches” an approximate correlation matrix inside 2 outer layers of matrix algebra that otherwise define the variance of a weighted least squares estimator. Unfortunately, the sandwich estimator is biased downward when the number of groups is below 40, whether in GRTs60–62 or in other designs involving correlated binary data.63–65 This problem only increases as the number of groups becomes smaller.66–68 Many investigators working in GRTs appear to be unaware of this limitation, in that there have been many applications of GEE and the sandwich estimator in GRTs involving fewer than 40 groups.47 Thornquist and Anderson reported more than 10 years ago that this bias was corrected in a GRT by inflating the variance to reflect the uncertainty in the estimation of the fixed effects, much as restricted maximum likelihood (REML) estimation does relative to full maximum likelihood (ML) estimation. Paired with a *t* test and using degrees of freedom based on the number of groups, the size of their corrected test was at the nominal level.60

More recent work has also focused on the development and evaluation of correction procedures, though usually not in the context of GRTs. Long and Ervin69 provided additional results for 3 corrections introduced earlier by MacKinnon and White65 and reported that a jackknife estimator (a nonparametric method to estimate standard errors based on repeated subsamples) was better than the alternatives. Mancl and DeRouen reported a corrected estimator that was of nominal size even with 10 groups per condition and only 16 observations per group67; they also offered an SAS macro. Corcoran et al.70 offered an exact test, but it has only narrow application to situations in which the groups represent ordered levels of an underlying factor such as dose. Fay and Graubard reported that the sandwich estimator worked well, even in small samples, so long as the usual Wald test was evaluated not as a χ^{2} value but as an F ratio of the form F(1, *d*), where *d* is calculated as a function of the variance of the sandwich estimator.71

A similar correction provided by Kauermann and Carroll replaces the usual cutpoint in the *z* distribution with a cutpoint that is a function of the variance of the sandwich estimator; they demonstrated its utility even when the sample size was as small as 5.72 Pan and Wall offered a correction much like that of Fay and Graubard in the form of an approximate *t* or F test, with degrees of freedom defined as a function of the variance of the sandwich estimator.73 Bell and McCaffrey74 offered a correction and a Satterthwaite approach to degrees of freedom that seemed to involve less bias and a better type I error rate than the sandwich estimator or the corrected estimators recommended by Long and Ervin69 or Mancl and DeRouen.67 Preisser et al. suggested using a model-based variance estimator in GEE, rather than the sandwich estimator, as another solution.75

Unfortunately, none of these corrections appear in the standard software packages, so they are relatively unavailable to investigators who analyze GRTs. Absent an effective correction, the sandwich estimator will have an inflated type I error rate in GRTs involving fewer than 40 groups, and investigators who use this approach continue to risk overstating the significance of their findings.

### Limitations of Model-Based Estimators Used in Conditional Models

Rodriguez and Goldman76 reported that multilevel analyses of binary data underestimate both fixed effects and their variances when the ICC is large (0.231 in their data) and there are few observations per group (e.g., family-based studies). With a smaller ICC (0.041 in their data), underestimation is quite modest, even with few observations per group. Breslow and Clayton77 and Ten Have et al.78 reported a similar problem for models fit via penalized quasi-likelihood (PQL) estimation. This led some to question the use of conditional models for GRTs involving binary data. That appears to be an overreaction, because most GRTs involve many observations per group and small ICCs; under these conditions, there is little bias. In fact, the simulation study of Hannan and Murray79 indicated that a conditional model for Gaussian data carried the nominal type I error rate when applied to binary data with an ICC as large as 0.05, so long as there were at least 4 groups per condition and 25 observations per group.

### Methods for Binary Data

Gibbons and Hedeker proposed a random-effects probit and logistic regression model for data with 3 levels of nesting based on ML estimation using numerical integration.80 Their approach would be preferred over PQL procedures when the number of observations per group is quite small, but it is computationally intractable with more than 5 or 6 random effects; this is a problem common to methods that rely on numerical integration. Unfortunately, many models fit to longitudinal data in the context of GRTs have 5 random effects, and some stratified models have 71; such models would be difficult to fit with these methods. Aitkin proposed a nonparametric method based on ML estimation81; he noted that this approach had been widely viewed as computationally intensive, but his method avoided that problem. The benefit of the nonparametric method is that it does not depend on correct specification of the distribution of random effects. Bellamy et al. reported a simulation study comparing mixed-model regression (using the SAS GLIMMIX macro) and GEE68 and confirmed earlier reports that GEE was liberal with fewer than 40 total groups, while GLIMMIX was conservative when the average cluster size was quite small.

Several Bayesian approaches have also been suggested. Kleinman and Ibrahim proposed a semiparametric Bayesian approach to generalized linear mixed models but provided no simulation results to evaluate their method.82 Ten Have and Localio83 proposed an empirical Bayes method based on numerical integration and incorporated an adjustment for the standard error; their method performed better than PQL estimation given many small groups (100 groups with 2 observations per group) but not as well as PQL estimation with a smaller number of larger groups (20 groups and 100 observations per group). As such, their method may be useful in family-based GRTs but not in school-, worksite-, or community-based GRTs. Turner et al. discussed a Bayesian approach involving specification of an informative prior ICC distribution based on values taken from the literature84; as published values for ICCs become increasingly available, their approach may prove useful. A much simpler approach for binary data was reported by Hannan and Murray,79 who indicated that the familiar conditional model for Gaussian data carried the nominal type I error rate even when applied to binary data with an ICC as large as 0.05, so long as there were at least 4 groups per condition and 25 observations per group.

### Methods for Survival Analysis

Hedeker et al. proposed a discrete-time survival model that allowed multiple random effects, operated under either the proportional hazards or proportional odds assumption, and relied on ML estimation using numerical integration.85 Hedeker et al. did not provide simulation results for their method. Donner and Klar29 described group-level methods that could be applied to either discrete-time or continuous-time survival data but did not allow for adjustment for individual-level factors; importantly, the unweighted form assumed that each group’s survival rate was equally precise. Frailty models allow the hazard rate to vary at random among groups,86 but their effect estimates may be difficult to interpret.29

Marginal survival models employ standard Cox regression methods to estimate the effect of the intervention and then use the sandwich estimator to obtain standard errors for the fixed effects87–89; their intervention effect estimates are readily interpretable, but caution is required if the total number of groups is less than 40. Sargent described an adaptation of the Cox model to incorporate random effects using Bayesian methods but provided no simulation data on the performance of the method.90 Vaida and Xu91 described a random-effects model for proportional hazards regression similar to that of Sargent, but they also did not provide simulation results.

Yau92 proposed a 3-level proportional hazards model estimated via REML. He reported results from a simulation study involving only 10 groups with just 3 members per group and 3 repeated observations for each member; censoring varied from 30% to 60%. Yau’s method provided unbiased estimates of fixed effects but slightly overestimated random effects; the overestimation of random effects was reduced with even slightly increased group size. Other advantages were that the baseline hazard function did not have to be specified and estimation did not rely on numerical integration. Cai et al.88 proposed a transformation model with random effects based on numerical integration and showed that it was less biased than some of the earlier parametric models. Lui et al. proposed several methods for confidence interval estimation for rate ratios based on the betabinomial distribution93; they reported that an interval estimator based on a log transform performed best in simulations, but their smallest study included 20 groups per condition, so the small sample properties of the estimator are unknown.

Bennett et al. presented a 2-stage approach to analysis of incidence rates based on person-year data,94 estimating group-specific rates (for an unadjusted analysis) or residuals (for an adjusted analysis) in a first stage without regard to intervention status; these rates or residuals were used in a second stage to estimate the intervention effect and assessed via a *t* statistic with degrees of freedom based on the number of groups. Simulation studies showed this approach had nominal size even with as few as 3 groups per condition and perhaps 30 members per group. While these results are encouraging, it would be of interest to see how the method performs with smaller groups.

### Randomization Tests

In a randomization test for a GRT, the data are analyzed on the basis of the actual assignment of groups to conditions and then reanalyzed for every other possible assignment of groups to conditions given the design, including any limitations in randomization due to matching, stratification, and the like. The test statistic observed on the basis of the actual assignment is referenced against the distribution of such statistics calculated from the set of all possible assignments. The 2-tailed *P* value for the observed test statistic is defined as the proportion of the possible test statistics that are as large as or larger than the observed test statistic in terms of absolute value. Randomization tests were first used in GRTs in the context of the Community Intervention Trial for Smoking Cessation (COMMIT).95–97 Gail et al.98 later demonstrated that randomization tests carried the desired type I error rate for the null hypothesis of no treatment effect on average so long as the number of groups assigned to each condition was the same. Given balance at the group level, randomization tests also carried the desired type I error rate for dichotomous endpoints and for analyses that included regression adjustment for a covariate, even when the regression model was not correctly specified.98

At the same time, randomization tests can have less power than model-based tests when the model is correct. To address that problem, Braun and Feng99 developed a weighted randomization test using the inverse of the total variance for each group as the weight; they showed this test to be the uniformly most powerful randomization test for Gaussian data. They also developed a locally most powerful randomization test based on a more complicated quasi-score method for non-Gaussian data. In a series of simulation studies, Braun and Feng showed that their optimal randomization test had nominal size and better power than alternative randomization tests or GEE, although it was still not as powerful as the model-based analysis when the model was specified correctly; additional research is needed to compare Braun and Feng’s optimal randomization test and model-based methods under model misspecification.

### Survey Methods

The clustering of data in GRTs has much in common with the clustering of data observed in complex surveys; as a result, analysis methods developed for complex surveys can have application in the analysis of data from GRTs.100,101 Since the introduction of GEE, there has been a convergence in methods used for survey applications and for many nonsurvey applications involving correlated data, including GRTs. LaVange100 showed that parameter estimates and standard errors from their survey logistic regression procedure were identical to those obtained with GEE under the assumption of working independence. LaVange also provided information on survey analysis procedures for proportional odds and proportional hazards regression models, which would be applicable to GRTs. The SUDAAN software package supports those models (http://www.rti.org/sudaan/home.cfm). Caution is required as with other methods that are asymptotically valid only when the total number of groups is below 40 unless special procedures are used to correct for underestimation; LaVange100 discussed this problem and proposed a correction.

### Latent Variable Methods and Nonlinear Models

Muthen102 presented a general latent variable modeling approach that encompassed a variety of techniques used in GRTs, including mixed-model ANOVA/ANCOVA and random coefficient models. Schulenberg and Maggs observed that mixed models and latent variable models gave identical results when set up to test equivalent models.103 Others have noted important differences between these approaches103–106; however, some of these differences may disappear with improvements in software.

Nonlinear mixed models are a type of mixed model in which both the fixed and random effects have a nonlinear relationship with the endpoint. They differ from the more familiar generalized linear mixed models in which the fixed and random effects are linearly related to a predictor and the predictor is related to the endpoint through a nonlinear link function. Readers are referred to Davidian and Gilinian107 or Vonesh and Chinchilli108 for further information.

### Interrupted Time Series

Gruenewald109 and Biglan et al.110 suggested interrupted time series methods for the evaluation of community-level interventions. The classic time series analysis compares data in a large geographic unit before and after an intervention and evaluates the intervention effect as a change from the preintervention trend, level, or variance. It draws its strength for estimating the preintervention and postintervention time patterns from many observations, thereby providing good precision. These methods would appear to be useful for within-community comparisons but, absent a reasonably large number of communities, not for between-community comparisons. If the number of communities is limited, degrees of freedom for between-community comparisons will be limited and power will be poor; nor would asymptotically valid tests be appropriate with limited degrees of freedom.

### Global Tests for Multiple Endpoints

Many GRTs have more than 1 primary endpoint, raising the issue of how to adjust the type I error rate for multiple tests. One solution is to divide the nominal type I error rate evenly among the tests. Feng and Thompson offered as an alternative a global test that functions in much the same way as a multivariate test statistic.18

### Methods for Analysis of Mediation Effects

Krull and MacKinnon described methods for mediation analyses in GRTs using extensions of methods developed for RCTs.111 Simulation results indicated that the mediation estimators were unbiased and that estimation of standard errors via first-order Taylor series approximation was preferred. MacKinnon et al. expanded that discussion in an application to tobacco prevention research to include a discussion of a model with multiple mediators.112

### Missing Data

Missing data are as serious a problem in GRTs as they are in RCTs. Fortunately, methods developed for RCTs are easily adapted to GRTs. For example, Yi and Cook reported on marginal methods for missing data from clustered designs.113 Hunsberger et al. described strategies for missing data in GRTs and identified a multiple imputation method that carried acceptable type I and type II error rates in simulations.114

### Software

There has been substantial improvement over the past 5 years in the software available for analysis of GRTs. Zhou et al. reviewed many of these programs and reported that when they were used correctly to fit equivalent models, they gave the same results in simulation studies.115 HLM (http://www.ssicentral.com/hlm/hlm5all.htm) provides a flexible and powerful vehicle for a variety of analyses appropriate for GRTs.32,116 It can be used with Gaussian, binary, and Poisson data and can fit 2- and 3-level models. As such, it supports both nested crosssectional and nested cohort designs. HLM also supports latent variable estimation, multiple imputation, GEE, and sandwich estimation for standard errors. HLM relies on REML for Gaussian endpoints and PQL for non-Gaussian endpoints. The Laplace approximation to ML is available for 2-level and 3-level Bernoulli models.

Several SAS (http://www.sas.com/) procedures support analyses for GRTs. PROC MIXED117,118 supports models and covariance structures for Gaussian endpoints.1,119 The GLIMMIX macro118 supports parallel models and structures for non-Gaussian endpoints and can perform mixed-model logistic and Poisson regression. Some have criticized GLIMMIX because it uses pseudo-likelihood estimation, which is similar to PQL and so underestimates fixed effects and their standard errors under the circumstances noted earlier. However, because most GRTs do not fit those circumstances GLIMMIX continues to be a valid tool in most GRTs.

More recently, SAS introduced PROC NLMIXED, which is a nonlinear mixed-model regression procedure.117 NLMIXED uses numerical integration for ML estimation and so is more appropriate than GLIMMIX for GRTs that involve very small groups (e.g., family studies). NLMIXED can be used with Gaussian, binomial, and Poisson distributions for mixed-model linear, logistic, and Poisson regression; users can also construct their own log-likelihood function to perform, for example, a clustered ordinal logistic regression or frailty analysis (O. Schabenberger; written communication; April 9, 2003). NLMIXED can accommodate nested designs, although the procedure will encounter computational difficulties if the number of random terms exceeds 5 or 6.120

The NLMIXED procedure does not support the within-group repeated measures structures available in MIXED and GLIMMIX; instead, NLMIXED assumes that repeated observations within a member or group are uncorrelated. MIXED and GLIMMIX support model-based and sandwich estimation for standard errors, while NLMIXED provides only model-based estimation. PROC PHREG and PROC GENMOD support sandwich estimation for standard errors and so can be applied to GRTs to perform Cox regression and logistic and ordinal logistic regression, respectively121; however, caution is required when there are fewer than 40 groups, absent a correction for the bias in the sandwich estimator.

MIXOR (http://tigger.uic.edu/~hedeker/mix.html) and its related programs122–124 can be used with Gaussian, binary, and Poisson data to provide mixed-model linear, logistic, and Poisson regression. These programs also allow mixed-model grouped-time survival analysis,85 mixed-model logistic or probit analysis for ordinal endpoints,125 and mixed-model logistic regression for nominal endpoints.124,126

The MlwiN program (http://multilevel.ioe.ac.uk/index.html) can be used with Gaussian, Bernoulli, binomial, multinomial, and Poisson distributions and can also fit ordinal logistic models for clustered data.127 The SUDAAN software package (http://www.rti.org/sudaan/home.cfm) (Research Triangle Institute, Research Triangle Park, NC) supports models for analysis of survey data that are often applicable to GRTs. In addition, SPSS (http://www.spss.com) has introduced a mixed-model regression program that supports several covariance structures.128

None of the programs just mentioned incorporate a correction for the underestimation bias in the sandwich estimator when the data are binary and there are few groups per condition. As indicated earlier, the work in that area seems to be converging on a solution, and this may encourage the developers to add such a correction to their procedures.

### Recommendations for Trial Reporting

Investigators reporting on GRTs are encouraged to report their reasons for choosing group randomization; separate eligibility criteria, sampling schemes, and informed consent procedures for groups and members; justification for their sample size; ICC or variance component estimates from the analysis of intervention effects; and details of the analysis methods and software used.1,16,29,40

## CONCLUSION

The purpose of this article has been to review the methodological developments from the past 5 years regarding the design and analysis of GRTs. The sheer volume of work is quite remarkable, and while every effort was made to provide a thorough review based on extensive searches of electronic databases and other sources, there are no doubt relevant papers that we did not include. Nonetheless, this review makes clear that there are valid methods that are readily available and well documented for the design and analysis of GRTs. We hope that this review will help investigators familiarize themselves with these methods and encourage them to collaborate with methodologists who can use these developments to strengthen the design and analysis of their trials.

Certainly, the methods required for GRTs are not as simple as those required for RCTs, and this is unfortunate. As noted 5 years ago, however:

Whenever the investigator wants to evaluate an intervention that operates at a group level, manipulates the social or physical environment, or cannot be delivered to individuals, a group-randomized trial design is the best comparative design available.1(p15)

When that text appeared in 1998, it attempted to address the question of how to conduct GRTs well. Clearly the developments of the past 5 years have made it even easier to conduct GRTs well, and we simply must do a better job of taking advantage of these developments.

## Acknowledgments

We wish to acknowledge helpful comments from Zideng Feng, Fred Hutchinson Cancer Research Center; Barry Graubard, National Cancer Institute; Peter Hannan, University of Minnesota; Donald Hedeker, University of Illinois, Chicago; Stephen Raudenbush, University of Michigan; Oliver Schabenberger, SAS Institute Inc; and Alexander Wagenaar, University of Minnesota.

## Notes

**Contributors**

D. M. Murray wrote the first and final drafts of the article. S. P. Varnell and J. L. Blitstein located many of the reviewed articles and edited several versions of the article.

Peer Reviewed

## References

*Design and Analysis of GroupRandomized Trials*. New York, NY: Oxford University Press Inc; 1998.

*Survey Sampling*. New York, NY: John Wiley & Sons Inc; 1965.

*Advances in Clinical Trials Biostatistics*. New York, NY: Marcel Dekker Inc; 2003.

*Design and Analysis of Cluster Randomization Trials in Health Research.*London, England: Arnold; 2000.

*Generalized, Linear and Mixed Models*. New York, NY: John Wiley & Sons Inc; 2001.

*Applied Mixed Models in Medicine*. Chichester, England: John Wiley & Sons Inc; 1999.

*Hierarchical Linear Models*. 2nd ed. Thousand Oaks, Calif: Sage Publications; 2002.

*Introducing Multilevel Modeling*. London, England: Sage Publications; 1998.

*Analysis of Health Surveys*. New York, NY: John Wiley & Sons Inc; 1999.

*Nonlinear Models for Repeated Measurement Data*. London, England: Chapman & Hall; 1995.

*Linear and Nonlinear Models for the Analysis of Repeated Measurements.*New York, NY: Marcel Dekker; 1997.

*Hierarchical Linear Models: Applications and Data Analysis Methods.*Newbury Park, Calif: Sage Publications; 1992.

*SAS/STAT User’s Guide, Version 8*. Cary, NC: SAS Institute Inc; 1999.

*SAS System for MIXED Models*. Cary, NC: SAS Institute Inc; 1996.

*SAS/STAT Software: Changes and Enhancements, Release 8.1.*Cary, NC: SAS Institute Inc; 2000.

*SPSS 11.0 Syntax Reference Guide*. Chicago, Ill: SPSS Inc; 2002:136–151.

**American Public Health Association**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (204K)

- Design and analysis of group-randomized trials: a review of recent practices.[Am J Public Health. 2004]
*Varnell SP, Murray DM, Janega JB, Blitstein JL.**Am J Public Health. 2004 Mar; 94(3):393-9.* - Multilevel analysis of group-randomized trials with binary outcomes.[Community Dent Oral Epidemiol. 2006]
*Kim HY, Preisser JS, Rozier RG, Valiyaparambil JV.**Community Dent Oral Epidemiol. 2006 Aug; 34(4):241-51.* - Pitfalls of and controversies in cluster randomization trials.[Am J Public Health. 2004]
*Donner A, Klar N.**Am J Public Health. 2004 Mar; 94(3):416-22.* - Meta-analyses involving cluster randomization trials: a review of published literature in health care.[Stat Methods Med Res. 2003]
*Laopaiboon M.**Stat Methods Med Res. 2003 Dec; 12(6):515-30.* - Analysis of group randomized trials with multiple binary endpoints and small number of groups.[PLoS One. 2009]
*Lee JH, Schell MJ, Roetzheim R.**PLoS One. 2009 Oct 21; 4(10):e7265. Epub 2009 Oct 21.*

- Effectiveness of a Brief Parent-Directed Teen Driver Safety Intervention (Checkpoints) Delivered by Driver Education Instructors[The Journal of adolescent health : official...]
*Zakrajsek JS, Shope JT, Greenspan AI, Wang J, Bingham CR, Simons-Morton BG.**The Journal of adolescent health : official publication of the Society for Adolescent Medicine. 2013 Jul; 53(1)27-33* - Glass ionomer ART sealant and fluoride-releasing resin sealant in fissure caries prevention - results from a randomized clinical trial[BMC Oral Health. ]
*Liu BY, Xiao Y, Chu CH, Lo EC.**BMC Oral Health. 1454* - Infection control interventions in small rural hospitals with limited resources: results of a cluster-randomized feasibility trial[Antimicrobial Resistance and Infection Cont...]
*Stevenson KB, Searle K, Curry G, Boyce JM, Harbarth S, Stoddard GJ, Samore MH.**Antimicrobial Resistance and Infection Control. 310* - Documentation and management of overweight and obesity in primary care[Journal of the American Board of Family Med...]
*Waring ME, Roberts MB, Parker DR, Eaton CB.**Journal of the American Board of Family Medicine : JABFM. 2009; 22(5)544-552* - An Integrative, Multilevel, and Transdisciplinary Research Approach to Challenges of Work, Family, and Health[Methods report (RTI Press). 2013]
*Bray JW, Kelly EL, Hammer LB, Almeida DM, Dearing JW, King RB, Buxton OM.**Methods report (RTI Press). 2013 Mar;1-38*

- Cited in BooksCited in BooksPubMed Central articles cited in books
- PubMedPubMedPubMed citations for these articles
- TaxonomyTaxonomyRelated taxonomy entry
- Taxonomy TreeTaxonomy Tree