• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptNIH Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Stat Med. Author manuscript; available in PMC Oct 7, 2009.
Published in final edited form as:
PMCID: PMC2758777

Sample size requirements to detect an intervention by time interaction in longitudinal cluster randomized clinical trials


In designing a longitudinal cluster randomized clinical trial (cluster-RCT), the interventions are randomly assigned to clusters such as clinics. Subjects within the same clinic will receive the identical intervention. Each will be assessed repeatedly over the course of the study. A mixed-effects linear regression model can be applied in a cluster-RCT with three level data to test the hypothesis that the intervention groups differ in the course of outcome over time. Using a test statistic based on maximum likelihood estimates, we derived closed form formulae for statistical power to detect the intervention by time interaction and the sample size requirements for each level. Importantly, the sample size does not depend on correlations among second level data units and the statistical power function depends on the number of second and third level data units through their product. A simulation study confirmed that theoretical power estimates based on the derived formulae are nearly identical to empirical estimates.

Keywords: longitudinal cluster RCT, three level data, power, sample size, intervention by time interaction, effect size

1. Introduction

A longitudinal cluster randomized trial (cluster-RCT) assumes a three level data structure in that the time-specific outcome assessments are nested within subjects who in turn, are nested within the randomized clusters. For instance, consider a study designed to test the effect of an experimental intervention of physician training on the reduction of severity of patients' symptoms of depression over time. In this design, primary care clinics are randomly assigned to either experimental or control intervention and each physician within an experimental clinic is trained to detect and treat depression. Each physician will treat multiple subjects, who, in turn, repeatedly measured on severity of depression symptoms over time.

The primary hypothesis in such a study would focus on the difference in declines of symptom severity over time between subjects who were treated by physicians with and without the experimental intervention. The three level data in a longitudinal cluster-RCT could test the significance of the intervention by time interaction using a mixed-effects linear regression model [1-3].

Sample size determination and power calculations are essential in designing a cluster-RCT. The number of clusters that is required for a target statistical power must be estimated at the experimental design stage. To this end, we build on sample size formulae for two level data structures [4-6] to derive explicitly closed form power function and sample size formulae for detecting a hypothesized interaction effect. The derivations are based on a distribution of a test statistic that used the maximum likelihood estimate of the interaction effect. A simulation study followed to verify the statistical power achieved with the estimated sample sizes.

2. Statistical Model

A three level mixed-effects linear model for outcome Y can be written as follows:


where i =1,2,…,2N3 is the index for the level three unit (e.g., clinic); j = 1,…, N2, is the index for the level two unit (e.g., subject) nested within each i; and k = 1, 2, …, N1, is the index for the level one unit (e.g., repeated outcome observations) within each j. The intervention assignment indicator variable Xijk = 0 if the i-th level three unit is assigned to a control intervention and Xijk = 1 if assigned to an experimental intervention; therefore Xijk = Xi for all j and k. Furthermore, here a balanced design is assumed in that ΣiXi = N3. The time variable is denoted by Tijk. In this study, it is assumed that Tijk = Tk for all i and j, and that the time increase from 0 (the baseline) to Tend = N1 - 1 (the last time point) by 1 with equal time intervals. Therefore, the parameter ξ represent the intervention effect at the baseline, and the parameter τ represents the slope of time effect, that is, decline in symptom severities over time. Finally, the intervention by time effect δ is of primary interest representing the slope difference in outcome Y between the intervention groups, or additional decline in the experimental group. The overall fixed intercept is denoted by β0.

It is assumed that the error term eijk is normally distributed as N(0,σe2), the level two random intercept uj(i)~N(0,σ22) and the level three random intercept ui~N(0,σ32). Among those random components, it is further assumed that ui [perpendicular] uj(i) [perpendicular] eijk, i.e., these three random components are mutually independent. In addition, conditional independence is assumed for all uj(i) and for all eijk, whereas as ui are unconditionally independent. That is, uj(i) are independent conditional on ui, and eijk are independent conditional on both ui and uj(i). After all, β0, ξ, τ and δ are fixed effect parameters and the last three terms in model (1) are random effects.

As the parameter δ is of the primary interest, the null hypothesis to be tested is:


Under model (1), with its accompanying assumptions such as conditional independence among random components, it can be shown that the elements of the mean vector are


and that the elements of the covariance matrix are:


where 1(.) is an indicator function. This yields in particular,


Therefore, the correlation among level two data can be written for jj' as follows.


And, the correlation among level one data can be written for kk',


It can be easily seen that ρ1 ≥ ρ2 with equality when σ22=0.

3. Maximum Likelihood Estimate and its Variance

The maximum likelihood estimate (MLE)δ^ of the interaction effect is indeed the slope difference between the two groups: that is,


where η^g(g=0,1) is the MLE of the slope for the outcome Y in the g-th group, in which Xi = g. Specifically, for i in the g-th group,


where: 1)Yg(g=0,1) is the overall group mean of the outcome Y for the g-th group; 2) T=Σk=1N1Tk/N1 is the “mean” time point; and 3) Varp(T)=Σk=1N1(TkT)2/N1 is the “population variance” of the time variable T. In fact, the slope estimate (8), but not the variance of the slope estimate, is the same as that of an ordinary linear regression with ui = uj(i) = 0 in model (1). The reason for this, on a heuristic level, is that weights assigned to data points Yijk in estimation of the slopes are identical and the slopes do not depend on random intercepts of any data level. Indeed, the ordinary least square estimate (8) is the mle under a perfectly balanced design [2] that we are considering in this paper.

Based on equations (3) and (8), it can easily be shown that the MLE δ^ is unbiased, i.e., E(δ^)=E(η^1η^0)=(τ+δ)τ=δ. The variance of a slope MLE η^δ can be obtained based on equation (4) as follows (see Appendix for a proof):


Therefore, the variance of δ^ is


Observe that η^1 and η^0 are independent each other. It is notable, however, that the variance of δ^ depends only on the residual variance σe2, and none of σ32, σ22, or ρ2. Therefore, for a given total variance σ2, it decreases with decreasing σe2 or increasing ρ1, the correlation among the first level data.

4. Power and sample size

The following test statistic D, based on (7) and (10), can be used to test the null hypothesis (2):


If the three variance components—σ22, σ32 and σe2— are known, then the test statistic D is normally distributed with meanδse(δ^) and variance 1. When those three variance components are unknown and replaced by their MLE's, the test statistic D becomes a Wald test statistic and its asymptotic distribution is normal based on a large sample theory [7]. Thus, under the null hypothesis (2), D ~ N(0, 1) and under an alternative hypothesis of δ0,D~N(δse(δ^),1).

The power of the test statistic D, denoted by [var phi], can therefore be written as follows:


where α is a two-sided significance level; β represents the probability of type II error; Φ is the cumulative distribution function (CDF) of a standard normal distribution and Φ-1 is its inverse. From now on, it is understood that: 1) δ = |δ| > 0; and 2) the probability below a critical value, Φ-1(α/2), in the other side under the alternative hypothesis is negligible and thus assumed to be 0. When the slope difference is expressed in pooled within-group standard deviation (SD) units, i.e., when expressed in terms of a standardized effect size


the power function can be expressed as follows:


It follows that when the hypothesis testing is based on D with a two-sided significance level of α, the third level unit sample size N3 per group for a desired statistical power [var phi] = 1 - β can be calculated from equation (12) as:


or equivalently in terms of the standardized effect size Δδ from equation (13)


More precisely, N3 is the smallest integer greater than the right hand side of equation (14) or (15). It can be observed that the level 3 sample size is a deceasing function of increasing ρ1 and Varp(T) in particular. Stated differently, more follow-up with more consistent (as opposed to erratic) observations within subjects over time will increase the power (15) and at the same time will reduce sample size required of N3 or N2 for the same anticipated power.

The sample size N2 has a reciprocal relationship with N3 in a sense that the power depends through N2N3 because both are free each other and of the other parameters. Therefore, sample size N2 for the level two data can immediately be determined from equation (15) as follows:


The sample size N1 for the level one data should, however, be determined in an iterative manner because Varp(T) is a function of N1. Specifically, an iterative solution for N1 must satisfy the following equation:


5. Simulation study specification

We conducted simulation studies to verify the sample size N3 (15) and the power function (13) using SAS PROC MIXED, which is suitable for fitting the three-level mixed-effects linear model (1). For a two-sided significance level α = 0.05 and a desired power [var phi] = 0.8, the following combinations of the simulation parameters were prespecified: ΔδTend = Δδ(N1 - 1) = 0.3, 0.4, 0.5; N2 = 5, 10, 20, 30; N1 = 3, 6, 12; ρ1 = 0.4, 0.5, 0.6 while without loss of generality σ = 1, ρ2 = 0.05, β0 = ξ = 0, and τ = -1 (in model (1)) remained fixed. This 3×4×3×3 factorial design scheme yielded a total of 108 combinations of those parameters. In particular, the effect size of the interaction, or the between-group slope difference Δδ, is specified in a way that it would yield a standardized between-group mean difference ΔδTend at the end of trial, i.e., when T = Tend = N1 - 1.

To generate simulated data, we first estimated N3 using equation (15) for a given combination (see step 2 below). Specifically, for each combination we followed the following steps for simulations:

  1. Calculate the variance of time, Varp(T), for given N1;
  2. Calculate N3 (15) with the computed Varp(T) and given α, [var phi], N1, N2, and Δδ;
  3. Calculate variance components, σ22, and σ32 based on equations (5) and (6) for given ρ1, ρ2 and σ2; Specifically, σ22=(ρ1ρ2)σ2 and σ32=ρ2σ2;
  4. Calculate σe2=σ2(σ32+σ22);
  5. Calculate δ =σΔδ for the given σ2 and Δδ;
  6. Generate the random intervention assignment indicator Xi = 0 or 1 for each i = 1,2,.., 2N3 in a balanced manner so that Σi Xi = N3;
  7. Generate ui from N(0,σ22) independently for each i = 1,2,…,2N3 (Unconditional independence assumption);
  8. For each ui, generate uj(i) from N(0,σ22) independently for j = 1,2,…,N2 (Conditional independence assumption);
  9. For each combination of ui and uj(i), generate eijk from N(0, σe 2) independently for k = 1,2, …,N1 (Conditional independence assumption);
  10. Generate outcome data set for Yijk = β0 + ξXi + τTk + δXiTk + ui + uj(i) + eijk (1);
  11. Fit the data set with the three-level linear mixed-effects model (1);
  12. Retain a p-value, denoted by ps(δ) for the s-th simulated data set, obtained from testing the null hypothesis (2);
  13. Repeat the steps 6-12 for 1000 times (i.e., s = 1, 2, …, 1000) for each combination of the simulation parameters.

Let us denote the empirical power by φ~ that is obtained from the 1000 simulations as follows:


This empirical power is compared with the theoretical power [var phi] that is computed based on N3 obtained in step 2 above, but not with the prespecified power of 0.8. It should be noted that the theoretical power [var phi] obtained in that way is never less than the prespecified power of 0.8 since N3 is the smallest integer greater than the right hand side of equation (15).

6. Simulation study results

Table 1 summarizes the specified (N2 and N1) and estimated (N3) sample sizes, the empirical power φ~ (18) and the theoretical power [var phi] (13) based on the estimated N3. Although the empirical power is negligibly underestimated as reflected on the mean differences in the last row in Table 1, it is virtually identical to the theoretical power. For instance, among the 108 combinations (Table 1), the maximum absolute difference φφ~ was 0.027, which is tolerable given that the width of the 95% confidence interval for simulation estimates is ±1.960.8×0.21000=±0.025. Thus, the derived formulae for sample size and the power are very accurate under the conditions that were examined. In each case, the theoretical power is no less than 0.8, since the power calculations were based on “integer” values of N3.

Table 1
Sample size N3 theoretical power [var phi] and empirical power φ~ for testing intervention group by time interaction effect in a three level mixed-effects linear regression analysis, based on 1000 simulations.

As expected, the sample size N3 for the identical power decreases with increasing correlation ρ1 when the other design parameters are held the same. For example, when N2 = 5, N1 = 6, and ΔδTend = 0.3, (or Δδ = 0.3/5 = 0.06) the respective sample sizes requirements for 80% power, for the level three data (N3), were 30, 25, and 20 for ρ1 = 0.4, 0.5, and 0.6. Furthermore, the theoretical power is identical for various combinations of N2 and N3 that yield an equivalent product, assuming other design parameters are held constant. For instance, as shown in Table 1, each the following pairs of N2 and N3 with a product of 210 yielded identical power of 0.801 when N1 = 3, ρ1 = 0.4, ΔδTend = 0.3 (or Δδ= 0.3/2 = 0.15): N2 = 5 and N3 = 42; N2 = 10 and N3 = 21; N2 = 30 and N3 = 7.

7. Application

The results in Table 1 can be applied to designing a longitudinal cluster-RCT. Consider, for instance, a longitudinal cluster-RCT that compares an innovative primary care level intervention with a usual primary care practice on depression outcome of subjects as conducted in the PROSPECT [8,9] and the RESPECT [10] trials. To test whether the course of depressive symptoms over time depends on the care that the subjects receive, it is anticipated that primary clinics can accommodate 20 subjects (N2) for the research purpose and each patient would be followed up for 6 times (N1) for assessments. The results presented in Table 1 can be applied to estimating number of primary clinics, i.e., level 3 units (N3), for 80% power. If ρ1 = 0.5, then four clinics (N3) for each of the two intervention groups, or a total of 160 subjects, would be needed to detect an effect size ΔδTend = 5Δδ = 0.4 (or Δδ = 0.4/5 = 0.08) with at least 80% statistical power (Table 1). Sample size requirements for other design parameters can be obtained from Table 1. For other combinations of design specification that were not presented in Table 1, the sample size formula (18) can be applied.

8. Discussion

The derived power function (13) and level 3 unit sample size formula (15) requirements to detect an intervention by time interaction are shown to be accurate compared to empirical estimates based on a simulation study. Therefore, sample size formulae (16, 27) for number of level 2 and level 1 data units are also accurate because they are different expressions of equation (15). Importantly, the sample size did not depend on correlations among second level data units and the statistical power function depends on the number of second and third level data units through their product. Furthermore, when either N3 or N2 is equal to one, it reduces the level 3 data structure to that of level 2 data with the number of second level data as N2 or N3 correspondingly. In either case, the variance σ32 of the level three random intercept can be considered to be 0 and thus ρ2 can be assumed to be 0. This reduces the sample size formula (14) to equation (2.4.1) in Diggle et al [6] on its page 29, as it should. In Diggle et al's formula too, it can be found that the power function is increasing in ρ1.

Collectively, therefore, as far as testing the intervention by time interaction is concerned, the design can be very flexible for the same statistical power depending on feasibility. For example, when N3N2 = 200 subjects per group is needed for 80% power, then sample sizes for N3 and N2 can be determined depending on availability of recruitment of level two and level three units regardless of an anticipated ρ2. To this end, if recruitment of 10 subjects (N2) per clinic was feasible, then the investigators could try to enlist 20 clinics (N3) per intervention group. On the other hand, if only 5 clinics (N3) were available per intervention group, then recruitment of 40 subjects (N2) per clinic would be required. In an extreme case where only one clinic (N3=1) is available, one could recruit 200 subjects (N2) from the single clinic.

Although the empirical power was based on unknown variance components of random effects, it was virtually identical to the theoretical power derived with known variance components in the test statistic D (11). Therefore, derivation of power function with unknown variances may not be necessary even for small N3, although it might be possible through application of CDFs of central and non-central t distributions [11] replacing the standard normal CDF Φ and its inverse Φ-1 in equation (14) or (15).

It should be noted that the sample size formula is to detect a slope difference per se but not an expected between-group difference at Tend, the end of a study. In other words, the sample formula (15) derived herein is not appropriate to detect an intervention effect at a prespecified time point such as the end of a trial. It is because the variance of this effect is not equal to Tend2Var(η^1η^0), even if the estimated quantities are the same. Thus, this intervention effect, ΔδTend, served as the basis for estimating a hypothesized slope difference Δδ.

Other sample size formulae are available. For instance, Liu et al [12] derived sample size formulas for the slope difference using generalized estimating equations. Murray et al [13] presented detectable effect sizes based on expected mean square errors using random coefficients analysis for the nested cohort design. Roy et al [14] derived general-form sample size determinations using a mixed-effects linear model, taking into account for potential attrition rates and more general correlation structures. Heo and Leon [15] derived an algorithm for sample size requirements to detect a main effect of group using a linear mixed effects model for three level data. Although comparisons of sample sizes assuming different modeling approaches would provide better insight in designing a cluster-RCT, the sample size equations presented above (15,16,17) are more readily implemented.

The sample size determinations derived here have limitations. First, the formulae were derived assuming fixed numbers of units for all levels although number of subjects per clinic will likely vary, i.e., j = 1, 2, …, ni, depending the i-th clinic. Furthermore, the number of assessments per subjects will also vary (i.e., k = 1, 2, …, nij, depending on both clinics and subjects) because attrition of subjects during a trial in reality is the norm rather than exception [16,17]. Nevertheless, our derivation based on non-varying cluster sizes provides a useful approximation and, further, can serve as a basis for deriving a sample size algorithm for varying cluster sizes. For instance, if the variation in the cluster sizes is completely at random in the missing data analysis framework [18], a replacement of the varying cluster sizes with an average cluster size has been shown to be effective for sample size and statistical power with varying cluster sizes under two level binary outcome data [19]. Second, for pragmatic reasons the covariance structure (4) considered here was based on the conditional independence assumption. Therefore, robustness of the derived formulae under alternative covariance structure, such as autocorrelation or unstructured covariance matrix, is unknown.

In conclusion, the derived formulae for sample sizes (15,16,17) and power functions (12,13) can be useful in designing community based longitudinal cluster-randomized clinical trials that compare slopes of outcomes over time between two intervention groups in a three level data structure.


We are grateful to Donald Hedeker Ph.D., two anonymous referees and an Associate Editor for their valuable suggestions. This study was supported in part by NIMH grants, P30MH068638 and R01MH060447.


Proof of equation (9), Var Var(η^g)=σe2N3N2N1Varp(T)=(1ρ1)σ2N3N2N1Varp(T). Let Wk=(TkT), then We have: Σk=1N1Wk2=N1Varp(T); Σk=1N1Wk=0; ΣkkN1Wk=Wk and η^g=Σi=1N3Σj=1N2Σk=1N1Wk(YijkYg)/N3N2N1Varp(T)=Σi=1N3Σj=1N2Σk=1N1WkYijk/N3N2N1Varp(T). Observing that Y is independent over i, we decompose the variance of the numerator of η^g as follows:


Now, recall equation (4), that is,


It follows that A = σ2N3N2N1Varp(T) since Var(Yijk)=σ2=σe2+σ22+σ32. Further, Σk=1N1ΣkkN1WkWkCov(Yijk,Yijk)=(σ22+σ32)Σk=1N1Wk2 since ΣkkN1Wk=Wk. Therefore, B=(σ22+σ32)N3N2N1Varp(T). It is easy to see that C = 0 since Σk=1N1Wk=0. Hence, we have Var(Σi=1N3Σj=1N2Σk=1N1WkYijk)=A+B=σe2N3N2N1Varp(T). It follows that equation (9) above holds.


1. Goldstein H. Multilevel Statistical Models. 2nd ed. Wiley; New York: 1996.
2. Raudenbush SW, Bryk AS. Hierarchical Linear Models: Applications and Data Analysis Methods. 2nd ed. SAGE; Thousand Oaks: 2002.
3. Hedeker D, Gibbons RD. Longitudinal Data Analysis. Wiley; Hoboken, NJ: 2006.
4. Donner A, Birkett N, Buck C. Randomization by clusters; Sample size requirements and analysis. American Journal of Epidemiology. 1981;114:906–914. [PubMed]
5. Donner A, Klar N. Statistical Consideration in the design and analysis of community intervention trials. Journal of Clinical Epidemiology. 1996;49:435–439. [PubMed]
6. Diggle PJ, Heagerty P, Liang K-Y, Zeger SL. Analysis of Longitudinal Data. 2nd ed. Oxford University Press; New York: 2002.
7. Serfling RJ. Approximation Theorems of Mathematical Statistics. Wiley; New York: 1980.
8. Alexopoulos GS, Katz IR, Bruce ML, Heo M, Ten Have T, Raue PJ, Bogner HR, Schulberg HC, Mulsant BH, Reynolds CF, III, the PROSPECT Group Remission in depressed geriatric primary care patients: a report from the PROSPECT study. American Journal of Psychiatry. 2005;62:718–724. [PMC free article] [PubMed]
9. Bruce ML, Ten Have TR, Reynolds CF, III, Katz I, Schulberg HC, Mulsant BH, Brown GK, McAvay GJ, Pearson JL, Alexopoulos GS. Reducing suicidal ideation and depressive symptoms in depressed older primary care patients: a randomized controlled trial. JAMA. 2004;291:1081–1091. [PubMed]
10. Dietrich AJ, Oxman TE, Williams JW, Jr., Schulberg HC, Bruce ML, Lee PW, Barry S, Raue PJ, Lefever JJ, Heo M, Rost K, Kroenke K, Gerrity M, Nutting PA. Re-Engineering Systems for the Primary Care Treatment of Depression: A Randomized Controlled Trial. British Medical Journal. 2004;329:602–605. [PMC free article] [PubMed]
11. Johnson NL, Kotz S. Distributions in Statistics: Continuous Univariate Distributions-2. Houghton Mifflin; New York: 1970.
12. Liu A, Shih WJ, Gehan E. Sample size and power determination for clustered repeated measurements. Statistics in Medicine. 2002;21:1787–1801. [PubMed]
13. Murray DM, Blitstein JL, Hannan PJ, Baker WL, Lytle LA. Sizing a trial to alter the trajectory of health behaviors: Methods, parameter estimates, and their application. Statistics in Medicine. 2007;26:2297–2316. [PubMed]
14. Roy A, Bhaumik DK, Aryal S, Gibbons RD. Sample size determination for hierarchical longitudinal designs with differential attrition rates. Biometrics. 2007;63:699–707. [PubMed]
15. Heo M, Leon AC. Statistical power and sample size requirements for three level hierarchical cluster randomized trials. Biometrics. in press. [PubMed]
16. Leon AC, Mallinckrodt CH, Chuan-Stein C, Archibald DG, Archer GE, Chartier K. Attrition in randomized controlled clinical trials: methodological issues in psychopharmacology. Biological Psychiatry. 2006;59:1001–1005. [PubMed]
17. Heo M, Leon AC, Meyers BS, Alexopoulos GS. Problems in statistical analysis of attrition in randomized controlled clinical trials of antidepressants for geriatric depression. Current Psychiatry Reviews. 2007;3:178–185.
18. Rubin DB. Inference and missing data. Biometrika. 1976;63:581–592.
19. Heo M, Leon AC. Performance of a mixed effects logistic regression model with unequal cluster size. Journal of Biopharmaceutical Statistics. 2005;15:513–526. [PubMed]
PubReader format: click here to try


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...