• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of biometLink to Publisher's site
Biometrika. Sep 2011; 98(3): 503–518.
Published online Jul 13, 2011. doi:  10.1093/biomet/asr019
PMCID: PMC3254237

Sample size formulae for two-stage randomized trials with survival outcomes

Zhiguo Li
Department of Biostatistics and Bioinformatics, Duke University Medical Center, Durham, North Carolina 27710, U.S.A., zhiguo.li/at/duke.edu

Abstract

Two-stage randomized trials are growing in importance in developing adaptive treatment strategies, i.e. treatment policies or dynamic treatment regimes. Usually, the first stage involves randomization to one of the several initial treatments. The second stage of treatment begins when an early nonresponse criterion or response criterion is met. In the second-stage, nonresponding subjects are re-randomized among second-stage treatments. Sample size calculations for planning these two-stage randomized trials with failure time outcomes are challenging because the variances of common test statistics depend in a complex manner on the joint distribution of time to the early nonresponse criterion or response criterion and the primary failure time outcome. We produce simple, albeit conservative, sample size formulae by using upper bounds on the variances. The resulting formulae only require the working assumptions needed to size a standard single-stage randomized trial and, in common settings, are only mildly conservative. These sample size formulae are based on either a weighted Kaplan–Meier estimator of survival probabilities at a fixed time-point or a weighted version of the log-rank test.

Keywords: Dynamic treatment regime, Sample size calculation, Sequential multiple assignment randomized trial, Weighted Kaplan–Meier estimator, Weighted log-rank test

1. Introduction

Adaptive treatment strategies, also called treatment policies or dynamic treatment regimes (Robins, 1986; Lavori et al., 2000; Murphy et al., 2001), are growing in importance in the management of chronic disorders and in the treatment of disorders for which there are no universally effective treatments. In both cases, initial treatments may only be acutely effective for about 40–60% of the patients; thus multiple stages of treatment are frequently required to obtain good outcomes. Formally, an adaptive treatment strategy is a sequence of decision rules, one per stage of the treatment. Each decision rule inputs patient information and outputs a recommended treatment. Examples of adaptive treatment strategies abound. See McKay et al. (2004), Murphy et al. (2007) and Marlowe et al. (2008), for examples in the treatment of alcohol dependence, in continuing care for drug abuse, and in criminology, respectively. The second author is currently collaborating with scientists on improving the following simple adaptive treatment strategy for attention deficit hyperactivity disorder in children: provide behavioural modification therapy initially, and beginning at month two and every month thereafter, assess the child’s classroom behaviour; if the behaviour problems in the classroom exceed a prespecified criterion, then augment the behavioural therapy with methylphenidate.

With the increase in the use of adaptive treatment strategies, there has been an increase in calls for clinical trial designs for use in developing these treatment strategies. Sequential, multiple assignment, randomized trials (Lavori & Dawson, 2003; Murphy, 2005; Murphy et al., 2007) have been proposed. In these trials each subject can proceed through multiple stages of treatment and may be randomized at each stage among treatments. Thus each subject may be randomized multiple times through the course of the trial. Precursors of this design have been used in a variety of medical fields (Stone et al., 1995; Tummarello et al., 1997; Rush et al., 2004; Lieberman et al., 2005). In this paper we focus on a particularly common version of the sequential multiple assignment randomized design, a two-stage randomized trial: in the first stage, subjects are randomized to one of the two initial treatments. The second stage of treatment begins when and if early signs of nonresponse occur. In the second stage, nonresponding subjects are rerandomized among two subsequent treatments. Responding subjects stay on the initial treatment or are assigned another fixed second-stage treatment. Variants of this design are discussed in Supplementary Material.

Because sequential multiple assignment randomized trials are intended to provide data that can be used to assist in the development of an adaptive treatment strategy, they tend to be sized for comparing two subgroups involved in the trial (Murphy, 2005; Murphy et al., 2007; Oetting et al., 2007). In the two-stage randomized trial, the two subgroups might be the two treatments for the nonresponders or alternatively may be two of the adaptive treatment strategies implemented in the trial (Murphy et al., 2007). Here we consider sizing the study to compare two adaptive treatment strategies beginning with different initial treatments. In practice, these strategies might be the most intensive and the least intensive strategy or might represent opposing clinical approaches. In general, the sequential multiple assignment randomized design should be followed by a more standard randomized clinical trial, in which the developed strategy is compared with an appropriate alternative. See the above references for discussions of these issues.

We focus on sizing a two-stage randomized trial for a failure time outcome, for example, time until a first school disciplinary event in the case of the following trial for attention deficit hyper-activity disorder or time until treatment dropout in the case of some schizophrenia and substance abuse trials. This paper is motivated by our experience in designing several sequential multiple assignment randomized studies, one of which is a 36-week trial for attention-deficit hyperactivity disorder. The primary purpose of this trial is to assist clinical scientists in constructing an adaptive treatment strategy composed of behavioural and/or medication components. In this study, children with this disorder are first randomized to either a low intensity behavioural modification therapy or a low dose of medication. Beginning at two months and every month thereafter each child’s classroom behaviour is assessed and compared with a prespecified criterion. Exceeding the criterion is interpreted as an early sign of nonresponse; the nonresponding children are then rerandomized either to intensification of current treatment or to a combined treatment. Children who do not show signs of nonresponse continue on their first-stage treatment. The trial design is shown Fig. 1. Although this trial was originally sized using the outcome of end of school year child behaviour problems, one of the interesting outcomes was the time until a first school disciplinary event.

Fig. 1
Design of the trial in attention-deficit hyperactivity disorder, abbreviated ADHD. R, randomization; BT, behavioural therapy; Med, medication; BT+, intensified behavioural therapy; Med+, higher dose medication; BT&Med, combined behavioural therapy ...

As is clear from the study in attention-deficit hyperactivity disorder, there are two time-to-event outcomes, that is, the time to early nonresponse and the time to the primary outcome. To improve clarity we use the term failure time to refer to the time to the primary outcome. The failure time can occur before or after the time to early nonresponse, e.g., the time until a first school disciplinary event can occur before or after the child’s classroom behaviour meets the criterion for early nonresponse.

We develop relatively simple, easy to use, sample size formulae for comparing two adaptive treatment strategies that begin with different initial treatments. We provide sample size formulae based on two different tests: a test of the equality of survival probabilities at one time point using a weighted Kaplan–Meier estimator, and a test of the equality of hazard functions based on a weighted version of the log-rank test. The challenge in developing easy-to-use sample size formulae is that the variances involved in the test statistics are complex functionals of the joint distribution of time to early nonresponse and failure time; these two times are likely to be dependent. For example, in the study in attention-deficit hyperactivity disorder, the time to child behaviour exceeding the nonresponse criterion and the time to a first school disciplinary event are likely dependent.

To achieve the goal of easy-to-use sample size formulae, we simplify the formulae by replacing the variance terms in them by appropriate upper bounds. The resulting sample size formulae require similar information to that needed to size a two-group randomized trial for a failure time outcome; in particular, we need not make assumptions concerning the dependence between the time to early nonresponse and the failure time. To construct the upper bounds we first express the variances in terms of potential outcomes. Second, we use time-independent weights instead of time-dependent weights to construct the test statistics underlying the sample size formulae. The resulting sample size formulae will be conservative both due to the use of upper bounds on the variances and due to the fact that tests which are potentially more efficient than those used to derive the sample size formulae are used in data analysis. See the next sections for details about this. However, as will be seen, in common settings in which about 40–60% of subjects experience early signs of nonresponse and are thus rerandomized, these sample sizes are only minimally conservative.

In addition to providing simple, easy to use, sample size formulae, we provide the asymptotic theory for the weighted Kaplan–Meier estimator. Moreover, we provide missing theory justifying the use of the weighted version of log-rank test. Details are provided in Supplementary Material.

Sample size formulae for sizing two-stage randomized trials for failure time outcomes have been proposed and studied in Feng & Wahed (2008, 2009). Feng & Wahed (2009) developed a sample size formula based on a weighted sample proportion estimator of survival function whereas Feng & Wahed (2008) developed a sample size formula based on the supremum of a weighted version of the log-rank test. These formulae require working assumptions on the relationship between the time to early nonresponse and the failure time. As will be seen, these assumptions are unnecessary in the approach proposed here.

2. Test statistics

Suppose that n subjects are to be randomized to one of two first-stage treatments, denoted by A1 = 1, 2. For example, in the study on attention deficit hyperactivity disorder these correspond to low intensity behavioural modification therapy or low dose methylphenidate, respectively. Nonresponders are further randomized to one of the two second-stage treatments, denoted by A2 = 1, 2. These could be intensification of current treatment or augmentation of current treatment, respectively. Recall that responding subjects are assigned a fixed second-stage treatment which could be the same as the initial treatment. Let T denote the failure time, S denote the time to early nonresponse and C denote the censoring time. On each subject we observe A1, R, R A2, where R = I {S [less-than-or-eq, slant] min(T, C)} is the nonresponse indicator or equivalently in this case, the rerandomization indicator, S′ = min(T, S, C), Δ = I (T [less-than-or-eq, slant] C) and U = min(T, C). It is worthwhile to mention that, in different trials, the nonresponse, or alternatively the response, is assessed differently. For example, in some trials it is assessed at a fixed time-point after the initial treatment. Then the definition of R should be changed accordingly. However, this will not affect our development in the sequel, as long as R is the indicator for rerandomization. As is customary, we assume throughout that the censoring time C is independent of all other variables including A1 and A2. Note that S can be censored by either T or C; indeed, the failure time T may occur prior to the time to nonresponse, S. Denote the randomization probabilities by p = pr(A1 = 1) and q = pr(A2 = 1 | R = 1). Denote jk to be the treatment strategy for which A1 = j and then A2 = k for nonresponders, for j = 1, 2 and k = 1, 2. Finally, the duration of the study is τ. Data from this trial design provide information on four adaptive treatment strategies, i.e., strategies 11, 12, 21 and 22. The trial design described here is just a special case of general two-stage randomized trial designs. Sample size formulae for other types of trial designs are discussed in Supplementary Material.

A variety of statistics have been proposed for comparing two adaptive treatment strategies with data from two-stage randomized trials and can thus be used to construct test statistics and related sample size formulae. A first approach is to compare survival probabilities under different strategies at a certain time-point, for example, survival probabilities at the end of the study period. Estimators of survival functions can be found in Lunceford et al. (2002) who proposed several versions in a weighted sample proportion estimator, in Wahed & Tsiatis (2006) who derived both a semiparametric efficient estimator and a less efficient, but easier to implement estimator, in Guo & Tsiatis (2005) who proposed a weighted Nelson–Aalen estimator of the cumulative hazard function, and recently, in Miyahara & Wahed (2010) who proposed a weighted Kaplan–Meier estimator. A second approach is to compare survival functions of the two adaptive treatment strategies using a weighted version of the log-rank test as in a 2005 PhD thesis by Xiang Guo at the Department of Statistics, North Carolina State University, which can be found at: http://www.lib.ncsu.edu/resolver/1840.16/5768.

We utilize test statistics based on the two approaches discussed above to derive sample size formulae. In the following we consider powering the two-stage randomized trial to detect a difference, if any, between strategies 11 and 21. Comparisons between strategies 12 and 21, 11 and 22, and 12 and 22 are similar. In the context of the study on attention-deficit hyperactivity disorder, this means that the trial is sized to ensure power to compare two competing approaches with clinical management of the disease: using behavioural modification therapy with intensification if needed, which is favoured by clinical psychologists, versus using medication with intensification if needed, which is favoured by physicians. Since only strategies 11 and 21 are considered in the following, to simplify notation, from here on we denote the survival function, cumulative hazard function and hazard function of the failure time under strategy 11 as [F with macron]1(t), Λ1(t) and λ1(t) and those under strategy 21 as [F with macron]2(t), Λ2(t) and λ2(t).

All of the statistics discussed above involve weights, which come in two forms: time-independent weights or time-dependent weights. Weights are required because in this design, for any given adaptive treatment strategy, responding subjects who have an initial treatment consistent with this strategy are overrepresented in this strategy and nonresponding subjects who have an initial treatment consistent with this strategy are underrepresented. For example, all subjects starting with A1 = 1 who responded have a treatment sequence that is automatically consistent with strategy 11, while only a proportion of subjects started with A1 = 1 who failed to respond have a treatment sequence that is consistent with strategy 11, which depends on the second randomization probability q. To adjust for this design characteristic we use inverse probability weights (Robins et al., 1994; Murphy et al., 2001):

Wj=I(A1=j)p{1R+I(A2=1)qR}(j=1,2),

for strategies 11 and 21, respectively, where I (·) denotes the indicator function. These weights are similar to the weights used in Lunceford et al. (2002) and Miyahara & Wahed (2010). The difference is that they consider strategies with the same first-stage treatment and hence the factors I (A1 = 1)/ p and I (A1 = 2)/(1 − p) are omitted.

Alternatively one could use time-dependent weights:

Wj(t)=I(A1=j)p{1R(t)+I(A2=1)qR(t)}(j=1,2),

where R(t) = I {S [less-than-or-eq, slant] min(t, T, C)}. These time-dependent weights are used in Guo’s dissertation and by Feng & Wahed (2008). Time-dependent weights permit more subject information to be used in the estimation at each time t. Consider strategy 11 and subjects who have A1 = 1 as the initial treatment. If time-dependent weights are used, then all of these subjects have a nonzero time-dependent weight at time-points t less than min(S, T, C), i.e. before they meet the nonresponse criterion, regardless of the value of A2. However, if time-independent weights are used, then those subjects who have A2 = 2 will never have a nonzero weight for strategy 11. Hence, intuitively the use of time-dependent weights results in efficiency gains compared with the time-independent weights.

We first consider the test statistic based on the weighted Kaplan–Meier estimator. This test statistic is used for testing for H0 : [F with macron]1(t) = [F with macron]2(t) versus H1 : [F with macron]1(t) ≠ [F with macron]2(t) for some t satisfying 0 < t [less-than-or-eq, slant] τ. It is based on the weighted Kaplan–Meier estimator of [F with macron]j (t) as proposed by Miyahara & Wahed (2010), which is defined as

F¯^Kj(t)=ut{1i=1nWji(u)dNi(u)i=1nWji(u)Yi(u)}(j=1,2),
(1)

where the subscript i indicates the ith subject, for i = 1, …, n, and N(u) = ΔI (U [less-than-or-eq, slant] u) is the counting process for the failure time, and Y (u) = I (U [gt-or-equal, slanted] u) is the at risk process. The estimator in (1) uses time-dependent weights; however, we can also use time-independent weights by replacing Wji (u) with Wji in (1). As will be discussed in § 3, we will use the test statistic based on the weighted Kaplan–Meier estimator with time-independent weights for sample size calculation, while in data analysis for evaluating the sample size formula the test statistic with time-dependent weights will also be considered. It is worth mentioning that another estimator of [F with macron]j (t) can be defined as exp{−[Lambda with circumflex]j (t)}, where [Lambda with circumflex]j (t) is the weighted Nelson–Aalen estimator of Λj (t) proposed in Guo & Tsiatis (2005). Theorem 1 in the Supplementary Material implies that the asymptotic distribution of the weighted Kaplan–Meier estimator is the same as that of this estimator.

By Theorem 1 in Supplementary Material, the asymptotic distribution of n1/2{F¯^Kj(t)F¯j(t)} is N{0,σKj2(t)}, where

σKj2(t)=F¯j2(t)E[0tWj(u)F¯j(u)F¯C(u){dN(u)Y(u)dΛj(u)}]2.

Let F¯^C(u) be the usual Kaplan–Meier estimator of [F with macron]C(u). From this result, a test statistic for testing H0 : [F with macron]1(t) = [F with macron]2(t) for some 0 < t [less-than-or-eq, slant] τ can be constructed as

TK(t)=n1/2{F¯^K1(t)F¯^K2(t)}{σ^K12(t)+σ^K22(t)}1/2,

where

σ^Kj2(t)=F¯^Kj2(t)ni=1n[0tWji(u)F¯^Kj(u)F¯^C(u){dNi(u)Yi(u)dΛ^Nj(u)}]2(j=1,2),

is a consistent estimator of σKj2(t). The variance of the numerator of the test statistic is the sum of variances since F¯^K1(t) and F¯^K2(t) are independent. This independence occurs because the two strategies begin with different treatments; when the strategies begin with the same treatment, the Kaplan–Meier estimators are dependent. See the Supplementary Material, in the paragraph following the proof of Theorem 1, for details regarding this situation. If time-independent weights are used in the weighted Kaplan–Meier estimator, then Wj (u) and Wji (u) in the above expressions for σKj2(t) and σ^Kj2(t) are replaced by Wj and Wji, respectively, and the resulting estimators of variances remain consistent. Under the null hypothesis, the test statistic TK (t) has an asymptotic N(0, 1) distribution.

Now we consider the weighted log-rank statistic. This test statistic is used for testing H0 : [F with macron]1 [equivalent] [F with macron]2 versus H1 : [F with macron]1(t) ≠ [F with macron]2(t) for some t [less-than-or-eq, slant] τ. The log-rank test is the most commonly used test for comparing the distributions of two failure times. It is also commonly used to calculate sample sizes in classical survival analysis. See, for example, Schoenfeld (1981). Weighting each subject as above, the following statistic is an analogue of the usual log-rank statistic:

Ln=0τY¯W2(t)Y¯W1(t)+Y¯W2(t)dN¯W1(t)0τY¯W1(t)Y¯W1(t)+Y¯W2(t)dN¯W2(t),

where Y¯Wj(t)=i=1nWji(t)Yi(t)/n and dN¯Wj(t)=i=1nWji(t)dNi(t)/n. This test statistic, which was proposed in Guo’s dissertation, uses time-dependent weights. One can use time-independent weights as well, just by replacing Wji (t) in the definition with Wji. By Theorem 3 in the Supplementary Material, the asymptotic distribution of √nLn under the null hypothesis is N{0,(σL12+σL22)/4}, where

σLj2=E[0τWj(t){dN(t)Y(t)dΛ1(t)}]2(j=1,2).

Again, see the Supplementary Material, the paragraph following the proof of Theorem 3, for testing two strategies starting with the same initial treatment. Let

σ^Lj2=1ni=1n[0τWji(t){dNi(t)Yi(t)dΛ^1(t)}]2(j=1,2),

where

dΛ^1(t)=i=1nW1idNi(t)+i=1nW2idNi(t)i=1nW1iYi(t)+i=1nW2iYi(t)

is obtained by pooling the two groups. We can use the following test statistic to test for H0 : [F with macron]1 [equivalent] [F with macron]2:

TL=2n1/2Ln(σ^L12+σ^L22)1/2.

This test statistic also has an asymptotic N(0, 1) distribution under H0. Again, when time-independent weights are used, we simply replace Wj (u) and Wji (u) in the above expressions by Wj and Wji, respectively. We will use the test with time-independent weights for sample size calculation but for data analysis we also consider the test with time-dependent weights. The weighted log-rank test here is different from the test with the same name in classical survival analysis. We used this name because it is consistent with the terminology used for the weighted Kaplan–Meier estimator and with the literature in this area. Lastly, Lokhnygina & Helterbrand (2007) proposed a pseudo-score test of the log hazard ratio in a Cox proportional hazards model for comparing two adaptive treatment strategies. In their pseudo-score function, each subject is weighted by a time-independent weight. The statistic Ln defined above with time-independent weights is equivalent to the pseudo-score function defined there, but their expression of its asymptotic variance is in a different form.

3. Sample size calculation

As mentioned above, we propose to use time-independent weights in the test statistics for sample size calculation. This is because the time-independent weights make it easier to obtain simple upper bounds on variances involved in the test statistics. These upper bounds are crucial in obtaining comparatively simple sample size formulae, which will be seen below. On the other hand, we suggest that test statistics using time-dependent weights is used in the data analyses as these tests are potentially more powerful. We first derive sample size formulae using exact variances, and then replace the variances in the sample size formulae with their upper bounds to get our final sample size formulae.

First suppose that we wish to test H0 : [F with macron]1(t) = [F with macron]2(t) versus H1 : [F with macron]1(t) ≠ [F with macron]2(t) for some t satisfying 0 < t [less-than-or-eq, slant] τ using the test based on the weighted Kaplan–Meier estimator with time-independent weights. For definiteness, we suppose that the survival probabilities at the end of the study are to be compared, i.e., t = τ, and write TK instead of TK (τ). Under significance level α, the rejection region of a two-sided test of H0 : [F with macron]1(τ) = [F with macron]2(τ) is {|TK| > Z1−α/2}. By Theorem 1 in the Supplementary Material, the distribution of TK is approximately normal with mean n1/2{F¯1(τ)F¯2(τ)}/{σK12(τ)+σK22(τ)}1/2 and variance 1 under the alternative hypothesis. To detect a difference in survival probabilities at time τ of size δK = [F with macron]1(τ) − F2(τ) with power 1 − β, we set

pr(TK>Z1α/2orTK<Z1α/2)=1β,

which yields the sample size formula

nK(Z1α/2+Z1β)2{σK12(τ)+σK22(τ)}{F¯1(τ)F¯2(τ)}2.

Next suppose that we wish to test H0 : [F with macron]1 [equivalent] [F with macron]2 versus H0 : [F with macron]1(t) ≠ [F with macron]2(t) for some t [less-than-or-eq, slant] τ and the weighted log-rank test with time-independent weights is used. To construct a sample size formula based on the log-rank test or its variants, the asymptotic distributions of the test statistics are usually derived under a proportional hazards assumption with a local alternative for the log hazards ratio (Schoenfeld, 1981; Chow et al., 2005; Eng & Kosorok, 2005; Feng & Wahed, 2008). The proportional hazards assumption is λ2(t) = λ1(t)eξ, where ξ is the log hazard ratio. And the local alternative assumption we use is represented by ξ = γ/n1/2 for some constant γ. The use of a local alternative greatly simplifies the asymptotic means of the test statistics, which facilitates the sample size calculation. We use this approach here as well. Guo’s dissertation studied the asymptotic distribution of the weighted log-rank statistic Ln with time-dependent weights. While he provided an outline, we provide a complete proof of the asymptotic normality of Ln under the local proportional hazards alternative in the Supplementary Material. Based on the asymptotic distribution of Ln given in Theorem 3 in the Supplementary Material and using a similar derivation as above, the corresponding sample size formula can be obtained as

nL(Z1α/2+Z1β)2(σL12+σL22)ξ2{0τF¯C(t)dF1(t)}2.

To use the above formulae to calculate sample sizes, we need values of σK12(τ) and σK22(τ) or σL12 and σL22. A challenge is that, even with time-independent weights, these variances depend in a complex manner on the joint distribution of the failure, T, and time to early nonresponse, S. In particular, R in the weight functions W1 and W2 depends on S; furthermore, as discussed in the Supplementary Material, the weights are not predictable, thus one cannot simplify the variance formulae by the usual martingale arguments. Thus a direct approximation of the values for these variances requires working assumptions on the joint distribution of T and S. We avoid making working assumptions on the joint distribution by the use of upper bounds on the variances; this then permits the use of working assumptions similar to those used to size standard one-stage trials.

To obtain interpretable upper bounds on the variances, we use potential outcomes (Holland, 1986) notation. The use of potential outcomes permits us to form upper bounds that are easily interpretable to scientists and thus enables the scientist to more easily provide the necessary information needed in the sample size calculation. Furthermore, the potential outcome notation allows us to make our arguments precise. Let Sj and Tj be the time to early nonresponse and the failure time, respectively, if a subject is assigned first-stage treatment j. The Tj s correspond to failure times in a one-stage study; that is, a study in which there is no change in treatment assignment. If Sj < Tj, then define Djk to be the interval of time between early nonresponse and the failure time if the subject is assigned second-stage treatment k. Intuitively, the failure time in response to the adaptive treatment strategy, jk, is the mixture of two times, one failure time in which there would have been no further treatment assignment and the other failure time in which there would be a further treatment assignment once early nonresponse occurs. That is, the potential failure time under the assignment of strategy jk is Tjk = Tj I (Sj > Tj) + (Sj + Djk)I (Sj [less-than-or-eq, slant] Tj). On the event {Sj > Tj} Tj1 = Tj2. We make the following consistency assumptions (Robins, 1997). If a subject is assigned first-stage treatment j, then the time to early nonresponse S is equal to Sj; similarly, if the assigned first-stage treatment is j, then T = Tjk for all nonresponding subjects assigned second-stage treatment k. The failure time for all responding subjects satisfies T = Tjk for k = 1, 2.

Now consider the variance term σK12(τ). First, when W1 ≠ 0, N(t) = ΔI (T [less-than-or-eq, slant] t) and Y (t) = I (U [gt-or-equal, slanted] t) can be replaced by N1(t) = I (T11 [less-than-or-eq, slant] t, T11 [less-than-or-eq, slant] C) and Y1(t) = I {min(T11, C) [gt-or-equal, slanted] t}, respectively. Thus

σK12(τ)=F¯12(τ)E[0τW1F¯1(t)F¯C(t){dN1(t)Y1(t)dΛ1(t)}]2.

Next since N1(t) and Y1(t) are functions of T11 and C only, and the weight is time-independent, the above is

σK12(τ)=F¯12(τ)E(E(W12|T11,C)[0τ1F¯1(t)F¯C(t){dN1(t)Y1(t)dΛ1(t)}]2).

Because the randomization of first- and second-stage treatments ensures that A1 and A2 are independent of the potential outcomes for the Sj s and the Tjks, we can replace E(W12|T11,C) in the above display by E(1 − R + R/q | T11, C)/ p. Next replace R by 1 to produce the upper bound,

σK12(τ)F¯12(τ)pqE[0τ1F¯1(t)F¯C(t){dN1(t)Y1(t)dΛ1(t)}]2=F¯12(τ)pq0τdΛ1(t)F¯1(t)F¯C(t),
(2)

where martingale theory is used to obtain the integral in the last equality. An upper bound for σK22(τ) is obtained similarly, and it is the bound where we replace p with (1 − p) and the subscript 1 by subscript 2 in (2). A similar argument can be used to derive the upper bounds for σL12 and σL22 as well; in the case of σL12 the upper bound is

σL121pq0τF¯C(t)dF1(t).

The upper bound on σL22 is similar, but p is replaced by (1 − p) and the subscript 1 is replaced by subscript 2.

Since the upper bounds are obtained by replacing R with 1, the larger proportion of subjects randomized at the second stage, or equivalently, the larger proportion of nonresponders, the sharper the upper bounds. In the case that every subject is randomized at the second stage, for example, if all subjects are nonresponders, the upper bounds are precise.

Now replacing the variances in the above sample size formulae by their upper bounds, the sample size based on the weighted Kaplan–Meier estimator with time-independent weights and upper bounds on the variances is

nK(Z1α/2+Z1β)2σB2{F¯1(τ)F¯2(τ)}2,
(3)

where

σB2=F¯12(τ)pq0τdΛ1(t)F¯1(t)F¯C(t)+F¯22(τ)(1p)q0τdΛ2(t)F¯2(t)F¯C(t).

In a similar manner, the sample size based on the weighted log-rank test with time-independent weights and upper bounds on the variances is

nL{1pq+1(1p)q}(Z1α/2+Z1β)2ξ20τF¯C(t)dF1(t).
(4)

In order to calculate the sample size using formula (4), we only need information about the hazard ratio and the integral 0τF¯C(t)dF1(t), which is exactly equal to the probability of observing an event before time τ when all subjects are assigned strategy 11. This is the same information used to size a two-arm one-stage trial using the log-rank test (Schoenfeld, 1981). In contrast, if we use (3) to calculate the sample size, we need working assumptions on the distribution function of the potential failure time under each strategy, i.e. F1(t) and F2(t), as well as FC (t), the distribution function of the censoring time. In practice, one could assume these functions have a parametric form, for example, exponential or Weibull distributions. For the censoring distribution [F with macron]C(t), a uniform distribution over (0, τ) with a point mass at τ is often reasonable. Then one only needs to guess at the parameters in these distributions to calculate the integrals in the formula. Alternatively, one could make guesses at these distribution functions at some fixed time-points before τ, and then approximate the integrals using numerical approximation.

Although (3) needs more inputs than (4), it is made much simpler by using upper bounds on variances. We illustrate this by comparing our working assumptions needed for (3) with those in a sample size formula developed by Feng & Wahed (2009). Their sample size formula is also based on testing the equality of survival probabilities at a fixed time-point but using the second weighted sample proportion estimator with time-independent weights proposed in Lunceford et al. (2002). Instead of forming upper bounds on the variances, they simplify the variance terms by making working assumptions on the joint distribution of the potential outcomes. They use slightly different potential outcomes than used here. In their setting, Rj denotes the nonresponse indicator under first-stage treatment j; Tj0 is the failure time if the subject responds and Tjk* is the failure time if the subject does not respond and is assigned treatment k in the second stage. Both times are durations beginning at the start of first-stage treatment. Then the failure time in response to strategy jk is Tjk=(1Rj)Tj0+RjTjk*. Using these potential outcomes, they assume that E(Rj | Tjk) = E(Rj). See, for example, their derivation of display (12) and their simulation models. One might interpret this working assumption as the chance that a subject exhibits early non-response is independent of the subject’s failure time. Since they do not use upper bounds on variances, they need to make more working assumptions than (3). Besides the assumption just mentioned, they also need assumptions about the distribution functions of Tj0 and Tjk*, and the nonresponse rates pr(Rj = 1)(j, k = 1, 2) to calculate the sample size.

We further compare working assumptions used by the different sample size formulae in the context of the trial for attention deficit hyperactivity disorder. First, consider sizing the trial to test if the chance of a school disciplinary event occurring by 36 weeks differs between strategies 11 and 21. In this example, the working assumption required to use the formula in Feng & Wahed (2009) is that, for both initial treatments, early nonresponse is independent of the time to a school disciplinary event. Moreover, to apply their formula, we need to know the distribution functions of the times to a school disciplinary event for both those who respond to behavioural therapy and those who respond to medication. We also need to know the distribution functions of time to a school disciplinary event for those who do not respond to behavioural therapy and are then assigned more intensified behavioural therapy, and for those who do not respond to medication and are then assigned higher dose medicine, respectively. Lastly, we need to know the proportion of subjects who respond to behavioural therapy and proportion of subjects who respond to medication. In contrast, the sample size formula (3) does not require the independence assumption stated above. Moreover, we only need to know the distribution functions of time to a school disciplinary event for subjects who are assigned strategy 11 and 21, respectively. If instead, we wish to size the trial to test the difference between the distributions of the times to a school disciplinary event under strategies 11 and 21 using the weighted log-rank test, then the working assumptions are even simpler. To use the sample size formula (4), our first assumption is proportional hazards between strategies 11 and 21. In addition, we need to specify the hazard ratio and the probability of observing the first school disciplinary event before the end of study in children who are assigned strategy 11. The working assumptions in general settings for the three sample size formulae are provided in Table 1.

Table 1
Methods for sample size calculation, working assumptions made, and quantities needing to be guessed to calculate the sample size

4. Simulation

We conducted a simulation study to assess the performance of the sample size formulae. In evaluating the formulae we include tests based on time independent weights and tests based on time-dependent weights. Moreover, for comparing the survival probabilities at a given time-point, we include two additional tests. The first test is based on a slight generalization of the third weighted sample proportion estimator of [F with macron]j (t) in Lunceford et al. (2002), which is the most efficient among the three estimators proposed there. Our generalization uses time-dependent weights W1(t) and W2(t) instead of the time-independent weights. The second test is based on the pseudo-semiparametric efficient estimator of [F with macron]j (t) proposed by Wahed & Tsiatis (2006).

In the simulations, we suppose τ = 36. We generate (Tj1, Sj) jointly from a Frank copula model (Roger, 1998) with association parameter 5 when j = 1 and 6 when j = 2, resulting in a positive correlation between the failure time and the time to early nonresponse. The marginal distributions of T11 and T21 are Weibull distributions, with a common location parameter equal to 2 to ensure proportional hazards. The scale parameter for T11 is 50 and the scale parameter for T21 is determined by the desired hazard ratio. The hazard ratio takes values 1.25, 1.5 or 2.0. The marginal distributions of the times to early nonresponse, S1 and S2, also have Weibull distributions with both scale and location parameters varied so as to achieve varying percentages of subjects who are randomized in the second stage. Specifically, the percentage of subjects randomized at the second-stage ranges from about 25% to about 75% throughout. The censoring time, C, has a point mass at τ and is otherwise uniformly distributed over (0, τ), and is independent of all other variables. The size of the point mass at τ is varied so that approximately 20% or 40% of the failure times are censored. The survival probabilities at the end of study under the two strategies 11 and 21 range in (0.4, 0.6) among all scenarios. The first- and second-stage randomization probabilities are p = q = 0.5. All simulations are based on 1000 simulated data sets. The significance level is α = 0.05 and the desired power is 80%.

Tables 2 and and33 provide the results of the simulation. From both tables, we observe that the desired power is achieved by all the tests except the one based on the third estimator in Lunceford et al. (2002) when comparing survival probabilities at the end of study. The tests based on time-dependent weights are more efficient than the corresponding tests based on time-independent weights. By results in Table 2, the pseudo-efficient estimator of Wahed & Tsiatis (2006) is slightly more efficient than the weighted Kaplan–Meier estimator with time-dependent weights when the sample sizes are large. In smaller samples, the efficiency of the pseudo-efficient estimator is reduced due to the variability introduced by the estimation of probabilities of potentially rare events. Table 2 also includes sample sizes calculated from the formula in Feng & Wahed (2009). This sample size formula assumes that the potential outcomes Rj and Tjk are independent; this assumption does not hold here. Their sample sizes are a little larger than those calculated from our formula (3). This is possibly due to the relatively low efficiency of the sample proportion estimator compared with the weighted Kaplan–Meier estimator.

Table 2
Achieved powers (%) for the sample sizes based on the weighted Kaplan–Meier estimator with time-independent weights. The significance level of the test is 5% and the desired power is 80%
Table 3
Achieved powers (%) of the weighted log-rank tests under sample sizes obtained by cWLR. The significance level of the test is 5% and the desired power is 80%

From Tables 2 and and33 we also observe that the degree of conservatism of the sample size formulae depends on the percentage of subjects randomized at the second stage. The higher percentage of subjects randomized at the second stage, the less conservative the sample sizes. Notably, when the percentage of subjects who are randomized at the second-stage approaches or exceeds 70%, the achieved powers are close to the expected power. We can also use simulations to search for the nonconservative sample sizes and compare them with our conservative ones. For example, when the hazard ratio is 1.5, and when 25, 50 and 75% subjects are rerandomized, the sample sizes needed to guarantee 80% power using the test based on the weighted Kaplan–Meier estimator with time-independent weights are about 680, 710 and 770, respectively, compared with 880 subjects required by our conservative sample size formula. When the weighted log-rank test with time-independent weights are used, the nonconservative sample sizes are about 620, 690 and 720, when 25, 50 and 75% subjects are rerandomized, respectively, while our conservative sample size is 753.

Table 4 presents the results of another simulation which illustrate the achieved powers when working assumptions are incorrect, that is, the quantities needed to calculate the sample sizes are misspecified. We consider cases where the hazard ratio is 1.25 or 1.5. In the case of the sample size formula (3), the wrong scale or shape parameter is used in the specification of the Weibull distributions of T11 and T21. In the case of the sample size formula (4), the probability of observing an event before the end of study is misspecified. The results show that, the desired powers for both tests are usually achieved or approximated under a reasonable degree of misspecification. This robustness is a consequence of conservatism of the sample size formulae.

Table 4
Achieved powers (%) of tests under misspecifications of the true model. In the true model, (T11, S1) and (T21, S2) both follow Frank copula models with association parameters 1 and 2, respectively. The marginal distributions of T11 and T21 are Weibull ...

In all these simulations we generated data in which the failure times and the times to early non-response are positively associated. We also conducted simulations in which they are negatively associated; the results are very similar to those shown here.

5. Discussion

Although the focus of this work is on trials intended for use in refining or developing adaptive treatment strategies, in some situations two fully developed adaptive treatment strategies exist and scientific interest rests primarily in contrasting the strategies. In this case, a traditional two-arm trial of the strategies is more appropriate than the sequential multiple assignment randomized trial discussed here and has the advantage of requiring a smaller sample size.

In this work we assume that the censoring time distribution does not depend on the initial treatment. In the case of the weighted log-rank test, removing this assumption will result in a more complicated sample size formula. However, removing this assumption for the sample size formula based on the weighted Kaplan–Meier estimator requires minimal change as follows. Denote [F with macron]Cj (t) to be the survival function of the censoring time for subjects starting with initial treatment A1 = j, j = 1, 2. Then it is easy to see that the asymptotic distribution of F¯^j(t) is normal with the same mean but [F with macron]C (t) in the variance formula is replaced by [F with macron]Cj (t). Therefore, the corresponding sample size formula is the same as (3) except that σB2 now becomes

σB2=F¯12(τ)pq0τdΛ1(t)F¯1(t)F¯C1(t)+F¯22(τ)(1p)q0τdΛ2(t)F¯2(t)F¯C2(t).

The simulation results in the previous section indicate that the sample sizes obtained using the weighted log-rank test are usually considerably smaller than those obtained from comparing survival probabilities at a given time-point. Moreover, less information is needed to calculate the sample size derived from the weighted log-rank test. Hence we have developed a web applet which can be used to size studies based on the weighted log-rank test. This web applet, along with the simulation code for our simulations, can be found at http://methodologymedia.psu.edu/logranktest/samplesize. However, the log-rank test is designed to have highest power against proportional hazards (Peto & Peto, 1972), thus the use of the sample size formula based on the log-rank test may not provide the desired power if the hazards under the alternate hypothesis are likely nonproportional. The test for equality between survival probabilities at one time-point has a more modest goal, hence may be better able to ensure power.

Feng & Wahed (2008) derived a sample size formula for the same problem using a supremum weighted log-rank test comparing two strategies. Their sample size formula requires more detailed working assumptions than that required by the formula based on the weighted log-rank test given here. It would be desirable to develop a sample size formula based on the supremum weighted log-rank test that requires less detailed working assumptions.

Supplementary material

Supplementary material available at Biometrika online includes a discussion of sample size formulae for two variants of two-stage randomized trials and proofs of theoretical results.

Acknowledgments

The authors thank the editor, an associate editor and the referee for valuable comments. This research was supported by the National Institute of Health, U.S.A.

REFERENCES

  • Chow SC, Shao J, Wang H. Sample Size Calculations in Clinical Research. London: Chapman and Hall; 2005.
  • Eng KH, Kosorok MR. A sample size formula for the supremum log-rank statistic. Biometrics. 2005;61:86–91. [PubMed]
  • Feng W, Wahed AS. A supremum log-rank test for comparing adaptive treatment strategies and corresponding sample size formula. Biometrika. 2008;95:695–707.
  • Feng W, Wahed SA. Sample size for two-stage studies with maintenance therapy. Statist. Med. 2009;28:2028–41. [PubMed]
  • Guo X, Tsiatis AA. A weighted risk set estimator for survival distributions in two-stage randomization designs with censored survival data. Int. J. Biostatistics. 2005;1:1–15.
  • Holland PW. Statistics and causal inference. J Am Statist Assoc. 1986;81:945–60.
  • Lavori PW, Dawson R. Dynamic treatment regimes: practical design considerations. Clin. Trials. 2003;1:9–20. [PubMed]
  • Lavori PW, Dawson R, Roth AJ. Flexible treatment strategies in chronic disease: clinical and research implications. Biol Psychiat. 2000;48:605–14. [PubMed]
  • Lieberman JA, Stroup TS, McEvoy JP, Swartz MS, Rosenheck RA, Perkins DO, Keefe RS, Davis SM, Davis CE, Lebowitz BD, Severe J, Hsiao JK. Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE) Investigators. Effectiveness of antipsychotic drugs in patients with chronic schizophrenia. New Engl J Med. 2005;53:1209–23. [PubMed]
  • Lokhnygina Y, Helterbrand JD. Cox regression methods for two-stage randomization designs. Biometrics. 2007;63:422–8. [PubMed]
  • Lunceford JK, Davidian M, Tsiatis AA. Estimation of survival distributions of treatment strategies in two-stage randomization designs in clinical trials. Biometrics. 2002;58:48–57. [PubMed]
  • McKay JR, Lynch KG, Shepard DS, Ratichek S, Morrison R, Koppenhaver J, Pettinati HM. The effectiveness of telephone-based continuing care in the clinical management of alcohol and cocaine use disorders: 12-month outcomes. J Consult Clin Psychol. 2004;72:967–79. [PubMed]
  • Marlowe DB, Festinger DS, Arabia PL, Dugosh KL, Benasutti KM, Croft JR, McKay JR. Adaptive interventions in drug court: a pilot experiment. Criminal Justice Rev. 2008;33:343–60. [PMC free article] [PubMed]
  • Miyahara S, Wahed AS. Weighted Kaplan–Meier estimators for two-stage treatment regimes. Statist Med. 2010;29:2581–91. [PubMed]
  • Murphy SA. An experimental design for the development of adaptive treatment strategies. Statist Med. 2005;24:1455–81. [PubMed]
  • Murphy SA, Lynch KG, Oslin D, McKay JR, TenHave T. Developing adaptive treatment strategies in substance abuse research. Drug Alcohol Dependence. 2007;88:S24–S30. [PMC free article] [PubMed]
  • Murphy SA, van der Laan MJ, Robins JM, CPPRG Marginal mean models for dynamic regimes. J Am Statist Assoc. 2001;96:1410–23. [PMC free article] [PubMed]
  • Oetting AI, Levy JA, Weiss RD, Murphy SA. Statistical methodology for a SMART design in the development of adaptive treatment strategies. In: Shrout PE, editor. Causality and Psychopathology: Finding the Determinants of Disorders and their Cures. Arlington, VA: American Psychiatric Publishing, Inc; 2007.
  • Peto R, Peto J. Asymptotically efficient rank invariant test procedures (with discussion) J. R. Statist. Soc. A. 1972;96:1410–23.
  • Robins JM. A new approach to causal inference in mortality studies with sustained exposure periods-application to control of the healthy worker survivor effect. Comp Math Appl. 1986;14:1393–512.
  • Robins JM. Causal Inference From Complex Longitudinal Data. In: Berkane M, editor. Lecture Notes in Statistics. Vol. 120. Berlin, Heidelberg, New York: Springer; 1997. pp. 69–117.
  • Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. J Am Statist Assoc. 1994;89:846–66.
  • Roger BN. An Introduction to Copulas. Berlin, Heidelberg, New York: Springer; 1998. p. 1998.
  • Rush AJ, Fava M, Wisniewski SR, Lavori PW, Trivedi MH, Sackeim HA, Thase ME, Nierenberg AA, Quitkin FM, Kashner TM. Sequenced treatment alternatives to relieve depression (STAR*D): rationale and design. Contr. Clin. Trials. 2004;25:119–42. [PubMed]
  • Schoenfeld DA. The asymptotic properties of nonprametric tests for comparing survival distributions. Biometrika. 1981;68:316–9.
  • Stone RM, Berg DT, George SL, Dodge RK, Paciucci PA, Schulman P, Lee EJ, Moore JO, Powell BL, Schiffer CA. Granulocyte-macrophage colony-stimulating factor after initial chemotherapy for elderly patients with primary acute myeloge-nous leukemia. New Engl J Med. 1995;332:1671–7. [PubMed]
  • Tummarello D, Mari D, Graziano F, Isidori P, Cetto G, Pasini F, Santo A, Cellerino R. A randomized, controlled phase III study of cyclophosphamide, doxorubicin, and vincristine with Etoposide (CAVE) or Teniposide (CAV-T), followed by recombinant interferon-α maintenance therapy or observation, in small cell lung carcinoma patients with complete responses. Cancer. 1997;80:2222–9. [PubMed]
  • Wahed AS, Tsiatis AA. Semiparametric efficient estimation of survival distribution in two-stage randomization designs in clinical trials with censored data. Biometrika. 2006;93:163–77.

Articles from Biometrika are provided here courtesy of Oxford University Press

Formats:

Related citations in PubMed

See reviews...See all...

Links

  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...