- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

- Journal List
- NIHPA Author Manuscripts
- PMC2892819

# Screening Experiments for Developing Dynamic Treatment Regimes

## Abstract

Dynamic treatment regimes are time-varying treatments that individualize sequences of treatments to the patient. The construction of dynamic treatment regimes is challenging because a patient will be eligible for some treatment components only if he has not responded (or has responded) to other treatment components. In addition there are usually a number of potentially useful treatment components and combinations thereof. In this article, we propose new methodology for identifying promising components and screening out negligible ones. First, we define causal factorial effects for treatment components that may be applied sequentially to a patient. Second we propose experimental designs that can be used to study the treatment components. Surprisingly, modifications can be made to (fractional) factorial designs - more commonly found in the engineering statistics literature -for screening in this setting. Furthermore we provide an analysis model that can be used to screen the factorial effects. We demonstrate the proposed methodology using examples motivated in the literature and also via a simulation study.

**Keywords:**Multi-stage Decisions, Experimental Design, Causal Inference

## 1. Introduction

*Dynamic treatment regimes* (Robins, 1986, 1987; Lavori et al., 2000; Murphy et al., 2001) are time-varying treatments increasingly employed in the treatment and management of chronic, relapsing disorders such as drug dependence, HIV infection and depression (Brooner and Kidorf, 2002; Rush et al., 2003; Lisziewicz and Lori, 2002). These regimes individualize the treatment level and type via decision rules that input patient outcomes collected during treatment and output recommended treatment alterations. A two stage dynamic treatment regime specifies how to select the treatment combination by a sequence of decision rules, one per stage, {*d*_{1}, *d*_{2}} where the decision rule *d*_{1} takes the information available initially, say *O*_{1} and outputs stage 1 treatments *A*_{1} and the decision rule *d*_{2} takes the information available at the beginning of the stage 2, say {*O*_{1}, *A*_{1}, *O*_{2}}, and recommends stage 2 treatments, *A*_{2}. The time order is *O*_{1}, *A*_{1}, *O*_{2}, *A*_{2}, *O*_{3} where *O*_{3} is observed at the end of stage 2. The primary outcome is *Y* = *f*(*O*_{1}, *A*_{1}, *O*_{2}, *A*_{2}, *O*_{3}) for *f* known.

Dynamic treatment regimes are also often multi-component (i.e., multiple factor) treatments: components may include not only the treatments for the primary disorder but also adjunctive treatments for co-occurring problems, behavioral therapies to improve adherence and delivery mechanisms. Presently dynamic treatment regimes are constructed via expert opinion and then evaluated in randomized two group trials. Observational studies are typically used to assist experts in constructing the dynamic treatment regime. Observational analyses attempt to infer causal relations from non-randomized comparisons and are subject to bias when the compared groups differ in ways other than by type of treatment (Rubin, 1974, 1978). While there have been great advances in the epidemiological, statistical and other literatures in developing methods that precisely specify the assumptions under which this bias may be eliminated, randomized studies, when possible, remain the optimal approach to reducing bias (Rubin, 1974, 1978). The field of experimental design provides an approach to using randomization in the construction of multi-component treatments.

The focus of this article is on screening experiments for two stage dynamic treatment regimes. Screening experiments are used to whittle down the large number of treatment components resulting in a few promising dynamic treatment regimes. The formulation and analysis of screening experiments for multi-component dynamic treatment regimes requires the combination of ideas from causal inference and factorial experimental design.

Unfortunately, in settings such as substance abuse, mental health and HIV research, the classical design and analysis of screening experiments *cannot* be directly imported. This is because some stage 2 factors are only relevant for patients who responded (or did not respond) to prior stage 1 treatment factors. For this reason, in Section 2, we carefully define factorial effects so that these effects have a causal interpretation. In Section 3 we provide 2* ^{k}* experimental designs and associated analysis methods that can be used to screen these effects. In Section 4, we consider 2

^{k}^{−}

*designs. In this section we provide an approach to ascertain the aliasing of the factorial effects. Section 5 provides examples illustrating a variety of designs. In Section 6 a simulation study is used to evaluate the robustness of the proposed methods. Section 7 concludes with a summary and discussion of open problems.*

^{m}## 2. Causal Effects of the Factors

Screening factors often occurs via the assessment of factorial effects in ANOVA decompositions (see Box, Hunter and Hunter, 1978; Wu and Hamada, 2000 or Montgomery and Jennings, 2006) for the conditional mean of the primary outcome. In our setting special care is required in defining effects. Consider a simple case where there is only one stage 1 factor and two stage 2 factors. Each factor takes on levels in {−1, 1}. Suppose *N* subjects are each randomized equally to the two levels of the stage one factor *A*_{1}. Then an indicator of early response is observed; *R* is 1 if the subject is responding following assignment of *A*_{1} and 0 otherwise (there is no *O*_{1} and *O*_{2} = *R*). For simplicity, we refer to individuals with *R* = 1, 0 as responders, non-responders, respectively. At stage 2, responders are randomized equally between the {−1, 1} levels of the factor
${A}_{2}^{(1)}$, and non-responders are randomized equally between the {−1, 1} levels of the factor
${A}_{2}^{(0)}$ (the superscripts, (1), (0), indicate stage 2 factors for responders, non-responders, respectively). At the end of the study the primary outcome *Y* (coded so that high values are preferable) is observed. Assuming subject observations are independent and identically distributed, a “factorial” decomposition is

with the interpretation that the coefficients {*η*_{1}, *β*_{1}, *β*_{2}, *α*_{1}, *α*_{2}} represent factorial effects. *R* is included in the above decomposition since
${A}_{2}^{(1)}$ can only be assigned to subjects with *R* = 1 and similarly for
${A}_{2}^{(0)}$.

If we interpret the coefficients {*η*_{1}, *β*_{1}, *β*_{2}, *α*_{1}, *α*_{2}} as factorial effects then to screen factors *A*_{1},
${A}_{2}^{(1)},{A}_{2}^{(0)}$ we will base our inference on these coefficients. However, this can lead to erroneous conclusions concerning the usefulness of *A*_{1}, via *η*_{1}, for at least two reasons. First in the medical/behavioral fields there are a plethora of both known and unknown common causes of *R* and *Y* (hence *R* is prognostic for *Y*). Consider the following example. Let *U*, a Bernoulli random variable with success probability
${\scriptstyle \frac{1}{2}}$, represent an unknown common cause of both the outcome *Y* and early response *R. U* might be an unknown genetic factor. Suppose that *Y* = *δ*_{0} + *δ*_{1}*U* + *ε* where *ε* (mean zero, finite variance) is independent of (*U*, *R*, *A*_{1},
${A}_{2}^{(0)},{A}_{2}^{(1)}$). Thus there is *no* effect of the stage 1 factor *A*_{1} or of the stage 2 factors (
${A}_{2}^{(0)},{A}_{2}^{(1)}$) on the outcome *Y*. Next suppose that
$P[R=1\mid U,{A}_{1}]=U\phantom{\rule{0.16667em}{0ex}}[{\scriptstyle \frac{{q}_{1}+{q}_{2}}{2}}+{\scriptstyle \frac{{q}_{1}-{q}_{2}}{2}}{A}_{1}]+(1-U)\phantom{\rule{0.16667em}{0ex}}[{\scriptstyle \frac{{q}_{3}+{q}_{4}}{2}}+{\scriptstyle \frac{{q}_{3}-{q}_{4}}{2}}{A}_{1}]$ where each *q _{j}* [0, 1]. Note that

*A*

_{1}may impact the early response (

*q*

_{1}−

*q*

_{2}≠ 0 or

*q*

_{3}−

*q*

_{4}≠ 0). We obtain

Thus
${\eta}_{1}={\scriptstyle \frac{{\delta}_{1}}{2}}\left({\scriptstyle \frac{1-{q}_{1}}{2-{q}_{1}-{q}_{3}}}-{\scriptstyle \frac{1-{q}_{2}}{2-{q}_{2}-{q}_{4}}}\right)$ which may differ from the true effect of zero. That is, *η*_{1} in (1) reflects biases that occur because we are conditioning on *R* which is both an outcome of *A*_{1} and a prognostic variable for *Y*. Further discussion of this bias can be found in Robins and Greenland (1994) and Robins and Wasserman (1997).

Second even if there are no unknown common causes of *R* and *Y*, *η*_{1} does not reflect the overall impact of *A*_{1} as *A*_{1} may impact *Y* at least partially by its effect on *R* (Parmigiani, 2002; Murphy, 2005). For both reasons, clinical trial guidelines (ICH E9, 1999) warn against conditioning on outcomes such as side effects or other post randomization covariates in comparing one treatment to another.

These two issues preclude the use of (1) for screening. To our knowledge the former issue does not arise in the traditional experimental design literature, even in clinical trials with multiple stages. For example, adaptive experimental designs (Hu and Rosenberger, 2006, Zacks, 1996, Patel, 1962) also involve multiple stages of (sequential) decision making, but each stage involves different subjects; that is “sequential” there means decisions on a sequence of subjects not sequential decisions on each subject. Since subjects can be generally assumed to respond independently (given membership in the same population of subjects), and the stages involve different subjects, the above issues do not occur.

### 2.1 Causal Effects in Terms of Potential Outcomes

In this section a model in terms of factorial effects for each fixed pattern of factor levels is developed; in the next subsection randomization of the factor levels is incorporated. This model will be in terms of potential outcomes (Rubin, 1978; Robins, 1986, 1987); the parameters in the model will be the causal effects. The idea is that each subject has multiple potential outcomes, one per factor level combination. However only one of the potential outcomes, that is, the outcome associated with the subject’s assigned factor levels is observed; the remaining potential outcomes are missing. The potential outcome framework facilitates a precise definition of the causal factorial effects; as will be seen this is somewhat nuanced for stage 1 factorial effects. Also this framework facilitates the statement of the conditions under which estimators of the causal effects of a factor can be obtained.

Suppose there are *p*_{1} stage 1 factors, *p*_{12} stage 2 factor levels for responders and *p*_{02} stage 2 factor for non-responders. As is generally the case in screening, we assume each factor has two levels (see Section 3 for discussion), coded by values ±1. Let *a*_{1} denote a vector of stage 1 factor levels,
${a}_{2}^{(1)},{{a}_{2}^{(1)}}^{\prime}$ denote vectors of stage 2 factor levels for responders and
${a}_{2}^{(0)},{{a}_{2}^{(0)}}^{\prime}$ denote vectors of stage 2 factor levels for non-responders. Let
${R}_{{a}_{1},{a}_{2}^{(0)},{a}_{2}^{(1)}}\in \{0,1\}$ and
${Y}_{{a}_{1},{a}_{2}^{(0)},{a}_{2}^{(1)}}$ denote the stage 1 response and the primary outcome, respectively, that would be observed if the subject were assigned the dynamic treatment regime: \provide treatment *a*_{1} in stage 1 and then if an early response occurs provide
${a}_{2}^{(1)}$ otherwise provide
${a}_{2}^{(0)}$.” Note that there are 2^{p}^{1+}^{p}^{02+}^{p}^{12} different dynamic treatment regimes, thus, there are 2^{p}^{1+}^{p}^{02+}^{p}^{12} potential stage 1 outcomes and 2^{p}^{1+}^{p}^{02+}^{p}^{12} potential primary outcomes for each subject.

Throughout we make two assumptions that hold if the stage 2 treatment is not revealed to the subject/clinical staff until after the occurrence of the stage 1 outcome. These assumptions will enable us to reduce the number of groups of subjects in the experimental design (see the next section).

#### Ignorability Assumption I

Assume that
${R}_{{a}_{1},{a}_{2}^{(0)},{a}_{2}^{(1)}}={R}_{{a}_{1},{{a}_{2}^{(0)}}^{\prime},{{a}_{2}^{(1)}}^{\prime}}$ for all {*a*_{1, }
${a}_{2}^{(0)},{{a}_{2}^{(0)}}^{\prime},{a}_{2}^{(1)},{{a}_{2}^{(1)}}^{\prime}$}.

In words, this is the assumption that a subject’s stage 1 outcome remains the same regardless of the stage 2 treatment assignments. Below we use the notation, *R _{a}*

_{1}, instead of ${R}_{{a}_{1},{a}_{2}^{(0)},{a}_{2}^{(1)}}$.

#### Ignorability Assumption II

Assume that
${Y}_{{a}_{1},{a}_{2}^{(0)},{a}_{2}^{(1)}}{R}_{{a}_{1}}={Y}_{{a}_{1},{{a}_{2}^{(0)}}^{\prime},{a}_{2}^{(1)}}{R}_{{a}_{1}}$ and
${Y}_{{a}_{1},{a}_{2}^{(0)},{a}_{2}^{(1)}}(1-{R}_{{a}_{1}})={Y}_{{a}_{1},{a}_{2}^{(0)},{{a}_{2}^{(1)}}^{\prime}}(1-{R}_{{a}_{1}})$ for all {*a*_{1},
${a}_{2}^{(0)},{{a}_{2}^{(0)}}^{\prime},{a}_{2}^{(1)},{{a}_{2}^{(1)}}^{\prime}$}.

For example, this assumption states that the primary outcome of a subject who does not respond in stage 1 does not depend on what this subject would have been assigned had he/she responded in stage 1. These assumptions may be violated if, as part of the protocol, subjects are informed in stage 1 of which stage 2 treatment they will be offered if they do not respond at stage 1. It may happen that subjects, who know that they will be assigned a rather burdensome stage 2 treatment upon non-response, will try to proactively avoid this burdensome treatment by adhering more faithfully to their assigned stage 1 treatment than would otherwise be the case.

In the sequel, stage 1 effects refer to all effects involving *only* stage 1 factors and the stage 2 effects refer to all effects involving *at least* one stage 2 factor. The stage 2 causal effects are defined via a saturated linear regression model for

in *a*_{1},
${a}_{2}^{(1)}{R}_{{a}_{1}},{a}_{2}^{(0)}(1-{R}_{{a}_{1}})$ and their higher order interactions. For example suppose that there is one stage 1 factor and two stage 2 factors (*p*_{1} = *p*_{02} = *p*_{12} = 1) then similar to (1) we have

where *α _{j}*,

*β*,

_{j}*j*= 1, 2 are the stage 2 causal effects; in particular

*α*

_{1},

*β*

_{1}are the stage 2 main effects and

*α*

_{2},

*β*

_{2}are the stage 2 interaction effects. If we solve for

*α*,

_{j}*β*,

_{j}*j*= 1, 2, we see that the causal effects are contrasts of conditional means involving potential outcomes; for example,

Recall that under the ignorability assumption II, the quantities above do not depend on the value of ${a}_{2}^{(0)}$. The coefficients in the above linear regression are 1/2 the size of the usual factorial effects (Wu and Hamada, 2000, pg. 113). (Note there is some discrepancy in the literature in how factorial effects are defined; for example Byar and Piantadosi’s (1985) definition of a main effect is one-half the size of Wu and Hamada’s definition). Throughout this work we use the regression formulation and refer to the regression coefficients as factorial effects.

As discussed at the beginning of this section the usual definition of the stage 1 effect (for example, via *η*_{1} in (3)) can lead to erroneous conclusions. To avoid this problem we do not condition on *R _{a}*

_{1}; instead we marginalize over

*R*

_{a}_{1}to obtain causal effects. Consistent with ignorability assumption II, we use the notation ${Y}_{{a}_{1},{a}_{2}^{(0)}}$ if

*R*

_{a}_{1}= 0 and ${Y}_{{a}_{1},{a}_{2}^{(1)}}$ if

*R*

_{a}_{1}= 1. We define the stage 1 causal factorial effects via a saturated linear regression model for

in *a*_{1}. Again suppose that *p*_{1} = *p*_{02} = *p*_{12} = 1 then the linear regression is simple:

Solving for _{2} we obtain,

_{2} is the causal main effect of *A*_{1}.

In general the causal main effect of one of the *p*_{1} stage 1 factors is the average of contrasts between two means of *Y* one for each level of the stage 1 factor, marginal over the distribution of *R* (and all other intermediate outcomes in *O*_{2} as well); the average is over the remaining factor levels (with all other factors taking levels ±1 with equal probability). The averaging/marginalization with respect to a uniform distribution over the other factor levels is consistent with the definition of factorial effects in experimental design (Wu and Hamada, 2000; Byar and Piantadosi, 1985). Note the definition of the stage 2 treatment effects also involves marginalization (except the stage 2 treatments are nested within the outcome *R* and thus the contrasts are between means conditional on *R*).

### 2.2 Effects in Terms of Conditional Expectations Involving Randomized Factors

Suppose we have experimental data in which the factors are randomized according to some distribution on {−1, 1}^{p}^{1+}^{p}^{02+}^{p}^{12}. The following provides definitions of the factorial effects in terms of the resulting data distribution. Let *A*_{1} denote the random vector of *p*_{1} stage 1 factor levels,
${A}_{2}^{(1)}$ denote the random vector of *p*_{12} stage 2 factor levels for responders and
${A}_{2}^{(0)}$ denote the random vector of *p*_{02} stage 2 factor levels. Lower case letters denote realizations of these random vectors. In this case {*A*_{1},
${A}_{2}^{(0)},{A}_{2}^{(1)}$} is clearly independent of the collection {*R*_{a1},
${Y}_{{a}_{1},{a}_{2}^{(0)},{a}_{2}^{(1)}}$*, a*_{1} {−1, 1}^{p}^{1},
${a}_{2}^{(0)}\in {\{-1,1\}}^{{p}_{02}},{a}_{2}^{(1)}\in {\{-1,1\}}^{{p}_{12}}$}. On each subject we observe *A*_{1}, *R*,
${A}_{2}^{(0)},{A}_{2}^{(1)}$*, Y* (when *R* = 1, factor levels given by
${A}_{2}^{(1)}$ are assigned and when *R* = 0, factor levels given by
${A}_{2}^{(0)}$ are assigned). We connect the potential outcomes to the observed data via the consistency assumption (Robins, 1997): assume that for all values of {*a*_{1}*, *
${a}_{2}^{(0)},{a}_{2}^{(1)}$}, if *A*_{1} = *a*_{1}*, *
${A}_{2}^{(0)}={a}_{2}^{(0)},{A}_{2}^{(1)}={a}_{2}^{(1)}$ then
$Y={Y}_{{a}_{1},{a}_{2}^{(0)},{a}_{2}^{(1)}}$ and *R* = *R _{a}*

_{1}.

The randomization of factor levels plus the consistency and ignorability assumptions imply that *E*[*R _{a}*

_{1}] =

*E*[

*R*

_{a}_{1}|

*A*

_{1}=

*a*

_{1}] =

*E*[

*R*|

*A*

_{1}=

*a*

_{1}]. Similarly for each

*r*{0, 1},

*P*[

*R*

_{a}_{1}=

*r*,

*A*

_{1}= 1 ${A}_{2}^{(r)}={a}_{2}^{(r)}$] =

*P*[

*R*=

*r*,

*A*

_{1}= 1

*,*${A}_{2}^{(r)}={a}_{2}^{(r)}$]. Assuming this probability is nonzero, $E[{Y}_{{a}_{1},{a}_{2}^{(0)},{a}_{2}^{(1)}}\mid {R}_{{a}_{1}}=r]$ is equal to $E[{Y}_{{a}_{1},{a}_{2}^{(0)},{a}_{2}^{(1)}}\mid {A}_{1}={a}_{1},{R}_{{a}_{1}}=r,{A}_{2}^{(r)}={a}_{2}^{(r)}]$, which by the consistency and ignorability assumptions is in turn equal to $E[Y\mid {A}_{1}={a}_{1},R=r,{A}_{2}^{(r)}={a}_{2}^{(r)}]$. As a result, the stage 2 causual effects in (2) are simply the coefficients of terms involving ${a}_{2}^{(0)}$ in a saturated linear model for conditional mean $E[Y\mid {A}_{1}={a}_{1},R=0,{A}_{2}^{(0)}={a}_{2}^{(0)}]$ and the coefficients of terms involving ${a}_{2}^{(1)}$ in a saturated linear model for conditional mean $E[Y\mid {A}_{1}={a}_{1},R=1,{A}_{2}^{(1)}={a}_{2}^{(1)}]$.

The discussion above implies that in the simple setting where there is only one stage 1 factor and two stage 2 factors the *β* (*α*) parameters in (1) represent the stage 2 causal effects for the responders (non-responders, respectively). A similar statement can not be made for the stage 1 effects. Instead recall that the stage 1 causal effects are given by a saturated linear model for the mean in (4). This mean can be rewritten as

Again using randomization and the consistency assumption this mean can be reexpressed as

Thus the stage 1 causal effects are given by the coefficients in a saturated linear model for the above sum. The above formula (5) is a version of Robins “G-computation” formula (Robins, 1986); this formula frequently arises in causal inference for time varying treatments. A more intuitive definition for the stage 1 effects obtains if the stage 2 factors are independently randomized according to discrete uniform distributions on {−1, 1}. Then (5) simplifies to *E*[*Y* |*A*_{1} = *a*_{1}]; a saturated linear model for this conditional mean provides the definition of the stage 1 effects in this special setting.

Somewhat surprisingly the stage 1 and stage 2 causal effects are terms in one model for the conditional mean. Define

This definition is used to make the next formulae concise; note (5) is equal to $E\left[{E}_{{A}_{2}\sim U}[E[Y\mid {A}_{1},R,{A}_{2}^{(R)}]\mid {A}_{1},R]\mid {A}_{1}={a}_{1}\right]$. Consider the telescoping sum for the conditional mean, $E[Y\mid {A}_{1},R,{A}_{2}^{(R)}]$:

The first term (6) contains the stage 2 causal effects. The last term (7) is just another way to write (5) and thus contains all stage 1 causal effects. The middle term is composed entirely of nuisance parameters and can be rewritten as *R* − *E*[*R*|*A*_{1}] times

Thus in the example of one stage 1 factor and two stage 2 factors (one for responders, the other for non-responders), we replace the linear regression in (1) by

where the first row corresponds to (7), the second row to (8) times (*R* − *E*[*R*|*A*_{1}]) and the last row to (6). Note that even though we call the parameters, *ψ*_{1}, *ψ*_{2} and the stage 1 response rate, *E*[*R*|*A*_{1}], nuisance parameters, these parameters may be of independent interest. This is certainly the case for the stage 1 response rate; additionally *ψ*_{1}, *ψ*_{2} reflect the ability of the stage 1 response to predict the primary outcome. Note that the nuisance parameters, *ψ*_{1}, *ψ*_{2}, would be absent from the above formula only if the response indicator, *R*−*E*[*R*|*A*_{1}] were zero, that is, only if the stage 1 response, *R*, is a deterministic function of *A*_{1}.

## 3. 2^{k} Factorial Two-Stage Designs

^{k}

Our goal is to screen out inactive factors. Usually we consider two levels for each factor. The two levels should be selected to be sufficiently disparate so that we can obtain an effect yet not be unethical. In settings where there is likely to be a downturn in the primary outcome at high levels of the factor, we select the higher level small enough so that the downturn is thought to be insufficient to eliminate the effect. If necessary, subsequent trials can be used to more fully explore the dose-response.

Suppose that there are two stage 1 factors *A*_{1} = {*A*_{11}, *A*_{12}} and one stage 2 factor for responders
${A}_{2}^{(1)}$ and one for non-responders,
${A}_{2}^{(0)}$. Consider the experimental design in Table 1. Each row in the design of Table 1 provides the factor levels assigned to a group of subjects. The column labeled
${A}_{2}^{(1)}={A}_{2}^{(0)}$ designates the identical factor level settings fo
${A}_{2}^{(1)}$ and
${A}_{2}^{(0)}$. Note that once responder (non-responder) status is known for each subject, we can view the design as two 2^{3} full factorial designs, one for responders to stage 1 treatment and the other for non-responders to stage 1 treatment.

Each row of the design corresponds to a group of subjects, all of whom are assigned the same dynamic treatment regime. For example, the subjects in the first group are assigned the dynamic treatment regime: “Provide *A*_{11} = +1, *A*_{12} = +1, if they respond, assign
${A}_{2}^{(1)}=+1$ in stage 2 and if they do not respond, assign
${A}_{2}^{(0)}=+1$ in stage 2” (note that a dynamic treatment regime must specify the stage 2 treatments for both the responders and the non-responders). For convenience we use the term “stacked” to indicate situations in which the factor level settings of two factors are identical across all rows of the experimental design; factors
${A}_{2}^{(1)}$ and
${A}_{2}^{(0)}$ are stacked. The stacking is advantageous because investigators only have to implement a 2^{3} design yet under the ignorability assumptions, we have information on all 2^{4} dynamic treatment regimes. For example even though no row of the design corresponds to “assign *A*_{11} = +1, *A*_{12} = +1 and if response assign
${A}_{2}^{(1)}=+1$ but if non-response assign
${A}_{2}^{(0)}=-{1}^{\u2033}$ we may combine the responders in row 1 and the non-responders in row 2 together to produce a new group of subjects assigned this dynamic treatment regime.

As in Table 1, all designs considered here possess the property that for each factor half of the rows in the design are set at the +1 level and half of the rows are set at the −1 level. This property will play a crucial role in ascertaining the aliasing between effects. Note however that due to the difficulties in recruiting subjects inherent in clinical trials, the groups of subjects may not be exactly equal in size, and second even if there are equal numbers of subjects in each group, the number of subjects assigned a particular level of ${A}_{2}^{(1)},{A}_{2}^{(0)}$ depends on the response rate.

As before suppose there are *p*_{1} stage 1 factors, *p*_{12} stage 2 factors for responders and *p*_{02} stage 2 factors for non-responders. Here we consider screening the stage 1 and stage 2 effects using data from a 2* ^{k}* factorial two-stage experiment in which

*k*=

*p*

_{1}+ max(

*p*

_{12},

*p*

_{02}). In contrast to classical analyses in experimental design (analyses that assume normality and are exact for small group sizes), we consider analyses that are justified by large sample theory (large group sizes) and thus are approximate for small group sizes. With this viewpoint we interpret a full factorial design as a design in which each subject is assigned with equal probability to one of the 2

*possible combinations of the factor levels. In large samples this randomization results in approximately equal sized groups assigned to each row of the design. Denote the collection of stage 1, stage 2 factors by*

^{k}*A*

_{1}, respectively

*A*

_{2}. We have that (

*A*

_{1},

*A*

_{2}) has a discrete uniform distribution on {−1, 1}

*where ${A}_{2}^{(1)}$ is the first*

^{k}*p*

_{12}entries in

*A*

_{2}(these will be the settings of the stage 2 factors for responders) and ${A}_{2}^{(0)}$ is the first

*p*

_{02}entries in

*A*

_{2}(these will correspond to the settings of the stage 2 factors for non-responders). The data consists of

*N*i.i.d. copies of {

*A*

_{1}

*, R, A*

_{2}

*, Y*}. There are 2

*unique values of (*

^{k}*A*

_{1},

*A*

_{2}).

Define **X**_{1} as the random vector composed of a 1, all stage 1 factors and their two-way and higher order products (2^{p}^{1} terms),
${\mathbf{X}}_{2}^{(1)}$ as the random vector composed of all stage 2 factors for responders, their two-way and higher order products and the all products of these combinations with members of **X**_{1} (2^{p}^{1+}^{p}^{12}−2^{p}^{1} terms).
${\mathbf{X}}_{2}^{(0)}$ is defined similarly. The decomposition in (6 – 8) can be written via a linear model

where *β* and *α* are the vectors of all stage 2 causal effects for responders and non-responders, respectively, *ψ* is the vector of nuisance effects, is the vector of stage 1 causal effects and *p*(**X**_{1}) = *E*[*R*|*A*_{1}] (note **X**_{1} is a function of *A*_{1}).

Recall that a parameter is identifiable if different values of the parameter lead to different distributions of {*A*_{1}, *R*, *A*_{2}, *Y*} (van der Vaart, 1998). Identifiability in this setting is straightforward.

### Lemma 1

Assume that *P*[0 < *E*[*R*|*A*_{1}] < 1] = 1. The parameters, , *ψ*, *β* and *α* in (9) are identifiable.

The above lemma is a direct consequence of two facts. First the conditional means,
$E[R\mid {A}_{1}={a}_{1}],E[Y\mid {A}_{1}={a}_{1},R=1,{A}_{2}^{(1)}={a}_{2}^{(1)}]$ and
$E[Y\mid {A}_{1}={a}_{1},R=0,{A}_{2}^{(0)}={a}_{2}^{(0)}]$ for all values of {*a*_{1},
${a}_{2}^{(1)},{a}_{2}^{(0)}$}, are identifiable. Second the expectation of
${\left[{\mathbf{X}}_{1}^{T},\phantom{\rule{0.38889em}{0ex}}(R-p({\mathbf{X}}_{1})){\mathbf{X}}_{1}^{T},\phantom{\rule{0.38889em}{0ex}}R{\mathbf{X}}_{2}^{{(1)}^{T}},\phantom{\rule{0.38889em}{0ex}}(1-R){\mathbf{X}}_{2}^{{(0)}^{T}}\right]}^{T}$ times its transpose is given by a block diagonal with blocks,
$E[{\mathbf{X}}_{1}{\mathbf{X}}_{1}^{T}],\phantom{\rule{0.38889em}{0ex}}E[{\mathbf{X}}_{1}p({\mathbf{X}}_{1})\phantom{\rule{0.16667em}{0ex}}(1-p({\mathbf{X}}_{1})){\mathbf{X}}_{1}^{T}],\phantom{\rule{0.38889em}{0ex}}E\left[{\mathbf{X}}_{2}^{(1)}p({\mathbf{X}}_{1}){\mathbf{X}}_{2}^{{(1)}^{T}}\right]$ and
$E\left[{\mathbf{X}}_{2}^{(0)}(1-p({\mathbf{X}}_{1})){\mathbf{X}}_{2}^{{(0)}^{T}}\right]$ (recall *p*(**X**_{1}) = *E*[*R*|*A*_{1}]). These blocks are invertible under the conditions specified in Lemma 1. The assumption of Lemma 1 is not necessary for identifiability of and can be weakened to only *P*[*E*[*R*|*A*_{1}] > 0] = 1 for identifiability of *β* (with a similar statement for *α*). The details of the proof are omitted.

To utilize (9) in screening estimate *p*(**X**_{1}) by forming the average of *R* for each value of **X**_{1} to obtain (**X**_{1}). Conduct a linear regression of *Y* on {**X**_{1}, (*R* − (**X**_{1}))**X**_{1},
$R{\mathbf{X}}_{2}^{(1)},(1-R){\mathbf{X}}_{2}^{(0)}$} to obtain , , , . Note that in contrast to classical screening analyses, these estimators are not orthogonal. This is because homogenous variance is not assumed (e.g. the conditional variance of Y may vary by factor levels), the groups sizes may not be identical and because the responder rates will likely vary by the stage 1 factor levels.

### Lemma 2

Assume that the variance of *Y* is finite and that *P*[0 < *E*[*R*|*A*_{1}] < 1] = 1. Then as *N* → ∞, {
$\sqrt{N}(\widehat{\phi}-\phi ),\sqrt{N}(\widehat{\beta}-\beta ),\sqrt{N}(\widehat{\alpha}-\alpha )$} converges in distribution to a multivariate normal. If
$P\left[\mathit{Var}(Y\mid {A}_{1},R,{A}_{2})=\mathit{Var}(Y\mid {A}_{1},R,{A}_{2}^{(R)})\right]=1$, then the estimators , and are locally semiparametric efficient.

In many settings all responders will be provided the same treatment. In this case there are no , *β* and *P*[0 < *E*[*R*|*A*_{1}] < 1] = 1 can be replaced by *P* [*E*[*R*|*A*_{1}] > 0] = 1 (similar statements can be made if all non-responders are provided the same treatment). See Tsiatis (Chapter 4, 2006) for an introduction to and definition of locally semiparametric efficiency. Note the assumption on the conditional variances holds under the consistency and ignorability assumptions. This lemma is a special case of Lemma 4 below. The formula for the variance-covariance matrix is provided in Appendix A.1 along with an asymptotically consistent estimator and the proof of Lemma 4.

## 4. 2^{k}^{−}^{m} Factorial Two Stage Design

^{k}

^{m}

A better use of resources for screening is use a 2^{k}^{−}* ^{m}* fractional factorial design, which uses only a 1/2

*fraction of the groups in a 2*

^{m}*design. An example of a 2*

^{k}^{3–1}design for the setting in which there are two stage 1 factors and one stage 2 factor each for responders/non-responders is

This design has one-half as many rows (groups of subjects) as the full factorial design in Table 1; see the examples in Section 6 for realistic designs. Usually a 2^{k}^{−}* ^{m}* design is selected by first ascertaining plausible working assumptions concerning the factorial effects. Wu and Hamada (2000) provide principles that can be used to guide these working assumptions in the absence of scientific knowledge (e.g. often it is plausible that three way and higher order effects are negligible). Next, the aliasing associated with candidate 2

^{k}^{−}

*designs is ascertained; parameters, that is, effects, are aliased when only their sum can be identified. Note that the aliasing occurring here is different from the confounding that occurs in causal inference as the latter is unplanned and due to potentially unknown variables whereas the former is the result of the planned study design and is known. Lastly the design with the aliasing structure that is most consistent with working assumptions is selected. See Wu and Hamada, (2000) for simple approaches to creating 2*

^{m}

^{k}^{−}

*designs.*

^{m}As was the case for full factorial designs, we interpret a fractional factorial design as a design in which each subject is assigned with equal probability to one of the 2^{k}^{−}* ^{m}* possible combinations of the factor levels. Thus (

*A*

_{1},

*A*

_{2}) has a discrete uniform distribution across the factor levels given by the 2

^{k}^{−}

*rows of the design (recall that ${A}_{2}^{(1)}\phantom{\rule{0.16667em}{0ex}}({A}_{2}^{(0)})$ is set to the first*

^{m}*p*

_{12}(

*p*

_{02}, respectively) entries in

*A*

_{2}). The experimental data consists of

*N*i.i.d. copies of {

*A*

_{1},

*R*,

*A*

_{2},

*Y*}.

### 4.1 Identification of Effects

We use a large sample notion of aliasing. Effects will be aliased if only their sum can be identified (operationalized here to mean that only asymptotically consistent estimators of their sum are available). This is a weaker concept than the finite sample concept of aliasing as discussed, for example, in Wu and Hamada (2000); in that setting, effects are aliased if we are only able to obtain an unbiased estimator of their sum.

Defining words are generally used to ascertain finite sample aliasing. Consider the design in Table 2. The defining word for this design is
$1={A}_{11}{A}_{12}{A}_{2}^{(1)}$ (the product of the ±1’s across the three columns is equal to 1). Equivalently
$1={A}_{11}{A}_{12}{A}_{2}^{(0)}$ since the levels of
${A}_{2}^{(1)}$ are equal to the levels of
${A}_{2}^{(0)}$. Similarly
${A}_{2}^{(1)}={A}_{11}{A}_{12}$ (the product of the ±1’s in columns 1 and 2 is equal to the ±1’s in column 3). In a standard 2^{3–1} design, the defining word
$1={A}_{11}{A}_{12}{A}_{2}^{(0)}$ means that each main effect is aliased with a two way interaction (e.g. A_{11} is aliased with
${A}_{12}{A}_{2}^{(0)}$, etc.). It turns out that even though the screening analysis model for a dynamic treatment regime is based on a nonstandard model (9), we will, nonetheless, be able to use the defining words to ascertain the large sample aliasing. This means that we can use commonly available designs such as those provided by Wu and Hamada (2000) to design screening studies for dynamic treatment regimes. Lemma 3, below, provides conditions under which the defining words can be used to determine the aliasing of effects.

As in Section 3 we construct the vector
$\left[{\mathbf{X}}_{1}^{T},{\left({\mathbf{X}}_{2}^{(1)}\right)}^{T},{\left({\mathbf{X}}_{2}^{(0)}\right)}^{T}\right]$ from (*A*_{1}, *A*_{2}). This vector takes on 2^{k}^{−}* ^{m}* equally likely vector values. Construct the matrix
$\left[{\stackrel{\sim}{\mathbf{X}}}_{1},{\stackrel{\sim}{\mathbf{X}}}_{2}^{(1)},{\stackrel{\sim}{\mathbf{X}}}_{2}^{(0)}\right]$ with each row corresponding to one of these vectors values. The defining words identify the identical columns in
$\left[{\stackrel{\sim}{\mathbf{X}}}_{1},{\stackrel{\sim}{\mathbf{X}}}_{2}^{(1)}\right]$ and in
$\left[{\stackrel{\sim}{\mathbf{X}}}_{1},{\stackrel{\sim}{\mathbf{X}}}_{2}^{(0)}\right]$. Consider the design in Table 2; here

Labeling the columns of [_{1} by as {1, *A*_{11}, *A*_{12}, *A*_{11}*A*_{12}} and the columns of
${\stackrel{\sim}{\mathbf{X}}}_{2}^{(j)}$ by {
${A}_{2}^{(j)},{A}_{11}{A}_{2}^{(j)},{A}_{12}{A}_{2}^{(j)},{A}_{11}{A}_{12}{A}_{2}^{(j)}$} we see that the defining word,
$1={A}_{11}{A}_{12}{A}_{2}^{(j)}$ (*j* = 0, 1) identifies the identical columns.

#### Lemma 3

Assume that *P*[0 < *E*[*R*|*A*_{1}] < 1] = 1. Make the *Formal* assumption:

For each identical column in both _{1} and
${\stackrel{\sim}{\mathbf{X}}}_{2}^{(1)}$ and for each identical column in both _{1} and
${\stackrel{\sim}{\mathbf{X}}}_{2}^{(0)}$ either

- (3a) the associated nuisance effect(s) (
*ψ*parameters) in (9) are zero or

Then the defining words can be used to ascertain aliasing.

To make Lemma 3 concrete, consider the analog of (9) for the design in Table 2:

The matrices in (10) permit a variety of formal assumptions. For example, we might make the formal assumption that *ψ*_{2} = *ψ*_{3} = *ψ*_{4} = 0 and *β*_{4} = *α*_{4} = 0. Then Lemma 3 allows us to read off the large sample aliasing from the defining word,
$1={A}_{11}{A}_{12}{A}_{2}^{(j)}$, *j* = 0, 1. That is each main effect is aliased with a two-way interaction between the remaining predictors (e.g. *A*_{11} is aliased with
${A}_{12}{A}_{2}^{(0)}$ and so on). This design is most useful if, according to the working assumptions, there are no two-way interactions.

See the Appendix A.1 for a proof of Lemma 3. As before if there are only stage 2 factors for responders (non-responders) then we need only assume *P*[*E*[*R*|*A*_{1}] > 0] = 1 (respectively *P*[*E*[*R*|*A*_{1}] < 1] = 1). In the next section we provide an algorithm based on this proof for determining the aliasing and constructing the predictors in the screening analysis. We only use 2^{k}^{−}* ^{m}* designs for which either (3a) or (3b) holds for any common stage 1 and stage 2 columns in

_{1}and

_{2}of Lemma 3. The use of designs in which this formal assumption is violated produces aliasing that is not easily discernable by the defining words (see Section A.2 in the Appendix for an example). Note that we can choose a design in which

_{1}and

_{2}do not share a column (see example 1 in Section 5) and thus formal assumptions are not required. The formal assumptions, in addition to the working assumptions, are used to select the appropriate 2

^{k}^{−}

*design. The design is selected so that according to the working assumptions all but one of the effects in each set of aliased effects is thought to be negligible.*

^{m}### 4.2 Screening Effects

Of course we cannot fit the model (9), or in the case of our example, (11), since the 2^{k}^{−}* ^{m}* design does not provide

*Y*outcomes for all 2

*values of {*

^{k}*A*

_{1},

*A*

_{2}} (the omitted outcomes correspond to the omitted rows in the design). To permit a regression model we rewrite (9) in terms of aliased effects. Using the results of Lemma 3, we reduce the number of columns in

_{1}, ${\stackrel{\sim}{\mathbf{X}}}_{2}^{(1)}$ and ${\stackrel{\sim}{\mathbf{X}}}_{2}^{(0)}$. Note removing columns from these matrices is equivalent to eliminating predictors in

**X**

_{1}, ${\mathbf{X}}_{2}^{(1)}$ and ${\mathbf{X}}_{2}^{(0)}$. The construction of the regressors for the screening analysis is straightforward except when formal assumption (3b) is made; see step 4 below.

#### Algorithm for Constructing the Regressors

- Construct
_{1}, ${\stackrel{\sim}{\mathbf{X}}}_{2}^{(1)}$ and ${\stackrel{\sim}{\mathbf{X}}}_{2}^{(0)}$. Eliminate duplicate columns in each of_{1}, ${\stackrel{\sim}{\mathbf{X}}}_{2}^{(1)}$ and ${\stackrel{\sim}{\mathbf{X}}}_{2}^{(0)}$ and associated predictors from ${\mathbf{X}}_{1}\phantom{\rule{0.16667em}{0ex}}({\mathbf{X}}_{2}^{(1)},{\mathbf{X}}_{2}^{(0)})$, retaining only the predictor that, according to the working assumptions, may have a non-negligible effect. Name the resulting vector ${\mathbf{U}}_{1}\phantom{\rule{0.16667em}{0ex}}({\mathbf{U}}_{2}^{(1)},{\mathbf{U}}_{2}^{(0)})$. - If any
*β*(*α*) parameters are assumed zero (via assumption 3b) delete the predictors associated with these parameters in ${\mathbf{U}}_{2}^{(1)}$ (respectively ${\mathbf{U}}_{2}^{(0)}$); similarly delete the associated columns in ${\stackrel{\sim}{\mathbf{X}}}_{2}^{(1)},{\stackrel{\sim}{\mathbf{X}}}_{2}^{(0)}$. - If some of the nuisance,
*ψ*, parameters are assumed to be zero (via assumption 3a) then create**Z**_{3}from**U**_{1}by eliminating predictors in**U**_{1}with an assumed zero valued*ψ*parameter. Otherwise**Z**_{3}=**U**_{1}. At this point we can rewrite (9) as$$E[Y\mid {A}_{1},R,{A}_{2}]={\mathbf{U}}_{1}^{T}{\phi}^{\u2033}+(R-p({\mathbf{U}}_{1})){\mathbf{Z}}_{3}^{T}{\psi}^{\u2033}+R{\mathbf{U}}_{2}^{{(1)}^{T}}{\beta}^{\u2033}+(1-R){\mathbf{U}}_{2}^{{(0)}^{T}}{\alpha}^{\u2033}$$where the primes on the regression coefficients indicate each entry in*ψ*″,*ψ*″,*β*″,*α*″ may be a sum of stage 1 effects, nuisance effects, stage 2 effects for responders and stage 2 effects for non-responders, respectively. Here the response rate,*E*[*R*|*A*_{1}] is written as*p*(**U**_{1}). - If assumption 3b is made (e.g. some stage 2 effects,
*β*,*α*, parameters are assumed to be zero) then the defining words have identified columns common to at least two of the three matrices_{1}, ${\stackrel{\sim}{\mathbf{X}}}_{2}^{(0)},{\stackrel{\sim}{\mathbf{X}}}_{2}^{(0)}$. If there are one or more columns common to all three matrices then for each common column delete an associated predictor from one of either**U**_{1}, ${\mathbf{U}}_{2}^{(0)}$ or ${\mathbf{U}}_{2}^{(0)}$ resulting in**Z**_{1}, ${\mathbf{Z}}_{2}^{(0)}$ or ${\mathbf{Z}}_{2}^{(0)}$. The choice of which predictor to eliminate determines the aliasing as follows. If the omitted predictor is a stage 1 variable then the coefficient of the stage 2 predictor for responders (non-responders) in the regression (12) is an estimator of the sum of the stage 1 and stage 2 effects for responders*β*′ = ″ +*β*″ (*α*′ = ″ +*α*″, respectively). If the omitted predictor is a stage 2 variable for responders (non-responders) then the coefficient of the stage 1 predictor in the regression (12) is an estimator of sum ′ = ″ +*β*″ (′ = ″ +*α*″, respectively) and the coefficient of the stage 2 predictor for non-responders is an estimator of the difference between the stage 1 and stage 2 effects*α*′ =*α*″ −*β*″ (*β*″ =*β*″ −*α*″, respectively). We obtain$$E[Y\mid {A}_{1},R,{A}_{2}]={\mathbf{Z}}_{1}^{T}{\phi}^{\prime}+(R-p({\mathbf{U}}_{1})){\mathbf{Z}}_{3}^{T}{\psi}^{\prime}+R{\mathbf{Z}}_{2}^{{(1)}^{T}}{\beta}^{\prime}+(1-R){\mathbf{Z}}_{2}^{{(0)}^{T}}{\alpha}^{\prime}.$$(12)

Using the working assumptions the regression coefficients can be labeled as corresponding to particular stage 1 effects, nuisance effects, a stage 2 effects for responders and stage 2 effects for non-responders.

Consider once again the example from Table 2 with the formal assumptions that *ψ*_{2} = *ψ*_{3} = *ψ*_{4} = 0 and *β*_{4} = *α*_{4} = 0 (see (11)). After step 3, we have **U**_{1} = {1, *A*_{11}, *A*_{12}, *A*_{11}*A*_{12}}, **Z**_{3} = {1},
${\mathbf{U}}_{2}^{(j)}=\{{A}_{2}^{(j)},{A}_{11}{A}_{2}^{(j)},{A}_{12}{A}_{2}^{(j)}\}$, *j* = 0, 1. In step 4 we select one of several analysis models (each corresponding to a different form of aliasing). Suppose we eliminate *A*_{11}*A*_{12} in **U**_{1} to form **Z**_{1} and eliminate
${A}_{11}{A}_{2}^{(1)},{A}_{12}{A}_{2}^{(1)}$ from
${\mathbf{U}}_{2}^{(1)}$ to form
${\mathbf{Z}}_{2}^{(1)}({\mathbf{Z}}_{2}^{(0)}={\mathbf{U}}_{2}^{(0)})$, then according to step 4 the conditional mean becomes

where ${\phi}_{1}^{\prime}={\phi}_{1},{\phi}_{2}^{\prime}={\phi}_{2}+{\beta}_{3},{\phi}_{3}^{\prime}={\phi}_{3}+{\beta}_{2},{\psi}_{1}^{\prime}={\psi}_{1},{\beta}_{1}^{\prime}={\beta}_{1}+{\phi}_{4},{\alpha}_{1}^{\prime}={\alpha}_{1}+{\phi}_{4},{\alpha}_{2}^{\prime}={\alpha}_{2}-{\beta}_{3}$ and ${\alpha}_{3}^{\prime}={\alpha}_{3}-{\beta}_{2}$ from (11).

The steps in the algorithm follow the steps in the proof of identifiability for Lemma 3; see the proof in Appendix A.1 for a detailed explanation. At this point the total number of entries in {**Z**_{1}, **Z**_{3},
${\mathbf{Z}}_{2}^{(1)},{\mathbf{Z}}_{2}^{(0)}$} is at most 2 (2^{k}^{−}* ^{m}*)and the total number of predictors in

**U**

_{1}is at most 2

^{k}^{−}

*. To conduct the screening analysis we estimate*

^{m}*p*(

**U**

_{1}) for each value of

**U**

_{1}in the design by the proportion of responders assigned that value to obtain (

**U**

_{1}) ( (

**U**

_{1}) can be formed from a saturated regression of

*R*on

**U**

_{1}). The screening analysis is a regression of

*Y*on $\left\{{\mathbf{Z}}_{1},(R-\widehat{p}({\mathbf{U}}_{1})){\mathbf{Z}}_{3},R{\mathbf{Z}}_{2}^{(1)},(1-R){\mathbf{Z}}_{2}^{(0)}\right\}$ thus obtaining ′, ′, ′, ′. See Section 5 for further concrete examples.

##### Lemma 4

Assume that the variance of *Y* is finite and that *P*[0 < *E*[*R*|*A*_{1}] < 1] = 1. Then as *N* → ∞, the multivariate distribution of {
$\sqrt{N}({\widehat{\phi}}^{\prime}-{\phi}^{\prime}),\sqrt{N}({\widehat{\beta}}^{\prime}-{\beta}^{\prime}),\sqrt{N}({\widehat{\alpha}}^{\prime}-{\alpha}^{\prime})$} converges to an multivariate normal distribution. Furthermore if the model in (12) is saturated then the estimators ′, ′, ′ are semiparametric efficient.

The proof, and an asymptotically consistent estimator for the variance-covariance matrix, is provided in Appendix A.1. In general the model (12) will be saturated. Exceptions occur when more formal assumptions are made than is necessary (see example 3 in Section 5 for an illustration). Or when according to the design, there are one or more stage 2 factors (only for responders or only for non-responders) with levels completely crossed with the levels of all other factors (e.g. these stage 2 factors are randomized independently of the remaining factors and they are not stacked with other factors in the design). For completeness Appendix A.1 also supplies an estimating function yielding semiparametric efficient estimators for use in the case when (12) is not saturated. As was the case for Lemmas 2 and 3, if all responders are provided the same treatment then there are no ′, *β*′ and *P*[0 < *E*[*R*|*A*_{1}] < 1] = 1 can be replaced by *P*[*E*[*R*|*A*_{1}] < 1] = 1 (similar statements can be made if all non-responders are provided the same treatment).

### 4.3 Sample Size Considerations

The primary goal is to assess the activity of all stage 1 and stage 2 main effects, that is to test if each main effect is zero. To choose the sample size *N*, we make the following rough approximations. We assume that our formal assumptions, if any, are correct and that the residual variance is equal across the 2^{k}^{−}* ^{m}* rows in the design, say

*σ*

^{2}. If the the stage 1 responder rates were equal (say

*p*) then the asymptotic variance-covariance matrix of the estimated effects would be a diagonal matrix with diagonal elements equal to ${\scriptstyle \frac{{\sigma}^{2}}{{N}_{p}}}$ for the stage 2 effects for responders, ${\scriptstyle \frac{{\sigma}^{2}}{N(1-p)}}$ for the stage 2 effects for non-responders and ${\scriptstyle \frac{{\sigma}^{2}}{N}}$ for the stage 1 effects. We act as if this is roughly the case. If there are stage 2 factors for both responders and non-responders then the sample size is determined by the smaller of ${\scriptstyle \frac{{\sigma}^{2}}{p}}$ and ${\scriptstyle \frac{{\sigma}^{2}}{(1-p)}}$. The sample size to detect a main effect of size Δ with power 1 −

*β*using a two-sided test with Type one error rate

*α*is

where *p _{min}*,

*p*are minimal and maximal response rates (across the 2

_{max}

^{k}^{−}

*rows) and*

^{m}*z*is the (1 −

_{u}*u*)th standard normal percentile.

Suppose there are stage 2 components for both responders and non-responders. If the signal-to-noise ratio (*SNR* = Δ/*σ*) is very large then the recommended row size will be so small that the chance of a group (corresponding to a row in the design) occurring with no responders or no non-responders is too high. Given the response rate *p* and group size *n* = *N*/2^{k}^{−}* ^{m}*, the chance that a group of this type will occur is 1 −(1 −

*p*−(1 −

^{n}*p*)

*)*

^{n}^{2}

^{k}^{−}

*. To reduce the chance of such groups we find the smallest*

^{m}*n*for which the above probability is smaller than a cutoff (in the simulations we use .01). To set the group size we use the maximum of the above number times 2

^{k}^{−}

*and*

^{m}*N*. Note that if there are no stage 2 components for responders (respectively non-responders) then 1 −

*p*−(1−

^{n}*p*)

*in the above formula may be replaced by 1−*

^{n}*p*(respectively by 1 − (1 −

^{n}*p*)

*).*

^{n}## 5. Examples

In this section we illustrate several 2^{k}^{−}* ^{m}* designs and associated working and formal assumptions. The examples below are 2

^{k}^{−}

*generalizations of the clinical studies STAR*D (Fava et al., 2003) and ExTENd (Oslin, personal communication) in which the stage 2 factors vary by whether the patient exhibits an early response. Three different designs, associated with different formal assumptions are illustrated.*

^{m}Consider a study with four stage 1 factors: *S* (speciality or general practice clinic), *B* (adjunctive therapy to improve adherence: yes/no), *C* (counseling: intensive/less intensive) and *T* (level of staff training: high/low). All individuals are offered medication. There is one stage 2 factor for early non-responders *F*_{2} (switch or augment medication) and one stage 2 factor for early responders *G*_{2} (telephone disease management: yes/no). Suppose that the working assumptions are that the main effects of the factors and the two way interactions *CG*_{2}, *TG*_{2}, *BF*_{2}, *TF*_{2} and *BC* may be active with the remaining effects assumed negligible. Recall, these working assumptions are used to guide the choice of the 2^{k}^{−}* ^{m}* design, but are not necessarily assumed true in the screening analysis.

In Design 1, a 2^{4–1} design is constructed for the stage 1 factors and then the levels of the stage 2 factor are crossed with this design. In Design 2 the stage 2 factor is aliased with the four way interaction between the stage 1 factors. Both of these designs are 2^{5–1} designs. In the Design 3 below we allow for an additional stage 2 factor which is applicable only for the non-responders. Here a 2^{6–2} design is discussed.

### Design 1

This design crosses a 2^{4–1} design with defining word *SBCT* = 1 with the stage 2 factor levels to produce a 2^{5–1} design; no formal assumptions are made. The defining word indicates that the stage 1 and stage 2 matrices _{1} and
${\stackrel{\sim}{\mathbf{X}}}_{2}^{(1)}$ (or
${\stackrel{\sim}{\mathbf{X}}}_{2}^{(0)}$) do not share columns and thus stage 2 effects are not aliased with nuisance effects (or stage 1 effects). As a result the aliasing follows directly from the defining word (e.g. the two way interaction *BC* is aliased with the two way interaction *ST*, and since *SBCTG*_{2} = *G*_{2}, the two interaction *TG*_{2} is aliased with the interaction *SBCG*_{2}, and so on). Note that this design is consistent with the working assumptions in that all of the interesting effects are aliased with effects that are thought to be negligible.

The screening analysis uses the model in (12) with **Z**_{1} = [1, *S*, *B*, *C*, *T*, *BC*, *SB*, *SC*], **Z**_{3} = **Z**_{1},
${\mathbf{Z}}_{2}^{(1)}=[{G}_{2},S{G}_{2},B{G}_{2},C{G}_{2},T{G}_{2},SB{G}_{2},SC{G}_{2},BC{G}_{2}]$ and similarly
${\mathbf{Z}}_{2}^{(0)}=[{F}_{2},S{F}_{2},B{F}_{2},C{F}_{2},T{F}_{2},SB{F}_{2},SC{F}_{2},BC{F}_{2}]$. This is a saturated model. Recall that due to the nonorthogonality of the predictors, omitting predictors from this model is equivalent to making formal assumptions. An advantage of this design is that no negligibility assumptions on the nuisance parameters (*ψ*’s) need be made. However, this design aliases *BC* with *ST*; a better design would avoid aliasing two way interactions.

### Design 2

Suppose that we are willing to make the formal assumption that there are no three way or higher order stage 2 interactions (*α* and *β* regression coefficients of interactions between a stage 2 factor and two or more stage 1 factors are 0), and that there are no four way or higher nuisance interactions involving *R* and stage 1 factors (*ψ* regression coefficients of interactions between *R* and three or more stage 1 factors are 0). Consider a 2^{5–1} design with defining word *SBCTF*_{2} = 1 with *F*_{2} and *G*_{2} stacked (so another expression for the defining word is *SBCTG*_{2} = 1). The design is provided in Table 3. The defining words indicate that under the formal assumptions, none of the stage 1 main effects or two-way interactions are aliased. Potentially active stage 2 effects, such as *CG*_{2}, can also be estimated since the nuisance effects associated with the same column in the stage 1 matrix _{1}, (here the four way interaction between *R* and *SBT*) are negligible. And potentially active nuisance effects such as the three way interaction between *R* and *SB* can be estimated since these nuisance effects are associated with the same column in the stage 2 matrices
${\stackrel{\sim}{\mathbf{X}}}_{2}^{(1)}$ (or
${\stackrel{\sim}{\mathbf{X}}}_{2}^{(0)}$) as negligible stage 2 effects (here *CTG*_{2} and *CTF*_{2}).

To screen the factors using data from design 2, we use the model in (12) with with **Z**_{1} = [1, *S*, *B*, *C*, *T*, *BC*, *SB*, *SC*, *ST*, *BT*, *CT*], **Z**_{3} = **Z**_{1},
${\mathbf{Z}}_{2}^{(1)}=[{G}_{2},S{G}_{2},B{G}_{2},C{G}_{2},T{G}_{2}]$ and similarly
${\mathbf{Z}}_{2}^{(0)}=[{F}_{2},S{F}_{2},B{F}_{2},C{F}_{2},T{F}_{2}]$. This is a saturated model. Note that the defining word indicates the aliasing, for example the ′ coefficient of *G*_{2} is actually estimating the sum of main effect of *G*_{2} and the four-way interaction *SCBT*.

### Design 3

In order to illustrate some of the more subtle considerations, consider the inclusion of an additional stage 2 factor *H*_{2} (additional behavioral contingency to improve long term medication adherence) that is only assigned to non-responders (*R* = 0). Further suppose the formal assumptions are that there are no three way or higher nuisance interactions involving *R* and stage 1 factors and that three way and higher stage 2 causal effects are negligible. The working assumptions are that all effects except the main effects of the factors and the two way interactions *TG*_{2}, *TF*_{2}, *CG*_{2}, *BF*_{2}, *F*_{2}*H*_{2} and *BC* are expected to be negligible. Consider the 2^{6–2} design, with defining words *SCTF*_{2} = 1 and *BTF*_{2}*H*_{2} = 1 and *F*_{2} and *G*_{2} stacked (so we could express the above with a *G*_{2} instead of *F*_{2}; note the product of these two defining words yields 1 = *SBCH*_{2}.)

In contrast to designs 1 and 2, not only the experimental design but also the choice of the regression model determines the aliasing (we use step 4 of the algorithm in Section 4.2 to determine the aliasing). Indeed we have a choice of several regression models that can be used to screen the factors each corresponding to different aliasing. Using the steps in Section 4.2 we see that one possibility is to use the model in (12) with **Z**_{1} = [1, *S, B, C, T, SB, BC, BT, SBC, SBT, BCT*], **Z**_{3} = [1, *S, B, C, T*],
${\mathbf{Z}}_{2}^{(1)}=[{G}_{2},S{G}_{2},B{G}_{2},C{G}_{2},T{G}_{2}]$ and
${\mathbf{Z}}_{2}^{(0)}=[{F}_{2},S{F}_{2},B{F}_{2},C{F}_{2},T{F}_{2},{H}_{2},S{H}_{2},C{H}_{2},{F}_{2}{H}_{2}]$. Using the defining words, we can deduce the aliasing. For example since the defining words indicate that *G*_{2} = *F*_{2} = *SCT* and *SCT* has been omitted from **X**_{1}, we have that the *β*′ coefficient of *G*_{2} is the sum of the main effect of *G*_{2} and the three way interaction *SCT*; similarly the *α*′ coefficient of *F*_{2} is the sum of the main effect of *F*_{2} and the three way interaction *SCT*. Note that the defining words indicate that *BF*_{2} = *BG*_{2} = *TH*_{2} = *SBCT* and we included only *BF*_{2}, *BG*_{2} in the regression thus the *α*′ coefficient of *BF*_{2} is the sum of the two way interactions *BF*_{2}, *TH*_{2} and the four way interaction *SBCT* whereas the *β*′ coefficient of *BG*_{2} is estimating the sum of the two way interaction *BG*_{2} and the four way interaction *SBCT*. In general, given the formal assumptions, the defining words along with the choice of **Z**_{1},
${\mathbf{Z}}_{2}^{(1)}$ and
${\mathbf{Z}}_{2}^{(0)}$ result in the aliasing:
${\beta}_{{G}_{2}}^{\prime}={\beta}_{{G}_{2}}+{\phi}_{\mathit{SCT}},{\beta}_{S{G}_{2}}^{\prime}={\beta}_{S{G}_{2}}+{\phi}_{CT},{\beta}_{B{G}_{2}}^{\prime}={\beta}_{B{G}_{2}}+{\phi}_{\mathit{SBCT}},{\beta}_{C{G}_{2}}^{\prime}={\beta}_{C{G}_{2}}+{\phi}_{ST},{\beta}_{T{G}_{2}}^{\prime}={\beta}_{T{G}_{2}}+{\phi}_{SC},{\beta}_{{F}_{2}}^{\prime}={\beta}_{{F}_{2}}+{\phi}_{\mathit{SCT}},{\beta}_{S{F}_{2}}^{\prime}={\beta}_{S{F}_{2}}+{\phi}_{CT},{\beta}_{B{F}_{2}}^{\prime}={\beta}_{B{F}_{2}}+{\beta}_{T{H}_{2}}+{\phi}_{\mathit{SBCT}},{\beta}_{C{F}_{2}}^{\prime}={\beta}_{C{F}_{2}}+{\phi}_{ST},{\beta}_{T{F}_{2}}^{\prime}={\beta}_{T{F}_{2}}+{\beta}_{B{H}_{2}}+{\phi}_{SC}$ (for clarity the associated predictor is given as the subscript on the parameter). The remaining parameters are unaliased (e.g. the (′, *β*′, *α*′) parameters are equal to the corresponding (, *β*, *α*) parameters).

Another possible screening analysis would use (12) with **Z**_{1} = [1, *S*, *B*, *C*, *T*, *SB*, *BC,BT, CT, SBC, SBT, BCT*] and
${\mathbf{Z}}_{2}^{(0)}=[{F}_{2},B{F}_{2},C{F}_{2},T{F}_{2},{H}_{2},S{H}_{2},C{H}_{2}]$ but leave **Z**_{3} and
${\mathbf{Z}}_{2}^{(1)}$ as is; that is the predictor *CT* is added to **Z**_{1} and the predictor *SF*_{2} is removed from
${\mathbf{Z}}_{2}^{(0)}$. In this case the aliasing is the same as before except
${\phi}_{CT}^{\prime}={\phi}_{CT}-{\alpha}_{S{F}_{2}},{\beta}_{S{G}_{2}}^{\prime}={\beta}_{S{G}_{2}}-{\alpha}_{S{F}_{2}}$ (there is no longer a
${\alpha}_{S{F}_{2}}^{\prime}$ parameter).

Both of the two screening analysis models discussed above have 30 parameters and hence are not saturated models. If desired we could fit a saturated model by adding two nuisance interactions to **Z**_{3}, *SBT* and *BCT*. These two nuisance interactions were needlessly assumed to be negligible in the formal assumptions. For example, from the defining words, note that the column associated with *SBT* is the same as the column associated with the three way stage 2 interactions *SF*_{2}*H*_{2}, *TCH*_{2} and *BCF*_{2} all of which were assumed to be negligible. That is we made both assumption 3a and 3b in Lemma 3 instead of one or the other. Estimation of these two effects acts as a check on the formal assumptions since under the formal assumptions these two nuisance effects are zero.

## 6. Simulation Results

Extensive simulations were conducted so as to evaluate the proposed analysis, examine the impact of violations of the formal assumptions and examine the impact of rows with no responders (or no non-responders) on the analysis. We are particularly interested in mental health settings such as the treatment of major depression and substance abuse in which response (absence of symptoms) rates to initial treatment are around 50 to 70%; as a result the simulations below use initial response rates in this range. Note that in actual practice, when response (non-response) rates are low, usually only factors for non-responders (respectively, responders) are investigated.

The findings were as follows. First when the formal assumptions hold, the standard errors performed well and the Type 1 error is as planned. In general the sample size calculations depend on the group with the smallest proportion of non-responders (or responders). As a result the sample size calculations are conservative and the power to detect active stage 1 main effects and stage 2 main effects is higher than the nominal value. Simulations with both normal and non-normal error distributions such as a t-distribution with 3 degrees of freedom were also conducted; this did not substantially alter the results. When the formal assumptions are violated (e.g. effects assumed negligible in the formal assumptions are not negligible) there is a surprising degree of robustness; that is, the bias is of smaller magnitude than would be expected. Also when one or more rows contain only responders, fitting an analysis model that omitted some active effects led to relatively robust estimators of the remaining effects. This robustness is discussed at greater length following simulation 2. We present three simulations that exemplify the above findings.

The simulations below use Design 2 (see Table 3). In this example the formal assumptions are that there are no three way or higher order stage 2 interactions and there are no four way or higher order nuisance interactions involving *R* and stage 1 factors. In the simulations provided below, *R* is generated using a response rate given by *logit E*[*R*|*F*_{1}]) = .6 + .1*S* + .1*B* + .1*C* + .1*T*. This results in response rates varying from .55 to .73 across the 16 rows; this wide range of early response rates is extreme for the mental health/substance abuse fields however it allows us to illustrate the issues. *Y* is normally distributed with residual variance, *σ*^{2}, set to 1. We present results with the signal-to-noise ratio (SNR= effect size/residual variance) equal to .25 or .35 units per standard deviation. In the simulations, active main effects are equal to the SNR× *σ*, active two way interactions are equal to SNR×.5×*σ* and active three way interactions are set equal to SNR× .25 × *σ*. These settings are consistent with the Hierarchical Ordering Principle (Wu and Hamada, 2000, pg. 112) which states that the lower order effects are more likely to be important than higher order effects and effects of the same order are equally likely to be important (this principle is used in the absence of domain knowledge indicating otherwise). Throughout all main effects and the interactions *SB*, *SC*, *SG*_{2}, *TG*_{2} and *TF*_{2} are active; thus the working assumptions are incorrect (the working assumptions are that only the main effects of the factors and the two way interactions *TG*_{2}, *TF*_{2}, *CG*_{2}, *BF*_{2} and *BC* are likely to be active).

We used the formula in Section 4.3 to determine the sample size required to detect a given SNR at 90% power with a 10% Type 1 error rate for *p _{min}* = .55 and

*p*= .73. Following the recommendations in Collins et. al. (2007) interesting effects are tested using a Type 1 error rate of .1 (no correction for multiple tests) and the remaining effects are tested using a overall Type 1 error rate of .1 (using a Bonferroni correction). In all simulations we estimate the variance-covariance matrix of the effects using the formula given in Appendix A.1 (near the beginning of the proof of Lemma 4 in Appendix A.1). Table 4 provides the results from 1000 data sets; the formal assumptions hold for these data sets.

_{max}Note that there are 6 effects that according to the working model should be negligible and are actually negligible. Given an overall error rate of .1, the empirical error rate should be around .011 and this is the case as can be seen by the rows labeled *ST*, *BT*, *CT*, *BG*_{2}, *SF*_{2} and *CF*_{2}. Similarly the empirical Type 1 error rate for the interesting effects should be around .1 and this can be seen by rows labeled *BC*, *CG*_{2} and *BF*_{2}. The high power in detecting main effects is due to the conservatism used in selecting the group size. Recall that the response rates per group range from a low of .55 to an high of .73. The group sizes are chosen then to achieve the power .9 to detect stage 2 effects for non-responders when the non-response rate is .27. Since only one of the groups exhibits this low non-response rate, the group sizes are conservative, resulting in higher power. In simulations with a constant response rate across all 16 groups (not shown here) the power to detect a main effect for stage 2 non-responders varied from around .88 to .94 depending on the simulation.

In the next simulation, also of 1000 simulated data sets, the formal assumptions are violated; there are active three way stage 2 effects for responders. The results in Table 5 are surprisingly good; the results in this table would be expected *if* the interactions involving *R* were zero (no nuisance interactions). To see this, recall the defining words for this design are 1 = *SBCTG*_{2} (equivalently 1 = *SBCTF*_{2}). Thus, for example the column in the stage 2 matrix
${\stackrel{\sim}{\mathbf{X}}}_{2}^{(0)}$ matrix associated with *CTF*_{2} is the same as the column associated with *SB* in _{1}. If the coefficient of the interaction (*R* − *E*[*R*|*A*_{1}])*SB* were zero (it is not) then we would expect that estimators of the stage 1 interactions such as *SB* will estimate this stage 1 interaction plus the product of the response rate times the effect of *CTG*_{2} plus the non-response rate times the effect of *CTF*_{2}. In this case, this is .175+ response rate(.0875)+ non-response rate(0). If we use the average non-response rate (here .64), this yields .231 which is close to the average estimated effect for *SB*. Similar statements can be made for remaining stage 1 two way interactions. This robustness is explained by the fact that the response rates vary only from .55 to .73 across the groups. As discussed in Section 4, the less the response rates vary, the closer the predictors are to being orthogonal. Thus the estimators of the effects are approximately uncorrelated (the off diagonal elements of the correlation matrix are small{in this simulation the maximum correlation in absolute value is .12 and the average of the absolute value of the correlations is .03). Thus as long as the response rates do not vary greatly (as is the case in many areas of mental health and substance abuse) we can expect this robustness. Similar results hold when the formal assumptions concerning the nuisance effects are violated (not shown here).

Also this simulation, by chance, did not result in any samples with one or more groups containing only responders. However this can happen and did happen in other simulations (not shown). Indeed when SNR= .35 the group sizes had to be adjusted upwards to ensure that the probability that all groups have some non-responders is not too low (low is arbitrarily chosen to be .01; see Section 4.3). Even with this adjustment, a simulation size of 1000 samples will sometimes include some samples in which groups with no responders occur. If in a given data set some groups (e.g. corresponding to rows in the design) contain no responders we fit a model using the working assumptions, that is, **Z**_{1} = [1, *S*, *B*, *C*, *T*, *BC*] **Z**_{3} = **Z**_{1},
${\mathbf{Z}}_{2}^{(1)}=[{F}_{2},B{F}_{2},T{F}_{2}]$ and
${\mathbf{Z}}_{2}^{(0)}=[{G}_{2},C{G}_{2},T{G}_{2}]$.

The working assumptions are incorrect thus the omission of active effects biases the estimated effects. To examine the degree of bias we conducted a simulation in which the groups corresponding to the rows of the design were sufficiently small so that the chance of one or more groups with all responders was likely; in effect we simulated many samples saving only those that had one or more groups with all responders. Table 6 reports the results for 729 such samples.

As with simulation 2, this simulation demonstrates robustness of the analysis method to the omission of active effects. Note that both the bias of the estimators and the quality of the standard errors is poorest for the stage 2 effects for non-responders. This is not surprising as many rows will have few non-responders.

## 7. Discussion

An attractive alternative to the use of randomized clinical trials in constructing dynamic treatment regimes is the use of observational data in which the components are not randomized. Here the variation in treatment is usually due to patient/clinician preference, patient adherence, component availability, etc. Observational data has the great advantage in that it often already exists and/or is less expensively obtained than data from the screening designs discussed here. However note that observational data suffers from two substantial drawbacks. First inferences regarding causal relations in the absence of randomization requires untestable assumptions (Rubin, 1978; Robins, 1992) concerning why individuals receive, or are offered, differing treatments. Second, even if we are comfortable with the required assumptions, then because we do not control the treatment levels in the observational study we have uncontrolled aliasing. That is, when our modeling assumptions do not hold, the form of the aliasing may be complex and difficult to ascertain. In contrast the screening designs considered here not only improve our ability to make causal inferences but also provide a greater understanding of the aliasing that occurs due to incorrect modeling assumptions. However there are many open problems. Four such problems follow.

First there is the question of how to construct the optimal experimental designs. A first thought might be to utilize the maximum resolution criterion or its refinement, the minimum aberration criterion (see Wu and Hamada, 2000). Note that the design in example 1 has only resolution IV whereas the design in example 2 has higher resolution (V). It is not necessarily true that the example 2 design is to be preferred since it requires formal assumptions whereas the example 1 design does not. More work is required to appropriately incorporate the roll of the formal assumptions into methods for comparing these designs.

Second as discussed in Step 4 of the Algorithm in Section 4.2 some designs permit multiple analysis models; because of the non-orthogonality of the predictors, different models result in tests with different powers. A strategy for choosing among these different analysis models is needed.

Third, designs with a smaller number of groups (e.g. treatment combinations) might be considered. To do so, we would begin by eliminating negligible effects from a potential model via formal and working assumptions. Next, we could attempt to find, say, a D-optimal design (e.g., see Wu and Hamada, 2000) that would permit estimation of the remaining effects. This would yield designs that have fewer than 2^{k}^{−}* ^{m}* groups. However, such designs rarely have the nice aliasing structure found with 2

^{k}^{−}

*factorial designs. Furthermore, the designs proposed here frequently permit us to assess the working assumptions, unlike most D-optimal designs (see design 3 in Section 5). We view this as an important advantage.*

^{m}Lastly, this paper has not discussed potential secondary analyses that might be used with data from a 2^{k}^{−}* ^{m}* design. Such analyses might consider how best to utilize time-varying patient covariates in the construction of the decision rules. The methods of Murphy (2003) and Robins (2004) require generalization as these methods require randomization or, at a minimum, stochastic variation in assigned stage 2 factors.

## Acknowledgments

We acknowledge support for this project from National Institute of Health grants RO1 MH080015 and P50 DA10075.

## Appendix

#### A.1 Lemmas

##### Proof of Lemma 3

First assume there are factors for both responders and non-responders. We have for *R* = 0, 1,

where the defining words specify the common columns in _{1},
${\stackrel{\sim}{\mathbf{X}}}_{2}^{(1)}$ and
${\stackrel{\sim}{\mathbf{X}}}_{2}^{(0)}$ (**D _{p}** is a diagonal matrix with

**p**on the diagonal). Note that the unique columns in {

_{1}, ${\stackrel{\sim}{\mathbf{X}}}_{2}^{(1)}$} and in {

_{1}, ${\stackrel{\sim}{\mathbf{X}}}_{2}^{(0)}$} each number at most 2

^{k}^{−}

*. We group terms in the above display so that only unique columns remain. First if a column occurs*

^{m}*k*times in

_{1}then delete

*k*−1 of the columns and sum all terms in

*ψ*and associated with the column. Each sum becomes the coefficient of the remaining column; all of the remaining entries in

*ψ*, remain the same. This process is repeated until all columns are unique resulting in a matrix

_{1}and two vectors of coefficients

*ψ*′, ′ for which

_{1}

*ψ*=

_{1}

*ψ*′ and

_{1}=

_{1}′. We follow a similar procedure for the two stage 2 matrices resulting in ${\stackrel{\sim}{\mathbf{X}}}_{2}^{(1)}\beta ={\stackrel{\sim}{\mathbf{Z}}}_{2}^{(1)}{\beta}^{\prime}$ and ${\stackrel{\sim}{\mathbf{X}}}_{2}^{(0)}\alpha ={\stackrel{\sim}{\mathbf{Z}}}_{2}^{(0)}{\alpha}^{\prime}$.

Next if assumption 3a was used then some of the entries in *ψ*′ are known to be zero. Delete the associated columns from _{1} to form _{3} and delete these entries from *ψ*′. If assumption 3b was made then follow the same procedure for
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(1)}$ and *β*′ and similarly for
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(0)}$ and *α*′.

At this point there are no duplicate columns in _{1} or in _{3} or in
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(1)}$ or in
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(0)}$. Furthermore there is no column common to both _{3} and
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(1)}$ and no column common to both _{3} and
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(0)}$. This is because such a column would have been assumed to either have a *ψ* coefficient or *α*, *β* coefficients equal to zero by assumptions 3a), 3b) respectively. There may, however, be one or more columns which are common to all three, _{1},
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(1)},{\stackrel{\sim}{\mathbf{Z}}}_{2}^{(0)}$, matrices. Note the above discussion implies these columns are not in _{3}. To obtain regression coefficients that are identifiable, we need to eliminate each common column from one of these three matrices. To see this consider one such common column, **z** which is, say, the ith column of _{1}, the jth column of
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(1)}$ and the kth column of
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(0)}$. This column contributes
$\mathbf{z}{\phi}_{i}^{\prime}+R\mathbf{z}{\beta}_{j}^{\prime}+(1-R)\mathbf{z}{\alpha}_{k}^{\prime}$ to *E*[**Y**|*R*]. We can write this sum in three ways corresponding to how we delete this column from one of the three matrices:

If we delete **z** from _{1} then we will be able to identify the sum of the associated stage 1 and stage 2 effects; if we delete **z** from
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(1)}$ then we will be able to identify the sum of the associated stage 1 effect and the stage 2 effect for responders and also the difference between the stage 2 effect for non-responders and the stage 2 effect for responders; if we delete **z** from
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(0)}$ then we will be able to identify the sum of the associated stage 1 effect and the stage 2 effect for non-responders and also the difference between the stage 2 effect for responders and the stage 2 effect for non-responders.

Note the assumption that *P*[*E*[*R*|*A*_{1}] > 0] = 1 (*P* [*E*[*R*|*A*_{1}] < 1] = 1) implies that the 2^{k}^{−}* ^{m}* ×1 vector

*E*[

**Y**|

*R*= 1], respectively

*E*[

**Y**|

*R*= 0], is identifiable. Furthermore the vector of response rates

**p**is identifiable. We have

Multiply all terms in both of these equations by
${\stackrel{\sim}{\mathbf{Z}}}_{3}^{T}$ and subtract the lower equation from the top. Note that _{3} does not share a column with
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(i)}$ so
${\stackrel{\sim}{\mathbf{Z}}}_{3}^{T}{\stackrel{\sim}{\mathbf{Z}}}_{2}^{(i)}=0$ for *i* = 0, 1 and that
${\stackrel{\sim}{\mathbf{Z}}}_{3}^{T}{\stackrel{\sim}{\mathbf{Z}}}_{3}$ is the identity matrix. We obtain
${\psi}^{\prime}={\stackrel{\sim}{\mathbf{Z}}}_{3}^{T}(E[\mathbf{Y}\mid R=1]-E[\mathbf{Y}\mid R=0])$. So *ψ*′ is identifiable. A little more matrix algebra yields

Identifiability of ′, *α*′ and *β*′ follows from the fact that there are no columns common to all three of the matrices.

In there are no factors for responders (non-responders) remove ${\stackrel{\sim}{\mathbf{X}}}_{2}^{(1)}\beta $ (respectively ${\stackrel{\sim}{\mathbf{X}}}_{2}^{(0)}\alpha $) from (13). Then follow the above arguments.

##### Proof of Lemma 4

We assume there are factors for both responders and non-responders. Similar arguments can be used when there are only factors for one or the other. Recall a 2^{k}^{−}* ^{m}* design is used; that is {

*A*

_{1},

*A*

_{2}} have a discrete uniform distribution over the rows of the 2

^{k}^{−}

*design. Recall that (*

^{m}**U**

_{1}) denotes the observed response proportion for each unique value of

**U**

_{1}. Let

*η*be the parameter in the bernoulli distribution of

*R*given

*A*

_{1}parameterized as $p({\mathbf{U}}_{1})={\mathbf{U}}_{1}^{T}\eta $. Let

*γ*′ = [(

*ψ*′)

*, (′)*

^{T}*, (*

^{T}*β*′)

*, (*

^{T}*α*′)

*]*

^{T}*(placing*

^{T}*ψ*′ first is for convenience in the invertibility proof below). Note that {

**Z**

_{3},

**Z**

_{1}, ${\mathbf{Z}}_{2}^{(1)},{\mathbf{Z}}_{2}^{(0)}$} in (12) are simply functions of

*A*

_{1},

*A*

_{2}.

**Z**

_{1},

**Z**

_{3}are subvectors of

**U**

_{1}.

Define
$\mathbf{Z}(\eta )=\left[(R-{\mathbf{U}}_{1}^{T}\eta ){\mathbf{Z}}_{3}^{T},{\mathbf{Z}}_{1}^{T},R{\mathbf{Z}}_{2}^{{(1)}^{T}},(1-R){\mathbf{Z}}_{2}^{{(0)}^{T}}\right]$. Also let **E*** _{N}* be expectation with respect to the empirical distribution of the data. Then solve

for *γ*′, *η* to obtain ′, . The second equation is the normal equation for a regression of *R* on **U**_{1}. Recall the model used for the conditional distribution of *R* is saturated so weighting by the variance of *R* would not alter the results; in fact the value of
${u}_{1}^{T}\widehat{\eta}$ will be equal to the observed combined response rate for the rows consistent with *u*_{1}. The asymptotic arguments are standard exercises. As a result we provide only an outline. First, standard asymptotic arguments are sufficient to show convergence in probability of ′ to *γ*′ and to *η*. Furthermore define Σ* _{γ}*=

*E*[

**Z**(

*η*)

**Z**(

*η*)

*], ${\mathrm{\sum}}_{\gamma ,\eta}=E\phantom{\rule{0.16667em}{0ex}}[\mathbf{Z}(\eta )\hspace{0.17em}({\mathbf{Z}}_{3}^{T}{\psi}^{\prime})\phantom{\rule{0.16667em}{0ex}}{\mathbf{U}}_{1}^{T}]$ and ${\mathrm{\sum}}_{\eta ,\eta}=E\phantom{\rule{0.16667em}{0ex}}[{\mathbf{U}}_{1}{\mathbf{U}}_{1}^{T}]$ and the corresponding estimators by*

^{T}*=*

_{γ}**E**

*[*

_{N}**Z**()

**Z**()

*], ${\widehat{\mathrm{\sum}}}_{\gamma ,\eta}={\mathbf{E}}_{N}\left[\mathbf{Z}(\widehat{\eta})\phantom{\rule{0.16667em}{0ex}}({\mathbf{Z}}_{3}^{T}\widehat{\psi}){\mathbf{U}}_{1}^{T}\right]$ and ${\widehat{\mathrm{\sum}}}_{\eta ,\eta}={\mathbf{E}}_{N}\phantom{\rule{0.16667em}{0ex}}[{\mathbf{U}}_{1}{\mathbf{U}}_{1}^{T}]$. Note that Σ*

^{T}*is the identity matrix due to the orthogonality of design (Σ*

_{η,η}*^*

_{η}_{,}

*may not be the identity since in finite samples, the groups sizes may be unequal). It is easily shown that*

_{η}*,*

_{γ}*and*

_{γ,η}

_{η}_{,}

*converge in probability (as the sample size*

_{η}*N*→ ∞) Σ

*, Σ*

_{γ}*and Σ*

_{γ,η}

_{ηη.}Below we show that Σ* _{γ}* is invertible. Again standard arguments can be used to show that

where $\ell ({\gamma}^{\prime},\eta ,{\mathrm{\sum}}_{\gamma ,\eta},{\mathrm{\sum}}_{\eta ,\eta})=\mathbf{Z}(\eta )\phantom{\rule{0.16667em}{0ex}}(Y-\mathbf{Z}{(\eta )}^{T}{\gamma}^{\prime})-{\mathrm{\sum}}_{\gamma ,\eta}{\mathrm{\sum}}_{\eta ,\eta}^{-1}{\mathbf{U}}_{1}(R-{\mathbf{U}}_{1}^{T}\eta )$. Thus the asymptotic variance-covariance matrix of $\sqrt{N}({\widehat{\gamma}}^{\prime}-{\gamma}^{\prime})$ is

A consistent estimator of the asymptotic variance-covariance matrix is provided by replacing Σ* _{γ}* by

*, Σ*

_{γ}*by*

_{γ,η}*, Σ*

_{γ,η}*by*

_{η,η}*,*

_{η,η}*η*by ,

*γ*′ by ′ and the operator

*E*by

**E**

*in the above. To improve the accuracy of the variance estimators when samples are small, we recommend adjusting the formula for the number of estimated parameters. That is multiply the estimated variance-covariance matrix by*

_{N}*N*/(

*N*−

*q*−

_{γ}*q*) where

_{η}*q*is the dimension of

_{γ}*γ*′ and

*q*is the dimension of

_{η}*η*.

##### Proof that Σ_{γ} is invertible

_{γ}

Construct the matrices _{(1)}, _{(3)},
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(1)},{\stackrel{\sim}{\mathbf{Z}}}_{2}^{(0)}$ as in the proof of Lemma 3 above. Note each row is equal to one of the 2^{k}^{−}* ^{m}* equally probable realizations of
$\left[{\mathbf{Z}}_{1}^{T}\phantom{\rule{0.16667em}{0ex}}{\mathbf{Z}}_{3}^{T},{\left({\mathbf{Z}}_{2}^{(1)}\right)}^{T},{\left({\mathbf{Z}}_{2}^{(0)}\right)}^{T}\right]$. Define

**p**to be a vector of the 2

^{k}^{−}

*response probabilities (e.g. the*

^{m}*i*th entry is given by ${u}_{i1}^{T}\eta $ where

*u*

_{i}_{1}is the value of

**U**

_{1}corresponding to the

*i*th row of the 2

^{k}^{−}

*design). Define*

^{m}**D**

_{−}

**to be a diagonal matrix with diagonal elements equal to −**

_{p}**p**. Define

where the 0’s denote conforming matrices with all entries equal to zero. Then 2^{k}^{−}* ^{m}*Σ

*can be written as*

_{γ}We show that is of full rank. This combined with the assumption that the response probabilities are bounded away from both 0 and 1 will imply that Σ* _{γ}* is invertible.

Note that will have less than 2(2^{k}^{−}* ^{m}*) columns (it has 2(2

^{k}^{−}

*) rows) if there are unequal numbers of factors for responders and non-responders and/or unnecessary formal assumptions have been made. We can add columns to by adding back in columns to*

^{m}_{3}associated with nuisance effects that did not necessarily need to be assumed zero (see example 3 for an illustration). Suppose there are fewer stage 2 factors for responders than for non-responders; this means that some of the stage 2 factors for non-responders are not stacked with a stage 2 factor for responders. Add the columns in ${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(0)}$ associated with these non-stacked factors to ${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(1)}$. A similar procedure can be followed in there are fewer stage 2 factors for non-responders than responders. If we prove this augmented 2(2

^{k}^{−}

*) × 2(2*

^{m}

^{k}^{−}

*) matrix is of full rank then certainly any subset of columns is also full rank. Hence it suffices to prove that is of full rank in the case when it is square (has 2(2*

^{m}

^{k}^{−}

*) columns).*

^{m}Next = _{1} + _{2} where

_{1} is easily shown to be of full rank, once one identifies common columns in _{3}, _{1},
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(1)},{\stackrel{\sim}{\mathbf{Z}}}_{2}^{(0)}$. Note by assumption there are no columns common to _{3} and
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(1)}$ or _{3} and
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(0)}$. And there are no columns common to all three _{(1)},
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(1)},{\stackrel{\sim}{\mathbf{Z}}}_{2}^{(0)}$. Thus _{3} is composed of columns common with _{1} and columns unique to _{3}. Similarly _{1} is composed of columns that are common with _{3}, columns common with
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(1)}$ and columns common with
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(0)}$. _{1} has no unique columns(if it did then this column should have been added to _{3} when was augmented). The inner product of a column with itself is 2^{k}^{−}* ^{m}* and the inner product of any other two columns is zero. These facts imply that the inverse of
${\stackrel{\sim}{\mathbf{V}}}_{1}^{T}{\stackrel{\sim}{\mathbf{V}}}_{1}$ is easily found. Let

*q*

_{3}be the number of columns in

_{3}. For the purposes of this proof we need only note that the first

*q*

_{3}rows of ${\left({\stackrel{\sim}{\mathbf{V}}}_{1}^{T}{\stackrel{\sim}{\mathbf{V}}}_{1}\right)}^{-1}$ are proportional to the matrix:

where *q*_{3}* _{c}* is the number of common columns between

_{3}and

_{1}and

*q*

_{3}

*is the number of columns unique to*

_{u}_{3}. In this we reordered the columns of

_{3}and

_{1}so that the common columns appear first.

Next because _{1} is of full rank and square (thus invertible) we can write,

Using the formula for the first *q*_{3} rows of
${\left({\stackrel{\sim}{\mathbf{V}}}_{1}^{T}{\stackrel{\sim}{\mathbf{V}}}_{1}\right)}^{-1}$ we find that the term in parentheses is proportional to a matrix of the form

where **U** is of dimension 2(2^{k}^{−}* ^{m}*)−

*q*

_{3}×

*q*

_{3}. This square matrix has nonzero determinant and is thus invertible. is invertible.

##### Efficiency

First we derive the semiparametric efficient score. Then we discuss how this score can be used to produce a locally semiparametric efficient estimator of *γ*′. Lastly we prove that in the saturated model the estimators are semiparametric efficient and that under an additional assumption on the variances (this assumption is provided in Lemma 2) certain unsaturated models will also yield semiparametric efficient estimators.

##### Semiparametric Efficient Score

The semiparametric model is given by

where *E*[*ε*|*A*_{1}, *R*, *A*_{2}] = 0, *R* has a Bernoulli distribution with success rate,
$E[R\mid {A}_{1},{A}_{2}]=E[R\mid {A}_{1}]={\mathbf{U}}_{1}^{T}\eta $ and *γ*′ belongs *R ^{qγ}* (

*q*≤ 2

_{γ}

^{k}^{−}

*) and*

^{m}*η*belongs to

*R*(

^{qη}*q*is equal to the number of rows in the 2

_{η}

^{k}^{−}

*design corresponding to unique levels of the stage 1 factors). Put*

^{m}*q*=

*q*+

_{γ}*q*. Recall that $\mathbf{Z}(\eta )=\left[(R-{\mathbf{U}}_{1}^{T}\eta ){\mathbf{Z}}_{3}^{T},{\mathbf{Z}}_{1}^{T},R{\mathbf{Z}}_{2}^{{(1)}^{T}},(1-R){\mathbf{Z}}_{2}^{{(0)}^{T}}\right]$ and

_{η}*γ*′ = [(

*ψ*′)

*, (′)*

^{T}*, (*

^{T}*β*′)

*, (*

^{T}*α*′)

*]*

^{T}*. The density for a single observation is given by*

^{T}The nuisance parameters are the densities *f _{ε}*(·|

*a*

_{1},

*r*,

*a*

_{2}) where each is a mean zero density with respect to Lebesgue measure and the

*p*’s. Actually we know that

_{i}*p*= 1

_{i}*=*2

^{k}^{−}

*, however the efficient score is the same whether we know the*

^{m}*p*’s or not. This is because {

_{i}*A*

_{1},

*A*

_{2}} is ancillary (Newey, 1990). All of the expectations in this subsection are with respect to this density. As is customary, to derive the efficient score, we make regularity and smoothness conditions on the likelihood of the semiparametric model; these conditions are given in definition A.1 of the appendix in Newey (1990). This model is only slightly different from the semiparametric restricted moment model treated at great length by Tsiatis (Ch. 4, 2006); in particular it differs since a parameter in the Bernoulli distribution of

*R*(e.g.

*η*) also appears in

**Z**(

*η*). In the following we use Tsiatis’s (2006) terminology and where possible use his results.

To derive the efficient score function for {*γ*′, *η*} we calculate the score vector for {*γ*′, *η*} and then form the projection of this score vector on the nuisance tangent space (see Theorem 4.1 of Tsiatis (2006)). The residual is the efficient score. Let is the Hilbert space of *q*-dimensional, mean zero, finite variance functions of {*A*_{1}, *R*, *A*_{2}, *Y*} equipped with the inner product,
$<{h}_{1},{h}_{2}>=E[{h}_{1}^{T}{h}_{2}]$. The tangent space for a particular parameter is the subset of formed by mean square closure of all functions that are score functions for the parameter in a parametric submodel containing the true value of that parameter. Define

From Tsiatis (Section 4.5, 2006) we have that the tangent space for the densities *f _{ε}*(

*ε, a*

_{1}

*, r, a*

_{2}) is Λ

_{1}∩ Λ

_{2}and that the tangent space for the distribution of {

*A*

_{1}

*, A*

_{2}} (e.g. the

*p*’s) is given by Λ

_{i}_{4}. Because the

*p*’s are variationally independent of each of the

_{i}*f*(

_{ε}*ε*|

*a*

_{1}

*, r, a*

_{2}) densities the nuisance tangent space is given by the direct sum of two orthogonal spaces,

Second, Tsiatis (Section 4.5, 2006) proves that

Since
$\mathcal{H}={\mathrm{\Lambda}}_{2}\oplus {\mathrm{\Lambda}}_{2}^{\perp}$, we have that the nuisance tangent space, Λ = (Λ_{1} Λ_{2}) Λ_{4} is the subset of Λ_{2} which is orthogonal to Λ_{3}, that is,
$\mathrm{\Lambda}={\mathrm{\Lambda}}_{2}\cap {\mathrm{\Lambda}}_{3}^{\perp}$. It follows that Λ^{} is the direct sum of
${\mathrm{\Lambda}}_{2}^{\perp}$ and the part of Λ_{2} that is contained in Λ_{3}. Since Λ_{3} Λ_{2} we have
${\mathrm{\Lambda}}^{\perp}={\mathrm{\Lambda}}_{2}^{\perp}\oplus {\mathrm{\Lambda}}_{3}$.

Recall the efficient score is the residual of the projection of the score vector for {*γ*′, *η*} onto the nuisance tangent space, Λ. Equivalently the efficient score is the projection of the score vector for {*γ*′, *η*} onto Λ^{}. In this case the latter is easier to compute. Tsiatis (Section 4.5, 2006) proves that

Because *R* is a binary random variable, members of Λ_{3} also have a simple form: *h*_{3}(*A*_{1}*, R, A*_{2}) = *g*′(*A*_{1}*, A*_{2})(*R*−*E*[*R*|*A*_{1}*, A*_{2}]) for *g*′, a *q*-dimension function of {*A*_{1}*, A*_{2}}.

We have that the efficient score must be of the form, *εg*(*A*_{1}, *R*, *A*_{2})+*g*′(*A*_{1}, *A*_{2})(*R*− *E*[*R*|*A*_{1}, *A*_{2}]) where both *g* and *g*′ are *q*-dimensional. Assume that *E*[*ε*^{2}|*A*_{1}, *R*, *A*_{2}] is positive. Then it is easy to see that the projection of an *h*(*ε*, *A*_{1}, *R*, *A*_{2}) onto
${\mathrm{\Lambda}}_{2}^{\perp}$ is

Similarly, assuming that *E*[*R*|*A*_{1}, *A*_{2}] (0, 1),

If we denote the scores for *γ*′, *η*, by *S _{γ}*

_{′},

*S*, respectively we see need only calculate

_{η}*E*[

*S*

_{γ}_{′}

*ε*|

*A*

_{1},

*R*,

*A*

_{2}],

*E*[

*S*

_{γ}_{′}(

*R*−

*E*[

*R*|

*A*

_{1},

*A*

_{2}])|

*A*

_{1},

*A*

_{2}] and similar quantities for

*S*. Following Tsiatis (2006) we do this indirectly without explicitly calculating

_{η}*S*

_{γ}_{′}and

*S*. We have

_{η}

for all values of *a*_{1}, *r*, *a*_{2}, *η*, *γ*′ (recall *u* and *z* are functions of these). Differentiating both sides of the first two equations with respect to *γ*′ we obtain

That is, 0 = *E*[*S _{γ}*

_{′}(

*ε*,

*A*

_{1},

*R*,

*A*

_{2})|

*A*

_{1},

*R*,

*A*

_{2}] and

**Z**(

*η*) =

*E*[

*εS*

_{γ}_{′}(

*ε*,

*A*

_{1},

*R*,

*A*

_{2})|

*A*

_{1},

*R*,

*A*

_{2}]. The former implies that

*E*[

*S*

_{γ}_{′}(

*ε*,

*A*

_{1},

*R*,

*A*

_{2})(

*R*−

*η*

^{T}**U**

_{1})|

*A*

_{1},

*A*

_{2}] = 0. Differentiating both sides of each of the last two equations with respect to

*η*results in:

That is,
$-({\mathbf{Z}}_{3}^{T}{\psi}^{\prime}){\mathbf{U}}_{1}=E[\epsilon {S}_{\eta}(\epsilon ,{A}_{1},R,{A}_{2})\mid {A}_{1},R,{A}_{2}]$ and
${\mathbf{U}}_{1}=E[(R-{\mathbf{U}}_{1}^{T}\eta ){S}_{\eta}({A}_{1},R,{A}_{2})\mid {A}_{1},{A}_{2}]$. Putting these results together we obtain the efficient score for (*γ*′, *η*):

where
${\sigma}_{{A}_{1},R,{A}_{2}}^{2}=E[{(Y-\mathbf{Z}{(\eta )}^{T}{\gamma}^{\prime})}^{2}\mid {A}_{1},R,{A}_{2}]$. The variance-covariance matrix of *S _{eff}*(Σ

*=*

_{eff}*E*[

*S*(

_{eff}*Y*,

*A*

_{1},

*R*,

*A*

_{2};

*γ*′,

*η*,

*σ*)

*S*(

_{eff}*Y*,

*A*

_{1},

*R*,

*A*

_{2};

*γ*′,

*η*,

*σ*)]) is given by

Although we did not show this above, the smoothness assumptions imply that the estimators {′, } are regular, asymptotically linear estimators (see Tsiatis (2006) for definitions). If Σ* _{eff}* were singular then no regular, asymptotically linear estimators can exist (Chamberlain, 1986). Thus Σ

*must be invertible. Furthermore for any regular and asymptotically linear estimator, say ((′)*

_{eff}*,*

^{T}*), and a*

^{T}*q*-dimensional vector,

*a*, we have that a lower bound on the asymptotic variance of $\sqrt{N}({({\stackrel{\sim}{\gamma}}^{\prime})}^{T},{\stackrel{\sim}{\eta}}^{T})-({({\gamma}^{\prime})}^{T},{\eta}^{T}))\phantom{\rule{0.16667em}{0ex}}a\phantom{\rule{0.16667em}{0ex}}\text{is}\phantom{\rule{0.16667em}{0ex}}{a}^{T}{\mathrm{\sum}}_{\mathit{eff}}^{-1}a$.

##### A Semiparametric Efficient Estimator

To construct a semiparametric efficient estimator we employ the efficient score, *S _{eff}* as follows. As before we assume that

*P*[0 <

*E*[

*R*|

*A*

_{1}] < 1] = 1. We also assume that

*P*[

*V ar*(

*Y*|

*A*

_{1},

*R*,

*A*

_{2}) > 0] = 1. First use the method described in the body of the paper (and above in this Appendix) to produce consistent estimators ′, . Next set ${\widehat{\sigma}}_{{A}_{1},1,{A}_{2}}^{2}({\widehat{\sigma}}_{{A}_{1},0,{A}_{2}}^{2})$ to be the sample variance of the residual (

*Y*−

**Z**()

*′) for responders (respectively, non-responders) for each group (each row) of the design. Solve*

^{T}

for *γ*′, *η* to obtain
${\widehat{\gamma}}_{\mathit{eff}}^{\prime}$, * _{eff}*. Standard arguments can be used to show that

as *N* → ∞

##### Semiparametric Efficiency of {′, }

In general the model(12) should be saturated; this is because each predictor associated with an effect (or one of its aliases) will occur in two vectors, either in **Z**_{1} and **Z**_{3} or in **Z**_{1} and
${\mathbf{Z}}_{2}^{(1)}$ or in **Z**_{1} and
${\mathbf{Z}}_{2}^{(0)}$ or in
${\mathbf{Z}}_{2}^{(1)}$ and
${\mathbf{Z}}_{2}^{(0)}$ (Similarly (9) will generally be saturated as well). An exception to this pattern (and thus the model will not be saturated) occurs if there are unnecessary formal assumptions; that is both assumptions 3a) and 3b) are made concerning the effects associated with a particular column in Lemma 3 rather than just one of these. In this case the “Algorithm for Constructing the Regressors for Use in the Screening Analysis” of Section 4.2 will remove too many predictors from **Z**_{3}.

Assuming that no unnecessary formal assumptions are made, the only other reason why (12) (or 9) will not be a saturated model is if there is a stage 2 factor for only responders (or only non-responders) and the design specifies that this factor’s levels are crossed with the levels of the remaining factors (e.g. this factor is independent of the remaining factors). In this latter case effects involving this stage 2 factor can only be aliased with other effects involving this stage 2 factor. As a result each predictor associated with effects of this stage 2 factor (or one of its aliases) can only be included once and only in ${\mathbf{Z}}_{2}^{(1)}$. However under the assumption that $P[\mathit{Var}(Y\mid {A}_{1},R,{A}_{2})=RV\phantom{\rule{0.16667em}{0ex}}ar(Y\mid {A}_{1},R,{A}_{2}^{(R)})]=1$ we show that in this non-saturated case, {′, } is locally semiparametric efficient. The adjective, “locally” is used because the semiparametric efficiency holds if this assumption holds and otherwise not.

First we prove that {′, } is semiparametric efficient when (12) is a saturated model. For simplicity suppose there are stage 2 factors for both responders and nonresponders. Assume *P*[0 < *E*[*R*|*A*_{1}] < 1] = 1 and *P* [*V ar*(*Y* |*A*_{1}, *R*, *A*_{2}) *>* 0] = 1. This means that the probability of
$P[\{{min}_{{a}_{1},r,{a}_{2}}{\widehat{\sigma}}_{{a}_{1},r,{a}_{2}}>0\}\cup \{{min}_{{u}_{1}}\mid {u}_{1}^{T}\widehat{\eta}\mid \phantom{\rule{0.16667em}{0ex}}\mid 1-{u}_{1}^{T}\widehat{\eta}\mid >0\}]\to 1$ as *N* → ∞. Thus we assume in the sample the proportion of responders in each of the 2^{k}^{−}* ^{m}* groups is not equal to one or zero. In this case we prove that
$\{{\widehat{\gamma}}^{\prime},\widehat{\eta}\}=\{{\widehat{\gamma}}_{\mathit{eff}}^{\prime},{\widehat{\eta}}_{\mathit{eff}}\}$.

First since
${\mathbf{E}}_{N}[{\mathbf{U}}_{1}(R-{\mathbf{U}}_{1}^{T}\widehat{\eta})]=0$ we have that
${\mathbf{E}}_{N}\left[{\mathbf{U}}_{1}{\scriptstyle \frac{R-{\mathbf{U}}_{1}^{T}\widehat{\eta}}{{\mathbf{U}}_{1}^{T}\widehat{\eta}(1-{\mathbf{U}}_{1}^{T}\widehat{\eta})}}\right]$ is zero as well. To see this suppose that there are *ν* unique values of **U**_{1} in the design (*ν* ≤2* ^{k−m}*). Then,

where **Ũ**_{1} is a *ν* × *ν* matrix with columns corresponding to the *ν* realizations of **U**_{1}, **D** is a diagonal matrix with *i*th entry equal to **E*** _{N}* [

*R*1

_{U1=u1i}] and is the

*ν*dimensional vector of observed response proportions. Since ${\mathbf{E}}_{N}[{\mathbf{U}}_{1}(R-{\mathbf{U}}_{1}^{T}\widehat{\eta})]=0,{\stackrel{\sim}{\mathbf{U}}}_{1}^{T}$ and

**D**are invertible, we have that satisfies =

**Ũ**

_{1}. Since we can write ${\mathbf{E}}_{N}\left[{\mathbf{U}}_{1}{\scriptstyle \frac{R-{\mathbf{U}}_{1}^{T}\widehat{\eta}}{{\mathbf{U}}_{1}^{T}\widehat{\eta}(1-{\mathbf{U}}_{1}^{T}\widehat{\eta})}}\right]$ in a similar matrix formulation (only the diagonal matrix is altered) we see that this formula is equal to zero as well.

Next since **E**_{N} **[Z**() *Y* − **Z**()* ^{T}*′ = 0, both
${\mathbf{E}}_{N}\left[\mathbf{Z}(\widehat{\eta}){\scriptstyle \frac{Y-\mathbf{Z}{(\widehat{\eta})}^{T}{\widehat{\gamma}}^{\prime}}{{\sigma}_{{A}_{1},R,{A}_{2}}^{2}}}\right]$ and
${\mathbf{E}}_{N}\left[-({\mathbf{Z}}_{3}^{T}{\widehat{\psi}}^{\prime}){\mathbf{U}}_{1}{\scriptstyle \frac{Y-\mathbf{Z}{(\widehat{\eta})}^{T}{\widehat{\gamma}}^{\prime}}{{\sigma}_{{A}_{1},R,{A}_{2}}^{2}}}\right]$ will be zero as well. To see this we use a matrix formulation. First let

**E**

*[*

_{N}**Y**

_{1}] be a 2

^{k}^{−}

*dimensional vector in which the*

^{m}*i*th entry corresponds to the sample average of

*Y R*for the

*i*th group (row in design) divided by the sample average of

*R*in the

*i*th group. Similarly define

**E**

*[*

_{N}**Y**

_{0}] be a 2

^{k}^{−}

*dimensional vector in which the*

^{m}*i*th entry is the sample average of

*Y*(1 −

*R*) for the

*i*th group (row in design) divided by the sample average of 1 −

*R*in the

*i*th group. Define a vector of response proportions of the same dimension, say (the

*i*th entry corresponds to a

*u*1

*i*. With these definitions we construct an empirical version of defined in the proof of Lemma 4. Define

^{T}

where the 0’s denote conforming matrices with all entries equal to zero and _{1}, _{3},
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(1)},{\stackrel{\sim}{\mathbf{Z}}}_{2}^{(0)}$ were defined in the proof of Lemma 4 above. Since the model is saturated, is a square matrix of dimension 2(2^{k}^{−}* ^{m}*). Using the same proof as used in Lemma 4 to show that is full rank, we can show that is full rank as well (and hence invertible). Define

**D**

_{1}to be a diagonal matrix with

*i*th diagonal entry equal to the proportion of responders in the

*i*th group multiplied by the group size. Similarly define

**D**

_{0}to be a diagonal matrix with

*i*th diagonal entry equal to the proportion of non-responders in the

*i*th group multiplied by the group size.

We can write 0 = **E*** _{N}* [

**Z**()

*Y*−

**Z**()

*′)] as*

^{T}Thus in the saturated model, ′

Since we can write
${\mathbf{E}}_{N}\left[\mathbf{Z}(\widehat{\eta}){\scriptstyle \frac{Y-\mathbf{Z}{(\widehat{\eta})}^{T}{\widehat{\gamma}}^{\prime}}{{\widehat{\sigma}}_{{A}_{1},R,{A}_{2}}^{2}}}\right]$ and
${\mathbf{E}}_{N}\left[-({\mathbf{Z}}_{3}^{T}{\widehat{\psi}}^{\prime}){\mathbf{U}}_{1}{\scriptstyle \frac{Y-\mathbf{Z}{(\widehat{\eta})}^{T}{\widehat{\gamma}}^{\prime}}{{\widehat{\sigma}}_{{A}_{1},R,{A}_{2}}^{2}}}\right]$ as matrices times the above difference we have these two terms are zero as well. Combining the above with the results for , we have that when the model is saturated and the sample is sufficiently large so that the response rates in each of the 2^{k}^{−}* ^{m}* groups are neither zero or one,
${\widehat{\gamma}}^{\prime}={\widehat{\gamma}}_{\mathit{eff}}^{\prime}$ and =

*.*

_{eff}Next we prove that in the non-saturated model, {′, } will be locally semi-parametric efficient if 1) no unnecessary formal assumptions are made (assumptions 3a and 3b are made concerning the effects associated with a particular column in Lemma 3 rather than just one of these) and 2) there is a stage 2 factor for responders (or non-responders) that is independent of the remaining factors. Assume that
$\mathit{Var}(Y\mid {A}_{1},R=r,{A}_{2})=\mathit{Var}(Y\mid {A}_{1},R=r,{A}_{2}^{(r)})$ for *r* = 0, 1. Under these assumptions on the residual variance we estimate *σ*_{A1,1;A2} by the sample variance of the residual (*Y* −**Z**()* ^{T} *′) for responders for each combination of groups (rows) in the design with a unique value of {

*A*

_{1}, ${A}_{2}^{(1)}$} and similarly for non-responders. Denote the estimators by ${\widehat{\sigma}}_{{A}_{1},r,{A}_{2}^{(r)}}$ for

*r*= 0, 1. A locally semiparametric estimator can be found by solving (15) (with

_{A}_{1,}

_{R,A}_{(}

_{R}_{)}instead of

_{A}_{1,}

_{R,A}_{2}).

As in the above both
${\mathbf{E}}_{N}\left[{\mathbf{U}}_{1}{\scriptstyle \frac{R-{\mathbf{U}}_{1}^{T}\widehat{\eta}}{{\mathbf{U}}_{1}^{T}\widehat{\eta}(1-{\mathbf{U}}_{1}^{T}\widehat{\eta})}}\right]$ and
${\mathbf{E}}_{N}[{\mathbf{U}}_{1}(R-{\mathbf{U}}_{1}^{T}\widehat{\eta})]$ are identically zero. Next consider the first equation in (14) and (15) (in the latter _{A}_{1,}_{R,A}_{2} is replaced by _{A}_{1,}_{R,A}_{(R)}).

For clarity, assume there is only one more stage 2 factor for responders than non-responders and that this stage 2 factor is randomized independently of all other factors. This means that while there are 2^{k}^{−}^{m}^{−1} unique realizations of {*A*_{1},
${A}_{2}^{(1)}$} there are only 2^{k}^{−}^{m}^{−1} unique realizations of {*A*_{1},
${A}_{2}^{(0)}$}. Now (16) continues to hold but although a full rank matrix, is no longer square. It’s dimensions are 2 (2^{k}^{−}* ^{m}*) × 2

^{k}^{−}

*+ 2*

^{m}

^{k}^{−}

^{m}^{−1}. A closer inspection of reveals that in the lower 2

^{k}^{−}

*rows there are only 2*

^{m}

^{k}^{−}

^{m}^{−1}unique rows, each has a one double. Thus if we remove these 2

^{k}^{−}

^{m}^{−1}doubles from (say to form ′) we now have a square matrix which is full rank. We can re-express (16) as follows. Let

**D**

_{0}and

**E**

*[*

_{N}**Y**

_{0}] be defined as before. The latter vector is of dimension 2

^{k}^{−}

*. Define*

^{m}**E**

*[*

_{N}**Y**

_{0}]′ to be 2

^{k}^{−}

^{m}^{−1}dimension with the

*i*th entry corresponds to the sample average of

*Y*(1 −

*R*) for the

*i*th unique realization of {

*A*

_{1}, ${A}_{2}^{(0)}$} divided by the sample average of

*R*in the

*i*th realization. Define ${\mathbf{D}}_{0}^{\prime}$ to be a diagonal matrix with

*i*th diagonal entry equal to proportion of responders in the group formed by the

*i*th unique realization of {

*A*

_{1}, ${A}_{2}^{(0)}$} multiplied by the group size. Now we can reexpress (16) as

As before the invertibility of ′ implies that the last term in parentheses is equal to zero. Because we can express both

as matrices times the above difference we have these two terms are zero as well. Note we would not have been able to do this had we used the estimator
${\widehat{\sigma}}_{{A}_{1},R,{A}_{2}^{(R)}}^{2}$ instead of
${\widehat{\sigma}}_{{A}_{1},R,{A}_{2}}^{2}$ in (15). Combining the above with the results for , we have that when the sample is sufficiently large so that the response rates in each of the groups defined by unique values of {*A*_{1},
${A}_{2}^{(1)}$} and {*A*_{1},
${A}_{2}^{(0)}$} are neither zero or one then
${\widehat{\gamma}}^{\prime}={\widehat{\gamma}}_{\mathit{eff}}^{\prime}$ and = * _{eff}*.

#### A.2 Aliasing and The Formal Assumptions

Here we provide a simple example of how the aliasing becomes difficult to interpret when the formal assumptions do not hold. Suppose there are two factors at stage 1 and one stage 2 factor for responders. We are interested in the main effects of these factors and we are willing to make the working assumption that there are no interactions. Suppose we are willing to make the formal assumption that all effects of *R* (including interactions between *R* and the stage 1 factors) are negligible. We use the experimental design in Table 2 with defining word 1 = *A*_{11}*A*_{12}*A*_{2} (*A*_{2} is
${A}_{2}^{(0)}$ here).

The conditional mean of *Y* given *A*_{1}, *R*, *A*_{2} is given by (12). Suppose that it is known that *ψ _{j}* = 0,

*j*= 2, 3, 4. However we do not know whether any of the remaining parameters are zero or not. The experimental design in Table 2 permits the identification of the

*η*parameters in

*p*(

*A*

_{1}) =

*η*

_{1}+

*η*

_{2}

*A*

_{11}+

*η*

_{3}

*A*

_{12}+

*η*

_{4}

*A*

_{11}

*A*

_{12}and the eight conditional means

*E*[

*Y*|

*A*

_{11}=

*a*

_{11},

*A*

_{12}=

*a*

_{12},

*R*=

*r*,

*A*

_{2}=

*a*

_{11}

*a*

_{12}] (recall that

*A*

_{2}=

*A*

_{11}

*A*

_{12}in this design). The conditional means are

for *a*_{11} {−1, 1}, *a*_{12} {−1, 1} and *r* {0, 1}. So from data resulting from the experimental design in Table 2 we can identify the eight quantities such as: _{1} + *ψ*_{1}, _{2} + *η*_{2}*ψ*_{1}, _{3} + *η*_{3}*ψ*_{1}, _{4}+*η*_{4}*ψ*_{1}, *ψ*_{1} + *β*_{4}, *β*_{1}, *β*_{2} and *β*_{3}. The above nonlinearity in parameters makes it difficult to interpret the identifiable quantities involving the * _{j}*’s. More complex designs result in similarly uninterpretable quantities. In summary, to maintain interpretability, we consider experimental designs that

*do not*alias a potentially active stage 2 effect with a potentially active nuisance effect.

## Contributor Information

S. A. Murphy, Department of Statistics, University of Michigan, Ann Arbor, MI 48109, Email: ude.hcimu@yhprumas.

D. Bingham, Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, BC, Canada, V5A 1S6, Email: ac.ufs.tats@mahgnibd.

## References

^{k−p}Designs. Technomet-rics. 1980;22:601–608.

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (1.5M)

- Macromolecular crowding: chemistry and physics meet biology (Ascona, Switzerland, 10-14 June 2012).[Phys Biol. 2013]
*Foffi G, Pastore A, Piazza F, Temussi PA.**Phys Biol. 2013 Aug 2; 10(4):040301. Epub 2013 Aug 2.* - An examination of effect estimation in factorial and standardly-tailored designs.[Clin Trials. 2008]
*Allore HG, Murphy TE.**Clin Trials. 2008; 5(2):121-30.* - Developing multicomponent interventions using fractional factorial designs.[Stat Med. 2009]
*Chakraborty B, Collins LM, Strecher VJ, Murphy SA.**Stat Med. 2009 Sep 20; 28(21):2687-708.* - Multilevel factorial experiments for developing behavioral interventions: power, sample size, and resource considerations.[Psychol Methods. 2012]
*Dziak JJ, Nahum-Shani I, Collins LM.**Psychol Methods. 2012 Jun; 17(2):153-75. Epub 2012 Feb 6.* - Can the time from synthesis design to validated chemistry be shortened?[Med Res Rev. 1999]
*Pilipauskas DR.**Med Res Rev. 1999 Sep; 19(5):463-74.*

- Introduction to SMART designs for the development of adaptive interventions: with application to weight loss research[Translational Behavioral Medicine. 2014]
*Almirall D, Nahum-Shani I, Sherwood NE, Murphy SA.**Translational Behavioral Medicine. 2014 Sep; 4(3)260-274* - SMART Design Issues and the Consideration of Opposing Outcomes: Discussion of “Evaluation of Viable Dynamic Treatment Regimes in a Sequentially Randomized Trial of Advanced Prostate Cancer” by by Wang, Rotnitzky, Lin, Millikan, and Thall[Journal of the American Statistical Associa...]
*Almirall D, Lizotte DJ, Murphy SA.**Journal of the American Statistical Association. 2012 Jun 1; 107(498)509-512* - Designing a Pilot Sequential Multiple Assignment Randomized Trial for Developing an Adaptive Treatment Strategy[Statistics in medicine. 2012]
*Almirall D, Compton SN, Gunlicks-Stoessel M, Duan N, Murphy SA.**Statistics in medicine. 2012 Jul 30; 31(17)1887-1902*

- PubMedPubMedPubMed citations for these articles

- Screening Experiments for Developing Dynamic Treatment RegimesScreening Experiments for Developing Dynamic Treatment RegimesNIHPA Author Manuscripts. Mar 1, 2009; 104(458)391PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...