- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

- Journal List
- NIHPA Author Manuscripts
- PMC2891316

# Inference for Nonregular Parameters in Optimal Dynamic Treatment Regimes

^{1}Department of Statistics, University of Michigan

^{2}Center for Health Communications Research, University of Michigan

^{*}Address for correspondence: Bibhas Chakraborty, Department of Statistics, 439 West Hall, 1085 South University Avenue, Ann Arbor, MI 48109-1107, USA. E-mail: ude.hcimu@sahbib

## Abstract

A dynamic treatment regime is a set of decision rules, one per stage, each taking a patient’s treatment and covariate history as input, and outputting a recommended treatment. In the estimation of the optimal dynamic treatment regime from longitudinal data, the treatment effect parameters at any stage prior to the last can be nonregular under certain distributions of the data. This results in biased estimates and invalid confidence intervals for the treatment effect parameters. In this paper, we discuss both the problem of nonregularity, and available estimation methods. We provide an extensive simulation study to compare the estimators in terms of their ability to lead to valid confidence intervals under a variety of nonregular scenarios. Analysis of a data set from a smoking cessation trial is provided as an illustration.

**Keywords:**dynamic treatment regime, nonregularity, bias, hard-threshold, soft-threshold, empirical Bayes, bootstrap

## 1 Introduction

Many diseases such as mental illness, HIV infection, and substance abuse are clinically treated in multiple stages, adapting the treatment type and dosage to the ongoing measures of an individual patient’s response, adherence, burden, side effects, and preference. Dynamic treatment regimes represent one way to operationalize this sequential decision making. A dynamic treatment regime (DTR) is a sequence of decision rules, one per stage. Each decision rule takes a patient’s treatment and covariate history as input, and outputs a recommended treatment. The main motivations for considering sequences of treatments are high variability across patients in response to any one type of treatment, likely relapse, presence or emergence of co-morbidities, time-varying side effect severity, and reduction of costs and burden when intensive treatment is unnecessary^{1}.

A DTR is said to be optimal if it optimizes the mean outcome at the end of the final stage of treatment. Data for estimating the optimal DTR can come from either an observational longitudinal study or a sequential multiple assignment randomized trial (SMART)^{2}^{–}^{5}. In these designs, each patient is followed through stages of treatment and at each stage the patient is randomized to one of the possible treatment options. Experimental designs similar to SMART have been implemented in the treatments of schizophrenia^{6}, depression^{7}, and cancer^{8}^{,}^{9}.

Estimating the optimal DTR is a problem of sequential, multi-stage decision making. Murphy^{10} developed a semiparametric method for estimating the optimal DTR, an efficient version of which was provided by Robins^{11}. A nice discussion about the relationship between these two methods can be found in Moodie et al.^{12}. Other methods for estimating optimal DTRs in the literature include likelihood-based methods, both frequentist and Bayesian, developed by Thall and colleagues^{8}^{,}^{13}^{,}^{14}, and the semiparametric methods of Lunceford et al.^{15}, and Wahed and Tsiatis^{9}^{,}^{16}.

Robins^{11} considered the problem of inference for the parameters of the optimal DTR. As discussed by Robins, the treatment effect parameters at any stage prior to the last can be *nonregular* under certain longitudinal distributions of the data which he called *exceptional laws*. By nonregularity, we mean that the asymptotic distribution of the estimator of the treatment effect parameter does not converge uniformly over the parameter space (see section 2.4 for further details). This technical phenomenon of nonregularity has considerable practical consequences; it often causes bias in estimation, and leads to poor frequentist properties of the confidence intervals. Recently Moodie and Richardson^{17} provided a method called *Zeroing Instead of Plugging In* (ZIPI) for correcting the bias in the estimation of the optimal DTRs resulting under exceptional laws.

The main goals of this paper are to illustrate the problem of nonregularity, and to compare available estimation methods that attempt to address this problem. In section 2, we discuss the problem of nonregularity in detail. Section 3 provides a description of different methods that address the problem. We provide an extensive simulation study in section 4 to compare the estimators in terms of their ability to lead to valid confidence intervals using bootstrap. This is followed by an analysis of a data set from a longitudinal smoking cessation trial in section 5; the purpose is to demonstrate the applicability of the estimation methods in a real-life nonregular scenario. Finally an overall discussion is provided in section 6. Throughout this article, we assume that the data come from SMART designs. The main reason for this is to separate the issue of nonregularity from causal inference issues. However the problem of nonregularity also arises when observational data^{11}^{,}^{17} are used; and the estimators proposed in section 3 should be applicable to observational data as well.

## 2 Estimation and Inference via Q-learning

### 2.1 Notation and Data Structure

For simplicity, we focus on studies with two stages. Longitudinal data on a single patient are given by the trajectory (*O*_{1}*, A*_{1}*, O*_{2}*, A*_{2}*, O*_{3}), where *O _{j}* (

*j*= 1, 2) denotes the covariates measured prior to treatment at the beginning of the

*j*-th stage,

*O*

_{3}is the observation at the end of stage 2, and

*A*(

_{j}*j*= 1, 2) is the treatment assigned at the

*j*-th stage subsequent to observing

*O*. The data set consists of a random sample of

_{j}*n*patients. Define the history at each stage as:

*H*

_{1}=

*O*

_{1},

*H*

_{2}= (

*O*

_{1},

*A*

_{1},

*O*

_{2}). We consider a SMART design in which there are two possible treatments at each stage,

*A*{−1, 1}; here we assume $P[{A}_{j}=-1\mid {H}_{j}]=P[{A}_{j}=1\mid {H}_{j}]={\scriptstyle \frac{1}{2}}$. The study can have either a single primary outcome

_{j}*Y*observed at the end of stage 2, or two outcomes

*Y*

_{1},

*Y*

_{2}observed at the two stages. Note that the case of a single outcome

*Y*observed at the end can be viewed as a case with

*Y*

_{1}0 and

*Y*

_{2}=

*Y*. We assume

*Y*

_{1}=

*f*

_{1}(

*O*

_{1},

*A*

_{1},

*O*

_{2}) and

*Y*

_{2}=

*f*

_{2}(

*O*

_{1},

*A*

_{1},

*O*

_{2},

*A*

_{2},

*O*

_{3}), with known functions

*f*

_{1},

*f*

_{2}. A two-stage DTR consists of two decision rules, say (

*d*

_{1},

*d*

_{2}), with

*d*(

_{j}*H*)

_{j}*, where*

_{j}*is the set of possible treatments at the*

_{j}*j*-th stage.

One simple method to construct (*d*_{1}, *d*_{2}) is Q-learning^{18}^{–}^{20}. Q-learning, like Robins’ *g-estimation of optimal structural nested mean models* (hereafter simply referred to as *Robins’ method*), suffers from nonregularity – the common reason being an underlying non-smooth maximization operation. Here we will illustrate the problem due to nonregularity using Q-learning, since it can be viewed as a generalization of the least squares regression to multistage decision problems, and hence simpler to explain than Robins’ semiparametric efficient method. In Lemma 1 below, we provide conditions under which Q-learning is equivalent to an inefficient version of Robins’ method.

### 2.2 Q-learning with Linear Models

First let us define the Q-functions^{19}^{,}^{20} for the two stages as follows:

If the two Q-functions were known, the optimal DTR (*d*_{1}, *d*_{2}), using backwards induction (as in dynamic programming) argument, would be

In practice, the true Q-functions are not known and hence must be estimated from the data. Consider a linear model for the Q-functions. Let the stage-*j* (*j* = 1, 2) Q-function be modeled as

where *H _{j}*

_{0}and

*H*

_{j}_{1}are two (possibly different) summaries of the history

*H*, with

_{j}*H*

_{j}_{0}denoting the “main effect of history” and

*H*

_{j}_{1}denoting the part of history that interacts with treatment (

*H*

_{j}_{0}and

*H*

_{j}_{1}include the intercept term). The Q-learning algorithm is:

- Stage-2 regression: $({\widehat{\beta}}_{2},{\widehat{\psi}}_{2})=arg{min}_{{\beta}_{2},{\psi}_{2}}{\scriptstyle \frac{1}{n}}{\sum}_{i=1}^{n}{\left({Y}_{2i}-{Q}_{2}({H}_{2i},{A}_{2i};{\beta}_{2},{\psi}_{2})\right)}^{2}$.
- Stage-2 optimal rule:
_{2}(*h*_{2}) = arg max_{a}_{2}*Q*_{2}(*h*_{2},*a*_{2};_{2},_{2}). - Stage-1 pseudo-outcome:
*Ŷ*_{1}=_{i}*Y*_{1}+ max_{i}_{a}_{2}*Q*_{2}(*H*_{2},_{i}*a*_{2};_{2},_{2}),*i*= 1,…,*n*. - Stage-1 regression: $({\widehat{\beta}}_{1},{\widehat{\psi}}_{1})=arg{min}_{{\beta}_{1},{\psi}_{1}}{\scriptstyle \frac{1}{n}}{\sum}_{i=1}^{n}{\left({\widehat{Y}}_{1i}-{Q}_{1}({H}_{1i},{A}_{1i};{\beta}_{1},{\psi}_{1})\right)}^{2}$.
- Stage-1 optimal rule:
_{1}(*h*_{1}) = arg max_{a}_{1}*Q*_{1}(*h*_{1},*a*_{1};_{1},_{1}).

The estimated optimal DTR using Q-learning is given by (_{1}, _{2}).

The following lemma gives a set of sufficient conditions under which Q-learning is equivalent to an inefficient version of Robins’ method.

#### Lemma 1

Consider linear models for the Q-functions as in (2). Assume that:

- the parameters in
*Q*_{1}and*Q*_{2}are distinct; *A*_{j}has zero conditional mean given the history*H*_{j},*j*= 1, 2; and- the covariates used in the model for
*Q*_{1}are nested within the covariates used in the model for*Q*_{2}*, i.e.,*$({H}_{10}^{T},{H}_{11}^{T}{A}_{1})\subset {H}_{20}^{T}$.

Then Q-learning is algebraically equivalent to an inefficient version of Robins’ method.

The proof is given in Appendix A.

### 2.3 The Inference Problem

With (2) as the model for Q-functions, the optimal DTR is given by

where *sign*(*x*) = 1 if *x* > 0, and −1 otherwise. Note that the term
${\beta}_{j}^{T}{H}_{j0}$ on the right side of (2) does not feature in the optimal DTR. Thus for estimating optimal DTRs, the *ψ _{j}*’s are the parameters of interest, while

*β*’s are nuisance parameters. We want to perform inference (e.g., construct confidence intervals) on

_{j}*ψ*’s.

_{j}Conducting inference on *ψ _{j}*’s is important due to the following reasons. First, if the confidence intervals (or hypothesis tests) for

*ψ*reveal that there is no evidence that some components of the vector

_{j}*ψ*are different from zero, then the corresponding components of the history vector

_{j}*H*

_{j}_{1}need not be collected to make decisions using the optimal DTR. This reduces the cost of data collection in a future implementation of the optimal DTR. Thus in the present context, confidence intervals (or hypothesis tests) can be viewed as a tool for doing variable selection. Second, it is important to know when there is insufficient support in the data to recommend one treatment over another, since in such cases treatment can be chosen according to other considerations like cost, familiarity, burden, preference etc. Third, as discussed by Robins

^{11}, confidence intervals for

*ψ*can lead to confidence intervals for

_{j}*d*. In the following, we discuss the problem of nonregularity in inference.

_{j}### 2.4 Nonregularity in Inference

Note that the stage-1 pseudo-outcome (in the Q-learning algorithm) is

which is a non-smooth (e.g., non-differentiable at
${\widehat{\psi}}_{2}^{T}{H}_{21,i}=0$) function of _{2}, because of the maximization operation. Since _{1} is a function of *Ŷ*_{1}* _{i}*,

*i*= 1,…,

*n*, it is in turn a non-smooth function of

_{2}. As a consequence, the asymptotic distribution of $\sqrt{n}({\widehat{\psi}}_{1}-{\psi}_{1})$ does not converge uniformly

^{11}over the parameter space of

*ψ*= (

*ψ*

_{1},

*ψ*

_{2}). More specifically, the asymptotic distribution of $\sqrt{n}({\widehat{\psi}}_{1}-{\psi}_{1})$ is normal if

*ψ*

_{2}is such that $P[{H}_{2}:{\psi}_{2}^{T}{H}_{21}=0]=0$, but is non-normal if $P[{H}_{2}:{\psi}_{2}^{T}{H}_{21}=0]>0$. This change in the asymptotic distribution happens abruptly. The (vector) parameter

*ψ*

_{1}is called a

*nonregular*parameter and the estimator

_{1}is called a

*nonregular*estimator; see Bickel et al.

^{21}for the precise definition of nonregularity. Because of this nonregularity, given the noise level present in small samples, the estimator

_{1}oscillates between the two asymptotic distributions across samples. As a result, usual Wald type confidence intervals perform poorly

^{11}

^{,}

^{17}.

The issue of nonregularity can be better understood with a toy example discussed by Robins^{11} (here is a slightly modified version). Consider the problem of estimating |*μ*| based on *n* i.i.d. observations *X*_{1},…,*X _{n}* from

*N*(

*μ*, 1). Note that |

*|is the maximum likelihood estimator of |*

_{n}*μ*|, where

*is the sample average. It can be shown that the asymptotic distribution of $\sqrt{n}(\mid {\overline{X}}_{n}\mid -\mid \mu \mid )$ for*

_{n}*μ*= 0 is different from that for

*μ*≠ 0. Thus |

*|is a nonregular estimator of |*

_{n}*μ*|. Also, for

*μ*= 0, ${lim}_{n\to \infty}E[\sqrt{n}(\mid {\overline{X}}_{n}\mid -\mid \mu \mid )]=\sqrt{{\scriptstyle \frac{2}{\pi}}}$. Robins referred to this quantity as the

*asymptotic bias*of the estimator |

*|. This asymptotic bias is one symptom of the underlying nonregularity, as discussed by Moodie and Richardson*

_{n}^{17}.

In many situations where the asymptotic distribution of an estimator is unavailable, bootstrap is used as an alternative approach to conduct inference. But the success of bootstrap also hinges on the underlying smoothness of the estimator. When an estimator is nonsmooth, the ordinary (*n* out of *n*) bootstrap procedure produces an inconsistent bootstrap estimator^{22}. Inconsistency of bootstrap in the above simple normal theory example has been discussed by Andrews^{23}. As shown by Shao^{22}, an alternative resampling procedure called “*m* out of *n* bootstrap” is consistent in such nonsmooth scenarios. One concern regarding the use of this procedure is the slower rate of convergence than
$\sqrt{n}$ even in a regular setting (e.g., when
$P[{H}_{2}:{\psi}_{2}^{T}{H}_{21}=0]=0$). Moreover, a data-adaptive choice of the tuning parameter *m* in the present context of DTRs is not obvious; see however Bickel and Sakov^{24} and Hall et al.^{25} for data-adaptive choice of *m* in other contexts.

The above concerns regarding nonregularity led us to investigate possible regularizations of the estimation procedure, and then use bootstrap for inference. In the simulation study to follow, we will investigate the behavior of different types of bootstrap confidence intervals for the parameters *ψ _{j}* of the optimal DTR in both regular and nonregular settings.

## 3 Different Regularized Estimators

In this section, we will present two competing estimators to address the non-regularity problem described above. Limited theoretical results are available at this point, and consequently it is not clear which estimator is better. In this paper, we will study their relative merits and demerits in simulations.

From the discussion on nonregularity above, it is clear that _{1} is a non-regular estimator because the stage-1 pseudo-outcome *Ŷ*_{1} is a non-smooth function (e.g., absolute value) of _{2}. The estimators presented in this section “regularize” the nonregular estimator (sometimes called the “hard-max” estimator because of the maximum operation used in the definition) by shrinking or thresholding the effect of the term involving the maximum, e.g.,
$\mid {\widehat{\psi}}_{2}^{T}{H}_{21}\mid $, towards zero.

### 3.1 Hard-threshold Estimator

Recall that the pseudo-outcome
${\widehat{Y}}_{1}={Y}_{1}+{\widehat{\beta}}_{2}^{T}{H}_{20}+\mid {\widehat{\psi}}_{2}^{T}{H}_{21}\mid $ is non-differentiable in _{2} only when
${\widehat{\psi}}_{2}^{T}{H}_{21}=0$, and so the corresponding estimator _{1} is problematic only when the true
${\psi}_{2}^{T}{H}_{21}$ is close to zero. The general form of the hard-threshold pseudo-outcome is

where *λ _{i}* (> 0) is the threshold for the

*i*-th subject in the sample (possibly depending on the variability of the linear combination ${\widehat{\psi}}_{2}^{T}{H}_{21,i}$ for that subject). One way to operationalize this is to perform a preliminary test (for each subject in the sample) of the hypothesis ${H}_{0i}:{\psi}_{2}^{T}{H}_{21,i}=0$ (

*H*

_{21,}

*is considered fixed in this test), set ${\widehat{Y}}_{1i}^{HT}={\widehat{Y}}_{1i}$ if*

_{i}*H*

_{0}

*is rejected, and replace $\mid {\widehat{\psi}}_{2}^{T}{H}_{21,i}\mid $ with the “better guess” 0 in case*

_{i}*H*

_{0}

*is accepted. Thus the hard-threshold pseudo-outcome can be written as*

_{i}

where _{2} is the estimated covariance matrix of _{2}. The corresponding estimator of *ψ*_{1}, denoted by
${\widehat{\psi}}_{1}^{HT}$, will be referred to as the hard-threshold estimator. The hard-threshold estimator is common in many areas like variable selection in linear regression and wavelet shrinkage^{26}. Moodie and Richardson^{17} proposed this estimator for bias correction in the context of Robins’ method, and called it *Zeroing Instead of Plugging In* (ZIPI) estimator.

Note that
${\widehat{Y}}_{1}^{HT}$ is still a non-smooth function of _{2} and hence
${\widehat{\psi}}_{1}^{HT}$ is a nonregular estimator of *ψ*_{1}. However, the problematic term
$\mid {\widehat{\psi}}_{2}^{T}{H}_{21}\mid $ is shrunk (thresholded) towards zero, and hence one might expect that the degree of nonregularity is somewhat reduced. Moodie and Richardson^{17} showed that this estimator reduces the bias occurring in Robins’ method (efficient version of Q-learning). In the simulation study to follow, we will explore if this estimator can be used to construct valid confidence intervals for *ψ*_{1}. An important issue regarding the use of this estimator is the choice of significance level *α* of the preliminary test, which is an unknown tuning parameter. As discussed by Moodie and Richardson^{17}, this is a difficult problem even in better-understood settings where preliminary test based estimators are used; and no widely applicable data-driven method for choosing *α* in this setting is currently available.

### 3.2 Soft-threshold or Shrinkage Estimator

The general form of the soft-threshold pseudo-outcome considered here is

where *x*^{+} = *x***1**{*x* > 0} stands for the positive part of a function, and *λ _{i}* (> 0) is a tuning parameter associated with the

*i*-th subject in the sample (again possibly depending on the variability of the linear combination ${\widehat{\psi}}_{2}^{T}{H}_{21,i}$ for that subject). In the contexts of regression shrinkage

^{27}and wavelet shrinkage

^{28}, the third term in (7) is generally known as the

*nonnegative garrote*estimator. As discussed by Zou

^{29}, the nonnegative garrote estimator is a special case of the

*adaptive lasso*estimator. As in the case of hard-threshold estimator, a crucial issue here is to choose a data-driven tuning parameter

*λ*. Below we provide a choice following a Bayesian approach.

_{i}Like the hard-threshold pseudo-outcome,
${\widehat{Y}}_{1}^{ST}$ is also a non-smooth function of _{2} and hence
${\widehat{\psi}}_{1}^{ST}$ remains a nonregular estimator of *ψ*_{1}. However, the problematic term
$\mid {\widehat{\psi}}_{2}^{T}{H}_{21}\mid $ is shrunk (or thresholded) towards zero, and hence one might expect that the degree of nonregularity is somewhat reduced. In the simulation study to follow, we will investigate how much improvement this estimator offers over the “hard-max” estimator, when it comes to constructing confidence intervals. Figure 1 presents the hard-max, the hard-threshold, and the soft-threshold pseudo-outcomes.

#### 3.2.1 Choice of Tuning Parameter

A hierarchical Bayesian formulation of the problem, inspired by the work of Figueiredo and Nowak^{30} in the area of wavelet-based image processing, can be used in the context of the soft-threshold estimator to choose *λ _{i}*’s in a data-driven way. It turns out that the estimator (7) with
${\lambda}_{i}=3{H}_{21,i}^{T}{\widehat{\mathrm{\sum}}}_{2}{H}_{21,i}/n,i=1,\dots ,n$, where

_{2}/

*n*is the estimated covariance matrix of

_{2}, is an approximate empirical Bayes estimator. The following lemma will be used to derive the choice of

*λ*.

_{i}##### Lemma 2

Let *X* be a random variable such that *X*|*μ* ~ *N(μ, σ*^{2}*)* with known variance* σ*^{2}. Let the prior distribution on *μ* be given by *μ*|*ϕ*^{2} ~ *N*(0, *ϕ*^{2}),with Jeffrey’s noninformative hyper-prior on *ϕ*^{2}, e.g., *p*(*ϕ*^{2}) 1/*ϕ*^{2}. Then an empirical Bayes estimator of |*μ*| is given by

where* Φ(·) *is the standard normal distribution function.

The proof is given in Appendix B.

Clearly, ${\widehat{\mid \mu \mid}}^{EB}$ is a thresholding rule, since ${\widehat{\mid \mu \mid}}^{EB}=0$ for $\mid X\mid <\sqrt{3}\sigma $. Moreover, when $\mid {\scriptstyle \frac{X}{\sigma}}\mid $ is large, the second term of (8) goes to zero exponentially fast, and

Consequently, the empirical Bayes estimator is approximated by

Now for *i* = 1,…,*n* separately, put
$X={\widehat{\psi}}_{2}^{T}{H}_{21,i}$, and
$\mu ={\widehat{\psi}}_{2}^{T}{H}_{21,i}$ (for fixed *H*_{21}* _{;i}*); and plug in
${\widehat{\sigma}}^{2}={H}_{21,i}^{T}{\widehat{\mathrm{\sum}}}_{2}{H}_{21,i}/n$ for

*σ*

^{2}. This leads to a choice of

*λ*in the soft-threshold pseudo-outcome (7):

_{i}

The presence of the indicator function in (11) indicates that ${\widehat{Y}}_{1i}^{ST}$ is a thresholding rule for small values of $\mid {\widehat{\psi}}_{2}^{T}{H}_{21,i}\mid $, while the term just preceding the indicator function makes ${\widehat{Y}}_{1i}^{ST}$ a shrinkage rule for moderate to large values of $\mid {\widehat{\psi}}_{2}^{T}{H}_{21,i}\mid $ (for which the indicator function takes the value one). Thus the current Bayesian formulation gives us a data-driven choice of the tuning parameters.

## 4 Simulation Study

In this section, we consider a simulation study to compare the performances of the hard-max, the hard-threshold, and the soft-threshold estimators under different nonregular scenarios. In this study, we vary the parameters of the generative model, the degree of nonregularity, and the type of bootstrap confidence interval.

### Generative Model

Recall that the data consist of *n* patient trajectories, each of the form (*O*_{1}, *A*_{1}, *O*_{2}, *A*_{2}, *O*_{3}). Without loss of generality, we assume *Y*_{1} 0 and *Y*_{2} *Y* = *O*_{3}. Let *μ _{Y}* =

*E*[

*Y*|

*O*

_{1},

*A*

_{1},

*O*

_{2},

*A*

_{2}], and

*ε*be the associated error term. Then

*Y*=

*μ*+

_{Y}*ε*, where

and *ε* ~ *N*(0, 1). Next, we consider binary treatments randomized with probability 1/2, e.g., *P*[*A _{j}* = 1] =

*P*[

*A*= −1] = 1/2,

_{j}*j*= 1, 2. Also, the binary covariates

*O*’s are generated as

_{j}

where *expit*(*x*) = exp(*x*)/(1 + exp(*x*)). Note that *γ*_{1},…,*γ*_{7} and *δ*_{1}, *δ*_{2} are the parameters that specify the generative model. These parameters will be varied in the examples to follow.

### Analysis Model

### Two dimensions of nonregularity: *p* and *ϕ*

Nonregularity in stage 1 parameters arises when the optimal stage 2 treatment is non-unique for at least some subjects in the population. With reference to the present generative model, a setting is nonregular if the linear combination *γ*_{5} + *γ*_{6}*O*_{2} + *γ*_{7}*A*_{1} = 0 with positive probability. Also one might expect some nonregular behavior as *γ*_{5} + *γ*_{6}*O*_{2} + *γ*_{7}*A*_{1} falls in a small neighborhood of zero (even though not exactly zero). In the following, we consider specific examples varying the “degree of nonregularity”, e.g., *p* = *P*[*γ*_{5} + *γ*_{6}*O*_{2} + *γ*_{7}*A*_{1} = 0] and the “standardized effect size” defined as
$\varphi =\mid E[{\gamma}_{5}+{\gamma}_{6}{O}_{2}+{\gamma}_{7}{A}_{1}]/\sqrt{\mathit{Var}[{\gamma}_{5}+{\gamma}_{6}{O}_{2}+{\gamma}_{7}{A}_{1}]}\mid $. The quantities *p* and *ϕ*, which depend on the distribution of the above linear combination, represent two dimensions of the nonregularity phenomenon. Note that the linear combination (*γ*_{5} + *γ*_{6}*O*_{2} + *γ*_{7}*A*_{1}) can take only four possible values corresponding to the four possible (*O*_{2}, *A*_{1}) cells. The cell probabilities can be easily calculated; the formulae are provided in Table 1.

It follows that *E*[*γ*_{5} + *γ*_{6}*O*_{2} + *γ*_{7}*A*_{1}] = *q*_{1}*f*_{1} + *q*_{2}*f*_{2} + *q*_{3}*f*_{3} + *q*_{4}*f*_{4}, and
$E[{({\gamma}_{5}+{\gamma}_{6}{O}_{2}+{\gamma}_{7}{A}_{1})}^{2}]={q}_{1}{f}_{1}^{2}+{q}_{2}{f}_{2}^{2}+{q}_{3}{f}_{3}^{2}+{q}_{4}{f}_{4}^{2}$, where *q*_{1},…,*q*_{4} are the cell probabilities given in Table 1. From these two, one can calculate *V ar*[*γ*_{5} + *γ*_{6}*O*_{2} + *γ*_{7}*A*_{1}], and subsequently the effect size *ϕ*.

We want to conduct inference on *ψ*_{10} and *ψ* _{11}, the analysis model parameters associated with stage 1 treatment *A*_{1}. They can be expressed in terms of *γ*’s and *δ*’s, the parameters of the generative model, as follows. It turns out that

where
${q}_{1}^{\prime}={q}_{3}^{\prime}={\scriptstyle \frac{1}{4}}(\mathit{expit}({\delta}_{1}+{\delta}_{2})-\mathit{expit}(-{\delta}_{1}+{\delta}_{2}))$, and
${q}_{2}^{\prime}={q}_{4}^{\prime}={\scriptstyle \frac{1}{4}}(\mathit{expit}({\delta}_{1}-{\delta}_{2})-\mathit{expit}(-{\delta}_{1}-{\delta}_{2}))$. In the following, we consider specific examples for varying *p* and *ϕ*. In Examples 1 − 4 below, we use *δ*_{1} = *δ*_{2} = 0.5. For this choice, we get the following values of the cell probabilities: *q*_{1} = *q*_{4} = 0.3078 and *q*_{2} = *q*_{3} = 0.1922. This choice of the *δ*’s also makes
${q}_{1}^{\prime}={q}_{2}^{\prime}={q}_{3}^{\prime}={q}_{4}^{\prime}=.0578$.

#### Example 1 (p = 1, ϕ undefined)

Consider a setting where there is no treatment effect for any subject (any history) in either stage. This is achieved by setting *γ*_{1} = … = *γ*_{7} = 0, and *δ*_{1} = *δ*_{2} = 0.5. Then *f*_{1} = *f*_{2} = *f*_{3} = *f*_{4} = 0, and hence *ψ* _{10} = *ψ* _{11} = 0, *p* = 1, and *ϕ* is undefined (0/0). This is a fully nonregular scenario.

#### Example 2 (p = 0, ϕ infinite)

Consider a setting similar to Example 1, where there is a very weak stage 2 treatment effect for every subject (all possible history). This is achieved by setting *γ*_{5} = 0.01 and *γ _{j}* = 0, ∀

*j*≠ 5, and

*δ*

_{1}=

*δ*

_{2}= 0.5. Then

*f*

_{1}=

*f*

_{2}=

*f*

_{3}=

*f*

_{4}= 0.01;

*ψ*

_{10}=

*ψ*

_{11}= 0,

*p*= 0, and

*ϕ*is infinite (0.01/0). This is a regular scenario, but close to nonregularity (it is hard to detect the very weak effect given the noise level in the data).

#### Example 3 ( $p={\scriptstyle \frac{1}{2}}$, ϕ = 1)

Consider a setting where there is no stage 2 treatment effect for half the subjects in the population, but a reasonably large effect for the other half of subjects. This is achieved by setting *γ*_{1} = *γ*_{2} = *γ*_{4} = *γ*_{6} = 0, *γ*_{3} = −0.5, *γ*_{5} = *γ*_{7} = 0.5, and *δ*_{1} = *δ*_{2} = 0.5. Then *f*_{1} = *f*_{3} = 1, *f*_{2} = *f*_{4} = 0, *ψ* _{10} = *ψ* _{11} = 0,
$p={\scriptstyle \frac{1}{2}}$ and *ϕ* = 1. This is a nonregular setting.

#### Example 4 (p = 0, ϕ = 1.0204)

Consider a setting where there is a very weak stage 2 treatment effect for half the subjects in the population, but a reasonably large effect for the other half of subjects. This is achieved by setting *γ*_{1} = *γ*_{2} = *γ*_{4} = *γ*_{6} = 0, *γ*_{3} = −0.5, *γ*_{5} = 0.5, *γ*_{7} = 0.49, and *δ*_{1} = *δ*_{2} = 0.5. It follows that *f*_{1} = *f*_{3} = 0.99, *f*_{2} = *f*_{4} = 0.01, *ψ* _{10} = −0.0100, *ψ* _{11} = 0, *p* = 0, and *ϕ* = 1.0204. This regular example is close to the nonregular Example 3.

#### Example 5 ( $p={\scriptstyle \frac{1}{4}}$, ϕ = 1.4142)

Consider a setting where there is no stage 2 treatment effect for one-fourth of the subjects in the population, but others have a reasonably large effect. To achieve this, set *γ*_{1} = *γ*_{2} = *γ*_{4} = 0, *γ*_{3} = −0.5, *γ*_{5} = 1, *γ*_{6} = *γ*_{7} = 0.5, *δ*_{1} = 1, and *δ*_{2} = 0. Then *f*_{1} = 2, *f*_{2} = *f*_{3} = 1, *f*_{4} = 0; the cell probabilities are equal, i.e.,
${q}_{1}={q}_{2}={q}_{3}={q}_{4}={\scriptstyle \frac{1}{4}}$; and
${q}_{1}^{\prime}={q}_{2}^{\prime}={q}_{3}^{\prime}={q}_{4}^{\prime}=0.1155$. Consequently, *ψ*_{10} = *ψ*_{11} = 0,
$p={\scriptstyle \frac{1}{4}}$, and *ϕ* = 1.4142. This is a nonregular setting.

#### Example 6 (p = 0, ϕ = 0.3451)

Consider a completely regular setting where there is a reasonably large stage 2 treatment effect for every subject in the population. This can be achieved by setting *γ*_{1} = *γ*_{2} = *γ*_{4} = 0, *γ*_{3} = −0.5, *γ*_{5} = 0.25, *γ*_{6} = *γ*_{7} = 0.5, and *δ*_{1} = *δ*_{2} = 0.1. Then *f*_{1} = 1.25, *f*_{2} = *f*_{3} = 0.25, and *f*_{4} = −0.75; the cell probabilities are *q*_{1} = *q*_{4} = 0.2625, *q*_{2} = *q*_{3} = 0.2375; and
${q}_{1}^{\prime}={q}_{2}^{\prime}={q}_{3}^{\prime}={q}_{4}^{\prime}=0.0125$. It follows that *ψ*_{10} = −0.3688, *ψ*_{11} = 0.0187, *p* = 0 and *ϕ* = 0.3451.

Note that in Example 5, the effect size *ϕ* is greater than Cohen’s^{31} benchmark large effect size (=0.8). Such a high effect size can be criticized as being unrealistic, based on the *principle of clinical equipoise* ^{32}, which provides the ethical basis for medical research involving randomization. This principle says that there must be a honest, professional disagreement (high variability) among expert clinicians about the preferred treatment (and thus the standardized effect size of treatment is likely small). Hence this example might be somewhat down-weighted for overall comparison of performance. Furthermore, Example 6 violates the *Hierarchical Ordering Principle* ^{33} in that the coefficient of the interaction term *A*_{1}*A*_{2} (*γ*_{7}) is larger than the co-efficient of the main effect *A*_{2} (*γ*_{5}). So this example might be given lower weight as well.

### Competing Estimators

In the simulation, we will consider four estimators: the hard-max estimator (original Q-learning), the soft-threshold estimator, and the hard-threshold estimator with two values of the tuning parameter *α*, e.g., 0.2, which was empirically found to be a good choice by Moodie and Richardson^{17}, and 0.08 which corresponds to the threshold used by the soft-threshold estimator proposed in this paper (from (11), the threshold used by the soft-threshold estimator is
$\sqrt{3}=1.7321$; equating this point to *z _{α}*

_{/2}and solving for

*α*, we get

*α*= 0.0833).

### Different Bootstrap CIs

We consider three types of bootstrap CIs, e.g., percentile, hybrid, and double (percentile) bootstrap CIs. Let be an estimator of *θ* and * be its bootstrap version. Then the 100(1 − *α*)% percentile bootstrap (PB) CI is given by
$\left({\widehat{\theta}}_{({\scriptstyle \frac{\alpha}{2}})}^{\ast},{\widehat{\theta}}_{(1-{\scriptstyle \frac{\alpha}{2}})}^{\ast}\right)$, and the 100(1 − *α*)% hybrid bootstrap (HB) CI is given by
$\left(2\widehat{\theta}-{\widehat{\theta}}_{(1-{\scriptstyle \frac{\alpha}{2}})}^{\ast},2\widehat{\theta}-{\widehat{\theta}}_{({\scriptstyle \frac{\alpha}{2}})}^{\ast}\right)$, where
${\widehat{\theta}}_{\gamma}^{\ast}$ is the 100*γ*-th percentile of the bootstrap distribution. The double bootstrap (DB) CI is calculated as follows:

- Draw
*B*_{1}first-stage bootstrap samples from the original data. For each first-stage bootstrap sample, calculate the bootstrap version of the estimator *,^{b}*b*= 1,…,*B*_{1}. - Conditional on each first-stage bootstrap sample, draw
*B*_{2}second-stage (nested) bootstrap samples and calculate the double bootstrap versions of the estimator, e.g., **,^{bm}*b*= 1,…,*B*_{1},*m*= 1,…,*B*_{2}. - For
*b*= 1,…,*B*_{1}, calculate ${u}^{\ast b}={\scriptstyle \frac{1}{{B}_{2}}}{\sum}_{m=1}^{{B}_{2}}\mathbf{1}\{{\widehat{\theta}}^{\ast \ast bm}\le \widehat{\theta}\}$, where is the estimator based on the original data. - The double bootstrap CI is given by $\left({\widehat{\theta}}_{\widehat{q}({\scriptstyle \frac{\alpha}{2}})}^{\ast},{\widehat{\theta}}_{\widehat{q}(1-{\scriptstyle \frac{\alpha}{2}})}^{\ast}\right)$, where $\widehat{q}(\gamma )={u}_{(\gamma )}^{\ast}$, the 100
*γ*-th percentile of the distribution of*u**,^{b}*b*= 1,…,*B*_{1}.

See Davison and Hinkley^{34} and Nankervis^{35} for details about double bootstrap CIs. One disadvantage of these CIs is that they are computationally very intensive.

We use *B* = 1000 bootstrap iterations to calculate the percentile and the hybrid bootstrap CIs. However, the double bootstrap CIs are based on *B*_{1} = 500 first-stage and *B*_{2} = 100 second-stage bootstrap iterations (due to the increased computational burden). The results in Tables 2 – 3 are based on *N* = 1000 Monte Carlo iterations.

*ψ*

_{10}using the hard-max (HM), the hard-threshold with

*α*= 0.08 (HT

_{0.08}) and

*α*= 0.2 (HT

_{0.20}), and the soft-threshold

**...**

### 4.1 Results

The simulation study compares the competing estimators on a variety of settings represented by Examples 1 – 6. We considered estimation and inference for both *ψ*_{10} and *ψ*_{11}. However in the present examples, the effect of nonregularity turned out to be more pronounced for the parameter *ψ*_{10} (main effect of *A*_{1}) than *ψ*_{11} (interaction of *A*_{1} with *O*_{1}). Hence we included results on *ψ*_{10} only in Tables 2 and and3.3. Also in the following discussion, we will focus on *ψ*_{10}.

In Example 1 (top part of Table 2), where stage 2 effects for all possible histories are zero (i.e., the stage 2 optimal treatment is non-unique for every subject in the population), we see that there is no bias associated with the hard-max estimator; and the mean squared error (MSE) is essentially the same as the variance. However the percentile bootstrap CI (both 95% and 90%) has over-coverage (note that over-coverage translates to lower power of the corresponding hypothesis test), and the hybrid bootstrap CI (95%) has under-coverage compared to the nominal level. We have also studied the Wald type CIs for this setting (not included in this paper) and observed over-coverage (the problem with Wald type CIs in such nonregular settings is well-known^{11}^{,}^{17}). This suggests that the asymptotic distribution of the hard-max estimator has a lighter tail than a comparable normal distribution. However, the double bootstrap CIs have correct coverage. Note that both versions of the hard-threshold estimator fail to rectify the coverage rate, even though neither suffer from bias. However, the soft-threshold estimator offers correct coverage for both types of bootstrap CIs. Moreover, it gives the lowest MSE among the four estimators. Note that the soft-threshold estimator is also non-smooth (nonregular), and consequently the bootstrap distribution is inconsistent for the true asymptotic distribution of this estimator. But in this setting, it reduces the degree of nonregularity just enough so that the bootstrap CIs do not show the problem with coverage.

Even though Example 2 (middle part of Table 2) is a regular setting (*p* = 0), it is very close to Example 1 and hence affected by nonregularity. Results are similar to those in Example 1. Thus the presence of very small effects causes problems with coverage even in regular settings.

Example 3 (bottom part of Table 2) is a setting where the stage 2 optimal treatment is non-unique for half the subjects in the population (
$p={\scriptstyle \frac{1}{2}}$) and is unique for the remaining half, but the overall standardized stage 2 effect size *ϕ* (= 1) is quite large. Here the hard-max estimator is biased, and hence both the percentile and the hybrid bootstrap CIs under-cover the true value. However the double bootstrap CI gives correct coverage rate. Both versions of the hard-threshold estimator reduce bias and one of them (corresponding to *α* = 0.08) gives correct coverage, while the other also offers substantial improvement of the coverage rate. This is consistent with the findings of Moodie and Richardson^{17}. The soft-threshold estimator also reduces bias, gives the lowest MSE among the four estimators, and provides correct coverage with the hybrid bootstrap method but not with the percentile method (even though it offers substantial improvement). Thus in this example, the hard-threshold estimator with *α* = 0.08 emerges as the winner, with the soft-threshold estimator at the second place. However, note that the value 0.08 of the tuning parameter *α* is not arbitrary – it corresponds to the threshold used by the soft-threshold estimator. If constructing confidence intervals is the main goal (so biased estimation is less of an issue), double bootstrap CI along with the hard-max estimator can also be used in this setting, although it is computationally more expensive.

Example 4 (top part of Table 3) is a regular setting, very similar to the nonregular setting in Example 3. Results are quite similar to those in Example 3. This is consistent with our previous observation (Example 2) that the presence of very small effects causes problems with coverage even in regular settings.

In example 5 (middle part of Table 3), the stage 2 optimal treatment is non-unique for one-fourth of the subjects in the population (
$p={\scriptstyle \frac{1}{4}}$) and the standardized effect size *ϕ* is very large (=1.4142). Again, the hard-max estimator is biased, and has low coverage of the CIs (except for double bootstrap). The hard-threshold and the soft-threshold estimators offer improvement in terms of bias as well as coverage. The soft-threshold estimator emerges as the best (lowest MSE and correct coverage rate) in this example.

Example 6 (bottom part of Table 3) is a regular setting (*p* = 0, with no extremely tiny stage 2 effect as in Examples 2 and 4), with the standardized effect size 0.3451. The reason for investigating this setting is to check if the regularized estimators (hard and soft threshold) perform poorly in settings where there is no need to regularize. As expected, the hard-max estimator performs well here. The soft-threshold estimator introduces some bias when there is none in the hard-max estimator and increases MSE; but still manages to provide correct coverage for the percentile bootstrap method. The hard-threshold estimators also give correct coverage for percentile CIs.

To summarize, the hard-max estimator is problematic in nonregular scenarios, except when used with the computationally intensive double bootstrap method for constructing confidence intervals. The hard-threshold estimator, if properly tuned, addresses the problem of bias but not the problem of light tail. The soft-threshold estimator seems to address both problems to a large extent. In the simulation, the soft-threshold estimator consistently produced the lowest MSE among the competing methods across all the nonregular scenarios. Also in all the nonregular settings, either the soft-threshold estimator or the hard-threshold estimator with *α* = 0.08 (this *α* corresponds to the threshold used by the soft-threshold estimator) emerged as the winner in terms of providing correct coverage rate of the bootstrap CIs. Even though the soft-threshold estimator incurs some bias in regular settings, it manages to provide reasonable coverage rate for small to moderate standardized effect sizes (we have studied up to around 0.35). Across all the scenarios considered here (Examples 1–6), the soft-threshold estimator emerged as more robust than the hard-threshold estimator to the degree of regularity of the underlying data distribution, probably because of its “soft” nature (the soft-threshold estimator is continuous everywhere even though it has two points of non-differentiability, whereas the hard-threshold estimator has two points of discontinuity – see Figure 1). Furthermore, note that overall the hybrid bootstrap CIs performed slightly better than the percentile bootstrap CIs in this simulation study. Hence the hybrid bootstrap CIs will be used in the data analysis to follow.

## 5 Analysis of Smoking Cessation Data

To demonstrate the occurrence of nonregularity and the use of the soft-threshold method in a real application, here we present the analysis of a data set from a randomized, two-stage, longitudinal, internet-based smoking cessation study conducted by the Center for Health Communications Research at the University of Michigan. The stage 1 of this study (*Project Quit*) was conducted to find an optimal multi-factor behavioral intervention to help adult smokers quit smoking; and the stage 2 (*Forever Free*) was a follow-on study to help those (among the participants of *Project Quit*) who already quit stay quit, and help those who failed at the previous stage with a second chance. Details of the study design and primary analysis of the stage 1 data can be found in Strecher et al.^{36}

At stage 1, although there were five two-level treatment factors, only two, e.g.,
`source` (of online behavioral counseling message) and
` story ` (of a hypothetical character who succeeded in quitting smoking) were significant in the analysis reported in Strecher et al.^{36} For simplicity, we considered only these two treatment factors at stage 1 of our present analysis, which gave a total of 4 treatment combinations at stage 1 corresponding to the 2×2 design. The treatment factor
`source` was varied at two levels, e.g., high vs. low personalized, coded 1 and −1; also the factor
`story` was varied at two levels, e.g., high vs. low tailoring depth (degree to which the character in the story was tailored to the individual subject’s baseline characteristics), coded 1 and −1. Baseline variables at this stage included subjects’
`motivation` to quit (on a 1–10 scale),
`selfefficacy` (on a 1–10 scale) and
`education` (binary, ≤ high school vs. > high school, coded −1/1). At stage 2, originally there were 4 different treatment groups and a control group; however the 4 treatment groups were combined together for the present analysis because of very little difference between them. This resulted in only two choices of treatment at stage 2; this treatment variable was called
`FFarm`, coded −1/1 (1=treatment, −1 = control).

There were two outcomes at the two stages of this study. The stage 1 outcome was binary quit status called
`PQ6Quitstatus` (1=quit, 0=not quit) at 6 month from the date of randomization. The stage 2 outcome was binary quit status
`FF6Quitstatus` at 6 months from the date of stage 2 randomization (i.e., 12 months from the date of stage 1 randomization).

An example DTR can have the following form: “At stage 1, if a subject’s baseline
`selfefficacy` is greater than a threshold value (say 7, on a 1–10 scale), then provide the highly-personalized level of the treatment component
`source`, and if the subject is willing to continue treatment, then at stage 2 provide treatment if he/she continues to be a smoker at the end of stage 1”. Of course characteristics other than
`selfefficacy` or a combination of more than one subject characteristics can be used to specify a DTR. To find the optimal DTR, we applied both the hard-max and the soft-threshold estimators within the Q-learning framework. This involved:

- a stage 2 regression (
*n*= 281) of`FF6Quitstatus`using the model:$$\begin{array}{l}FF6\mathit{Quitstatus}={\beta}_{20}+{\beta}_{21}\phantom{\rule{0.16667em}{0ex}}\mathit{motivation}+{\beta}_{22}\phantom{\rule{0.16667em}{0ex}}\mathit{source}+{\beta}_{23}\phantom{\rule{0.16667em}{0ex}}\mathit{selfefficacy}\\ +{\beta}_{24}\phantom{\rule{0.16667em}{0ex}}\mathit{story}+{\beta}_{25}\phantom{\rule{0.16667em}{0ex}}\mathit{education}+{\beta}_{26}\phantom{\rule{0.16667em}{0ex}}PQ6\mathit{Quitstatus}\\ +{\beta}_{27}\phantom{\rule{0.16667em}{0ex}}\mathit{source}\ast \mathit{selfefficacy}+{\beta}_{28}\phantom{\rule{0.16667em}{0ex}}\mathit{story}\ast \mathit{education}\\ +\left({\psi}_{20}+{\psi}_{21}\phantom{\rule{0.16667em}{0ex}}PQ6\mathit{Quitstatus}\right)\ast \mathit{FFarm}+{\epsilon}_{2};\end{array}$$ - finding both the hard-max pseudo-outcome (
*Ŷ*_{1}) and the soft-threshold pseudo-outcome ( ${\widehat{Y}}_{1}^{ST}$) for the stage 1 regression:$$\begin{array}{l}{\widehat{Y}}_{1}=PQ6\mathit{Quitstatus}+{\widehat{\beta}}_{20}+{\widehat{\beta}}_{21}\phantom{\rule{0.16667em}{0ex}}\mathit{motivation}+{\widehat{\beta}}_{22}\phantom{\rule{0.16667em}{0ex}}\mathit{source}+{\widehat{\beta}}_{23}\phantom{\rule{0.16667em}{0ex}}\mathit{selfefficacy}\\ +{\widehat{\beta}}_{24}\phantom{\rule{0.16667em}{0ex}}\mathit{story}+{\widehat{\beta}}_{25}\phantom{\rule{0.16667em}{0ex}}\mathit{education}+{\widehat{\beta}}_{26}\phantom{\rule{0.16667em}{0ex}}PQ6\mathit{Quitstatus}\\ +{\widehat{\beta}}_{27}\phantom{\rule{0.16667em}{0ex}}\mathit{source}\ast \mathit{selfefficacy}+{\widehat{\beta}}_{28}\phantom{\rule{0.16667em}{0ex}}\mathit{story}\ast \mathit{education}\\ +\mid {\widehat{\psi}}_{20}+{\widehat{\psi}}_{21}\phantom{\rule{0.16667em}{0ex}}PQ6\mathit{Quitstatus}\mid ;\end{array}$$$$\begin{array}{l}{\widehat{Y}}_{1}^{ST}=PQ6\mathit{Quitstatus}+{\widehat{\beta}}_{20}+{\widehat{\beta}}_{21}\phantom{\rule{0.16667em}{0ex}}\mathit{motivation}+{\widehat{\beta}}_{22}\phantom{\rule{0.16667em}{0ex}}\mathit{source}+{\widehat{\beta}}_{23}\phantom{\rule{0.16667em}{0ex}}\mathit{selfefficacy}\\ +{\widehat{\beta}}_{24}\phantom{\rule{0.16667em}{0ex}}\mathit{story}+{\widehat{\beta}}_{25}\phantom{\rule{0.16667em}{0ex}}\mathit{education}+{\widehat{\beta}}_{26}\phantom{\rule{0.16667em}{0ex}}PQ6\mathit{Quitstatus}\\ +{\widehat{\beta}}_{27}\phantom{\rule{0.16667em}{0ex}}\mathit{source}\ast \mathit{selfefficacy}+{\widehat{\beta}}_{28}\phantom{\rule{0.16667em}{0ex}}\mathit{story}\ast \mathit{education}\\ +\mid {\widehat{\psi}}_{20}+{\widehat{\psi}}_{21}\phantom{\rule{0.16667em}{0ex}}PQ6\mathit{Quitstatus}\mid \xb7{\left(1-\frac{3\text{Var}({\widehat{\psi}}_{20}+{\widehat{\psi}}_{21}\phantom{\rule{0.16667em}{0ex}}PQ6\mathit{Quitstatus})}{\mid {\widehat{\psi}}_{20}+{\widehat{\psi}}_{21}\phantom{\rule{0.16667em}{0ex}}PQ6\mathit{Quitstatus}{\mid}^{2}}\right)}^{+};\end{array}$$and (3) for each of the two pseudo-outcomes, a stage 1 regression (*n*= 1401) of the pseudo-outcome using a model of the form:$$\begin{array}{l}{\widehat{Y}}_{1}\phantom{\rule{0.16667em}{0ex}}\text{or}\phantom{\rule{0.16667em}{0ex}}{\widehat{Y}}_{1}^{ST}={\beta}_{10}+{\beta}_{11}\phantom{\rule{0.16667em}{0ex}}\mathit{motivation}+{\beta}_{12}\phantom{\rule{0.16667em}{0ex}}\mathit{selfefficacy}+{\beta}_{13}\phantom{\rule{0.16667em}{0ex}}\mathit{education}\\ +\left({\psi}_{10}^{(1)}+{\psi}_{11}^{(1)}\phantom{\rule{0.16667em}{0ex}}\mathit{selfefficacy}\right)\ast \mathit{source}\\ +\left({\psi}_{10}^{(2)}+{\psi}_{11}^{(2)}\phantom{\rule{0.16667em}{0ex}}\mathit{education}\right)\ast \mathit{story}+{\epsilon}_{1}.\end{array}$$

Note that the sample sizes at the two stages differ because only 281 subjects were willing to continue treatment into stage 2 (as allowed by the study protocol). Our stage 2 analysis was a usual regression analysis. No significant treatment effect was found at this stage, indicating the likely existence of nonregularity. At stage 1, for either estimator, 95% confidence intervals were constructed by hybrid bootstrap using 1000 bootstrap replications. The stage 1 analysis summary is presented in Table 4. In this case, the hard-max and the soft-threshold estimators produced similar results.

The conclusions from the present data analysis can be summarized as follows. We did not find any significant stage 2 treatment effect. So this analysis suggests that the stage 2 behavioral intervention need not be adapted to the smoker’s individual characteristics, interventions previously received, or stage 1 outcome. More interesting results are found at stage 1. It is found that subjects with higher level of
`motivation` or
`selfefficacy` are more likely to quit. The highly personalized level of
`source` is more effective for subjects with a higher
`selfefficacy` (≥ 7), and deeply tailored level of
`story` is more effective for subjects with lower
`education` (≤ high school); these two conclusions can be drawn from the interaction plots (with confidence intervals) presented in figure 2. Thus this secondary data analysis suggests that to maximize each individual’s chance of quitting over the two stages, the web-based smoking cessation intervention should be designed in future such that: (1) smokers with high
`self-efficacy` (≥ 7) are assigned to highly personalized level of
`source`, and (2) smokers with lower
`education` are assigned to deeply tailored level of
`story`.

## 6 Discussion

In this paper, we have illustrated the problem of nonregularity that arises in the context of DTRs in the estimation of the optimal current treatment rule, when the optimal treatments at subsequent stages are non-unique for at least some proportion of subjects in the population. We have illustrated the phenomenon using Q-learning as the estimation procedure, which is a simpler yet inefficient version of Robins’ method; however the problem of nonregularity arises in Robins’ method as well^{11}^{,}^{17}.

For some underlying data-generating models (e.g., Examples 3, 4, 5 in the simulation study), this nonregularity induces bias in the point estimates of the parameters of the optimal DTRs, which in turn causes under-coverage of the bootstrap confidence intervals. In contrast, in case of Examples 1 and 2, this nonregularity causes lightness of tail of the asymptotic distribution but no bias, as seen from the over-coverage of the percentile bootstrap CIs (equivalently conservative tests leading to lower power). The coexistence of these two not-so-well-related issues (they work in opposite directions, e.g., bias tends to make the CIs under-cover, whereas lightness of tail tends to make the CIs over-cover) makes this problem a unique and challenging one.

As mentioned in section 2.4, the phenomenon of nonregularity can be understood more clearly with a simpler problem, e.g., estimating |*μ*| (note that *ψ*_{10} is a linear combination of terms like |*μ*|; see section 4) by |* _{n}*| (similar to the hard-max estimator), where

*is the sample average of*

_{n}*n*i.i.d. observations

*X*

_{1},…,

*X*from

_{n}*N*(

*μ*, 1). From section 2.4, we know that when

*μ*= 0, |

*| is a biased estimator of |*

_{n}*μ*| = 0, with $\mathit{bias}=E(\mid {\overline{X}}_{n}\mid )=\sqrt{{\scriptstyle \frac{2}{n\pi}}}$. Because of this bias (wrong centering), bootstrap CIs exhibited gross under-coverage in a toy example. But once we used a bias-corrected estimate (corrected by the analytically calculated bias), the percentile bootstrap CI exhibited over-coverage. This suggests that the distribution of |

*| is peaked around its mean, or to put it in another way, has light tails.*

_{n}In the simulation study to compare the competing estimators of the optimal DTR, we considered estimation of *ψ*_{10}, which involve linear combinations of |*f*_{1}|, |*f*_{2}|, |*f*_{3}|, and |*f*_{4}| (terms like |*μ*|). Under the non-regular scenarios, some or all (depending on the degree of nonregularity *p*) of the *f _{i}*’s are zero; and hence a phenomenon similar to the one described above in the toy example happens for each |

*f*| for which

_{i}*f*= 0. Each such term has its associated bias, and each has its own lightness of tail, with bias being the dominant property. In some nonregular scenarios (Example 1), the bias associated with the individual |

_{i}*f*|’s (in the expression for

_{i}*ψ*

_{10}) cancel each other out (note the opposite signs in front of |

*f*|’s), and hence the lightness of tail is revealed, resulting in a percentile bootstrap CI that over-covers. In other nonregular examples, however, bias is not canceled out, and hence dominates the property of the hard-max estimator. Hence under-coverage of the bootstrap CIs is observed.

_{i}Nonregularity is an issue in the estimation of the optimal DTRs because it arises when there is no treatment effect at subsequent stages (equivalently, there is no unique optimal treatment at subsequent stages). Unfortunately often there is no or very weak treatment effect in the settings we are interested in (e.g., randomized trials on mental illness or substance abuse). Thus we want our estimator to enjoy good statistical properties (e.g., less bias, lower risk or MSE, correct coverage rate of CIs, good power to detect “local” alternatives, etc.) when the optimal treatment at subsequent stages is non-unique. In case of the hard-max estimator, unfortunately the point of non-differentiability coincides with the parameter value such that ${\psi}_{2}^{T}{H}_{21}=0$ (non-unique optimal treatment at the subsequent stage), which causes nonregularity (bias, higher MSE, low power). But the soft-threshold estimator (also, hard-threshold estimator), in some sense, redistributes the nonregularity from this “null point” to two different points symmetrically placed on either side of the “null point” (see Figure 1). This is one reason why the soft-threshold (also, hard-threshold) estimator works well in nonregular settings.

We have shown that using bootstrap confidence intervals along with the soft-threshold (also, hard-threshold in some cases) estimator reduces the degree of nonregularity, and gives correct coverage rate. Also, the double bootstrap method can be used along with the original hard-max estimator to address the nonregularity. But this method is highly computationally intensive and may be difficult to use in practice. An alternative method to construct CIs for *ψ*’s in nonregular settings is the score method due to Robins^{11}. We have not investigated this in our simulation study.

One can consider an alternative Bayesian approach to formulate an estimator similar to the soft-threshold estimator as follows. Let the data distribution
${\widehat{\psi}}_{2}^{T}{H}_{21}\mid {\psi}_{2}^{T}{H}_{21}\sim N({\psi}_{2}^{T}{H}_{21},{\sigma}^{2})$ with known *σ*^{2}, and the prior distribution of
${\psi}_{2}^{T}{H}_{21}$ be a mixture of a point mass at 0 and *N* (0, 1), with mixing parameter *p* (0 < *p* < 1). Then the posterior distribution of
${\psi}_{2}^{T}{H}_{21}$ is a mixture distribution given by

One can use the median of this posterior distribution in place of
${\widehat{\psi}}_{2}^{T}{H}_{21}$ in the expression for *Ŷ*_{1}. Thus the Bayes estimator becomes

For using this, one has to replace *σ*^{2} by
${\widehat{\sigma}}^{2}={H}_{21}^{T}{\widehat{\mathrm{\sum}}}_{2}{H}_{21}/n$, and *p* by either some empirical estimate or a fixed value (e.g.,
${\scriptstyle \frac{1}{2}}$). In place of the above mixture prior, Johnstone and Silverman^{37} suggest using the mixture of a point mass and a heavy-tailed distribution (e.g., double-exponential). This is a promising formulation that we want to investigate in future.

In this paper, we have focused on randomized trials only to separate the issue of nonregularity from causal inference issues. However the problem of nonregularity also arises when observational data^{11}^{,}^{17} are used; and the hard-threshold and the soft-threshold estimators should be applicable in those settings as well. Also, here we have focussed on only two stages for clarity. However, it should be understood that Q-learning can be used for studies with more than two stages as well. In case of many stages, one can think of a scenario where some parameters are shared across stages, in which case a simultaneous version of Q-learning (as opposed to the recursive version discussed in this paper) would be more appropriate. Unfortunately nonregularity does not go away if a simultaneous estimation procedure is used; see Moodie and Richardson^{17} for a discussion on this with reference to Robins’ method. However, unlike the case of recursive estimation, it is not well understood at this point whether the threshold estimators (hard or soft) can reduce the nonregularity in simultaneous estimation. Moodie and Richardson^{17} gave a simulated nonregular example showing that hard-threshold or ZIPI estimator is not always better than simultaneous estimator of Robins. We did not investigate this issue in the current paper, but we recognize this as an important avenue of future research.

To conclude, we think in the estimation of optimal DTRs, appropriately tuned hard-threshold estimator and the soft-threshold estimator should be seriously considered as improved versions of Q-learning (and Robins’ method of estimation).

## Acknowledgments

We acknowledge support for this project from National Institutes of Health grants RO1 MH080015, P50 DA10075, and P50 CA101451.

## Appendix A: Proof of Lemma 1

#### Proof

Define the *advantage* at stage *j* as

Note that *μ _{j}*(

*H*,

_{j}*A*) represents the expected difference in outcome when using

_{j}*A*instead of the optimal treatment at stage

_{j}*j*, for subjects with treatment and covariate history

*H*who receive the optimal DTR at stages subsequent to

_{j}*j*. According to Robins

^{11}(p. 201), this is simply the

*blip*function with arg max

*(*

_{aj}Q_{j}*H*,

_{j}*a*) as the reference treatment. Below we will establish the connection between Q-learning and Robins’ method using the advantage function; one can derive the connection using other blip functions (other choices of reference treatment) following similar steps. When Q-functions are modeled as in (2), the advantages become

_{j}Since by condition (i), no parameters are shared across stages, we will proceed stage by stage, starting with stage 2, doing recursive (rather than simultaneous) estimation. The notation * _{n}* will be used below to denote the empirical average over a sample of size

*n*. Also, define

*m*

_{1}(

*H*

_{1}) =

*E*[

*Q*

_{1}(

*H*

_{1},

*A*

_{1})|

*H*

_{1}] and

*m*

_{2}(

*H*

_{2}) =

*E*[

*Q*

_{2}(

*H*

_{2},

*A*

_{2})|

*H*

_{2}].

##### Stage 2

At stage 2, Q-learning is a usual least squares regression problem. Thus the estimating equations are given by

From (13), it follows that

where _{2} is the estimate of *ψ*_{2} satisfying (13). Thus _{2} satisfies the estimating equation

On the other hand, the stage 2 estimating equation for Robins’ method (Robins^{11}, p. 211) is given by

where *V ar*(*Y*_{2}−*μ*_{2}(*H*_{2}, *A*_{2}; *ψ*_{2}) − *E*[*Y*_{2} − *μ*_{2}(*H*_{2}, *A*_{2}; *ψ*_{2})|*H*_{2}]|*H*_{2}, *A*_{2}) is omitted (This is one of the reasons why Q-learning is an inefficient version). Note that *E*[*H*_{21}*A*_{2}|*H*_{2}] = 0, by condition (ii) of the lemma. From (12),
${\mu}_{2}({H}_{2},{A}_{2};{\psi}_{2})={\psi}_{2}^{T}{H}_{21}{A}_{2}-\mid {\psi}_{2}^{T}{H}_{21}\mid $. Then
$E[{\mu}_{2}({H}_{2},{A}_{2};{\psi}_{2})\mid {H}_{2}]=-\mid {\psi}_{2}^{T}{H}_{21}\mid $, again by condition (ii). Also,

Therefore,
${Y}_{2}-{\mu}_{2}({H}_{2},{A}_{2};{\psi}_{2})-E[{Y}_{2}-{\mu}_{2}({H}_{2},{A}_{2};{\psi}_{2})\mid {H}_{2}]={Y}_{2}-{m}_{2}({H}_{2})-{H}_{21}^{T}{A}_{2}{\psi}_{2}$. Thus, _{2} in Robins’ method solves the following reduced version of (15):

for any choice of *m*_{2}(*H*_{2}) (with the conditional variance omitted). In particular, for
${m}_{2}({H}_{2})={H}_{20}^{T}{\widehat{\beta}}_{2}$, where _{2} is given by (14), this estimating equation exactly matches with that of Q-learning.

##### Stage 1

For Q-learning, the stage 1 pseudo-outcome is

and so the estimating equations are given by

Now from (13)

Since by condition (iii) of the lemma, $({H}_{10}^{T},{H}_{11}^{T}{A}_{1})\subset {H}_{20}^{T}$, it follows that

Solving for *β*_{1} gives,

Thus _{1} satisfies

On the other hand for Robins’ method, the stage 1 pseudo-outcome (Robins^{11}, p. 208; see also Moodie and Richardson^{17}) is *Ỹ*_{1} = *Y*_{1} + *Y*_{2} − *μ*_{2}(*H*_{2}, *A*_{2}), and so the stage 1 estimating equation (Robins^{11}, p. 211) is given by

where again the conditional variance *V ar*(*Ỹ*_{1}−*μ*_{1}(*H*_{1,} *A*_{1}; *ψ*_{1}) −*E*[*Ỹ*_{1} − *μ*_{1}(*H*_{1}, *A*_{1}; *ψ*_{1})|*H*_{1}]|*H*_{1}, *A*_{1} is omitted. Note that *E*[*H*_{11}*A*_{1}|*H*_{1}] = 0, by condition (ii) of the lemma. From (12),
${\mu}_{1}({H}_{1},{A}_{1};{\psi}_{1})={\psi}_{1}^{T}{H}_{11}{A}_{1}-\mid {\psi}_{1}^{T}{H}_{11}\mid $. Then
$E[{\mu}_{1}({H}_{1},{A}_{1};{\psi}_{1})\mid {H}_{1}]=-\mid {\psi}_{1}^{T}{H}_{11}\mid $, again by condition (ii). Also,

Finally, plug in *Y*_{1} + *Y*_{2} − *μ*_{2}(*H*_{2}, *A*_{2}; _{2}) for *Ỹ*_{1}. Thus, _{1} in Robins’ method solves the following reduced version of (21):

for any choice of *m*_{1}(*H*_{1}) (again omitting the conditional variance). In particular, for
${m}_{1}({H}_{1})={H}_{10}^{T}{\widehat{\beta}}_{1}$, where _{1} is given by (20), this estimating equation exactly matches with that of Q-learning.

In summary, the Q-learning algorithm as presented here is inefficient because: (a) it sets the conditional variances to be constant over (*H _{j}*,

*A*), and (b) uses

_{j}*H*

_{j}_{1}

*A*instead of the “efficient choice” of the term

_{j}*S*

_{eff}

*(that attains semiparametric variance bound) in Robins’ estimating equation (see Robins*

_{;j}^{11}, p. 212; more details in Robins

^{38}).

## Appendix B: Proof of Lemma 2

#### Proof

To estimate the hyper-parameter *ϕ*^{2}, first integrate out *μ* to get the marginal likelihood *X*|*ϕ*^{2} ~ *N*(0, *ϕ*^{2}+*σ*^{2}). The corresponding Jeffrey’s prior on the variance parameter is *p*(*ϕ*^{2}) 1*=*(*ϕ*^{2}+*σ*^{2}). Based on this formulation, the posterior distribution of *ϕ*^{2} is given by

Hence the posterior mode of *ϕ*^{2} is

Given
${\varphi}^{2}=\widehat{{\varphi}^{2}}$, now we will consider the data likelihood *X*|*μ* ~ *N*(*μ*, *σ*^{2}) along with the prior *μ*| *ϕ*^{2} ~ *N*(0, *ϕ*^{2}) to derive an empirical Bayes estimator for |*μ*|. It is easy to show that the posterior distribution of *μ* is given by

Now under the squared error loss, the Bayes estimator of |*μ*| is *E _{μ}*

_{|}

*(|*

_{X}*μ*|) which can be calculated using (23). If

*Y*~

*N*(

*θ*,

*τ*

^{2}), then

*E*|

*Y*| is given by:

In the present problem,

From (22), we get

Thus an empirical Bayes estimator of |*μ*| is given by

## References

*m*in the

*m*out of

*n*bootstrap and confidence bounds for extrema. Statistica Sinica. 2008;18(3):967–985.

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (1.8M)

- Dynamic regime marginal structural mean models for estimation of optimal dynamic treatment regimes, Part I: main content.[Int J Biostat. 2010]
*Orellana L, Rotnitzky A, Robins JM.**Int J Biostat. 2010; 6(2):Article 8.* - Demystifying optimal dynamic treatment regimes.[Biometrics. 2007]
*Moodie EE, Richardson TS, Stephens DA.**Biometrics. 2007 Jun; 63(2):447-55.* - Inference for optimal dynamic treatment regimes using an adaptive m-out-of-n bootstrap scheme.[Biometrics. 2013]
*Chakraborty B, Laber EB, Zhao Y.**Biometrics. 2013 Sep; 69(3):714-23. Epub 2013 Jul 11.* - Issues for covariance analysis of dichotomous and ordered categorical data from randomized clinical trials and non-parametric strategies for addressing them.[Stat Med. 1998]
*Koch GG, Tangen CM, Jung JW, Amara IA.**Stat Med. 1998 Aug 15-30; 17(15-16):1863-92.* - Varenicline: a pharmacoeconomic review of its use as an aid to smoking cessation.[Pharmacoeconomics. 2010]
*Keating GM, Lyseng-Williamson KA.**Pharmacoeconomics. 2010; 28(3):231-54.*

- Inference for Optimal Dynamic Treatment Regimes using an Adaptive m-out-of-n Bootstrap Scheme[Biometrics. 2013]
*Chakraborty B, Laber EB, Zhao Y.**Biometrics. 2013 Sep; 69(3)10.1111/biom.12052* - Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions[Biometrika. 2013]
*Zhang B, Tsiatis AA, Laber EB, Davidian M.**Biometrika. 2013; 100(3)10.1093/biomet/ast014* - Q-Learning: A Data Analysis Method for Constructing Adaptive Interventions[Psychological methods. 2012]
*Nahum-Shani I, Qian M, Almirall D, Pelham WE, Gnagy B, Fabiano G, Waxmonsky J, Yu J, Murphy S.**Psychological methods. 2012 Dec; 17(4)478-494* - Interventions to Address Chronic Disease and HIV: Strategies to Promote Smoking Cessation Among HIV-infected Individuals[Current HIV/AIDS reports. 2012]
*Niaura R, Chander G, Hutton H, Stanton C.**Current HIV/AIDS reports. 2012 Dec; 9(4)375-384* - Q-learning for estimating optimal dynamic treatment rules from observational data[The Canadian journal of statistics = Revue ...]
*Moodie EE, Chakraborty B, Kramer MS.**The Canadian journal of statistics = Revue canadienne de statistique. 2012 Dec 1; 40(4)629-645*

- PubMedPubMedPubMed citations for these articles

- Inference for Nonregular Parameters in Optimal Dynamic Treatment RegimesInference for Nonregular Parameters in Optimal Dynamic Treatment RegimesNIHPA Author Manuscripts. Jun 2010; 19(3)317PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...