- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

- Journal List
- NIHPA Author Manuscripts
- PMC1934348

# Constructing evidence-based treatment strategies using methods from computer science

^{a,}

^{1}Marc G. Bellemare,

^{a}A. John Rush,

^{b}Adrian Ghizaru,

^{a}and Susan A. Murphy

^{c}

^{a}McGill Unilversity, School of Computer Science 318 McConnell Eng., 3480 University st Montreal, QC H3A 2A7 CANADA

^{b}University of Texas, Southwestern Medical Center 323 Harry Hines Blvd Dallas, Texas 75390 USA

^{c}University of Michigan, Institute for Social Research 426 Thompson St. Ann Arbor, MI 48106 USA

^{1}Corresponding author: J. Pineau, School of Computer Science, McGill University, McConnell Eng Bldg Rm 318, 3480 University, Montreal QC H3A 2A7 CANADA. Tel.: +1-514-398-5432; fax: +1-514-398-3883. Email: ac.lligcm.sc@uaenipj

## Abstract

This paper details a new methodology, *instance–based reinforcement learning,* for constructing adaptive treatment strategies from randomized trials. Adaptive treatment strategies are operationalized clinical guidelines which recommend the next best treatment for an individual based on his/her personal characteristics and response to earlier treatments. The instance-based reinforcement learning methodology comes from the computer science literature, where it was developed to optimize sequences of actions in an evolving, time varying system. When applied in the context of treatment design, this method provides the means to evaluate both the therapeutic and diagnostic effects of treatments in constructing an adaptive treatment strategy. The methodology is illustrated with data from the STAR*D trial, a multi-step randomized study of treatment alternatives for individuals with treatment-resistant major depressive disorder.

**Keywords:**clinical decision-making, methodology, treatment, sequential decisions, learning

## 1.0 Introduction

In the treatment of substance abuse and other mental disorders, there is frequently large heterogeneity in response to any one treatment. As a rule, clinicians must often try a series of treatments in order to obtain a response. Furthermore these disorders are often chronic, requiring clinical treatment over a long-term period with options for altering or switching treatment when side effects or relapse on treatment occur. Consequently, the best clinical care requires adaptive changes in the duration, dose, or type of treatment over time. Clinical guidelines individualize treatment by recommending treatment type, dosage, or duration depending on the patient’s history of treatment response, adherence, and burden. *Adaptive treatment strategies* (Lavori and Dawson, 1998, 2003; Lavori et al., 2000; Murphy, 2003; Murphy and McKay, 2004; Collins et al., 2004) are operationalized clinical guidelines. These strategies consist of a sequence of “decision rules” which individually tailor a sequence of treatments for individual patients. The decision rules are defined through two components: an input and an output. The input includes information about a patient (e.g., baseline features such as age, concurrent disorders), as well as the outcomes of present or prior treatments (e.g. severity of side effects, etc.). The output consists of one or more recommended treatment options.

Adaptive treatment strategies are different from standard treatments in two ways. First, such strategies consider treatment sequences (as opposed to a single treatment). Such a consideration is essential when the initial treatments lack sufficient efficacy or are not tolerated, or when relapse is common. Second, such strategies consider time-varying outcomes to determine which of several possible next treatments is best for which patients. The overall goal is to improve longer-term outcomes, as opposed to focusing on only short-term benefits, for patients with chronic disorders.

This paper introduces and describes a novel methodology, *instance-based reinforcement learning* (Ormoneit and Sen, 2002; Sutton and Barto, 1998), for constructing useful adaptive treatment strategies from data collected during randomized trials. Reinforcement learning was originally inspired by the trial-and-error learning studied in the psychology of animal learning (thus the term “learning”). In this setting, good actions by the animal are positively reinforced and poor actions are negatively reinforced (thus the term “reinforcement”). Reinforcement learning was formalized in computer science and operations research by researchers interested in sequential decision-making for artificial intelligence and robotics, where there is a need to estimate the usefulness of taking sequences of actions in an evolving, time varying, system (Sutton and Barto, 1998). Reinforcement learning methods differ from standard statistical methods in that these methods can be used to evaluate a given treatment based on the immediate and longer term effects of this treatment in a treatment sequence. Furthermore these methods provide the means to evaluate both the therapeutic and diagnostic effects of treatment. The diagnostic effect of a treatment is the treatment’s ability to elicit informative patient responses that permit the clinician to better match the subsequent treatment to the patient. Both diagnostic and therapeutic effects are crucial when evaluating the usefulness of a treatment in an adaptive treatment strategy.

We consider the use of reinforcement learning to analyze data from studies in which patients are randomized multiple, sequential, times (see Stone et al., 1995; Tummarello et al., 1997; Schneider et al., 2001, Fava et al., 2003; Stroup et al., 2003 for examples). Such studies are known as Sequential Multiple Assignment Randomized Trials (SMART) (Murphy, 2005). Readers unfamiliar with SMART studies are encouraged to initially read Murphy et al. (in this issue) for an introduction.

In the following we introduce reinforcement learning and illustrate the concepts using a simple hypothetical SMART study on alcohol dependence. We then provide early results from its use in constructing treatment strategies from a recently completed SMART study called the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) trial (www.star-d.org) (Fava et al. 2003; Rush et al. 2004). STAR*D is the largest study of major depressive disorder. It was designed as a sequenced multi-step randomized clinical trial of patients with major depressive disorder, with the specific goal of comparing treatments for depressions that have not remitted after one, two, or even three antidepressant treatments. Finally we discuss the methodological and practical challenges of applying similar techniques for constructing treatment strategies to a broader class of chronic disorders, including substance abuse.

## 2.0 Adaptive Treatment Strategies and Reinforcement Learning

To discuss reinforcement learning, we first review the definition of an adaptive treatment strategy. Throughout we use terms likely to be familiar to clinician researchers yet for those who would like to follow up this introduction in the more technical literature, we provide the analogous terms used by the computer science community in parentheses.

Adaptive treatment strategies (*policies* in computer science) are composed of a series of decision rules, one per treatment step. Decision rules are typically of the form:

- IF baseline assessments = Z
_{0}- AND prior treatments at Steps 1 through t = {A
_{1}, …, A_{t}} - AND clinical assessments at Steps 1 through t = {Z
_{1}…, Z_{t}}

- THEN at Step t+1 apply treatment A
_{t+1}

To make the above more concrete consider the following hypothetical treatment strategy for alcohol dependence.

Example 1. Following graduation from an intensive outpatient program, alcohol dependent patients are provided naltrexone. In the ensuing two months patients are monitored; if at any time the patient experiences 5 or more heavy drinking days (a nonresponder), then the medication is switched to acamprosate. If the patient reports no more than 4 heavy drinking days during the two-month period (a responder), then the patient is continued on naltrexone and monitored monthly for signs of relapse.

Here there are two treatment steps; that is, the initial treatment (naltrexone) and the second treatment for responders/nonresponders.

The observations (the *state* in computer science) at each step are the baseline patient characteristics, clinical assessments (including outcomes observed during prior treatment) and prior treatments. The state need contain only those patient or prior treatment features that are most useful for predicting the individual’s time-dependent outcomes. In the second treatment step, the state corresponds to the responder/nonresponder status; in more complex decision rules, the state might also include the pre-treatment drinking score and a measure of adherence to the initial treatment, etc. Collins et al. (2004) calls the variables entering the decision rules, tailoring variables. That is, a decision rule inputs the tailoring variables and outputs one or more treatment options (*actions* in computer science). Note that the second decision rule in Example 1 outputs the treatment, acamprosate, if the patient is a nonresponder and outputs the treatment, “monthly monitoring for relapse” if the patient is a responder. That is, the second decision rule is:

- IF clinical assessment = {nonresponder}
- THEN at Step t+1 apply treatment {acamprosate}
- ELSE IF clinical assessment = {responder}
- THEN at Step t+1 apply treatment {naltrexone + monthly monitoring}.

In general, the *IF* and *ELSE IF* part of the rule contains patient tailoring variables, and the treatment option is expressed in the *THEN* portion of the rule.^{2} To construct high quality treatment strategies, we focus on selecting good decision rules, which can then be concatenated based on the patient’s outcome at each step of treatment.

As always, one operationalizes the goal of the treatment via the selection of primary outcomes. In reinforcement learning this is achieved via the concept of *reward*, with higher rewards being more favorable. In example 2, a simple reward might be the percentage of non-drinking days. The rewards may incorporate symptom severity, side-effect burden, cost, clinician and patient preferences, etc. In general when there are multiple important outcomes, a natural approach to forming the rewards is to use preference elicitation techniques (Froberg and Kane, 1989). Since this paper focuses on the *methodology* for constructing adaptive treatment strategies, we utilize simple rewards (see Section 4.2 below for an example).

Murphy, et al (in this issue) provide a series of simple analyses that can be used with data from a SMART study to test for the main effect of the treatment at each treatment step and to test if patient outcomes while on initial treatment interact with the main effect of the treatment. Although these analyses provide insight into the effects of treatment sequences these methods do not generally provide the necessary information for selecting the decision rules in the strategy. This is because the methods in Murphy et al (in this issue) do not fully incorporate the usefulness of a treatment both in terms of its immediate effects on a patient and in terms of longer term effects. However methods from reinforcement learning do exactly this in a principled way. Furthermore reinforcement learning does not evaluate the main effect of treatment but rather is used to evaluate the effect of treatment when subsequent treatment is matched to the patient. To discuss these points, we consider the hypothetical SMART study given in Example 2; in this figure R denotes randomization to one of two treatment options.

This hypothetical SMART study includes two randomizations, one per step of treatment. In the first step, patients are randomly assigned to Medication A (MedA, e.g. naltrexone) or Medication A with a cognitive behavioral therapy (CBT). After two months of this treatment, patient response is evaluated. The intermediate outcome considered is a summary of drinking behavior over time (e.g. Nonresponder if 5+ heavy drinking days within the two-month period). If the patient exhibits an early response (e.g. less than 5 heavy drinking days over two months), then the second randomization is to one of two maintenance treatments, either TDM or low level monitoring for relapse (Monitoring). Otherwise as soon as the patient exhibits signs of nonresponse (5+ heavy drinking days within the two month period), the patient is randomized between Medication B (MedB, e.g. acamprosate) and the combination: MedB plus a therapy designed to improve motivation to adhere to treatment (EM) plus CBT.

Reinforcement learning methods incorporate the usefulness of a treatment both in terms of its immediate effects on a patient and in terms of longer-term effects via the concept of the *value* of a treatment. The value of a treatment includes both the reward for applying the treatment and the rewards that will be obtained by subsequently applying the best possible treatment sequence. A given treatment can have high value if it has a good immediate reward, but it can also have high value if it produces outcomes such that subsequent treatments will have good rewards. In essence, the value of a treatment includes not only how the patient responds immediately to the treatment but also how well the treatment sets the patient up for future success with subsequent treatments. The following examples emphasize the need to consider the value as opposed to the immediate reward in constructing treatment strategies. Both use terms from Example 2.

Example 3. Suppose that MedA achieves a higher or equal proportion of responders as MedA+CBT initially, but MedA+CBT leads to higher overall response rates throughout the duration of the study. This is plausible if, for example, Med A+CBT results in improved (over MedA alone) adherence to subsequent treatment by nonresponders.

Another, more subtle, example follows.

Example 4. Suppose, conversely, that the initial response rates due to MedA+CBT are higher than MedA alone. The greater initial response rates may be due to the fact that some patients need more support to be adherent to their medication and CBT provides this support. It is tempting to conclude that it is better to start patients off with MedA+CBT. This conclusion may be false if the information that patients need more support (signaled via the lack of adherence to initial medication) is crucial in pinpointing those initial responders for whom a more intensive maintenance treatment is more successful at preventing relapse.

The methods in reinforcement learning appropriately take into account the tradeoff between present and future rewards in constructing treatment strategies. And these methods select treatments not only based on their likelihood of resulting in a good immediate outcome, but also for their *diagnostic* value (i.e. as in Example 4 to provide information about patient type, thus allowing better selection of subsequent treatments). The necessity of evaluating a treatment (*action*) in terms of both its immediate effects and in terms of the treatment’s ability to enhance the effectiveness of future treatments is well known in the field of sequential decision making (Bellman, 1957).

## 3.0 Instance-Based Reinforcement Learning

To understand instance-based reinforcement learning, it is useful to conceptualize the data from the SMART trial as if the data were a databank. Then when a new patient presents, one searches the databank for similar patients and selects the decision rule that produced the highest value (i.e. worked best) for these similar patients (hence the term “instance-based;” one learns what to do by considering similar instances). The important issue is the choice of an appropriate measure of similarity between patients. Once a measure of similarity is selected the next step in constructing the decision rules in the strategy involves regression. There is one regression per randomization. In these regressions, the independent variables include the state variables and treatments (*actions*) available at each randomization. The dependent variable is the value. In order for this dependent variable to include the value of future treatments (see discussion of *value* above), the regressions (one at each treatment step) are conducted moving backwards through the treatment steps. The treatment options (*actions*) that maximize the regression function at a given step are the best treatments given an individual’s state. These treatment options are the output of the decision rule. The inputs to the decision rule are those state variables (e.g. tailoring variables) that in the regression function interact sufficiently with the treatment that different values of the tailoring variables correspond to differing best treatments. See Murphy, et al (2006) for an intuitive discussion using standard (linear) regression.

Kernel regression (Ormoneit and Sen, 2002) is used in instance-based reinforcement learning since kernel regression provides a natural way to incorporate measures of patient similarity (via the choice of the kernel called a similarity function in this context). The similarity function measures the similarity between patients via their values on the independent variables. In the analysis below we assume that similarity at Step *t* between two patients is measured by the similarity in their treatment history, as well as similarity in their state. If the patients received different treatments prior to Step *t* then their similarity is zero. If the patients received the same treatments prior to Step *t* then their similarity is the distance between their states (in the alcohol drinking example this might be the distance between patients’ drinking scores during prior steps)^{3}. As discussed earlier, other variables (e.g. demographics, pretreatment variables) might also be useful as state variables to meaningfully quantify similarity between patients. The goal of this similarity measure is to generalize from patients observed in the SMART study, to other patients. For example, consider a clinician presented with a patient who has had 12 heavy drinking days in the past two months and suppose that by our similarity measure patients with 11–13 heavy drinking days in the prior two months are considered similar. To decide which treatment sequence is best for this patient the kernel regression method incorporates information from both very similar and less similar patients by weighting the information from very similar patients more highly than the information from less similar patients.

In the context of Example 2, two decision rules would be considered (corresponding to the two randomizations/steps in the SMART study). We first analyze the data from the last step, Step 2. The kernel regression provides a comparison of the two possible treatment options and estimates which has the best overall value; this comparison more heavily weights patients in the SMART study with similar state variables (for example, pre-treatment drinking score, initial treatment, the responder/nonresponder status, and a measure of adherence to the initial treatment). So at the end of the analysis, we have one “best” decision rule at Step 2 valid for each possible type of individual (as summarized by an individual’s state variables). To summarize we use kernel regression with the dependent variable equal to the Step 2 reward and the independent variables equal to the state variables. The treatment option that maximizes the regression function is the best Step 2 treatment given an individual’s state. The regression function evaluated at the best Step 2 treatment is the value at Step 2.

Next consider Step 1: the goal is to find the best first treatment given an individual’s state prior to Step 1. We apply kernel regression with the dependent variable equaling the sum of the reward for Step 1 and the value from the Step 2 analysis (the value at Step 2 is the Step 2 regression function evaluated at the best Step 2 treatment). The independent variables are the pretreatment patient characteristics (e.g. in our hypothetical example, the pre-treatment drinking score). The treatment option that maximizes the regression function is the best initial treatment given an individual’s state.

In general, there may be more than two treatment steps. For example in the analysis presented in the next section, there are four treatment steps (and thus four decision rules). In such cases, the regression progresses as presented here, starting from the last step and moving backwards, such that at every step, the treatment that is best for the entire sequence is selected. A detailed technical presentation of the use of kernel regression in reinforcement learning is available in Ormoneit and Sen (2002).

## 4.0 Case study

We now perform a case-study of the concepts explored above using data from the STAR*D study.

### 4.1 Data description

STAR*D aimed to investigate the comparative effectiveness of different treatments provided in succession for patients not adequately benefiting from the initial or subsequent treatment steps. Overall, 4041 patients were enrolled at 41 clinical sites. Treatment options varied between the different treatment steps. For the initial treatment step (Step 1), all participants received up to 14 weeks of citalopram (CIT). All participants without a satisfactory response to CIT (non-remission or intolerance) were eligible for seven different treatment options at the second step (Step 2): continue on CIT and augment with bupropion-SR (CIT+BUP-SR), or buspirone (CIT+BUS), or cognitive therapy (CIT+CT), or discontinue CIT and switch to bupropion-SR (BUP-SR), sertraline (SER), venlafaxine-XR (VEN-XR), or cognitive therapy (CT). As part of the STAR*D design, patients were allowed to elect *Switch* or *Augmentation* treatment options prior to randomization. Those choosing the switch were then randomized to one of *{BUP-SR, SER, VEN-XR, CT}*. Those choosing the augmentation were then randomized to one of *{CIT+BUP-SR, CIT+BUS, CIT+CT}*^{4}. Participants in Step 3 were again given a choice of whether they would prefer a switch or augmentation of treatment. Those choosing the switch were then re-randomized at Step 3 to one of two switch options: mirtazapine (*MRT*) or nortriptyline (*NTP*); those choosing the augment were then re-randomized to one of two augment options: lithium (*Li*) or thyroid hormone (*THY*). Finally, patients without an adequate benefit at Step 3 were re-randomized in Step 4 to either: tranylcypromine (*TCP*) or a combination of *MRT+VEN-XR*. Participants with a satisfactory response at any step could enter a one-year naturalistic follow-up during which they were to continue the same treatment found to be effective acutely. For clarity of presentation, we omit some of the more subtle details of the randomization. See Fava et al. (2003) and Rush et al. (2004) for a full description of the experimental design.

The discussion and analysis presented below considers the use of *symptom severity* to tailor treatment to individuals by using this variable both as a tailoring variable and to form the rewards. Symptom severity is defined by the Quick Inventory of Depressive Symptomatology (QIDS-SR_{16}: Rush et al., 2000; Trivedi et al., 2004; Rush et al., 2003), which is a short and self-report version of the Inventory of Depressive Symptomatology (IDS: Rush et al. 1996). The QIDS-SR_{16} rates only the nine criterion symptom domains (range: 0–27) needed to diagnose a major depressive episode by DSM-IV. The QIDS-SR_{16} was collected at baseline (before treatment initiation), as well as during each clinic visit. Clinic visits occurred multiple times during each treatment step, usually at approximately 2-week intervals. QIDS-SR_{16} was also used at the end of each treatment step to assess remission (e.g., QIDS-SR_{16} ≤ 5).

### 4.2 Decision rules

The analysis focuses on optimizing the treatment choices at Steps 2, 3, and 4. Step 1 was excluded because all patients received the same treatment (citalopram). The reward for each patient is:

reward= 1 if the patient’s exitQIDS-SRat last treatment step ≤ 5, (indicates remission)_{16}

= 0 otherwise

This reward is indicative of the overall long-term objective, namely that the primary goal is for patients to achieve remission. This reward does not differentiate between remitting at Steps 2, 3, or 4, even though clinicians and patients alike would clearly prefer an early remission. Other alternate rewards might also be considered, including rewards that differentiate between early and late remission and/or incorporate clinician and patient preferences, side effect burden, cost, etc. Using this definition for the reward, the average sum of rewards at Step *t* represents the proportion of patients remitting in Steps *t* through 4 out of the patients entering Step *t*.

As will be seen below at most of the steps, there is a very large number of states (patient types), and STAR*D does not contain sufficient data to compare all actions (treatment options) for each possible state. As discussed above, kernel regression is used to pool information across patients by weighting patients with similar states more highly. As before, similarity between patients must be expressed. We assume that patients receiving different treatments prior to Step *t* have similarity zero. For patients who received the same treatments prior to Step *t*, similarity is measured as the distance between their states (here the distance between their baseline pre-citalopram QIDS-SR_{16} and slope over QIDS-SR_{16} during prior steps).

For this application, instance-based reinforcement learning begins by performing a kernel regression at Step 4, where we consider decision rules of the form:

- IF baseline QIDS-SR
_{16}= Z^{1}_{0}- AND slope of QIDS-SR
_{16}during Step 1 = Z_{1} - AND treatment at Step 2 = A
_{2} - AND slope of QIDS-SR
_{16}during Step 2 = Z_{2} - AND treatment at Step 3 = A
_{3} - AND slope of QIDS-SR
_{16}during Step 3 = Z_{3}

- THEN at Step 4 apply treatment A
_{4}.

As before the state variables are listed in the IF part of the above decision rule; the state is the combination of the baseline QIDS-SR_{16} which can assume one of 22 numbers in {6, …, 27}, the slope^{5} over the QIDS-SR_{16} scores during Steps 1–3 (this is divided into 21 discrete categories) and the past treatments. For every possible state, the kernel regression provides a comparison of the two available decision options, *A _{4}={TCP, MRT+VEN-XR}*, based on overall value. At the end of the analysis, we retain the “best” decision option for each possible type of individual (as summarized by an individual’s baseline QIDS-SR

_{16}, and QIDS-SR

_{16}slope within each step).

Next at Step 3, we consider decisions rules of the form:

- IF baseline QIDS-SR
_{16}= Z^{1}_{0}- AND slope of QIDS-SR
_{16}during Step 1 = Z_{1} - AND treatment at Step 2 = A
_{2} - AND slope of QIDS-SR
_{16}during Step 2 = Z_{2} - AND Step 3 patient preference = P
_{2}

- THEN at Step 3 apply treatment A
_{3}.

Again for each possible state, we use kernel regression to compare the two available decision options for *A _{3}*, switch to

*MRT*or

*NTP*(if preference is for a switch) or augment treatment with either

*Li*or

*THY*(if preference is for augmentation).

Finally at Step 2, we consider decisions rules of the form:

- IF baseline QIDS-SR
_{16}= Z^{1}_{0}- AND slope of QIDS-SR
_{16}during Step 1 = Z_{1} - AND Step 2 patient preference = P
_{1}

- THEN at Step 2 apply treatment A
_{2}.

For every possible state, the regression provides a comparison of the three or four decision options: *A _{2} = {CIT+BUP-SR, CIT+BUS, CIT+CT}* (if preference is for an augment) or

*{BUP-SR, SER, VEN-XR, CT}*(if preference is for a switch in treatment).

### 4.3 Analysis

The reinforcement learning analysis requires considering decision rules at all treatment steps, to construct full treatment sequences. For brevity, we discuss only the rules for Step 2, understanding that they are part of a full treatment sequence. Recall that at Step 2 patients were given a choice of whether they would prefer a switch, or an augment of treatment; we present separate results for each preference group. The results presented herein ignore a host of other state variables, but nonetheless illustrate the types of treatment sequences that can be constructed from SMART studies and allow us to comment on the clinical relevance of such rules.

In Figure 2 we show the treatment that is prescribed by the instance-based reinforcement learning, both for the Switch group (Fig. 2a) and the Augment group (Fig. 2b). In both graphs, the horizontal axis is the patient’s pretreatment QIDS-SR_{16} (i.e. before citalopram) and the vertical axis is the slope of QIDS-SR_{16} scores during Step 1. Each square in the graph represents a state (patient type). The shade of a given square indicates which Step 2 treatment, for patients in that state, will maximize the sum of rewards over the treatment sequence.

*X-axis*is QIDS-SR

_{16}before Step 1.

*Y-axis*is slope of QIDS-SR

_{16}scores during Step 1. (a) Optimal treatment at Step 2 for patients preferring Medication Switch. (b) Optimal treatment

**...**

For example:

- IF baseline QIDS-SR
_{16}= 15- AND Step 1 QIDS-SR
_{16}slope = 0 - AND Step 2 Patient preference = Augment

- THEN at Step 2 apply treatment CIT+BUP-SR.

The squares also contain circles of varying sizes. The circle size indicates the density of patients in the database with this particular history (i.e. *x*−*y* value).

Consider Figure 2(a). The graph shows many small regions, with no single treatment dominating over a significant region. This suggests that for those who prefer a treatment switch, it is unclear which of the switch options is best. This result could be attributed to different factors: it is possible that we do not have sufficient data to distinguish between the switch options for certain types of patients; alternately, it could be that the different switch treatment options are actually equivalent, and the variability in the data cause different treatment options to dominate from one small set (square in Figure 2(a)) of patients to another.

Now consider Figure 2(b). The graph shows fewer, but larger, regions of the same treatment. This suggests that for patients who prefer an augment, there are categories of individuals who should get *CIT+BUP* and others who should receive *CIT+BUS*.

Next, we examine the effect of considering the full sequence of treatments when selecting an adaptive strategy, by comparing Figure 2(b) and Figure 2(c). Recall that in reinforcement learning the value of a treatment includes both the immediate reward for applying the treatment as well as the future rewards that will be obtained by subsequently applying the best possible treatment sequence. Figure 2(b) is obtained using the instance-based reinforcement learning method described in this paper. Figure 2(c) is obtained by using a slightly modified version of our instance-based reinforcement learning method, in which the regression at every step ignores future rewards and only considers the immediate reward. A comparison of the two figures shows the effect of considering subsequent treatments when choosing the best treatment at Step 2. The optimal treatment in both figures is the same for most patients. However, for some individuals (e.g. those with pre-citalopram *QIDS-SR _{16}*= 20, slope at Step 1=0), the optimal initial treatment to achieve remission in any of Steps 2–4, as shown in Figure 2(b), is citalopram augmented with buspirone (

*CIT+BUS*). For those same individuals, the optimal treatment to achieve immediate remission at (and only at) Step 2, as shown in Figure 2(c), is citalopram augmented with bupropion (

*CIT+BUP-SR*). The bold box on Figure 2c roughly shows the set of individuals for whom the treatment recommendation differs whether we consider only remission at Step 2, versus remission over the full sequence.

Note that all of the above analyses only use information from patients who did not remit during Step 1. All patients who remitted with CIT were moved to the follow-up phase and, therefore, were not given a Step 2 treatment.

## 5.0 Conclusion

Reinforcement learning provides a set of analysis tools that can be used with data to optimize the sequential decisions that must be made in clinical practice. The methodology is general, and can be applied to constructing adaptive treatment strategies for many chronic disorders, including drug and alcohol dependence. While the type of treatments and strategies considered in drug and alcohol dependence differ from those used in treating depression, the methodology is sufficiently general to be used across these domains.

The discussion and results presented above, while still preliminary, highlight the notion that a patient’s response to one treatment can provide useful predictive information concerning that patient’s response to another treatment. The novelty is that methods in reinforcement learning appropriately take into account the tradeoff between present and future responses in constructing treatment sequences. In particular earlier treatments may be chosen not only based on their immediate effects but also based on the ability of the early treatment to set the patient up for improved response to later treatments *and* also for the treatments’ diagnostic value (i.e. to reveal information about patient type, thus allowing better selection of subsequent treatments). This powerful and new way of thinking about the design of sequential treatments requires further development.

The main challenge in applying these methods is that a close level of collaboration between clinical researchers and methodologists is required to define the appropriate set of state variables (baseline patient characteristics, clinical assessments including outcomes observed during prior treatment) and the appropriate rewards. In future work with the STAR*D dataset, we will include a larger set of state variables such as the side effect burden (as gauged by the Frequency, Intensity and Burden of Side Effects Rating (FIBSER) Scale (Wisniewski et al., 2006)) and comorbid general medical conditions (as defined by the 14-item Cumulative Illness Rating Scale (CIRS) (Linn et al., 1968; Miller et al., 1992)). We will also include a more detailed version of patient preference, assigned treatment, duration of treatment at each step, and the timing of response and remission at each step.

When constructing decision rules, it is important to identify the appropriate primary outcome (here, the reward); this choice plays a large role, in that reinforcement learning constructs strategies that maximize the average sum of rewards. The reward used in the analysis above is very simplistic, and does not capture the preferences both on the part of the clinician and the patient for different outcomes (speed of remission, level of side effects, etc.). An important challenge is in devising more clinically relevant rewards. One feasible approach is to use preference elicitation (Forberg and Kane, 1989; Parmigiani, 2002; Rosenheck et al., 2005) to form the rewards.

Also there are a number of methodological challenges. First we would like to use reinforcement learning with data from observational studies (studies in which the treatments are not randomized). As is well known, the analysis of observational data, such as clinical databases, is subject to bias since statistical methods cannot completely disentangle the unobserved reasons why a patient received a treatment from the effects of the treatment (e.g., unbeknownst to the data analyst the patients provided treatment A are much sicker than the patients provided treatment B and thus treatment A appears less effective than it should). It is crucial that methods from statistics be combined with methods from reinforcement learning to reduce the bias when observational data is used. This is the subject of extensive research in statistics (Murphy, 2003; Robbins, 2004).

A second methodological challenge is that, to a large extent, present methods in reinforcement learning do not provide measures of confidence (e.g. standard errors, confidence sets, hypothesis tests). Thus while results such as those provided here can be used to generate new hypothesis, they must be verified using traditional statistical analyses. This important issue with reinforcement learning is the subject of ongoing theoretical investigations by the statistics and computer science communities.

## Acknowledgments

We gratefully acknowledge the contribution of the STAR*D team, in particular investigators at the Texas Southwestern Medical Center and the University of Pittsburgh School of Public Health, who supplied the data necessary for this work. STAR*D was funded in part with Federal funds from the National Institute of Mental Health, National Institutes of Health, under Contract N01MH90003 to UT Southwestern Medical Center at Dallas (P.I.: A.J. Rush). Susan Murphy and Joelle Pineau were funded partially under NIH grant R21 DA019800. Susan Murphy received additional funding for this work from NIH grants K02-DA1515674 and P50 DA10075. The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the U.S. Government.

## Footnotes

^{2}Sometimes a rule proposes a *class* of treatments, rather than a single treatment, in which case one uses clinical expertise to select one treatment from the set of options. For example, the second decision rule above may be altered to read: *If the patient is a responder, then provide either “monthly monitoring for relapse” or Telephone Disease Management (TDM;* Oslin, et al., 2003).

^{3}We use a Gaussian kernel function with bandwidths selected by cross-validation (Hardle, 1990).

^{4}In fact, this is a simplified explanation of the STAR*D design. Participants could also choose to include/exclude psychotherapy, *{CT, CIT+CIT}*, as well as allow/disallow more than one class, e.g. *Switch or Augment, but not Psychotherapy*. The methodology we present can handle this more complicated case, however for ease of presentation we focus on this simplified version.

^{5}The slope over the score during a treatment step is calculated as the gradient obtained when fitting a linear function to the QIDS-SR_{16} scores observed during the step versus time since beginning the treatment step. The slope is re-scaled to fit in the [−1, +1] interval. Higher values indicate a worsening in symptoms during the step, lower values indicate an improvement in symptoms during the step.

**Publisher's Disclaimer: **This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

## References

- Bellman R. Dynamic Programming. Princeton University Press; Princeton NJ: 1957.
- Collins LM, Murphy SA, Bierman KL. A conceptual framework for adaptive preventive interventions. Prev Sci. 2004;5(3):185–96. [PMC free article] [PubMed]
- Fava M, Rush AJ, Trivedi MH, Nierenberg AA, Thase ME, Sackeim HA, Quitkin FM, Wisniewski S, Lavori PW, Rosenbaum JF, Kupfer DJ. Background and rationale for the sequenced treatment alternatives to relieve depression (STAR*D) study. Psychiatr Clin North Am. 2003;26(2):457–494. [PubMed]
- Froberg DG, Kane RL. Methodology for measuring health-state preferences - II: Scaling methods. J Clin Epidemiol. 1989;42(5):459–471. [PubMed]
- Hardle W. Applied nonparametric regression. Cambridge University Press; Cambridge: 1990.
- Lavori PW, Dawson R. Developing and comparing treatment strategies: an annotated portfolio of designs. Psychopharmacol Bull. 1998;34(1):13–8. [PubMed]
- Lavori PW, Dawson R. Dynamic treatment regimes: practical design considerations. Clinical Trials. 2003;1(1):9–20. [PubMed]
- Lavori PW, Dawson R, Rush AJ. Flexible treatment strategies in chronic disease: clinical and research implications. Biol Psychiatry. 2000;48(6):605–14. [PubMed]
- Linn BS, Linn MW, Gurel L. Cumulative illness rating scale. J Am Geriatr Soc. 1968;16(5):622–626. [PubMed]
- Miller MD, Paradis CF, Houck PR, Mazumdar S, Stack JA, Rifai AH, Mulsant B, Reynolds CF. III: Rating chronic medical illness burden in geropsychiatric practice and research: application of the Cumulative Illness Rating Scale. Psychiatry Res. 1992;41(3):237–248. [PubMed]
- Murphy SA. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society, Series B (with discussion) 2003;65(2):331–366.
- Murphy SA. An experimental design for the development of Adaptive Treatment Strategies. Statistics in Medicine. 2005;24:1455–1481. [PubMed]
- Murphy SA, Lynch KG, McKay JR, Oslin DW, Ten Have TR. Developing Adaptive Treatment Strategies in Substance Abuse Research. Drug and Alcohol Dependence, in press, this issue In press. [PMC free article] [PubMed]
- Murphy SA, McKay JR. Clinical Science (Newsletter of the American Psychological Association Division 12, section III. The Society for the Science of Clinical Psychology; 2004. Adaptive Treatment Strategies: an emerging approach for improving treatment effectiveness. Winter 2003/Spring 2004.
- Murphy SA, Oslin D, Rush AJ, Zhu J. for MCATS. Methodological challenges in constructing effective treatment sequences for chronic disorders. Neuropsychopharmacol. 2006 advance online publication, November 8 2006, DOI:10.1038/sj.npp.1301241. [PubMed]
- Ormoneit D, Sen S. Kernel based reinforcement learning. Machine Learning. 2002;49:161–178.
- Oslin DW, Sayers S, Ross J, Kane V, Ten Have T, Conigliaro J, Cornelius J. Disease management for depression and at-risk drinking via telephone in an older population of veterans. Psychosom Med. 2003;65(6):931–7. [PubMed]
- Parmigiani G. Modeling in Medical Decision Making: A Bayesian Approach. John Wiley & Sons; New York: 2002.
- Robins JM. Optimal structural nested models for optimal sequential decisions. In: Lin DY, Haegerty P, editors. Proceedings of the Second Seattle Symposium on Biostatistics. Springer Verlag; New York: 2004. pp. 189–326.
- Rosenheck R, Stroup S, Keefe SE, McEvoy J, Swartz M, Perkins D, Hsiao J, Shumway M, Lieberman J. Measuring outcome priorities and preferences in people with schizophrenia. Br J Psychiatry. 2005;187:529–536. [PubMed]
- Rush AJ, Carmody TJ, Reimitz PE. The Inventory of Depressive Symptomatology (IDS): clinician (IDS-C) and self-report (IDS-SR) ratings of depressive symptoms. Int J Methods Psychiatr Res. 2000:945–59.
- Rush AJ, Fava M, Wisniewski SR, Lavori PW, Trivedi MH, Sackeim HA, Thase ME, Nierenberg AA, Quitkin FM, Kashner TM, Kupfer DJ, Rosenbaum JF, Alpert J, Stewart JW, McGrath PJ, Biggs MM, Shores-Wilson K, Lebowitz BD, Ritz L, Niederehe G. Sequenced Treatment Alternatives to Relieve Depression (STAR*D): rationale and design. Control Clin Trials. 2004;25(1):119–142. [PubMed]
- Rush AJ, Gullion CM, Basco MR, Jarrett RB, Trivedi MH. The Inventory of Depressive Symptomatology (IDS): psychometric properties. Psychol Med. 1996;26(3):477–486. [PubMed]
- Rush AJ, Trivedi MH, Ibrahim HM, Carmody TJ, Arnow B, Klein DN, Markowitz JC, Ninan PT, Kornstein S, Manber R, Thase ME, Kocsis JH, Keller MB. The 16-Item Quick Inventory of Depressive Symptomatology (QIDS), clinician rating (QIDS-C), and self-report (QIDS-SR): a psychometric evaluation in patients with chronic major depression. Biol Psychiatry. 2003;54(5):573–583. Erratum, p. 585. [PubMed]
- Schneider LS, Tariot PN, Lyketsos CG, Dagerman KS, Davis KL, Davis S, Hsiao JK, Jeste DV, Katz IR, Olin JT, Pollock BG, Rabins PV, Rosenheck RA, Small GW, Lebowitz B, Lieberman JA. National Institute of Mental Health Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE): Alzheimer disease trial methodology. Am J Geriatr Psychiatry. 2001;9(4):346–360. [PubMed]
- Stone RM, Berg DT, George SL, Dodge RK, Paciucci PA, Schulman P, Lee EJ, Moore JO, Powell BL, Schiffer CA. Granulocyte-macrophage colony-stimulating factor after initial chemotherapy for elderly patients with primary acute myelogenous leukemia. N Engl J Med. 1995;332:1671–1677. [PubMed]
- Stroup TS, McEvoy JP, Swartz MS, Byerly MJ, Glick ID, Canive JM, McGee MF, Simpson GM, Stevens MC, Lieberman JA. National Institute of Mental Health Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE): Schizophrenia trial design and protocol development. Schizophr Bull. 2003;29:15–31. [PubMed]
- Sutton R, Barto A. Reinforcement Learning: An Introduction. MIT Press; Cambridge: 1998.
- Trivedi MH, Rush AJ, Ibrahim HM, Carmody TJ, Biggs MM, Suppes T, Crismon ML, Shores-Wilson K, Toprac MG, Dennehy EB, Witte B, Kashner TM. The Inventory of Depressive Symptomatology, Clinician Rating (IDS-C) and Self-Report (IDS-SR), and the Quick Inventory of Depressive Symptomatology, Clinician Rating (QIDS-C) and Self-Report (QIDS-SR) in public sector patients with mood disorders: a psychometric evaluation. Psychol Med. 2004;34(1):73–82. [PubMed]
- Tummarello D, Mari D, Granziano F, Isidori P, Cetto G, Pasini F, Santo A, Cellerino R. A randomized, controlled, phase III study of cyclophosphamide, doxorubicin and vincristine with etoposide (CAV-E) or teniposide (CAV-T), followed by recombinant interferon-α maintenance therapy or observation, in small cell lung carcinoma patients with compete responses. Cancer. 1997;80:2222–2229. [PubMed]
- Wisniewski SR, Rush AJ, Balasubramani GK, Trivedi MH, Nierenberg AA. for the STAR*D Investigators. Self-rated global measure of the frequency, intensity, and burden of side effects. J Psychiatr Pract. 2006;12:71–79. [PubMed]

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (264K)

- Strategies for training counselors in evidence-based treatments.[Addict Sci Clin Pract. 2010]
*Martino S.**Addict Sci Clin Pract. 2010 Dec; 5(2):30-9.* - Feasibility and effectiveness of computer-based therapy in community treatment.[J Subst Abuse Treat. 2010]
*Brooks AC, Ryder D, Carise D, Kirby KC.**J Subst Abuse Treat. 2010 Oct; 39(3):227-35. Epub 2010 Jul 29.* - Reinforcement learning design for cancer clinical trials.[Stat Med. 2009]
*Zhao Y, Kosorok MR, Zeng D.**Stat Med. 2009 Nov 20; 28(26):3294-315.* - Evidence-based medicine for neurosurgeons: introduction and methodology.[Prog Neurol Surg. 2006]
*Linskey ME.**Prog Neurol Surg. 2006; 19:1-53.* - Evidence based medicine methods (part 1): the basics.[Paediatr Anaesth. 2007]
*MacKinnon RJ.**Paediatr Anaesth. 2007 Oct; 17(10):918-23.*

- Inference for Optimal Dynamic Treatment Regimes using an Adaptive m-out-of-n Bootstrap Scheme[Biometrics. 2013]
*Chakraborty B, Laber EB, Zhao Y.**Biometrics. 2013 Sep; 69(3)10.1111/biom.12052* - Linear Fitted-Q Iteration with Multiple Reward Functions[Journal of machine learning research : JMLR...]
*Lizotte DJ, Bowling M, Murphy SA.**Journal of machine learning research : JMLR. 2012 Nov; 13(Nov)3253-3295* - Initial management strategies for follicular lymphoma[International journal of hematologic oncolo...]
*Chen Q, Ayer T, Nastoupil LJ, Seward M, Zhang H, Sinha R, Flowers CR.**International journal of hematologic oncology. 2012 Oct; 1(1)35-45* - Interventions to Address Chronic Disease and HIV: Strategies to Promote Smoking Cessation Among HIV-infected Individuals[Current HIV/AIDS reports. 2012]
*Niaura R, Chander G, Hutton H, Stanton C.**Current HIV/AIDS reports. 2012 Dec; 9(4)375-384* - Q-learning for estimating optimal dynamic treatment rules from observational data[The Canadian journal of statistics = Revue ...]
*Moodie EE, Chakraborty B, Kramer MS.**The Canadian journal of statistics = Revue canadienne de statistique. 2012 Dec 1; 40(4)629-645*

- Cited in BooksCited in BooksPubMed Central articles cited in books
- PubMedPubMedPubMed citations for these articles

- Constructing evidence-based treatment strategies using methods from computer sci...Constructing evidence-based treatment strategies using methods from computer scienceNIHPA Author Manuscripts. May 2007; 88(Suppl 2)S52PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...