NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.
Gottfried JA, editor. Neurobiology of Sensation and Reward. Boca Raton (FL): CRC Press/Taylor & Francis; 2011.
14.1. INTRODUCTION
The ability to predict when and where a reward will occur enables humans and other animals to initiate behavioral responses prospectively in order to maximize the probability of obtaining that reward. Reward predictions can take a number of distinct forms depending on the nature of the associative relationship underpinning them (Balleine et al. 2008; and see Chapters 13 and 15 in this volume). The simplest form of reward prediction is one based on an associative “Pavlovian” relationship between arbitrary stimuli and rewards, acquired following experience of repeated contingent pairing of the stimulus with the reward. Subsequent presentation of the stimulus elicits a predictive representation of the reward, by virtue of the learned stimulus-reward association. This form of prediction is purely passive: it signals when a reward might be expected to occur and elicits Pavlovian conditioned reflexes, but does not inform about the specific behavioral actions that should be initiated in order to obtain it. By contrast, other forms of reward prediction are grounded in learned instrumental associations between stimuli, responses, and rewards, thereby informing about the specific behavioral responses that, when performed by the animal, lead to a greater probability of obtaining that reward. Instrumental reward predictions can be either goal directed (based on response-outcome associations and therefore sensitive to the incentive value of the outcome), or habitual (based on stimulus-response associations and hence insensitive to changes in outcome value) (Balleine and Dickinson 1998). In this chapter, we will review evidence for the presence of multiple types of predictive reward signal in the brain. We will also outline some of the candidate computational mechanisms that might be responsible for the acquisition of these different forms of reward predictions and evaluate evidence for the presence of such mechanisms in the brain.
14.2. STIMULUS-BASED PREDICTIONS
Studies of the neural basis of stimulus-based Pavlovian reward-prediction signals have focused on the amygdala in the medial temporal lobes, orbitofrontal cortex (OFC) on the ventral surface of the frontal lobes, and the ventral striatum in the basal ganglia.
Single-unit recording studies in both rodents and non-human primates have implicated neurons in these areas in encoding stimulus-reward associations. Schoenbaum, Chiba, and Gallagher (1998) recorded from single neurons in rat amygdala and OFC, while on each trial animals were presented with one of two different odor cues indicating whether or not a subsequent nose poke by the rat in a food well would result in the delivery of an appetitive sucrose solution or an aversive quinine solution. Neurons in both amygdala and OFC were found to discriminate between cues associated with the positive and negative outcomes, and some were also found to show an anticipatory response related to the expected outcome. Paton et al. (2006) found evidence for separate neuronal populations in monkeys responding to one of three Pavlovian cues predictive of the subsequent delivery of either a pleasant juice reward, no outcome, or an aversive air puff to the eye. Furthermore, activity in these neurons was found to change as a function of reversal of the associations—some neurons stopped responding to a specific cue following reversal, while other neurons reversed their cue selectivity. Cue-related and anticipatory responses have also been found in monkey OFC related to a monkey’s behavioral preference for the associated outcomes, such that the responses of the neurons to the cue paired with a particular outcome depend on the relative preference of the monkey for that outcome compared to another outcome presented in the same block of trials (Tremblay and Schultz 1999).
Another brain region implicated in Pavlovian reward prediction is ventral striatum. Neurons in ventral striatum have also been found to reflect expected reward in relation to the onset of a stimulus presentation, and activity of neurons in this area increases as a function of the degree of progression through a task sequence ultimately leading to reward (Shidara, Aigner, and Richmond 1998; Cromwell and Schultz 2003; Day et al. 2006). Similar findings have emerged from functional neuroimaging studies in humans. Gottfried, O’Doherty, and Dolan (2002) reported activity in both amygdala and orbitofrontal cortex following presentation of visual stimuli predictive of the subsequent delivery of both a pleasant and an unpleasant odor. Gottfried, O’Doherty, and Dolan (2003) subsequently showed that activity in these regions inresponse to Pavlovian cues track the current incentive value of an associated unconditioned stimulus (UCS). Subjects underwent conditioning in which food odors (the UCSs) were paired with visual conditioned stimuli (CSs), and then subsequently the value of a food odor associated with one of the stimuli was decreased by feeding subjects to satiety on a food corresponding to that specific odor. Neural responses to presentation of the CS paired with the devalued odor were found to correspondingly decrease in OFC, amygdala, and ventral striatum from before to after the satiation procedure, whereas no such decrease was evident for the CS paired with the non-devalued odor. The above findings therefore implicate a network of brain regions involving the amygdala, ventral striatum, and OFC in the learning and expression of stimulus-based reward predictions.
14.3. ACQUISITION OF STIMULUS-BASED REWARD PREDICTIONS
14.3.1. Theoretical Models of Reward Prediction Learning
The finding of stimulus-related predictive reward signals in the brain raises the question of how such signals are acquired in the first place. Modern theories of reward learning suggest that such learning proceeds by means of a signal called a prediction error, which encodes the difference between the reward that is predicted and the actual reward that is delivered. This notion was originally instantiated in the Rescorla-Wagner (RW) model of classical conditioning (Rescorla and Wagner 1972) and more recently espoused in a class of models collectively known as reinforcement learning (Sutton and Barto 1998). According to the RW model, the process by which a CS comes to produce a conditioned response is represented by two variables: V t, which is the strength of the conditioned response elicited on trial t, or more abstractly the value of the CS, and U, which is the mean value of the UCS. For conditioning to occur, through repeated contingent presentations of the CS and US, the variable V t (which may be initially zero at t = 1) should, as trials progress, converge toward the value of U. On any given trial if the reward is presented we can set U t = 1, and if the reward is not presented U t = 0. At the core of the RW model is the aforementioned prediction error signal δt, which represents the difference between the current value of U t and V t (δt = U t – V), on each conditioning trial. Under these circumstances δt will be positive because V t < U t. The value of V t is then updated in proportion to δt. Assuming that the reward is always delivered when the CS is presented, then over the course of learning V t eventually converges to U, δt will tend to zero, and once this happens learning is complete. However, if on a particular trial the reward is suddenly omitted after the CS is presented, on this occasion δt will be less than zero, because V t > U t. This would subsequently result in a reduction of the value of V t. The idea that prediction error signals can take on both positive and negative response characteristics is the central feature of this form of learning model, and as we shall see is at the core of modern views on how such learning might be implemented in the brain.
An important extension of the RW model is the Temporal Difference (TD) Learning Model (Sutton 1988). This model overcomes some of the initial limitations of the RW model such as an inability to learn sequential stimulus-based predictions (e.g., when one stimulus predicts another stimulus which in turn predicts reward), and a lack of sensitivity to the timing between stimulus presentation and reward delivery, which is known to be a critical factor in modulating the efficacy of conditioning (Sutton 1988; Dayan and Abbott 2001). The key difference between the TD model and the earlier RW model is that whereas the RW model is trial based and only concerned with estimating predicted reward pertaining to a particular stimulus across trials, the TD model is also concerned with estimating the future-predicted reward from discrete time-points i within a trial until the end of the trial (Schultz, Dayan, and Montague 1997). As a consequence, the TD prediction error signal has a much richer temporal profile than its RW cousin. In particular, the TD error would first generate a strong positive signal at the time of presentation of the UCS before learning is established, but this positive prediction error signal would then shift back in time within a trial over the course of learning (Figure 14.1a). By the time learning is complete, the TD error would be positive only at the time of presentation of the earliest predictive cue stimulus. Furthermore, on any occasion in the trial where greater reward is delivered than expected (or if the reward is delivered sooner or later than expected), then a positive prediction error would be elicited. Similarly, if less reward is delivered than expected at the specific time that it has previously been found to occur, then a negative prediction error will be elicited. For more details of this model and its properties see Montague, Dayan, and Sejnowski (1996) and Dayan and Abbott (2001).

FIGURE 14.1
Temporal difference prediction error signals during Pavlovian reward conditioning in humans. (a) Properties of the temporal difference prediction error signal. This signal responds positively at the time of reward presentation before training, shifts (more...)
14.3.2. Prediction Error Signals in the Brain
Evidence for reward prediction error signals was found in the activity patterns of dopamine neurons recorded from awake, behaving, non-human primates undergoing simple instrumental or classical conditioning tasks (Schultz, Dayan, and Montague 1997; Schultz 1998). The response profile of these neurons does not correspond to a simple RW rule but rather has more in common with that predicted by TD learning, showing each of the temporal response properties within a trial described above. Just like the TD error signal, these neurons increase their firing when a reward is presented unexpectedly but decrease their firing from baseline when a reward is unexpectedly omitted, respond initially at the time of the UCS before learning is established, but shift back in time within a trial to respond instead at the time of presentation of the cue once learning has taken place. Further evidence in support of this hypothesis has been garnered from recent studies using fast cyclical voltammetry assays of dopamine release in ventral striatum during Pavlovian reward conditioning, whereby dopamine released into ventral striatum exhibited a shifting profile, occurring initially at the time of reward but subsequently shifting back to occur at the time of presentation of the reward-predicting cue (Day and Carelli 2007). To test for evidence of a temporal difference prediction error signal in the human brain, O’Doherty et al. (2003) scanned human subjects while they underwent a classical conditioning paradigm in which associations were learned between arbitrary visual fractal stimuli and a pleasant sweet taste reward. This study found significant correlations between a TD model and activity in a number of brain regions, most notably ventral striatum (ventral putamen bilaterally) (Figure 14.1b) and OFC, both prominent target regions of dopamine neurons. These results suggest that prediction error signals are present in the human brain during reward learning and that these signals conform to a response profile consistent with a specific computational model: TD learning. Another study by McClure, Berns, and Montague (2003) also revealed activity in ventral striatum consistent with a reward prediction error signal using an event-related trial-based analysis.
14.4. PREDICTIVE-REWARD SIGNALS OF INSTRUMENTAL ASSOCIATIONS
So far, we have considered reward predictions tied to the presentation of specific stimuli, which can often occur in situations where rewards are not in any way contingent on the performance of behavioral actions, but what about predictions related to the implementation of specific behavioral actions upon which rewards are contingent? A number of fMRI studies in humans have found evidence for expected reward signals in a specific brain region, the ventromedial prefrontal cortex, while subjects performed instrumental actions in order to obtain reward. Kim, Shimojo, and O’Doherty (2006) used a learning algorithm based on the RW rule to generate trial-by-trial predictions of expected reward as subjects made decisions between which of two possible actions to choose in order to obtain monetary reward in one condition or to avoid losing money in a different condition. Different actions were associated with distinct probabilities of either winning or losing money, such that in the reward condition one action was associated with a 60% probability of winning money and the other action with only a 30% probability of winning. To maximize their cumulative reward, subjects should learn to choose the action associated with the higher reward probability. In the avoidance condition, subjects were presented with a choice between the same probabilities, except in this context 60% of the time after choosing one action they avoided losing money, whereas this only occurred 30% of the time after choosing the alternate action. To minimize their losses, subjects should learn to choose the action associated with the 60% probability of loss avoidance. Model-generated expected value signals for the action chosen were found to be correlated on a trial-by-trial basis with BOLD responses in medial OFC and adjacent medial prefrontal cortex, which collectively can be described as ventromedial prefrontal cortex (vmPFC), in both the reward and avoidance conditions. In other words, activity in these areas was increased under situations where greater reward was expected for the action chosen (according to the learning algorithm) and decreased under conditions where less reward was expected for a given action (Figure 14.2). Similar results were obtained by Daw et al. (2006), who used a four-armed bandit task in which “points” (that would later be converted into money) were paid out on each bandit. Again, activity in vmPFC correlated with the trial-by-trial estimate of expected reward attributable to the action (in this case bandit), chosen on that trial.

FIGURE 14.2
Reward prediction signals in ventromedial prefrontal cortex during reward-based action selection in humans. (a) Regions of ventromedial prefrontal cortex (medial and central orbitofrontal cortex extending into medial prefrontal cortex) correlating with (more...)
14.4.1. Goal-Directed Value Signals
However, although the above studies provide evidence of expected reward signals during reward-based action selection in vmPFC, they do not delineate the underlying associations upon which such signals depend. Such signals could be grounded in action-outcome associations, corresponding to the goal-directed component of instrumental conditioning, or alternatively could pertain to the encoding of habitual stimulus-response associations, which owing to the process of being incrementally strengthened as a result of prior reinforcement, would be stronger under situations where a given action yields greater reward than under situations where an action yields less reward. To address this question and determine whether human vmPFC is involved in encoding response-outcome (goal directed) or stimulus-response (habitual) associations, Valentin, Dickinson, and O’Doherty (2007) used an experimental design inspired by the animal learning literature (Balleine and Dickinson 1998). A key behavioral manipulation allowing one to determine whether behavior is under goal-directed or habitual control is to selectively devalue the outcome associated with that action, by for example feeding the animal to satiety on that outcome, and then testing the degree to which the animal persists in choosing that action following the devaluation process. If action selection is goal directed (and therefore depending on action-outcome associations), then the animal should immediately stop responding on the action associated with the devalued outcome. If, on the other hand, behavioral control is habitual and dependent on stimulus-response associations that do not link directly to the current value of the outcome, then the animal will persist in responding, as no representation of the outcome will be elicited during performance. In the Valentin, Dickinson, and O’Doherty study, subjects were scanned while they learned to choose instrumental actions associated with the subsequent delivery of different food rewards. Following training, one of these foods was devalued by feeding the subject to satiety on that food (similar to the approach used in Gottfried, O’Doherty, and Dolan 2003). The subjects were then scanned again, while being re-exposed to the instrumental choice procedure (in extinction). By testing for regions of the brain showing a change in activity during selection of the devalued action compared to that elicited during selection of the valued action from pre- to post-satiety, it is possible to identify regions showing sensitivity to the learned action-outcome associations, thereby revealing candidate areas responsible for goal-directed instrumental learning. The regions found to show such a response profile included vmPFC as well as an additional region of central OFC (Figure 14.3). These findings suggest that learned representations in vmPFC are sensitive to the current incentive value of the reward, ruling out a contribution of this region to habitual stimulus-response processing, which by definition would not show such sensitivity.

FIGURE 14.3
(See Color Insert)Regions of vmPFC and OFC showing response properties consistent with action-outcome learning. Neural activity during action selection for reward shows a change in response properties as a function of the value of the outcome with each (more...)
The aforementioned results would therefore appear to implicate vmPFC in encoding action-outcome and not stimulus-response associations. However, another possibility that cannot be ruled out on the basis of the Valentin, Dickinson, and O’Doherty study alone is that vmPFC may instead contain a representation of the discriminative stimulus (in this case the fractals), used to signal the different available actions, and that the learned associations in this region may be formed between these stimuli and the outcomes obtained (i.e., a Pavlovian rather than an instrumental association).
When comparing the results of the instrumental devaluation study by Valentin, Dickinson, and O’Doherty study with that of the purely Pavlovian devaluation study by Gottfried, O’Doherty, and Dolan (2003) reviewed in Section 14.2, there is some suggestion that the brain regions involved in instrumental associations may be at least partly dissociable. In the Gottfried, O’Doherty, and Dolan study, modulatory effects of reinforcer devaluation were found in central OFC but not vmPFC, whereas in the Valentin, Dickinson, and O’Doherty study evidence was found of instrumental devaluation effects in both central and medial OFC. This raises the possibility that medial OFC (part of vmPFC) may be more involved in the goal-directed component of instrumental conditioning whereas central OFC may contribute more to Pavlovian stimulus-outcome learning (as this area was found in both the Valentin, Dickinson, and O’Doherty study that did have a Pavlovian component and in the previous purely Pavlovian devaluation study). This proposal is compatible with the known anatomical connectivity of these areas in which central areas of OFC (Brodmann areas 11 and 13) receive input primarily from sensory areas, consistent with a role for these areas in stimulus-stimulus learning, whereas the medial OFC (areas 14 and 25) receives input primarily from structures on the adjacent medial wall of prefrontal cortex such as cingulate cortex, an area often implicated in response selection and/or reward-based action choice (Carmichael and Price 1995, 1996).
More compelling evidence of a role for vmPFC specifically in encoding action-related value signals has come from an fMRI study by Glascher et al. (2009). In this study, subjects participated in two distinct tasks: in the “action-based” task, subjects had to choose between performing one of two different physical motor responses (rolling a trackball vs. pressing a button) in the absence of explicit discriminative stimuli signaling those actions. Monetary rewards or losses were delivered on a probabilistic basis according to their choice of the different physical actions, and the rewards available on the different actions changed over time. Trial-by-trial model-predictive expected reward signals were generated for each action choice made by the subjects. Similar to the results found in studies where both discriminative stimulus information and action-selection components are present, in this task activity in vmPFC was found to track the expected reward corresponding to the chosen action (Figure 14.4). These results suggest that activity in vmPFC does not necessarily depend on the presence of discriminative stimuli, indicating that this region contributes to encoding of action-related value signals. In another, “stimulus-based,” task, subjects performed action selection where decision options are denoted by the presence of specific discriminative stimuli; however, the two physical actions denoting the different choice options were randomly assigned (depending on random spatial position of the two discriminative stimuli). In common with the action-based reversal task, expected reward signals were also observed in vmPFC while subjects performed the stimulus-based task, consistent with a number of previous reports (Daw et al. 2006; Hampton, Bossaerts, and O’Doherty 2006; Kim, Shimojo, and O’Doherty 2006; Valentin, Dickinson, and O’Doherty 2007). Furthermore, in a conjunction analysis to test for regions commonly activated in both the action-based and stimulus-based choice conditions, robust activity was found in vmPFC. Overall, these findings could be taken to indicate that vmPFC contributes to both stimulus-based and action-based processes. An alternative possibility is that activity in vmPFC during the stimulus-based condition is, in common with that in the action-based condition, also being driven by goal-directed action-outcome associations. Although in the stimulus-based task the particular physical motor response required to implement a specific decision varied on a trial-by-trial basis (depending on where the stimuli are presented), it is possible for associations to be learned between a combination of visual stimuli locations, responses, and outcomes. Thus, the common involvement of vmPFC in both the action- and stimulus-based reversal could be attributable to the possibility that this region is generally involved in encoding values of chosen actions but that those action-outcome relationships are encoded in a more abstract and flexible manner than concretely mapping specific physical motor responses to outcomes. The more flexible encoding of “actions” that this framework would entail may have parallels with computational theories of goal-directed learning in which action selection is proposed to occur via a flexible forward model system, which explicitly encodes the states of the world, the transition probabilities between those states, and the outcomes obtained in those states (Daw, Niv, and Dayan 2005). Overall, therefore, these findings suggest that vmPFC may play a role in encoding the value of chosen actions irrespective of whether those actions denote physical motor responses or more abstract decision options.

FIGURE 14.4
(See Color Insert)Expected reward representations in vmPFC during action-based and stimulus-based decision making. (a) Illustration of action-based decision-making task in which subjects must choose between one of two different physical actions that yield (more...)
Additional evidence in support of a contribution of vmPFC to goal-directed learning and in encoding action-based value signals has come from a study by Tanaka, Balleine, and O’Doherty (2008). Apart from devaluing the outcome, another way to distinguish goal-directed from habitual behavior in the animal learning literature is to degrade the contingency. Contingency is the term used by behavioral psychologists to describe the relationship between an action and its consequences or outcome, which is defined in terms of the difference between two probabilities: the probability of the outcome given the action is performed and the probability of the outcome given the action is not performed (Hammond 1980; Beckers, De Houwer, and Matute 2007). Animals whose behavior is under control of the goal-directed system and who learn to perform an action for a reward under a situation of a highly contingent relationship between actions and outcomes will, following degradation of that contingency (by making the reward available even if the action is not performed), reduce their rate of responding on that action (Adams 1981; Dickinson and Balleine 1993). However, if animals are habitized, they will persist in responding on the action following contingency degradation, indicative of an insensitivity to action-outcome contingencies. Thus, another means apart from outcome devaluation to identify brain systems involved in goal-directed action-outcome learning and to discriminate those regions from those involved in habitual control is to assess neural activity tracking the degree of contingency between actions and outcomes during instrumental responding. To study this process in humans, Tanaka, Balleine, and O’Doherty abandoned the traditional trial-based approach, typically used in experiments using humans and non-human primates, in which subjects are cued to respond at particular times in a trial, for the unsignaled, self-paced approach more often used in studies of associative learning in rodents in which subjects themselves choose when to respond. Subjects were scanned with fMRI while in different sessions they responded on four different free operant reinforcement schedules that varied in the degree of contingency between responses made and rewards obtained. Consistent with the findings from the outcome devaluation study of Valentin, Dickinson, and O’Doherty, activity in two sub-regions of vmPFC (medial OFC and medial prefrontal cortex), as well as in dorsomedial striatum, was higher on average across a session when subjects were performing on a high-contingency schedule than in a session when they were performing on a low-contingency schedule (Figure 14.5). Moreover, in the sub-region of vmPFC identified on the medial wall, activity was found to vary not only with the degree of contingency overall on average across a schedule, but also with a locally computed estimate of the contingency between actions and outcome that tracks rapid changes in contingency over time within a session, implicating this specific sub-region of medial prefrontal cortex in the on-line computation of contingency between actions and outcomes.

FIGURE 14.5
Brain regions tracking objective action-outcome contingency in humans. (a) Schematic of task design used in the human fMRI study by Tanaka, Balleine, and O’Doherty, (2008). Each experiment consisted of four sessions lasting 5 min each. A single (more...)
When taken together, all of the evidence described above implicates vmPFC in encoding reward predictions based on action-outcome associations, which suggests that predictive reward representations in these medial parts of prefrontal cortex may be distinct from those in more central and lateral parts of OFC, amygdala, and ventral striatum, which, as reviewed in Section 14.2, may be more involved in encoding Pavlovian stimulus-outcome as opposed to instrumental action-outcome associations.
14.4.2. Habit Value Signals
The finding that predictive reward representations in vmPFC are based on action-outcome associations leaves open the question of where and how reward predictions based on habitual S-R associations are encoded. A reasonable hypothesis based on the rodent literature is that such signals could be present in a part of the dorsolateral striatum (Yin, Knowlton, and Balleine 2004). However, perhaps surprisingly, very little is known about where or how habitual value-signals are encoded in the human or primate brain more generally. Although a number of human fMRI studies have employed procedural learning paradigms, which are often assumed to invoke habit-like learning processes, in fact these studies have never attempted to determine whether procedurally learned behavior is in fact under goal-directed or habitual control by using either the outcome devaluation or contingency degradation probes described above. As a consequence, these studies have never been able to determine whether responses acquired in a procedural learning paradigm are, in fact, habitual, goal-directed, or a combination of both. However, in those few paradigms that have used the appropriate manipulations in humans, such as the studies by Valentin, Dickinson, and O’Doherty and Tanaka, Balleine, and O’Doherty, it is possible to test for the presence of habit-like value signals. In the Valentin study, such signals would take the form of being a predictive reward-signal that would discriminate between rewarding and non-rewarding actions during acquisition, but show no difference in activity between the devalued and valued actions following the devaluation procedure, whereas in the Tanaka, Balleine, and O’Doherty study, candidate habit signals would have been expected to emerge, particularly in situations where subjects were responding on a schedule with a low contingency between actions and outcomes. Notably in both of these studies, no clear evidence was found for predictive-reward signals consistent with habitual associative processes. The failure to find evidence for habit-like signals in these studies does not of course imply that those signals are not present. A number of factors could contribute to the lack of evidence for such signals in these studies. First of all, habitual processes by their very nature might be expected to be much less metabolically demanding such that the degree of blood oxygenation required to sustain them might be considerably less than would be the case for goal-directed representations. Therefore, perhaps the neural correlates of habit signals are much weaker and therefore more difficult to detect using standard BOLD imaging protocols than is the goal-directed component. Alternatively, perhaps the BOLD correlates of these different signals are only present in situations when behavior is being controlled by that system. In the Valentin study, it is known that behavior is under the control of the goal-directed system, because subjects showed significant reductions in their responses to the action associated with the devalued outcome compared to the action associated with the still valued outcome. In animal studies, one of the key behavioral manipulations required to demonstrate habitual control in rodents is to overtrain animals on a particular instrumental action (Balleine and Dickinson 1998). Actions that are exposed to only little or moderate training appear to be predominantly under goal-directed control. Thus, it is possible that in the Valentin, Dickinson, and O’Doherty study, the failure to observe clear evidence of habitual value-signals may be because subjects were exposed to only moderate training on the instrumental actions. Similarly, in Tanaka, Balleine, and O’Doherty, subjects were exposed to only brief 5 min sessions for each contingency, and therefore it is likely that behavior was also under goal-directed and not habitual control. A final possibility is that because humans probably possess enhanced cognitive control mechanisms compared to rodents, perhaps even after extensive training, human behavior might remain under goal-directed control. In order to address these questions, Tricomi, Balleine, and O’Doherty (2009) scanned subjects with fMRI while they performed on a variable interval schedule for food rewards. Subjects performed multiple training sessions each of 8 min duration during which they performed one of two different actions in particular trial blocks, which led to the delivery of one of two specific food outcomes (stored and consumed after the session). One group of subjects was overtrained by receiving four training sessions per day on three separate days, while another group received only moderate training by being exposed to only two sessions. Following the training sessions subjects were fed to satiety on one of the foods, thereby selectively devaluing that outcome. When tested in extinction, the overtrained group showed a tendency to maintain responding on the action associated with the now devalued outcome, remarkably analogous to behavior shown in rodents under similar circumstances, whereas the undertrained group quickly reduced their responding on the action associated with the devalued outcome. Moreover, activity in a posterior region of lateral striatum was found to increase over the course of training and was maximal on the final day of training in the overtrained group. These findings therefore indicate that humans do show behavioral evidence of habitual control following overtraining, just as in rodents, and that a region of posterior lateral striatum may be involved in such a process. This supports the idea that BOLD correlates of habit learning are increasingly detectable over the course of learning, perhaps reflecting the greater involvement of this system in controlling behavior as a function of training. It is also notable that the involvement of human posterolateral striatum in the expression of habitual behavior resonates with findings implicating a similar part of the striatum in this process in rodents (Yin, Knowlton, and Balleine 2004).
14.5. COMPUTATIONAL SIGNALS UNDERLYING LEARNING OF INSTRUMENTAL ASSOCIATIONS
We now consider computational theories about how instrumental associations, be they goal directed or habitual, might be acquired. In Section 14.3 we reviewed the RW model and its real-time extensions and showed how these models appear to provide a good characterization of neural signals underlying learning of stimulus-reward associations. For instrumental conditioning, a related class of models can be invoked collectively known as reinforcement learning (Sutton and Barto 1998). The core feature of reinforcement learning models is that in order to choose optimally between different actions, an agent needs to maintain internal representations of the expected reward available on each action, and then subsequently choose the action with the highest expected value. Also central to these algorithms is the notion of a prediction error signal that is used to learn and update expected values for each action through experience, just as in the RW learning model for Pavlovian conditioning described earlier. In one such model—the actor/critic—action selection is conceived as involving two distinct components: a critic, which learns to predict future reward associated with particular states in the environment, and an actor, which chooses specific actions in order to move the agent from state to state according to a learned policy (Barto 1992, 1995). The critic encodes the value of particular states in the world and as such has the characteristics of a Pavlovian reward prediction signal described above. The actor stores a set of probabilities for each action in each state of the world and chooses actions according to those probabilities. The goal of the model is to modify the policy stored in the actor such that over time those actions associated with the highest predicted reward are selected more often. This is accomplished by means of the aforementioned prediction error signal that computes the difference in predicted reward as the agent moves from state to state. This signal is then used to update value predictions stored in the critic for each state, but also to update action probabilities stored in the actor such that if the agent moves to a state associated with greater reward (and thus generates a positive prediction error), then the probability of choosing that action in future is increased. Conversely, if the agent moves to a state associated with less reward, this generates a negative prediction error and the probability of choosing that action again is decreased.
In order to distinguish regions of the brain involved in mediating the actor from the critic, O’Doherty et al. (2004) scanned hungry human subjects with fMRI while they performed a simple instrumental conditioning task in which they were required to choose one of two actions leading to juice reward with either a high or low probability (Figure 14.6a). Neural responses corresponding to the generation of prediction error signals during performance of the instrumental task were compared to that elicited during a control Pavlovian task in which subjects experienced the same stimulus-reward contingencies but did not actively choose which action to select. This comparison was designed to isolate the actor (which was hypothesized to be engaged only in the instrumental task) from the critic (which was hypothesized to be engaged in both the instrumental and Pavlovian control tasks). While dorsal striatum was correlated with prediction errors in the instrumental task only, by contrast ventral striatum was correlated with prediction errors in both the instrumental and Pavlovian tasks (Figure 14.6b). These findings thereby provided evidence of a possible ventral/dorsal trend within the striatum such that ventral striatum is more concerned with implementing the critic, while dorsal striatum is more involved in implementing the actor, confirming a initial proposal along these lines by Montague, Dayan, and Sejnowski (1996). However, while the above study provides evidence in support of the presence of an actor/critic type mechanism in the brain, it does not establish the causal role of such signals in learning these associations and in subsequently controlling behavior.

FIGURE 14.6
Prediction error signals underlying action selection for reward. (a) Schematic of instrumental choice task used by O’Doherty et al. (2004). On each trial of the reward condition, the subject chooses between two possible actions, one associated (more...)
To address this issue, Schonberg et al. (2007) made use of an instrumental reward-conditioning task, which has the property that there is a high degree of variance across the population in the degree to which human subjects can successfully learn the task. Approximately 50% fail to converge in their choices toward the two options (out of the four available options) that yield the greatest probability of reward over 150 trials, whereas the other 50% tend to converge quite rapidly on the optimal choices. This property of the task provides a useful means of testing the degree to which reward-prediction error signals in dorsal striatum are engaged and can differentiate those subjects who successfully learn the instrumental associations from those who do not. To address this, both “learner” and “non-learner” groups were scanned while performing this task. Consistent with the possibility that reward-prediction errors in dorsal striatum are causally related to acquisition of instrumental reward associations, activity in dorsal striatum was significantly better correlated with reward prediction error signals in learners than in non-learners. On the other hand, consistent with the actor/critic proposal, while reward-prediction error activity in ventral striatum was also weaker in the non-learners, this ventral striatum prediction error activity did not significantly differ between groups.
These results suggest a dorsal/ventral distinction within the striatum whereby ventral striatum is more concerned with Pavlovian or stimulus-outcome learning, while dorsal striatum is more engaged during learning of stimulus-response or stimulus-response-outcome associations. The suggestion that human dorsal striatum is specifically involved in situations when subjects need to select actions in order to obtain reward has received support from a number of other fMRI studies, both model based and trial based (Haruno et al. 2004; Tricomi, Delgado, and Fiez 2004).
14.6. COMPUTATIONAL MODELS OF INSTRUMENTAL LEARNING AND GOALS VERSUS HABITS
Another outstanding issue is to what extent does the distinction between goal-directed and habitual reward predictions described earlier map onto theories of computational reinforcement learning such as the actor/critic? Daw, Niv, and Dayan (2005) proposed that a reinforcement learning model such as the actor/critic is concerned purely with learning of habitual S-R value signals (see also Balleine, Daw, and O’Doherty 2008). According to this interpretation, action-value signals learned by an actor/critic would not be immediately updated following a change in the value of the reward outcome (such as by devaluation). Instead such an update would occur only after the model re-experiences the reward in its now devalued state and generates prediction errors that would incrementally modulate action-values. Alternatively, action-values learned by the actor-critic could reflect the strength of an association not only between stimuli and responses, but also between actions and outcomes. In that event, such a representation would show devaluation effects and hence meet the criteria of being goal directed. There is currently insufficient empirical data to distinguish between these possibilities. One way potentially to address this is to determine whether the reward-prediction error signal generated by dopamine neurons (which should be at the core of any reinforcement learning-like process) is sensitive to devaluation effects. If not, this would support the idea that RL models and the associated error signal is involved in learning habitual S-R associations; otherwise the idea that dopamine neurons may contribute to the goal-directed component of learning will have to be entertained.
Daw et al. (2005) also proposed an alternative model to the actor/critic to account for the goal-directed component of instrumental learning: a forward model. Unlike reinforcement learning, which develops approximate or “cached” values for particular actions based on prior experience with those actions, in the “forward model,” values for different actions are worked out on-line by taking into account knowledge about the rewards available in each state and the transition probabilities between each state, and iteratively working out the value of each available option, analogous to how a real chess player or computer chess algorithm might work out which chess move to make next by explicitly thinking through the consequences of all possible moves. One property of this model is that value representations should be modulated not only incrementally, but instantaneously following detection of a change in the underlying state-space or structure of the decision problem.
Evidence that predictive reward representations in the brain are sensitive to changes in state has emerged from a study by Hampton, Bossaerts, and O’Doherty (2006). In this study subjects were asked to participate in a decision problem called probabilistic reversal learning in which two different actions yield reward with distinct probabilities. The key feature of the task is that the rewards available on the two actions are anti-correlated, that is, when one action has a high value the other has a low value, in terms of the amount of rewards that are obtained probabilistically on the two actions, and that after an unpredictable series of trials the reward probabilities on the two actions reverse. Hampton et al. compared two computational models in terms of how well they could account for human choice behavior on the task and for the pattern of neural activity in vmPFC during task performance. One of these algorithms incorporated the rules or structure of the decision problem as would be expected for a state-based inference mechanism, such that following a reversal the value of the two actions was instantly updated to reflect knowledge of the abstract structure. The other model was a simple reinforcement learning algorithm that did not incorporate the structure and thus would only learn to slowly and incrementally update values following successive reinforcements. Consistent with a role for vmPFC in state-based inference, predicted reward signals in this region were found to reflect the structure of the decision problem, such that activity was updated instantly following a reversal, rather than being updated incrementally as might be expected for a simple non-state-based reinforcement learning mechanism.
Although the forward model approach is appealing as a possible computational explanation for goal-directed learning, evidence in support of such a model is still rather preliminary. For now the main conclusion available on the basis of the current evidence is that prediction error signals certainly seem to play a role in the acquisition of instrumental associations, particularly through input into dorsal striatum, and that changes in state representations can result in the direct modulation of value signals in vmPFC. Further studies both at the level of human imaging and single-unit neuro-physiology in other animals will be needed to establish the extent to which such signals contribute to learning of goal-directed value signals, habitual values, or both.
14.7. CONCLUSIONS
In this chapter, we have reviewed evidence for the existence of multiple types of predictive reward-signals in the human brain and for a number of different possible learning mechanisms that might underlie their acquisition. Pavlovian or stimulus-bound reward predictions appear to be present in three principle brain regions: OFC (particularly central and lateral areas), amygdala, and ventral striatum. Such signals may be learned via a stimulus-based reward prediction error signal originating in the phasic activity of dopamine neurons and projecting to ventral striatum and elsewhere. In addition, reward predictions based on instrumental action-outcome associations appear to be encoded in vmPFC, which incorporates medial OFC and adjacent medial prefrontal cortex, as well as the anterior medial striatum. Furthermore, a region of more posterior lateral striatum appears to be engaged once behavior is under habitual and not goal-directed control when following repetitive performance of a particular action: individuals no longer take into account the incentive value of outcomes while performing actions previously associated with those outcomes. There is also now considerable evidence to suggest that prediction error signals into dorsal striatum may play a direct role in the acquisition of instrumental reward associations.
An important future direction will be to establish clearly the extent to which stimulus-bound and action-bound predictions are neurally dissociable. Another pressing issue, not touched on in the present chapter, is that once the neural systems responsible for each different type of reward prediction have been delineated, it will also be important to begin to understand how these different types of reward-prediction systems interact together in order to ultimately control behavior. Although this question has already been extensively studied in the animal literature, it has, as yet, only received preliminary treatment in humans (Bray et al. 2008; Talmi et al. 2008).
REFERENCES
- Adams C. D. Variations in the sensitivity of instrumental responding to reinforcer devaluation. Q J Exp Psychol. 1981;34B:77–98.
- Balleine B. W., Daw N. D., O’Doherty J. P. Multiple forms of value learning and the function of dopamine. Glimcher P. W., Camerer C. F., Poldrack R. A., Fehr E. New York: Academic Press; Neuroeconomics: Decision Making and the Brain. 2008:367–87.
- Balleine B. W., Dickinson A. Goal-directed instrumental action: Contingency and incentive learning and their cortical substrates. Neuropharmacology. 1998;37:407–19. [PubMed: 9704982]
- Barto A. G. Reinforcement learning and adaptive critic methods. White D. A., Sofge D. A. New York: Van Norstrand Reinhold; Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches. 1992:469–91.
- Barto A. G. Adaptive critics and the basal ganglia. Houk J. C., Davis J. L., Beiser B. G. Models of Information Processing in the Basal Ganglia. 1995:215–32. Cambridge, MA: MIT Press.
- Beckers T., De Houwer J., Matute H. London: Human Contingency Learning: Recent Trends in Research and Theory. 2007 Psychology Press.
- Bray S., Rangel A., Shimojo S., Balleine B., O’Doherty J.P. The neural mechanisms underlying the influence of pavlovian cues on human decision making. J Neurosci. 2008;28:5861–66. [PMC free article: PMC6670800] [PubMed: 18509047]
- Carmichael S. T., Price J. L. Sensory and premotor connections of the orbital and medial prefrontal cortex of macaque monkeys. J Compar Neurol. 1995;363:642–64. [PubMed: 8847422]
- Carmichael S. T., Price J. L. Connectional networks within the orbital and medial prefrontal cortex of macaque monkeys. J Compar Neurol. 1996;371:179–207. [PubMed: 8835726]
- Cromwell H. C., Schultz W. Effects of expectations for different reward magnitudes on neuronal activity in primate striatum. J Neurophysiol. 2003;89:2823–38. [PubMed: 12611937]
- Daw N. D., Niv Y., Dayan P. Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nat Neurosci. 2005;8:1704–11. [PubMed: 16286932]
- Daw N. D., O’Doherty J. P., Dayan P., Seymour B., Dolan R. J. Cortical substrates for exploratory decisions in humans. Nature. 2006;441:876–79. [PMC free article: PMC2635947] [PubMed: 16778890]
- Day J. J., Wheeler R. A., Roitman M. F., Carelli R. M. Nucleus accumbens neurons encode Pavlovian approach behaviors: Evidence from an autoshaping paradigm. Eur J Neurosci. 2006;23:1341–51. [PubMed: 16553795]
- Day J. J., Carelli R. M. The nucleus accumbens and Pavlovian reward learning. Neuroscientist. 2007;13:148–59. [PMC free article: PMC3130622] [PubMed: 17404375]
- Dayan P., Abbott L. Boston: Theoretical Neuroscience. 2001 MIT Press.
- Dickinson A., Balleine B. W. Actions and responses: The dual psychology of behaviour. Eilan N., McCarthy R., Brewer M. W. Oxford: Basil Blackwell; Spatial Representation. 1993:277–93.
- Glascher J., Hampton A. N., O’Doherty J. P. Determining a role for ventromedial prefrontal cortex in encoding action-based value signals during reward-related decision making. Cereb Cortex. 2009;19:483–95. [PMC free article: PMC2626172] [PubMed: 18550593]
- Gottfried J. A., O’Doherty J., Dolan R. J. Appetitive and aversive olfactory learning in humans studied using event-related functional magnetic resonance imaging. J Neurosci. 2002;22:10829–37. [PMC free article: PMC6758414] [PubMed: 12486176]
- Gottfried J. A., O’Doherty J., Dolan R. J. Encoding predictive reward value in human amygdala and orbitofrontal cortex. Science. 2003;301:1104–7. [PubMed: 12934011]
- Hammond L. J. The effect of contingency upon appetitive conditioning of free operant behavior. J Exp Anal Behav. 1980;34:297–304. [PMC free article: PMC1333008] [PubMed: 16812191]
- Hampton A. N., Bossaerts P., O’Doherty J. P. The role of the ventromedial prefrontal cortex in abstract state-based inference during decision making in humans. J Neurosci. 2006;26:8360–67. [PMC free article: PMC6673813] [PubMed: 16899731]
- Haruno M., Kuroda T., Doya K., Toyama K., Kimura M., Samejima K., Imamizu H., Kawato M. A neural correlate of reward-based behavioral learning in caudate nucleus: A functional magnetic resonance imaging study of a stochastic decision task. J Neurosci. 2004;24:1660–65. [PMC free article: PMC6730455] [PubMed: 14973239]
- Kim H., Shimojo S., O’Doherty J. P. Is avoiding an aversive outcome rewarding? Neural substrates of avoidance learning in the human brain. PLoS Biol. 2006;4 [PMC free article: PMC1484497] [PubMed: 16802856]
- McClure S. M., Berns G. S., Montague P. R. Temporal prediction errors in a passive learning task activate human striatum. Neuron. 2003;38:339–46. [PubMed: 12718866]
- Montague P. R., Dayan P., Sejnowski T. J. A framework for mesencephalic dopamine systems based on predictive Hebbian learning. J Neurosci. 1996;16:1936–47. [PMC free article: PMC6578666] [PubMed: 8774460]
- O’Doherty J., Dayan P., Friston K., Critchley H., Dolan R. J. Temporal difference models and reward-related learning in the human brain. Neuron. 2003;38:329–37. [PubMed: 12718865]
- O’Doherty J., Dayan P., Schultz J., Deichmann R., Friston K., Dolan R. J. Dissociable roles of ventral and dorsal striatum in instrumental conditioning. Science. 2004;304:452–54. [PubMed: 15087550]
- Paton J. J., Belova M. A., Morrison S. E., Salzman C. D. The primate amygdala represents the positive and negative value of visual stimuli during learning. Nature. 2006;439:865–70. [PMC free article: PMC2396495] [PubMed: 16482160]
- Rescorla R. A., Wagner A. R. A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and nonreinforcement. Black A. H., Prokasy W. F. New York: Appleton Crofts; Classical Conditioning II: Current Research and heory. 1972:64–99.
- Schoenbaum G., Chiba A. A., Gallagher M. Orbitofrontal cortex and basolateral amygdala encode expected outcomes during learning. Nat Neurosci. 1998;1:155–59. [PubMed: 10195132]
- Schonberg T., Daw N. D., Joel D., O’Doherty J. P. Reinforcement learning signals in the human striatum distinguish learners from nonlearners during reward-based decision making. J Neurosci. 2007;27:12860–67. [PMC free article: PMC6673291] [PubMed: 18032658]
- Schultz W. Predictive reward signal of dopamine neurons. J Neurophysiol. 1998;80:1–27. [PubMed: 9658025]
- Schultz W., Dayan P., Montague P. R. A neural substrate of prediction and reward. Science. 1997;275:1593–99. [PubMed: 9054347]
- Shidara M., Aigner T. G., Richmond B. J. Neuronal signals in the monkey ventral striatum related to progress through a predictable series of trials. J Neurosci. 1998;18:2613–25. [PMC free article: PMC6793099] [PubMed: 9502820]
- Sutton R. S. Learning to predict by the methods of temporal differences. Mach Learn. 1988;3:9–44.
- Sutton R. S., Barto A. G. Reinforcement Learning. 1998 Cambridge, MA: MIT Press.
- Talmi D., Seymour B., Dayan P., Dolan R. J. Human pavlovian-instrumental transfer. J Neurosci. 2008;28:360–68. [PMC free article: PMC2636904] [PubMed: 18184778]
- Tanaka S. C., Balleine B. W., O’Doherty J. P. Calculating consequences: Brain systems that encode the causal effects of actions. J Neurosci. 2008;28:6750–55. [PMC free article: PMC3071565] [PubMed: 18579749]
- Tremblay L., Schultz W. Relative reward preference in primate orbitofrontal cortex. Nature. 1999;398:704–8. [PubMed: 10227292]
- Tricomi E., Balleine B. W., O’Doherty J. P. A specific role for posterior dorsolateral striatum in human habit learning. Eur J Neurosci. 2009;29:2225–32. [PMC free article: PMC2758609] [PubMed: 19490086]
- Tricomi E. M., Delgado M. R., Fiez J. A. Modulation of caudate activity by action contingency. Neuron. 2004;41:281–92. [PubMed: 14741108]
- Valentin V. V., Dickinson A., O’Doherty J. P. Determining the neural substrates of goal-directed learning in the human brain. J Neurosci. 2007;27:4019–26. [PMC free article: PMC6672546] [PubMed: 17428979]
- Yin H. H., Knowlton B. J., Balleine B. W. Lesions of dorsolateral striatum preserve outcome expectancy but disrupt habit formation in instrumental learning. Eur J Neurosci. 2004;19:181–89. [PubMed: 14750976]
- INTRODUCTION
- STIMULUS-BASED PREDICTIONS
- ACQUISITION OF STIMULUS-BASED REWARD PREDICTIONS
- PREDICTIVE-REWARD SIGNALS OF INSTRUMENTAL ASSOCIATIONS
- COMPUTATIONAL SIGNALS UNDERLYING LEARNING OF INSTRUMENTAL ASSOCIATIONS
- COMPUTATIONAL MODELS OF INSTRUMENTAL LEARNING AND GOALS VERSUS HABITS
- CONCLUSIONS
- REFERENCES
- Inactivation of the lateral but not medial dorsal striatum eliminates the excitatory impact of Pavlovian stimuli on instrumental responding.[J Neurosci. 2007]Inactivation of the lateral but not medial dorsal striatum eliminates the excitatory impact of Pavlovian stimuli on instrumental responding.Corbit LH, Janak PH. J Neurosci. 2007 Dec 19; 27(51):13977-81.
- Review Involvement of basal ganglia and orbitofrontal cortex in goal-directed behavior.[Prog Brain Res. 2000]Review Involvement of basal ganglia and orbitofrontal cortex in goal-directed behavior.Hollerman JR, Tremblay L, Schultz W. Prog Brain Res. 2000; 126:193-215.
- Reward-Mediated, Model-Free Reinforcement-Learning Mechanisms in Pavlovian and Instrumental Tasks Are Related.[J Neurosci. 2023]Reward-Mediated, Model-Free Reinforcement-Learning Mechanisms in Pavlovian and Instrumental Tasks Are Related.Moin Afshar N, Cinotti F, Martin D, Khamassi M, Calu DJ, Taylor JR, Groman SM. J Neurosci. 2023 Jan 18; 43(3):458-471. Epub 2022 Oct 10.
- Orbitofrontal Cortex Signals Expected Outcomes with Predictive Codes When Stable Contingencies Promote the Integration of Reward History.[J Neurosci. 2017]Orbitofrontal Cortex Signals Expected Outcomes with Predictive Codes When Stable Contingencies Promote the Integration of Reward History.Riceberg JS, Shapiro ML. J Neurosci. 2017 Feb 22; 37(8):2010-2021. Epub 2017 Jan 23.
- Review Model-based and model-free Pavlovian reward learning: revaluation, revision, and revelation.[Cogn Affect Behav Neurosci. 2014]Review Model-based and model-free Pavlovian reward learning: revaluation, revision, and revelation.Dayan P, Berridge KC. Cogn Affect Behav Neurosci. 2014 Jun; 14(2):473-92.
- Reward Predictions and Computations - Neurobiology of Sensation and RewardReward Predictions and Computations - Neurobiology of Sensation and Reward
Your browsing activity is empty.
Activity recording is turned off.
See more...