Logo of wtpaEurope PMCEurope PMC Funders GroupSubmit a Manuscript
Nat Neurosci. Author manuscript; available in PMC Nov 1, 2012.
Published in final edited form as:
PMCID: PMC3378641
EMSID: UKMS41108

Mapping value based planning and extensively trained choice in the human brain

Abstract

Investigations of the underlying mechanisms of choice in humans have focused on learning from prediction errors, and so the computational structure of value based planning is comparatively underexplored. Using behavioural and neuroimaging analyses of a minimax decision task, we show that the computational processes underlying forward planning are expressed in the anterior caudate nucleus as values of individual branching steps in a decision tree. In contrast, values represented in the putamen pertain solely to values learnt during extensive training. During actual choice, both striatal areas show a functional coupling to ventromedial prefrontal cortex, consistent with this region acting as a value comparator. Our findings point towards an architecture of choice in which segregated value systems operate in parallel in the striatum for planning and extensively trained choices, with medial prefrontal cortex integrating their outputs.

An overarching view of adaptive behaviour is that humans and animals seek to maximise reward and minimise punishment in their choices. Solutions to such value based decision problems fall along a crude spectrum. At one end, on-the-fly planning, based on a model of the relevant domain, can determine which of the available actions lead to a desired outcome. Finding optimal actions in this type of choice context, for instance by searching the branches of a decision tree for the best outcome, poses severe demands on computation and memory, and becomes rapidly intractable with growing complexity. This planning end of the spectrum is of particular importance when we have relatively little experience in an environment or where its aspects change quickly.

By contrast, when subjects have extensive practice in a relatively stable domain, they can directly learn from experience about the affective consequences of different actions. Decision-making at this end of the spectrum can become highly automatized, and indeed need no longer be based on a complex representational model of the world. One of the main results in the field of reinforcement learning 1, and indeed one of the earliest insights in artificial intelligence 2 is that it is possible to learn optimal actions in complex, but stable, domains by making and measuring errors in predictions over the course of extended experience.

A rich body of work on value-based decision making in humans has focused on learning based on prediction errors 3, 4. Although there has been extensive investigation of various tasks involving planning, such as the Tower of London 5, these tasks have typically not focused on value, and have not been designed to compare the two ends of the spectrum referred to above. More recent investigations targeting this spectrum 6-8 have been revealing, but have not directly addressed the computational mechanisms or neural encoding of value based planning, or the integration of extensively trained and planning based evaluation and choice.

We designed a novel value based choice task for human subjects, and used fMRI to examine the neural mechanisms underlying forward planning on the one hand, and choices based on learnt values after extensive behavioural training on the other. Our task allowed us to index separately planning and extensively trained contexts respectively, allowing us to specifically investigate value representations in the brain associated with the computational processes of each type of choice. We found medial striatum was more strongly engaged during planning, and lateral striatum during choices in extensively trained contexts. Importantly, the BOLD signals in caudate pertained to individual computational components of planned choice values while, by contrast, signals in posterior putamen selectively fluctuated with the values during extensively trained responses. The results provide direct evidence in humans for multiple decision systems that operate independently and in parallel and recruit neural structures along a mediolateral axis in basal ganglia. Furthermore, prefrontal cortex, specifically ventromedial prefrontal cortex (vmPFC), represented the value of the chosen option across systems, highlighting its possible role as a value comparator across both decision systems.

Results

21 subjects participated in a decision task in which decision values, derived either from forward planning or values learnt through extensive training, could be distinguished on a trial-by-trial basis.

One component of the task required subjects to navigate a tree-shaped branching maze to reach one of several available terminal states. Each state was associated with distinct probabilities of getting reward, thus rendering the value components of individual branches in the decision tree computationally transparent. In pure planning trials (Fig. 1a), probabilities of reward were visually displayed, but could change on each trial. Three consecutive choices led from the start state to the terminal state. Subjects planned the first and last choices; the middle choice was made by a predictable computer agent acting according to a fully disclosed rule (minimax: namely selecting the tree branch having the lower maximum value). This latter step induced a tree search strategy for calculating planned values, whereas, for instance, a mere requirement to compare displayed values might fail to invoke sufficient forward planning. Thus, in our task, the best possible choice required a form of dynamic programming, involving the estimation of values at distinct stages in the tree (Fig. 1b).

Figure 1
Task and behavioural results

A second component of the overall task design involved trials which did not require forward planning and were instead extensively exercised during three days of behavioural training. In these, subjects had to make single choices between two available actions after having learnt values from samples of a probabilistic reward delivery process (Fig. 1c). The inclusion of separate planning and extensively trained trials allowed us to investigate neural computations unique to each decision system. Subsequent to this we examined mixed trials, involving choices between either a planning or a trained branch (Fig. 1d), and choices between two trained branches. From a normative perspective, the combination of both components within the same trial entails a direct comparison of planned values from one branch with values derived from the extensively trained task on the other branch.

Behavioural results

Subjects’ choices were largely consistent with choice values over all trial types (Table S1), confirming that planning was cognitively tractable, and consistent with subjects having learnt values and action mappings in the trained trials.

On average, subjects chose optimally in 94% of planning trials, and chose the rewarding door in 98% of extensively trained trials. Our task was designed such that only a tree search strategy would yield good performance in planning trials. To test whether participants indeed used a tree search strategy we compared individual subjects’ choices to the optimal minimax strategy, and to other simpler heuristics, such as picking the path with the largest maximum value later in the tree or picking the path with the highest average value in the leaf nodes. Subjects’ behaviour was better explained by the minimax planning strategy than any of the alternative heuristics (Fig. 1e; p < 10−8, Wilcoxon rank sum test). Moreover, choices in every individual subject matched choices predicted by the planning strategy more closely than those predicted by the heuristics (Table S2).

Before undergoing fMRI, subjects were trained for three days on extensively trained trials to ensure that associated values had stabilized. Over the course of that training, subjects’ responses converged to the optimal action in each context; there was no difference between the rate of correct responses in higher and lower valued contexts from day 2 onwards (Fig. S1).

Categorical neural differences: planning - trained choices

We first compared activity at the time of initial choice during planning trials versus activity in trials involving extensively trained choices. Activity dissociated along an anteromedio-posterolateral axis in basal ganglia (Fig. 2a). Structures preferentially activated during planning included caudate and medial striatum, thalamus, bilateral anterior insula (AIC), dorsomedial prefrontal cortex (dmPFC), dorsolateral prefrontal cortex (dlPFC), bilateral medial frontal gyrus (MFG) and parietal cortex (precuneus extending into IPS). In contrast, lateral posterior putamen, posterior insula extending into the medial temporal gyrus, vmPFC, and somatosensory postcentral gyrus were more strongly activated in trained trials (all p < 0.05 FWE; Table S3). Anatomically defined region of interest (ROI) analyses confirmed that BOLD responses in caudate increased significantly only during planning trials, while posterior putamen activity was selective to trained trials.

Figure 2
Neural correlates of planning versus extensively trained choices

Neural correlates of choice relevant values

As choice crucially depends on value, we next investigated neural responses pertaining to valuations of available choices in two striatal regions strongly linked to decision making, namely anterior caudate nucleus, implicated in goal directed choices 9, 10, and posterior putamen, which has been associated with overtrained choices 8. In addition we examined responses in vmPFC, a region also widely implicated in value based choice 11-13. We delineated regions of interest a priori (see Fig. S2 for location details) based on previous research and anatomical criteria, and regressed various values against fMRI signals in these regions. One set of values concerned were those of the target (the choice leading to the best reachable outcome, taking account of the computer’s minimax strategy), and of the alternative choices along the traversed maze path. This was motivated by the fact that these are the values that need to be compared during tree search. Indeed, consistent with this hypothesis, fMRI signals in the caudate covaried with the difference between target and alternative values, as evident in significant positive effects for target and negative effects for the alternatives (Fig. 2b). Strikingly, during the root choice, caudate activity related to several values relevant for a given choice including those at the present (Vtarget − Valt.root), but also to the consecutive choice deeper in the tree (Vtarget − Valt.deep). Note that successful forward search of the decision tree required a consideration of the latter values even whilst at the root state. During subjects’ second choice (layer 3), caudate activity was still associated with the values of both alternatives at the now current choice, but no longer with the value of the previously rejected root branch. These value difference signals are likely to reflect the output of value comparisons, a predicted hallmark of a cognitive implementation of tree search. Critically, the effects seen in caudate for planning value components were not evident in posterior putamen. Instead, the putamen solely encoded values on extensively trained trials at the time of choice (Fig. 2c).

Motivated by those results we next examined how the two networks interact in decisions that require a comparison of the respective values represented in these two distinct clusters. Thus, we presented subjects with a choice between a planning branch and a trained branch (see Fig. 1d), i.e., a task where subjects need to access both planned and trained values. By design, the value of the trained branch was uncorrelated with the values of the planning branch, allowing us to distinguish the influence of both value types. The caudate consistently represented the planned target value and negatively the value of the alternative option on the planning branch, in keeping with its performing the same value difference computations as in pure planning trials. Importantly, the caudate represented these planned values, and not values of the trained branch, regardless of which branch was later chosen (Fig. 3a). By contrast, activity in putamen pertained solely to values of the available trained branches, also irrespective of later choice (Fig. 3b). Note also that putamen represented the stimulus values of both available actions in trials comparing values from two trained branches.

Figure 3
Comparing values from planning and values from extensively trained mazes

The finding that activity in caudate and putamen covaried with planned values and values from the trained trials respectively, even for actions that were not chosen, provides direct evidence for a parallel and independent operation of two separate controllers. In turn, this parallel operation afforded us the opportunity to examine how these systems compete at the time of choice. Similar to action values 14, 15, the striatal correlates of planned values and values of the trained branches fulfil the criteria for pre-choice values and are likely to serve as inputs to a final decision comparator. The region most commonly implicated in comparative valuation is the vmPFC 16-19. Here, we observed that vmPFC activity covaried with the value of the chosen branch, a post choice signal, irrespective of whether it was planned or trained (Fig. 3c). Importantly, we found no evidence for the representation of mere stimulus values in the vmPFC cluster: if vmPFC activity had related to some form of representation of both value options (or their sum), then we would have expected to see a positive effect for both chosen and unchosen values in this contrast. Furthermore, we ruled out that the vmPFC signals represented the best option (maximum value) rather than the chosen option by re-estimating our GLM with a maximum choice value regressor and performing Bayesian model comparison 20 between both GLMs. This analysis provided strong evidence for a choice related signal in both mixed and trained/trained trials (exceedance probability > 0.99).

Functional coupling between basal ganglia and vmPFC

In mixed trials, signals in caudate and putamen consistently pertained to the value of the same system, independent of choice, whereas vmPFC pertained to the value that was modulated by choice. In other words, activity in vmPFC depended on choice while activity in putamen and caudate did not. This suggests the caudate and putamen are at an input stage to a value comparison process, while vmPFC is at an output stage.

To discriminate between alternative mechanisms for how choice values from both systems are compared we used a connectivity analysis, derived from a psychophysiological interaction (PPI), and examined the functional relationship between vmPFC, caudate and putamen during mixed choices. One possibility is that the competition between the planning and extensive trained system is resolved within the basal ganglia and the outcome is transferred to vmPFC. This predicts a PPI showing increased coupling of only the winning area (caudate or putamen) with vmPFC. The alternative hypothesis is that values from both areas are transferred to vmPFC and competition is resolved within vmPFC. This predicts increased coupling between both precursor areas and vmPFC regardless of choice.

Data from the PPI analysis support the latter pattern hypothesis, revealing a significant increase in strength of coupling of both putamen and caudate with vmPFC during the time of choice, independent of the action that was finally chosen (Fig. 4). In contrast, we did not find areas that showed a differential increased coupling with caudate on trials where subjects chose the planning branch but not on trials where they chose the trained branch, or a differential increased coupling with posterior putamen on trained branch choices but not on planning branch choices.

Figure 4
Functional coupling between caudate-vmPFC and putamen-vmPFC is significantly increased during choice in mixed trials

Discussion

We have shown that behaviour on trials invoking forward planning, and trials with extensively trained options evoke activity in distinct neural systems during computations associated with choice. BOLD signals in the caudate pertained to values of the individual branches in a decision tree, whereas BOLD signals in the posterior putamen fluctuated with values associated with responses in an extensively trained context. Importantly, during choices requiring a simultaneous comparison of values from both choice types, the individual striatal subsystems consistently represented their respective values regardless of final choice. These findings suggest that two independent systems represent the two choice types in our task. In contrast, activity in prefrontal cortex pertained to a value signal that depended on the actual decision that was made.

Converging evidence from animal and human studies has long suggested that two different learning processes govern behaviour: one controlling the acquisition of goal-directed actions, and the other the acquisition of habits 21,22. According to this dissociation, an association between actions and outcomes governs action selection in the goal-directed case, whereas in the habitual case, it is controlled through learned stimulus-response associations without any direct assessment of the outcome of those actions. As such, goal-directed control is performed with regard to the consequence of actions, whereas habits are determined by the predicting stimuli rather than the outcomes. Accounts suggesting a plurality of control are also supported by theoretical considerations of the computational mechanisms underlying different forms of reinforcement learning 23, 24. The defining criterion in the more computationally centered literature has been a functional one, focused on the differences in the computational mechanisms underlying different types of learning. The dissociation here is between model-free temporal difference learning of cached values and model-based choice that predict, on the fly, the immediate consequences of each action in a sequence. Our planning task, which was designed to be only solvable by searching the decision tree, typifies model-based control. The absence of a devaluation or contingency degradation test means we cannot prove definitively that our extensive training had created a true habit. Likewise, we cannot exclude that subjects derived values in the extensively trained mazes from solving a decision tree during training, and memorizing it so that it is retrieved at the time of choice. However, similar paradigms in previous studies have demonstrated that learning through numerous repetitions in stable contexts is solved by a prediction error based mechanism 3, 25-28.

In the caudate we observed value differences, which are likely correlates of the choice values during the planning process. The existence of planning value representations in anterior caudate is consistent with evidence for goal-directed impairment after caudate lesions in rodents 29. In addition, a human imaging study has demonstrated elevated activity in anterior caudate when subjects were performing on a high-contingency schedule compared with when they were performing on a low contingency schedule 10.

Although most of our results are consistent with a literature implicating the caudate in explicit planning, it is notable that the BOLD signal in this structure also correlated with the values of the relevant options in extensively trained trials. There are a number of possible explanations for this. The simplest is that this activity is epiphenomenal for choice. That is, the main claim of dual systems accounts is not that redundant systems do not calculate (if they have the information to do so), but rather that their calculations do not influence behaviour. Thus, in extensively trained trials, a planning system might estimate values but with an absent or only a modest effect on behavior, or perhaps at most improving the prediction errors available to the other system 6. When the planning system is engaged in its own unique computations, these calculations are no longer possible. We consider the mixed trials as showing this, although it would be interesting to design a more explicit test, for instance engaging the planning system with a distractor task whilst subjects make extensively trained choices. Under such a scenario, we would expect value-associated signals in caudate to vanish, or rather to pertain to the concurrent planning task, while leaving choice performance on the extensively trained task essentially unimpaired. Diametrically opposed to this interpretation is the possibility that the caudate actually controlled choice even in trials that we consider to be non-planning. We believe this is unlikely since the value of the trained branch was conspicuously absent from caudate in mixed trials, whereas if subjects based all choices on planning then we should have seen in the caudate a value difference, similar to the pattern of activity observed in pure planning trials. A third, and more radical, possibility is that the values from the trained branches are used to ground evaluations in the planning system. This interpretation would be most appropriate for trials involving two trained branches, as these values could then be compared by the planning system. Such an integration of values across systems has been widely predicted from the very earliest days of planning 2, 30 , but has not previously been observed. The present task is not ideal for testing this possibility, but it points to an important avenue for future work. On a similar note, while we show that prediction error based learning of action values does not impact choice in planning trials, we cannot exclude some form of concomitant model-free learning even in planning trials. However, in the absence of an overt expression of behavior from the model-free system, this is not trivial to dissect in the present study. These questions are nevertheless ripe issues for future research.

The putamen encoded values associated with the extensively trained trials throughout our study. A recent imaging study 8 showed that cue-driven activation in dorsolateral posterior putamen increases with prolonged habitization and concluded that this region may contribute to the habitual control of behaviour in humans. While that study did not investigate value related parametric effects in this region, we found neural representations of values for extensively trained choices in the same area. However, it is less clear whether there is a process of consolidation by which values migrate in the striatum over the course of overtraining. Our ROI (which was based on the coordinates in 8) was posterior to the location of many previous studies reporting prediction error signals in putamen during basic learning tasks 4, 27,31. When we tested for value signals in an ROI in anterior putamen we did not find a reliable significant representation for values of the trained branches or planned values. This resonates with evidence from studies on procedural sequence learning 32, 33 suggesting a transfer of activity from rostral to more caudal parts of putamen with increasing learning.

Our data (including the behavioural effect that higher values in those trials reduce response times; Table S1) also suggest that even extensively trained responses can still be influenced by learnt values of the associated actions, rather than depending only on the sort of more arbitrary action propensities found in certain reinforcement learning models (notably the actor-critic 34). It is interesting to note that in neither of the tasks did we observe a value difference for the extensively trained choices in posterior putamen, which might function as reinforcement learning cached memory. This is particularly clear in trials with two trained branches, where the values of both available options were simultaneously represented in putamen. This pattern of option values, but not a value difference, suggests the putamen does not compare values, but needs the vmPFC or caudate (where we see such a difference between the chosen and unchosen option) to perform this task. Note also that posterior putamen did not reflect a prediction error at the time of outcome (unlike caudate), which might underlie the persistence of extensively trained habits.

The findings that vmPFC increased its coupling with both caudate and putamen during choice, and encoded the winning outcome of a choice process (chosen value), are in line with its putative role as a value comparator. These findings challenge a view that prefrontal cortex is largely sensitive to model derived values 10, 11, 35, and instead suggest that the vmPFC is engaged whenever values are compared, in order to prepare an action, regardless of whether this derives from a planning computation or from extensive training. The absence of vmPFC value representations during E-trials, which do not require a comparison, implies that subjects immediately initiate the action in these trials. This interpretation would also explain why vmPFC does not represent choice values at the third stage of P-trials, as subjects might have precomputed and stored deep choices already at the root stage and then only executed the appropriate response at the deep stage. Our behavioural findings support this interpretation: subjects’ response times increase with decision difficulty (measured as absolute value difference) only at the 1st stage but no longer at 2nd stage (Table S1). In contrast, caudate represented values of the second stage choice both at the time when they were computed (root choice) and at the time when the associated action was activated (deep choice), in line with its proposed role of organizing and representing the forward planning process. In summary, vmPFC might facilitate actual value comparisons while caudate represents the planned actions (together with the planned values) as long as they are task relevant and until the required actions are initiated. Further, it is an interesting observation that vmPFC pertained to the value difference between chosen and unchosen options during choices requiring a comparison of values from only one system (such as pure planning trials and trials involving a comparison between two extensively trained branches), but only to the chosen value in mixed trials. We cannot rule out that our test was insensitive in picking up the negative effect of the unchosen option during those trials. However, an alternative explanation is that the brain employs different mechanisms for the value comparison in both conditions. This hypothesis requires further investigation as previous studies have reported vmPFC sensitivity to both chosen values 11, 13, 31, 36, 37 and to value differences between chosen and unchosen options 16, 19, but were not explicitly controlled for whether behaviour was guided by planning or non-planning computations. In addition to chosen values, a number of previous studies found evidence for goods or option values in medial PFC 12, 38. Note that our overall interpretation suggests a value comparison role for vmPFC, after pre-choice values are transferred there from other structures such as the basal ganglia. It is possible that vmPFC also plays a separate role in the valuation of economic goods 39, 40, in which case it may also reflect stimulus values 41. Here we used abstract monetary rewards and therefore our task did not require such an appraisal of real world items in common value space.

We note four caveats to our findings. First, non-significant results do not prove the absence of an effect. However, it should be mentioned that neural signals in caudate and putamen did not just correlate with a singular value signal but instead pertained to a set of specific computationally meaningful patterns across several different tasks. Second, we concentrated on BOLD signals at the time of choice rather than at outcome. This was because we had no expectation for the computation at this time for either system. The outcome is irrelevant for planning since values change on a trial-by-trial basis and the computer opponent’s strategy is instructed. For the choices in the extensively trained context, substantial experience with fixed outcome probabilities (unlike the case in 6) should render nugatory any prediction error. Third, although the results in relation to categorical differences between trial types in figure 2a might be influenced by variations in difficulty between conditions, this would not impact on the parametric analysis of values because those potential confounds are encompassed by the associated categorical regressor. Finally, all value signals are relative to the reference frame of the choosing agent 38, 42 and any neural representation of values should ultimately reflect subjective values. In the present study we assumed that our subjects employed a linear transformation of reward probabilities to value, consistent both with subjects’ choices and neural data showing a linear relationship between reward probability and BOLD within our ROIs (Fig. S3).

Our findings add to recent investigations of value based choices 6-8, 11, suggesting conserved processes in basal ganglia across species. Previous studies were limited with respect to the questions we pose by either not dissociating value representations of multiple controllers, or not involving actual planning 43. Further, we designed our task to minimize the possible indirect interactions between the two forms of control – for instance, even if the planning system were to calculate temporal difference prediction errors on planning trials (unnecessarily for it) 6, there would be little to do with them, since the values change on a trial-by-trial basis. Perhaps the most pressing possibility furnished by our results is to embed values derived from extensive training deeper in the tree. This would require those learnt values to be assessed as part of a planning choice in a more thoroughgoing way than in our trained/trained trials. As mentioned above, that this actually happens is a critical prediction of theories and practice in planning in extended domains, but has never been experimentally tested.

Methods

Subjects

21 healthy subjects (9 female; 18 – 35 years old) with no history of neurological or psychiatric illness participated in the study. None showed colour vision deficiency in the Ishihara test. All subjects participated in 3-day learning of trained mazes and scan session 1 (see below). 20 of the 21 subjects participated in scan session 2. The Institute of Neurology (University College London) Research Ethics Committee approved the study.

Task

Our experiment consisted of four conditions: pure planning (P-trials), extensively trained choices (E-trials), choices requiring a comparison of planned and extensively trained values (PE-trials), and choice between two trained branches (EE-trials).

Planning (P-trials)

Subjects navigated through a tree-shaped maze in search of maximal reward. Each state in the decision tree corresponded to a unique room in the maze with state transitions implemented through left and right forward exit doors (backtracking was not possible). Depending on the chosen doors subjects progressed along different branches in the tree maze until they reached a reward room at the end of each branch (Fig. 1b). All participants acquired correct mappings between room transitions and maze positions before the functional imaging experiment. Each reward room contained probabilistic reward, shown to subjects as a chest full with gold coins or an empty chest. The reward probabilities of all terminal rooms were clearly available to subjects throughout the trial as a display of eight numbers on top of the screen. The reward probabilities fluctuated in discrete 0.1 wide steps between 0 and 1 and were shown to subjects as percentage integer number (in the range [0, 100]). Transitions from state to state within the maze (the spatial layout of the maze) were deterministic and constant throughout the entire experiment. Importantly, however, the reward probabilities for the 8 terminal states changed completely on every planning trial, thereby effectively preventing successful application of model free learning strategies.

To engage subjects in forward planning over and above a mere comparison of instructed values, the choice at layer 2 in the tree was made by a deterministic value minimizing computer agent. Before the experiment subjects were explicitly instructed about the computer agent’s choice rule and, to avoid that subjects considered the computer as social agent, we emphasized that its choice rule would remain deterministic and predictable throughout the experiment. The only rational strategy in this task was to plan the best possible transit through the maze using a mimimax strategy 44 to rollback state value. This involves, already at the root choice, considerations of the choice at the 3rd layer and the computer’s choice in each of the two possible rooms in layer 2.

Choices in extensively trained contexts (E-trials)

Each of the four mazes consisted of one choice room with two doors and a reward room behind each door. Only one door led to probabilistic reward and those contingencies never changed throughout the experiment. Different wall colouring (red, yellow, green, blue) provided distinct contexts in which subjects acquired separate value associations from the set of 15, 40, 65, or 90 percent 45. We counterbalanced mappings between colour, reward contingencies, and actions across subjects.

Choices between planning and trained branches (PE-trials)

Half of the decision tree was a planning branch with the same rules as in the planning maze; the doorframe in the root node of the other branch was coloured and its choice led into the trained maze of that colour. This required subjects to directly compare a planned target value from one branch with a trained value. We matched transitions in mixed trials to equate effort and time for traversing either branch. Note that this trial type did not provide subjects with an option to choose whether they would prefer to engage in planning or a choice based on the previously trained mazes. Instead, rational choice always required performance of both a planning part to calculate action values for the planning branch and retrieval of a value for the coloured branch, followed by a direct comparison between values from both systems.

Choice between two extensively trained branches (EE-trials)

Finally, trials in a fourth condition involved a comparison between two learned values. The root room contained two coloured doors and choice of any transitions into the respective maze.

Training of values in coloured mazes

To induce stable values in the coloured mazes, we informed subjects that each colour corresponds to a different maze with its own stable reward probabilities and then trained them on three consecutive days (720 trials in interleaved ordering) before the fMRI scan (Fig. S1). We did not perform functional imaging during this training phase but it is well established in numerous animal and human studies 3, 25-28 that such a task induces prediction error mediated learning.

FMRI experiment

To prevent a deterioration of responses in trained mazes (PE-trials might stimulate a formation of a new explicit value representation for each coloured maze, inducing a strategy change on subsequent E-trials) we blocked our experiment into two parts and first presented E and P-trials and in a subsequent block PE and EE-trials.

In scan session 1 we presented subjects with 96 P and 96 E-trials, randomly intermixed, to measure choice related brain activity unique to either planning or decisions in extensively trained contexts. After a 15 min break outside the scanner, subjects participated in scan session 2, which contained intermixed 100 PE and 50 EE-trials. Subjects’ payout related to the earned rewards (0.20 £ during fMRI session and 0.05 £ during training). In total, subjects accumulated approximately 60 £ in rewards (range 55 – 64 £).

Model predicted choice values

Extensively trained mazes

We used constant values of the true reward probabilities throughout the study. Due to the large number of training trials, and because subjects universally chose the better option towards the end of training, we can assume that subjects acquired learned values for the coloured mazes during the training period and that those values had converged to the true value at the time of the fMRI study (trial-by-trial fluctuations in value would then be minimal due to a very small learning rate, adapted to the stable environment 46).

Forward planning

We assumed that subjects would unroll values from the reward rooms (instructed on the screen) to every prior state and then plan in the root state the optimal transit through the maze. We modeled this forward search 47 for rewards R and calculated planned values for action a in each state s of Layer L(s) using a maximizing strategy over available choices in states under subjects’ control (layer 1 and 3), and a minimizing strategy in states under the computer’s control (layer 2).

V(s,a)R(s)+s(maxaV(s,a)[L(s)2]+minaV(s,a)[L(s)=2])

Behavioural analysis

To investigate potential motivational (caused by a high target value) and difficulty based influences (originating from small differences between target and alternative values) on choice time we regressed Vtarget, the negative absolute value difference (−|Vchosen − Vunchosen|), and trial number on logarithmic RT separately for each trial type. Note that we neither instructed subjects to respond quickly nor was it the case that fast responses had any monetary benefit to subjects (except for finishing the experiment slightly sooner). We similarly analyzed the influence of these parameters on correct choice (Table S1).

Stimuli

We programmed stimulus presentation in MATLAB using Cogent 2000 (www.vislab.ucl.ac.uk/cogent.php).

FMRI data acquisition

Data were acquired with a 3T scanner (Trio, Siemens, Erlangen, Germany) using a 12-channel phased array head coil. Functional images were taken with a gradient echo T2*-weighted echo-planar sequence (TR = 3.128 s, flip angle = 90°, TE = 30 ms, 64 × 64 matrix). Whole brain coverage was achieved by taking 46 slices in ascending order (2 mm thickness, 1 mm gap, in-plane resolution 3 × 3 mm), tilted in an oblique orientation at −30deg to minimize signal dropout in ventrolateral and medial frontal cortex. We also acquired a B0-fieldmap (double-echo FLASH, TE1 = 10 ms, TE2 = 12.46 ms, 3×3×2 mm resolution) and high-resolution T1-weighted anatomical scan of the whole brain (MDEFT sequence, 1×1×1 mm resolution).

FMRI data analysis

We used SPM8 (rev. 4068; www.fil.ion.ucl.ac.uk/spm) for image analysis and applied standard preprocessing procedures (EPI realignment and unwarping using field maps, segmenting T1 images into grey matter, white matter, and cerebrospinal fluid, and using segmentation parameters to warp T1 images to the SPM Montreal Neurological Institute (MNI) template, and spatially smoothing of normalized functional data using an isotropic 8 mm full-width half-maximum Gaussian kernel).

We regressed functional MRI (fMRI) time series onto a composite general linear model (GLM) containing individual regressors representing the presentation of the root, second choice, computer choice/transition, and outcome. We modeled choice trials in all four conditions separately and further divided choices in the PE-condition into planning and trained chosen trials. Additional regressors captured button presses and motion correction regressors estimated from the realignment procedure. Regressors at the choice time and outcome were parametrically modulated by task relevant decision variables as described in the separate section below. We did not apply orthogonalization when we entered regressors and modulators into the design matrix, ensuring that the regressors of interest were not confounded by spurious correlations from signals pertaining to any of the other value signals 48. We assessed statistical significance with a second-level random-effects analysis using a one-sample t-test against zero on the effect sizes in individual subjects.

Value modulated parametric analysis

Scan session 1, P-trials

We hypothesized that the most salient value signals would be the value of the optimal path (target choice) and the values of the two alternative decision branches that subjects follow along their way through the maze. We therefore expected to find neural value representations of the optimal target action, the alternative tree branch at the root node and the alternative value at the second choice, and in response to the outcome in the reward rooms. In the example shown in Figure 1b, Vtarget = 40, Vroot_alternative = 20, and Vdeep_alternative = 30, reward outcome = 100 on rewarded and 0 on non-rewarded trials. To investigate the temporal dynamics of value representations during planning over the entire trial we modulated regressors at three timepoints: (1) during the root choice, (2) during the 2nd choice in layer 3, and (3) during presentation of the outcome. The regressor during outcome presentation was additionally modulated by actual reward. Although the time of third choice and outcome were fixed (to avoid confounding effects of any potential prediction errors), the effects of expected value during choice (on a continuous scale) and response to the actual outcome (either 1 or 0) are still dissociable through the principle of competing variances in unorthogonalized regressors.

There was a significant positive effect for the target value and negative effect for the alternative value in this analysis, indicating a value difference between the two components in the overall signal. Separate testing of minuend (a) and subtrahend (b) is a more thorough test for a difference representation than a direct regression of the difference value a – b: if a alone had a very strong effect, then the latter test might still be significant despite the fact that the signal was actually better explained by a than by ab. Importantly, if there is a significant positive effect for a and a significant negative effect for b, then a contrast testing for the difference between ab is necessarily also significant.

Scan session 1, E-trials

We modulated regressors during presentation of the choice screen and at outcome with the true reward probability of the rewarding action (Vtrained). The regressor at the time of the outcome was also modulated by the experienced reward.

Scan session 2, PE-trials

We split trials according to subjects’ choices and modeled separately plan chosen and train chosen trials. Regressors during choice were parametrically modulated with the target value in the planning branch (Vtarget), the alternative value at the second choice of the planning branch (Vdeep.alternative) and the value of the coloured trained branch (Vtrained). The regressor at the time of the outcome was modulated by the experienced reward.

Scan session 2, EE-trials

The regressor during presentation of the choice screen was modulated by the value of the chosen (Vchosen) and unchosen (Vunchosen) branch, at the time of the outcome by the experienced reward.

Psychophysiological interaction (PPI) analysis

We performed a PPI analysis 49 to examine the functional coupling between vmPFC and caudate and putamen BOLD during mixed choices. The PPI term was Y x P, with Y being the BOLD timecourses in either the caudate and putamen ROI, and P being an indicator variable for the times during which mixed choices were made. We entered the seed region BOLD Y, and the PPI interaction term along with all regressors from our model based parametric analysis (containing P and all value regressors) into a new GLM. Importantly, this GLM also contained the parametric value signals for both branches, so any effect on the PPI interaction would reveal increased coupling that could not be explained from the mutual correlation of seed and target region with the choice values. We computed this PPI both for a seed in caudate and in putamen, thereby separately identifying areas that showed a significant increase in coupling with both areas. The conjunction highlights common regions that played a role in mediating between both choice systems. Alternatively we tested for choice-dependent changes in coupling, i.e. areas that would differentially increase coupling with caudate on plan chosen trials but not on train chosen trials, and vice-versa for putamen. This analysis did not reveal significant results anywhere in the brain, even at a lenient threshold of p < 0.005 uncorrected.

We also tested the possibility that vmPFC correlated with the choice dependent difference timecourse 18 between activity in caudate and putamen by estimating a GLM on the PPI = Y x P, where Y = tcaudate – tputamen, and P = 1 on plan chosen trials and −1 on trained chosen trials. However, when we added the parametric choice values Vplan and Vtrained as covariates of no interest to this PPI GLM (to rule out the possibility that effects on this interaction were solely due to mutual correlations of seed and target areas with the choice values), we did not find significant remaining interactions (p < 0.005 uncorrected).

Whole brain analysis

A whole brain parametric analysis confirmed a selective representation of planned target values during P-trials in anterior caudate (Fig. S4a) and cached values during E-trials within posterior putamen (Fig. S4b). Besides precentral gyrus (putatively motivational motor preparatory) we did not observe any other significant correlation (p < 0.05 FWE corrected) with value signals outside of our a priori brain regions in any trial types (Table S4).

Region of interest (ROI) analysis

We analysed value signals (results in Table S5) within a priori anatomically defined ROIs (Fig. S2). For each region we regressed our design matrix on a representative timecourse, calculated as first eigenvariate 50. This provides us with a very sensitive analysis as only a single regression is performed per region (no multiple comparisons required).

Anterior caudate (xyz mm): right: 9, 15, 3; left: −9, 15, 3; size: 6 mm radius, 66 voxels. Sphere centered in the anterior caudate nucleus. Dorsolateral posterior putamen: right: 33, −24, 0; left: −33, −24, 0; 4 mm radius, 20 voxels. Location based on the habit learning study by Tricomi et al. 8. VmPFC: 0, 32, −13; 8 mm sphere, 65 voxels. Radii chosen to fit anatomical boundaries.

Supplementary Material

Acknowledgements

This study was supported by a Wellcome Trust Program Grant and Max Planck Award (RJD and KW) and the Gatsby Charitable Foundation (PD). We thank Wako Yoshida and Jenny Oberg for help with data acquisition, and Nathaniel Daw and Marc Guitart Masip for their valuable and insightful comments on the manuscript.

The Wellcome Trust Centre for Neuroimaging is supported by core funding from the Wellcome Trust 091593/Z/10/Z. RJD is supported by a Wellcome Trust Programme Grant.

References

1. Sutton RS, Barto AG. Reinforcement Learning: An Introduction. MIT Press; Cambridge, MA: 1998.
2. Samuels AL. Some studies in machine learning using the game of checkers. IBM Journal of Research and Development. 1957;3:210–229.
3. O’Doherty JP, Dayan P, Friston K, Critchley H, Dolan RJ. Temporal difference models and reward-related learning in the human brain. Neuron. 2003;38:329–337. [PubMed]
4. Seymour B, et al. Temporal difference models describe higher-order learning in humans. Nature. 2004;429:664–667. [PubMed]
5. Shallice T. Specific impairments of planning. Philos Trans R Soc Lond B Biol Sci. 1982;298:199–209. [PubMed]
6. Daw ND, Gershman SJ, Dayan P, Seymour B, Dolan RJ. Model-Based Influences on Humans’ Choices and Striatal Prediction Errors. Neuron. 2011;69:1204–1215. [PMC free article] [PubMed]
7. Glascher J, Daw N, Dayan P, O’Doherty JP. States versus rewards: dissociable neural prediction error signals underlying model-based and model-free reinforcement learning. Neuron. 2010;66:585–595. [PMC free article] [PubMed]
8. Tricomi E, Balleine BW, O’Doherty JP. A specific role for posterior dorsolateral striatum in human habit learning. Eur J Neurosci. 2009;29:2225–2232. [PMC free article] [PubMed]
9. Tricomi EM, Delgado MR, Fiez JA. Modulation of caudate activity by action contingency. Neuron. 2004;41:281–292. [PubMed]
10. Tanaka SC, Balleine BW, O’Doherty JP. Calculating consequences: brain systems that encode the causal effects of actions. J Neurosci. 2008;28:6750–6755. [PMC free article] [PubMed]
11. Hampton AN, Bossaerts P, O’Doherty JP. The role of the ventromedial prefrontal cortex in abstract state-based inference during decision making in humans. J Neurosci. 2006;26:8360–8367. [PubMed]
12. Hare TA, O’Doherty J, Camerer CF, Schultz W, Rangel A. Dissociating the role of the orbitofrontal cortex and the striatum in the computation of goal values and prediction errors. J Neurosci. 2008;28:5623–5630. [PubMed]
13. Daw ND, O’Doherty JP, Dayan P, Seymour B, Dolan RJ. Cortical substrates for exploratory decisions in humans. Nature. 2006;441:876–879. [PMC free article] [PubMed]
14. Lau B, Glimcher PW. Action and outcome encoding in the primate caudate nucleus. J Neurosci. 2007;27:14502–14514. [PubMed]
15. Samejima K, Ueda Y, Doya K, Kimura M. Representation of action-specific reward values in the striatum. Science. 2005;310:1337–1340. [PubMed]
16. Boorman ED, Behrens TE, Woolrich MW, Rushworth MF. How green is the grass on the other side? Frontopolar cortex and the evidence in favor of alternative courses of action. Neuron. 2009;62:733–743. [PubMed]
17. Noonan MP, et al. Separate value comparison and learning mechanisms in macaque medial and lateral orbitofrontal cortex. Proc Natl Acad Sci U S A. 2010;107:20547–20552. [PMC free article] [PubMed]
18. Basten U, Biele G, Heekeren HR, Fiebach CJ. How the brain integrates costs and benefits during decision making. Proc Natl Acad Sci U S A. 2010 [PMC free article] [PubMed]
19. FitzGerald TH, Seymour B, Dolan RJ. The role of human orbitofrontal cortex in value comparison for incommensurable objects. J Neurosci. 2009;29:8388–8395. [PMC free article] [PubMed]
20. Stephan KE, Penny WD, Daunizeau J, Moran RJ, Friston KJ. Bayesian model selection for group studies. Neuroimage. 2009;46:1004–1017. [PMC free article] [PubMed]
21. Balleine BW, O’Doherty JP. Human and rodent homologies in action control: corticostriatal determinants of goal-directed and habitual action. Neuropsychopharmacology. 2010;35:48–69. [PMC free article] [PubMed]
22. Redgrave P, et al. Goal-directed and habitual control in the basal ganglia: implications for Parkinson’s disease. Nat Rev Neurosci. 2010;11:760–772. [PMC free article] [PubMed]
23. Daw ND, Niv Y, Dayan P. Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nat Neurosci. 2005;8:1704–1711. [PubMed]
24. Doya K. What are the computations of the cerebellum, the basal ganglia and the cerebral cortex? Neural Netw. 1999;12:961–974. [PubMed]
25. Schultz W, Dayan P, Montague PR. A neural substrate of prediction and reward. Science. 1997;275:1593–1599. [PubMed]
26. Knutson B, Cooper JC. Functional magnetic resonance imaging of reward prediction. Curr Opin Neurol. 2005;18:411–417. [PubMed]
27. O’Doherty J, et al. Dissociable roles of ventral and dorsal striatum in instrumental conditioning. Science. 2004;304:452–454. [PubMed]
28. Berns GS, McClure SM, Pagnoni G, Montague PR. Predictability modulates human brain response to reward. J Neurosci. 2001;21:2793–2798. [PubMed]
29. Yin HH, Ostlund SB, Knowlton BJ, Balleine BW. The role of the dorsomedial striatum in instrumental conditioning. European Journal of Neuroscience. 2005;22:513–523. [PubMed]
30. Sutton RS. First results with Dyna, an interesting architecture for learning, planning, and reacting. In: Miller T, Sutton RS, Werbos P, editors. Neural Networks for Control. MIT Press; 1990. pp. 179–189.
31. Knutson B, Taylor J, Kaufman M, Peterson R, Glover G. Distributed Neural Representation of Expected Value. J. Neurosci. 2005;25:4806–4812. [PubMed]
32. Jueptner M, Frith CD, Brooks DJ, Frackowiak RS, Passingham RE. Anatomy of motor learning. II. Subcortical structures and learning by trial and error. J Neurophysiol. 1997;77:1325–1337. [PubMed]
33. Lehericy S, et al. Distinct basal ganglia territories are engaged in early and advanced motor sequence learning. Proc Natl Acad Sci U S A. 2005;102:12566–12571. [PMC free article] [PubMed]
34. Barto AG. Adaptive critic and the basal ganglia. In: Houk JC, Davis JL, Beiser DG, editors. Models of information processing in the basal ganglia. MIT Press; Cambridge: 1995. pp. 215–232.
35. Valentin VV, Dickinson A, O’Doherty JP. Determining the neural substrates of goal-directed learning in the human brain. J Neurosci. 2007;27:4019–4026. [PubMed]
36. Wunderlich K, Rangel A, O’Doherty JP. Neural computations underlying action-based decision making in the human brain. Proceedings of the National Academy of Sciences. 2009;106:17199–17204. [PMC free article] [PubMed]
37. Tanaka SC, et al. Prediction of immediate and future rewards differentially recruits cortico-basal ganglia loops. Nat Neurosci. 2004;7:887–893. [PubMed]
38. Padoa-Schioppa C, Assad JA. Neurons in the orbitofrontal cortex encode economic value. Nature. 2006;441:223–226. [PMC free article] [PubMed]
39. Chib VS, Rangel A, Shimojo S, O’Doherty JP. Evidence for a common representation of decision values for dissimilar goods in human ventromedial prefrontal cortex. J Neurosci. 2009;29:12315–12320. [PubMed]
40. Plassmann H, O’Doherty J, Rangel A. Orbitofrontal cortex encodes willingness to pay in everyday economic transactions. J Neurosci. 2007;27:9984–9988. [PubMed]
41. Wunderlich K, Rangel A, O’Doherty JP. Economic choices can be made using only stimulus values. Proc Natl Acad Sci U S A. 2010;107:15005–15010. [PMC free article] [PubMed]
42. Kable JW, Glimcher PW. The neural correlates of subjective value during intertemporal choice. Nat Neurosci. 2007;10:1625–1633. [PMC free article] [PubMed]
43. Fitzgerald TH, Seymour B, Bach DR, Dolan RJ. Differentiable neural substrates for learned and described value and risk. Curr Biol. 2010;20:1823–1829. [PMC free article] [PubMed]
44. von Neumann J, Morgenstern O. Theory of Games and Economic Behavior. Princeton University Press; 1944.
45. Dickinson A, Balleine BW. The role of learning in the operation of motivational systems. In: Pashler H, Gallistel R, editors. Stevens’ handbook of Experimental Psychology. John Wiley & Sons; New York: 2002. pp. 497–533.
46. Behrens TE, Woolrich MW, Walton ME, Rushworth MF. Learning the value of information in an uncertain world. Nat Neurosci. 2007;10:1214–1221. [PubMed]
47. Bellman RE. On the Theory of Dynamic Programming. Proc Natl Acad Sci U S A. 1952;38:716–719. [PMC free article] [PubMed]
48. Andrade A, Paradis AL, Rouquette S, Poline JB. Ambiguous results in functional neuroimaging data analysis due to covariate correlation. Neuroimage. 1999;10:483–486. [PubMed]
49. Friston KJ, et al. Psychophysiological and modulatory interactions in neuroimaging. Neuroimage. 1997;6:218–229. [PubMed]
50. Friston KJ, Rotshtein P, Geng JJ, Sterzer P, Henson RN. A critique of functional localisers. Neuroimage. 2006;30:1077–1087. [PubMed]
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • Compound
    Compound
    PubChem Compound links
  • PubMed
    PubMed
    PubMed citations for these articles
  • Substance
    Substance
    PubChem Substance links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...