Parallel Excitatory and Inhibitory Neural Circuit Pathways Underlie Reward-Based Phasic Neural Responses

1School of Systems Science and National Key Laboratory of Cognitive Neuroscience and Learning, Beijing Normal University, Beijing 100875, China 2Intelligent Systems Research Centre, School of Computing, Engineering, and Intelligent Systems, University of Ulster, Magee Campus, Northland Road, Derry BT48 7JL, UK 3Beijing Key Laboratory of Brain Imaging and Connectomics, Beijing Normal University, Beijing 100875, China


Introduction
The ability to adapt to uncertainty is critical for survival and key to wellbeing.To investigate the underlying neural correlates and mechanisms, many experimental and computational studies using stochastic scheduling of reward have been carried out [1][2][3][4][5][6][7][8][9].Experimental studies have demonstrated that dopaminergic (DA) neurons in the ventral tegmental area or substantia nigra compacta (VTA/SNc) and neurons in the lateral habenula (LHb) play important roles in encoding uncertainty of reward and punishment [5,8].
As illustrated schematically in Figure 1 (top row), given some unexpected reward (the presence of an unconditioned stimulus US such as food), DA (LHb) neurons exhibit a phasic peak (dip) upon the presence of the US [5,8].After several trials of learning in the presence of a cue/stimulus, conditioning takes place.The (expected) conditioned cue/stimulus (CS) becomes associated with reward, and the DA (LHb) neurons exhibit a phasic peak (dip) in activity upon the onset of the CS (Figure 1, second row) [5,8].Note that the DA and LHb neurons now do not respond to the unconditioned stimulus (US) with a rewarding outcome [5,8].One can view this as postreinforcement learning: the agent has learned to completely associate the cue/stimulus CS with the US (e.g., an auditory tone with food), and the latter is no longer needed for further learning.However, if there is an omission of reward (e.g., absence of food), there is an additional dip (peak) in activity for the DA (LHb) neurons (Figure 1, third row) [5,8].Instead of the unexpected rewarding outcome US, if we now replace it with an unexpected nonrewarding or aversive stimulus US (e.g., no food or mild electric shock), it has been observed that phasic dip (peak) in the DA (LHb) neurons occurs during the initial phase of the reinforcement learning [5,8] (Figure 1, fourth row).After learning, this information is transferred to the CS, in which the DA (LHb) neurons exhibit a phasic dip (peak) activity upon CS presentation while staying at baseline activity level during US (Figure 1, fifth row).When there is a sudden unexpected omission of such US or when the US becomes rewarding, then there is a peak (dip) in activity of the DA (LHb) neurons [8,10,11] (Figure 1, bottom row).In summary, the phasic activities of DA and LHb neurons signal uncertainty in reward and punishment.Such signaling is also reflected in other brain regions such as the border region of the globus pallidus internal segment (GPb), the internal segment of the globus pallidus (GPi), and the rostral medial tegmentum (RMTg) [2,3].However, it is not clear how this information is transmitted within a larger neural circuit.
To understand the underlying computation, previous theoretical and computational studies have applied temporal difference learning [8,[12][13][14][15] and neural circuit modeling to understand the phasic activity of DA neurons [16][17][18] on the basis that the phasic activity of DA neurons acts as a form of reward-prediction error signal [8].In particular, in the model by Brown et al. [16], there are parallel pathways: one pathway from the cortex through the striosome to VTA/SNc and the other pathway from the cortex through the ventral striatum (VS) to the pedunculopontine nucleus (PPTN) and VTA/SNc.These two pathways cooperatively control the activity of DA neurons (Figure 2).However, the phasic activity of LHb neurons has not been taken into consideration yet, especially given that LHb has substantial projections to DA neurons in the VTA/SNc [5].
In this work, we propose a large-scale neural circuit model by extending Brown et al. 's [16] model to investigate the phasic activity of not only DA and LHb neurons, but also the extended parts of the network such as the GPb, GPi, and RMTg.In addition to the neural circuit pathways in Brown et al. [16] that control DA signaling (see above), our model also included pathways from the striosome and the VS to the LHb and also one pathway from the LHb to the VTA/SNc via RMTg.These additional pathways are necessary to account for the observed phasic activity of LHb neurons (Figure 2).Further, the pathway from LHb to VTA/SNc via RMTg provides inhibition to the DA neural activity when expected reward was omitted or when there is an aversive outcome.This interareal connectivity is constrained by currently available knowledge from physiological studies (see below for supporting evidence).
Based on simulation results, our model can account for various experimental observations of phasic activation with rewarding or nonrewarding CS, together with or without reward outcomes.Specifically, the model can account for a shift of VTA/SNc and LHb neuron responses from outcome to CS, which agrees with experiments.In addition, the model can also account for the phasic activity of GPb and RMTg neurons, whose responses are similar to those of LHb neurons.Our model shed light on the mechanism of VTA/SNc

Ventral pallidum
GPi  [21] shows that the ventral striatum (VS) excites PPTN and ventral pallidum (VP).Striosome neurons project to GPi neurons which in turn project to GPb.Dopaminergic (DA) neurons are excited by cortical inputs (  ) encoding conditioned stimuli and lateral hypothalamus inputs (  ) encoding unconditioned stimuli via the path VS-VP-GPb-LHb-RMTg-VTA/SNc and the path VS-PPTN-VTA/SNc path.DA neurons are inhibited by   via the path striosome-VTA/SNc.Note that the striosome contains an adaptive spectral timing mechanism and can learn to generate lagged, adaptively timed signals [16].LHb neurons are excited by   via the path striosome-GPi-GPb-LHb.LHb neurons are inhibited by   and   via the path VS-VP-GPb-LHb.and LHb phasic activity at the neural circuit level, with important roles from the parallel excitatory and inhibitory pathways in the learned responses; namely, (i) the VS-PPTN-VTA/SNc pathway excites DA, while the striosome-VTA/SNc pathway inhibits DA; (ii) the VS-VP-GPb-LHb pathway inhibits LHb, while the striosome-GPi-GPb-LHb pathway excites LHb; and (iii) the LHb-RMTg-VTA/SNc pathway magnifies the phasic activity of VTA/SNc.The model is also rather resilient to overall changes in the interregional connections.Finally, our model predicts that the striosome is important since it may remember the timing of the previous reward and provide the comparison signal with the present reward.

Model Architecture.
Our proposed neural circuit model is schematically shown in Figure 2, which is an extended version of the model proposed by Brown et al. [16].Namely, we included the GPb, LHb, and RMTg neural populations into the model based on more recent experimental findings [2,3,19,20].The details of each part of our model are described as follows.
2.1.1.LHb Inhibits SNC/VTA via RMTg.Most LHb neurons are glutamatergic [22], but experiments showed that LHb inhibits DA neurons.Firstly, in vivo recordings demonstrate that most LHb neurons are excited by a nonrewardpredicting cue and are inhibited by a reward-predicting cue when rhesus monkeys perform a visually guided saccade task [5].The phasic activity of LHb neurons is opposite that of DA neurons in terms of responding to outcome valence; LHb (DA) neurons are excited (inhibited) by nonreward/punishment outcome/cue and inhibited (excited) by reward outcome/cue [5,8].Secondly, LHb neurons respond to cues earlier than DA neurons in unrewarded trials [5].Thirdly, stimulating LHb neurons will inhibit DA neurons [21].The inhibition of LHb on DA neurons may arise from the direct projection from LHb neuron to inhibitory interneurons in the VTA/SNc [23] or indirectly through some inhibitory nucleus.In fact, experiments have revealed a path from the LHb to DA neurons through RMTg and neurons in the RMTg seem to encode aversive stimuli [19,20].At the same time, the RMTg transmits negative rewardprediction errors signal of LHb neuron to positive rewardprediction errors signal of DA neurons [3].For simplicity, we only include the indirect path from LHb to DA neurons via GABAergic RMTg.

GPb Excites LHb. Low intensity electrical stimulation
in GPb can evoke a short latency excitatory response in LHb neurons [21].The excitation of GPb neurons on LHb neurons may be mediated by acetylcholine or glutamate [2], or by disinhibition through intra-LHb interneurons considering the complex microcircuitry within the GP [2,24].In addition, glutamatergic projections to LHb from rat's entopeduncular or primate's GPb neurons have been observed in experiments on nonhuman primates [25,26].In brief, there are excitatory projections from GPb to LHb which form a pathway from GPb to VTA/SNC via LHb and RMTg [19].[27].Hong and Hikosaka [21] have observed that typical neurons in the external and internal segments of the globus pallidus (GPe and GPi) are first inhibited by striatal stimulation but GPb neurons are often (but not always) excited or disinhibited by striatal stimulations.They proposed that signals to GPb should be mediated through inhibitory axon collaterals within the striatum [28] or GPe [24].Based on these observations, we conjecture that striosome projects to LHb through GPi.
2.1.4.VP Inputs to GPb.In Brown et al. 's [16] model, VP neurons are inhibited by the expectation of reward.However, recent experiments observe that the majority of VP neurons are excited by the expectation of a large reward [21].Thus, VP-LHb connections could possibly be inhibitory [21].Therefore, we assume that reward-related signals are transmitted to the LHb through excitatory connections from the GPb and inhibitory connections from the VP.

Excitatory Inputs from VS to VP and PPTN.
Although VS neurons are usually identified as GABAergic and inhibit downstream neurons, Hong and Hikosaka [21] showed that the striatal (GABAergic) neurons excite PPTN and VP neurons.The excitation by VS neurons can be mediated by substance  [29,30].Thus, we assume that VS directly excites PPTN and VP.The time constant of the change of  pre-excite 36.00  2

Dynamical Equations
The time constant of the change of  pre-inhibit 6.00   VS-to-pre-PPTN synaptic weight 1.00  1 The time constant of the change of  pre-excite 36.00  2 The time constant of the change of  pre-excite 6.00   VS-to-pre-VP synaptic weight 1.00 background The background input to the PPTN 0.  simulate different conditions as follows.First, we simulate the first to the 99th trial with rewarding CS and rewarding US: learning trials.The network can associate the rewarding CS with the rewarding US.The 100th trial is a "test" trial and the network receives rewarding CS and nonrewarding US.
We then simulate the unexpected reward condition, that is, nonrewarding CS and rewarding US.From the 101st trial to the 199th trial, the network receives nonrewarding CS and nonrewarding US.The network associates the nonrewarding CS with the nonrewarding US.At the 200th trial, the network receives nonrewarding CS but rewarding US.See Figure 3(a) for a summary of the learning protocol.
We implement different inputs from the cortex to the VS and striosome based on four conditions: reward CS, nonreward CS, reward US, and nonreward US.The rewarding/nonrewarding CS and US are shown in Figure 3 and their mathematical expressions are given in the Mathematics and Equations.Note that the inputs from the cortex are always larger than zero (firing-rate activity cannot be negative in value).
The motivation for such an implementation is based on some observed lines of evidence.First, neurons in the orbitofrontal cortex fire most strongly for cues that predict large reward (with small penalty) and least strongly for cues that predict large penalty (with small reward) relative to neutral conditions (small reward and small penalty) [32,33].Second, cortical neurons, including the frontal cortex, are known to exhibit flexibility and mixed response properties; that is, different cortical neurons could have different responses to identical stimuli [34,35].For instance, an identical tone could result in different responses from different cortical neurons which could in turn separately transmit information to the same neural "downstream" (e.g., in the midbrain).Third, the expectation values of cue signaling are stored in the cortex but not in the basal ganglia or LHb [36,37].The phasic activity of DA neurons can result in plasticity in the cortex and change the representation of cue signaling [38].In fact, the activity profiles in Figures 3(d) and 3(e) look similar to that of DA release or nonrelease (as measured, e.g., in voltammetry [39]).Also, the sustained or persistent activity in Figure 3(b) could represent (working) memory of the cue, a commonly observed phenomenon in the frontal cortical neurons [36,37,40], while the suppressed activity in Figure 3(c) can be thought of as some inhibitory effect with respect to the response in Figure 3(b).

Shift of Phasic Response from US to CS.
Many experimental and theoretical studies have reported the shift of DA neurons response from US to CS [41][42][43].As discussed previously, in the initial phase of learning, DA neurons are phasically activated from the baseline upon the presentation of an unpredicted reward.An accompanying cue is associated with the rewarding outcome through a learning process.After learning, the phasic activity at reward outcome subsequently decreases to baseline, while a phasic activity now appears upon cue onset (Figure 1).
Our simulation can replicate this trend (Figure 4).When the network receives the rewarding CS and rewarding US (during the first 99 trials), DA neurons exhibit phasic activity upon the US in the first trial (Figure 4(a)).In the second and the subsequent trials, the peak appears upon the CS onset and the previous peak activity upon US onset disappears (Figures 4(b) and 4(c)).
The parallel pathways in our model can account for the shift in neural response from US to CS.At the beginning of the learning phase, CS-to-VS synaptic weights   and CS input-to-striosomal synaptic weights   are very small or near zero.Thus, the activity of the striosome is maintained at baseline level but the activity of VS has a peak upon US onset.The peak activity of VS then propagates to the LHb through the VS-VP-GPb-LHb pathway, which results in a dip of the LHb activity upon US.Meanwhile, a phasic input to DA neurons through the VS-VP-GPb-LHb-RMTg-VTA/SNc pathway and VS-PPTN-VTA/SNc pathway leads to a phasic activity of DA neurons upon reward US.The phasic activity of DA neurons upon reward US in turn enhances the positive reinforcement-learning signal N + (see (7)) which leads to stronger synaptic strengths of afferent inputs to VS and striosome from the cortex: the increased synapse   and   will enhance CS signal pathways from VS to DA via the PPTN (VS-PPTN-VTA/SNc) and VP (VS-VP-GPb-LHb-RMTg-VTA/SNc), the pathway from striosome to DA (striosome-VTA/SNc), and the pathway from striosome to DA via GPb (striosome-GPi-GPb-LHb-RMTg-VTA/SNc).
The striosome in the model has an adaptive timing spectrum, encoding the timing and the amount of reward associated with the CS [16,44,45] (see ( 10)-( 14)).Therefore, through the VS-PPTN-VTA/SNc pathway, rewarding CS can trigger phasic activity of DA neurons (Figures 4(a)-4(c)), while nonrewarding CS can trigger a dip in activity (Figures 5(c)-5(d)).The signal of rewarding US through the striosome inhibits DA neurons at the time when the rewarding US is expected to be present, but the excitation of reward US through the VS to VTA/SNc pathway via PPTN cancels the inhibition of the CS, leading to a baseline activity of DA neurons to reward US (Figures 4(c) and 5(a)).On the contrary, nonrewarding US cannot trigger enough excitation to cancel the inhibition caused by CS in DA neurons, leading to a dip in activity upon nonrewarding US onset (Figure 5(b)).
Experimental studies have shown that the phasic activity of LHb is opposite that of DA neurons in terms of response to reward valence, but with a similar shift in activity to DA phasic activity.In our model, LHb neurons are inhibited and show a dip in their activity upon rewarding US onset (Figure 4(d)).The dip of LHb neural activity shifts from US to rewarding CS in the following and subsequent trials (Figures 4(e)-4(f)).As mentioned previously, unexpected rewarding US can switch on the pathways striosome-GPi-GPb-LHb and VS-VP-GPb-LHb.However, before they are switched on, the rewarding US will inhibit LHb neurons through the VS-VP-GPb-LHb pathway (Figure 4(d)).Once the striosome-LHb and VS-LHb pathways are switched on, the reward CS will effectively inhibit LHb neurons through the VS-VP-GPb-LHb pathway, leading to a dip at the time of the rewarding CS.But the inhibition caused by the rewarding US will be canceled by excitation from the striosome-GPi-GPb-LHb pathway leading to a baseline activity of LHb neurons at the time of the rewarding US (Figure 4(f)).

Neural Pathways Underlying Learned Phasic
Activity of DA Neurons.The phasic activity of DA neurons has been suggested to encode reward-prediction error and to play a pivotal role in reinforcement learning [8,46,47].DA neural activity in our model shows reward-prediction error that is consistent with experimental observations (Figure 5(f)).For instance, after 99 trials of training, the network already can associate the rewarding CS with the rewarding US.

Complexity
The DA neurons show a phasic activity upon CS onset (at time 2 s in Figure 5(a)).But at the 100th trial, we simulate the condition where the expected reward is omitted.DA neurons are excited right after CS onset (2 s) and inhibited at US presentation (3.6 s) (Figure 5(b)).The network now reassociates the CS with the nonrewarding US after the training from the 101st to 199th trials.The activity of DA neurons then shows a dip at the time when nonrewarding CS is presented at 2 s and shows baseline activity when the nonrewarding US is presented at 3.6 s (Figure 5(c)).Finally, at the 200th trial, we present both the nonrewarding CS and rewarding US to simulate an unexpected reward condition.DA neurons are inhibited upon CS presentation (2 s) but excited at the time when rewarding US is presented once again (3.6 s) (Figure 5(d)).The overall activity profile of DA neurons is summarized in Figure 4(e), which is consistent with experimental observations (Figure 5(f)).
The above phasic responses of DA neural activity associated with the learned stimuli can be understood based on the two parallel pathways in the circuit: the VS-PPTN-VTA/SNc and the striosome-VTA/SNc pathways.It should be noted that, after the 1st trial, the synaptic strengths   and   are not zero, so VS responds to both rewarding CS and rewarding US.Then, the DA neurons are excited by the rewarding CS through the VS-PPTN-VTA/SNc pathway.When rewarding US is presented, the signal of rewarding CS triggers the activity of striosomal neurons and directly inhibits DA neurons.However, this inhibition is canceled out by the excitation from rewarding US through the VS-PPTN-VTA/SNc pathway.Thus, the activity of DA neurons is effectively maintained at baseline (Figure 5(a)).By the 99th trial, the network has already associated the rewarding CS with rewarding US.Now, if the rewarding US is omitted (at the 100th trial), no excitation counterbalances the direct inhibition from the striosome, leading to a dip in the activity of DA neurons (Figure 5(b)).This continues until the 199th trial.When the network is presented with a nonrewarding CS followed by nonrewarding US, the direct inhibitory pathway from striosome to DA neurons has been turned off, DA neurons show phasic activity upon nonrewarding CS onset, and the activity of DA neurons is maintained at baseline at the time of nonrewarding US (Figure 5(c)).With a subsequently unexpected rewarding US in trial 200, DA neurons are now excited through the VS-PPTN-VTA/SNc pathway while the nonrewarding CS still causes a dip in the activity (Figure 5(d)).

Neural Pathways Underlying Learned Phasic Activity of LHb Neurons.
Experimental studies have shown that phasic activity of LHb behaves in an opposite way to that of DA neurons [5].Hence, it has been suggested that LHb neurons play a key role in the coding of the aversive/negative signals [48,49].Experiments have been carried out to investigate the activity of several brain nuclei, such as GPb [2] and RMTg [3], to explore the possible functional relationship with these brain regions.
Here, we simulate the activity of these nuclei and the results are consistent with the experimental observations.Our simulations show that the phasic responses of LHb neurons shift from US to CS. LHb neurons show a phasic dip when the unexpected rewarding US was presented in the first trial (Figure 4(d)).In the following trials, the dip shifts to the time when the rewarding CS presented (Figures 4(e)-4(f)) and baseline activity is observed with rewarding CS (Figure 6(a)) and a small phasic activity upon nonrewarding US (Figure 6(b)).After the training of nonrewarding CS from the 101st to the 199th trials, LHb neurons show a phasic activity upon nonrewarding CS (2 s) while maintaining a baseline level at the time of the nonrewarding US (Figure 6(c)).At the 200th trial, LHb neurons show a peak activity with the nonrewarding CS but a big dip in activity given an unexpected rewarding US (Figure 6(d)).The overall activity profile of LHb neurons (Figure 6(e)) agrees with the experimental observations (Figure 6(f)).
The above-mentioned learned phasic activity of LHb neurons can be explained with two parallel pathways: striosometo-LHb pathway via GPi and GPb and the VS-to-LHb pathway via VP and GPb.For instance, at the 99th trial, the synaptic strengths W  and Z  are not zero, which means that the network has already completely associated the rewarding CS with rewarding US.The rewarding CS can inhibit LHb neurons through the inhibitory striatum-VP-GPb-LHb pathway.When the rewarding US appears, the inhibition through the striatum-VP-GPb-LHb pathway will be canceled out by the excitation from the striosome-GPi-GPb-LHb pathway, resulting in a baseline level of LHb neural activity upon reward omission.At the 100th trial, LHb neurons show a dip in the presence of the rewarding CS.But the omission of reward implies that the excitation through striosome-GPb-LHb pathway cannot be canceled out, which leads to a small phasic activity of LHb neurons upon reward omission.At the same time, the synaptic strength Z  from the cortex to the striosome decreases to zero.When next the nonrewarding CS is paired with a nonrewarding US (from the 101st to the 200th trial), LHb neurons show a phasic activity at the time of the nonrewarding CS onset because of the inhibition through the striatum-VP-GPb-LHb pathway.In the 200th trial, unexpected rewarding signal switches on the inhibitory pathway striosome-GPb-LHb, which leads to a dip in activity of the LHb neurons.

Learned Phasic Activity of GPb and RMTg.
Experiments have shown that the GPb and RMTg neurons display phasic responses to CS and US.In our model, the interaction between striosome-GPi-GPb pathway and VS-VP-GPb pathway leads to the phasic activity of GPb neurons upon CS and US presentation.In particular, the GPb, LHb, and RMTg are also connected with effectively excitatory synapses (Figure 2), and hence their phasic activities should be correlated with that of the LHb, with the same explanations of activity profiles as for the LHb (Figures 7 and 8).Moreover, the LHb-RMTg-VTA/SNc pathway only magnifies the phasic activity of DA neurons and does not qualitatively change the activity profile of DA neurons.experiments, we next further investigate the robustness of the phasic activities in our model with respect to connectivity strength variation.Specifically, we increase or decrease all synaptic weights by 10% and monitor how the phasic activities change.

Robustness Analysis of Two
First, we found that the phasic activities of DA and LHb neurons did not change substantially when we increased or decreased the following synaptic weights by 10%:  SVP ,  RS ,  SP ,  PD ,  SOG ,   , and  max  (data not shown).Second, weights of synapses on the pathway VP-GPb-LHb-RMTg-VTA/SNc were found to influence the tonic baseline activity of DA neurons, which we define as .Hence, we change  while maintaining the phasic activity of DA and LHb neurons when we increase or decrease the weights of the synapses along this pathway (see Table 3).In Figures 9 and 10, we show the activity of DA neurons and LHb neurons given three different sets of synaptic weights from VP to GPb and corresponding baseline activities .We can see that DA and LHb neurons continue to demonstrate their characteristic phasic activity profiles.In brief, our neural circuit model is robust to the variation of synaptic weights.

Discussion
We extended a previous neural circuit model [16] by incorporating the nuclei GPb, LHb, and RMTg, and the model could account for various experimental data from separate works.Specifically, the model could exhibit the shift of DA and LHb neural responses from US to CS presentation times.Our simulations also replicated the phasic activity of DA, LHb, GPb, and RMTg neurons observed in experiments.The DA (LHb) neurons exhibited a phasic peak (dip) upon reward CS and maintenance of baseline activity in response to a rewarding outcome but a phasic dip (peak) if the reward is omitted.By contrast, the DA (LHb) neurons exhibited a phasic dip (peak) in response to a nonrewarding CS or punishment CS and maintenance of baseline activity in response to the nonrewarding US, but a phasic peak (dip) if a reward occurs or the aversive US is omitted.The acquired  Our model provides insights into the neural circuit mechanism of DA and LHb phasic activity.In particular, parallel excitatory and inhibitory pathways underlie the learned responses: striatum-to-PPTN-to-VTA/SNc pathway excites DA, while striosome-VTA/SNc pathway inhibits DA; striatum-to-VP-to-GPb-to-LHb pathway inhibits LHb, while striosome-to-GPb-to-LHb pathway excites LHb; LHb-to-RMTg-to-VTA/SNc pathway magnifies the phasic activity of DA.  in which DA can modulate the corticostriatal synapses and the corticostriosome synapses.This will in turn affect the DA responses, closing the loop.After learning, the weights of these synapses stabilize and remain unchanged.This led to the emergent phasic activity profiles of the nuclei in the circuit, with the parallel pathways balancing out one another.In addition, we found striosome to be a key brain nucleus which remembers the timing of previous rewards and encodes the predicted rewards.In fact, there are recent experimental works [51] that support our model prediction.

Complexity
In our model, we predict neurons in the striosome to encode expected reward, but there are alternative theories.For example, Cohen et al. [52] found that there were three types of VTA neurons and VTA GABAergic neurons may signal expected reward, which could be a key variable for dopaminergic neurons to calculate reward-prediction error.Recent works [53][54][55] highlight the importance of VTA GABAergic neurons.Averbeck and Costa [56] proposed that the amygdala can learn and represent expected values like the striatum, and they predicted that the amygdala may play a central role in reinforcement learning and the ventral striatum may play less of a primary role.Wagner et al. [57] suggested that the cerebellar granule cells may encode the expectation of reward.Luo et al. [58], Li et al. [59], and Hayashi et al. [60] found that serotonin neurons in the dorsal raphe nucleus can encode reward signals.Some physiological and theoretical works [17,18,[61][62][63] focus on D1 and D2 receptors in the ventral striatum and suggested that they play an important role in computing reward-prediction error.Future neural circuit modeling effort would need to incorporate such findings.
To obtain the results consistent with experiments, we have adopted several assumptions.First, we assumed that the striatal neurons excite the PPTN and ventral pallidum.Striatal neurons are usually identified as GABAergic and inhibitory, but they may excite downstream neurons through disinhibitory effect or substance  released by striatal neurons [29,30].In fact, it has been demonstrated that substance  mediates the excitatory interaction between striatal neurons to VP neurons [29] and striatal projection neurons [30].Second, we hypothesized that the striosome projects to the GPi which in turn projects to the GPb.Although we have no direct evidence, Hong and Hikosaka [21] have observed that typical GPe and GPi neurons are first inhibited by striatal stimulation and GPb neurons are often (but not always) excited by striatal stimulation.They proposed that inputs to GPb were mediated through inhibitory axon collaterals within the striatum [28] or GPe [24].
While developing the model, we have tried to add minimal features to the previous model of Brown et al. [16].Hence, it is worthy of note that we have ignored several factors to simplify the model.Specifically, we ignored the connections between some brain nuclei, such as the cortex-to-GPb [2], VP-to-RMTg [3], LHb-to-LHb, cortex-to-LHb [48], and DAto-striatum [64] pathways.We also did not consider the direct LHb-to-VTA [65] and VTA-to-LHb [66] connections in our simulation, but we mimicked the overall inhibition of LHb on VTA.We have also ignored the different types of activity of many brain nuclei.For instance, studies have suggested three types of GPb neurons: reward-positive type, rewardnegative type, and direction selective type [2].Our model only considers the reward-negative type since the majority of the neurons of Gpb are of the reward-negative type and this type of neurons may play a key role in reward-related information transmission.
Despite the assumptions in the model, our neural circuit model can still implement the computation for reward-based phasic signaling and reinforcement learning, as observed in a variety of experiments.The phasic activities in multiple brain regions represent prediction error signals, which not only associates the cue with outcome but also memorizes the specific time interval between the two.This requires the neural system to hold the information predicted by the cue, compare the information with the outcome, and report the result of the comparison.In our model, the time spectrum of the striosome and the parallel excitatory and inhibitory pathways provided the platform for such computation.The peak activity of DA and LHb neurons functions in complementary roles, encoding reward and nonreward/punishment information separately and alleviating any flooring (limiting) effect of the dip in activity of either neuron type.Our novel neural circuit model with parallel pathways provides an instantiation of such complex neural computation.

Mathematics and Equations
This section lists the mathematical equations of the model (Figure 2).We give the model circuit different inputs to simulate different conditions.We use differential equations to simulate the firing rates (or the activity levels) of the neurons in different brain nuclei.The model variables are summarized in Table 1, the fixed parameters are summarized in Table 2, and the mathematical expressions are below.
(i) Different Inputs in Each Trial (Figure 2).The cortex, especially the orbitofrontal cortex (OFC), encodes the expectation future outcome and their response reflects the value conveyed by the combination of reward and punishment of the cue [36,37].Furthermore, OFC neurons fired most strongly for cues that predict large reward or small penalty and least strongly for cues that predict large penalty or small reward relative to neutral conditions [32,33].Therefore, we set a larger value for rewarding cue and smaller but positive value for nonrewarding cue as follows.
When the network receives a reward CS, the inputs from the cortex increase abruptly and last until the time when the expected reward should be given.Then, the inputs decay exponentially to baseline activity level.
Nonreward CS input is as follows: We set background  = 0.20.
When the network receives a reward US, the inputs from the lateral hypothalamus increase abruptly and last for a very short duration.Then, the inputs decay exponentially to baseline activity level.
Nonreward US input is as follows: If the network does not get reward or gets nonreward (aversion or punishment), we assume the inputs in this trial do not change, and the inputs remain at baseline level.
(ii) Differential Equations.First, the changes of activation level of ventral striatal cells () are governed by [16] The activity level of striatal cells changes in the wake of its passive decay and excitation from CS inputs and US inputs.
The weight   is fixed while the weight   can be changed.
The synaptic weight changes are induced by phasic dopamine burst or dip signal,  + and  − (defined in (7) and ( 8)).
Learning is gated by delayed release of a second messenger and calcium signal   is governed by ( 9) and ( 11) at a rate  = 12.5.
The positive reinforcement-learning signal  + derives from excitatory phasic fluctuations of the dopamine signal above the baseline: The complementary negative reinforcement-learning signal  − derives from inhibitory phasic fluctuations of the dopamine signal below baseline: Second, striosomes play an important role in the phasic activities of DA neurons and LHb neurons because of its timing spectrum mechanism: a spectrum of striosomal MSPN second messenger activities   responds to the th input at rates   : where the second messenger buildup rates are given by The In (11),   () is a step function: In the brief interval when the calcium concentration at a particular spine exceeds a threshold activity Γ  , CS-striosomal weight   at that particular spine becomes eligible for change that may be induced by dopaminergic bursts ( + ) or dips ( − ).( pre-excite and  pre-inhibit can be regarded as the effect of substance  and GABA on PPTN.Ventral striatum neurons can secrete substance  and GABA.Substance  excites

Figure 1 :
Figure 1: Schematic diagram of phasic activity of DA neurons (left orange part) and LHb neurons (right yellow part) given rewarding CS (upper) and nonrewarding/aversive CS (bottom).Each row denotes one situation of outcome.

Figure 2 :
Figure 2: Model circuit.Orange arrowheads denote excitatory pathways, blue circles denote inhibitory pathways, and hemidisks denote synapses at which learning occurs.Black dashed lines denote dopaminergic signals.Evidence[21] shows that the ventral striatum (VS) excites PPTN and ventral pallidum (VP).Striosome neurons project to GPi neurons which in turn project to GPb.Dopaminergic (DA) neurons are excited by cortical inputs (  ) encoding conditioned stimuli and lateral hypothalamus inputs (  ) encoding unconditioned stimuli via the path VS-VP-GPb-LHb-RMTg-VTA/SNc and the path VS-PPTN-VTA/SNc path.DA neurons are inhibited by   via the path striosome-VTA/SNc.Note that the striosome contains an adaptive spectral timing mechanism and can learn to generate lagged, adaptively timed signals[16].LHb neurons are excited by   via the path striosome-GPi-GPb-LHb.LHb neurons are inhibited by   and   via the path VS-VP-GPb-LHb.

Figure 3 :
Figure3: Model simulation protocol.(a) Different inputs are applied to simulate different conditions.We simulated a total of 200 trials.In the first 99 trials, we present reward CS input and reward US input to simulate the learning process, which associates the reward CS with the reward US.In the 100th trial, we present reward CS input but nonreward US input; thus, one predicts a reward but does not receive it.In the next 99 trials, we present nonreward CS input and nonreward US input to simulate the learning process, which associates the nonreward US with the nonreward CS.In the 200th trial, we present nonreward CS input but reward US input, simulating the situation where one predicts nonreward but receives it.(b)∼(e) Different inputs.The yellow dashed line indicates the time at which CS appears (2.0 s), and the green dashed line indicates the time at which rewards are released or not (3.4 s).(b) Reward CS input.(c) Nonreward CS input.(d) Reward US input.(e) Nonreward US input.

Figure 4 :
Figure 4: The shift of DA and LHb neurons' responses from US to CS.At the beginning of our simulation, the model circuit receives a reward CS and a reward US.FR: neural firing-rate activity.(a) Response of dopamine neurons in the 1st trial: DA neurons exhibit a phasic peak upon US before learning and do not respond to CS.(b) Response of DA neurons in the 2nd trial: the activity of DA neurons shows a peak upon CS and a peak upon US.The response upon US is weaker than the response in the 1st trial.The responses of DA neurons in the 3rd to 98th trials are similar to (b), but the peak upon US gets weaker over trials.(c) Response of DA neurons in the 99th trial: the activity of DA neurons shows a peak upon CS, but baseline responding to US after learning.(d) Response of LHb neurons in the 1st trial: LHb neurons exhibit a phasic dip upon US before learning and do not respond to CS. (e) Response of LHb neurons in the 2nd trial: the activity of LHb neurons shows a dip upon CS and a dip upon US.The response upon US is weaker than the response in the 1st trial.The responses of DA neurons in the 3rd to 98th trials are similar to (e), but the dip upon US gets weaker trial by trial.(f) Response of LHb neurons in the 99th trial: the activity of LHb neurons shows a dip upon CS, but baseline responding to US after learning.(a), (b), and (c) show the shift of DA neural response from US to CS after learning, while (d), (e), and (f) show the shift of LHb neural response.The yellow dashed line indicates the time at which CS appears and the green dashed line indicates the time at which rewards are released or not.

Figure 5 :
Figure 5: Acquired response of DA neurons.(a) The 99th trial: from the 1st to 99th trials, the model circuit receives a rewarding CS and a rewarding US.The result shows that, after learning, DA neurons exhibit a phasic peak upon rewarding CS and a baseline in response to reward outcome.(b) The 100th trial: the model circuit receives rewarding CS and nonrewarding US.The result shows that DA neurons exhibit a phasic peak when rewarding CS appears and exhibit a phasic dip at the time when the reward is expected.(c) The 199th trial: from the 101st to 199th trials, the model circuit receives nonrewarding CS and a nonrewarding US.The result shows that, after learning, the DA neurons exhibit a phasic dip upon nonrewarding CS and a baseline when there is no reward released at this trial.(d) The 200th trial: the model circuit receives nonrewarding CS and rewarding US.The result shows that DA neurons exhibit a phasic dip when nonreward CS appears and exhibit a phasic peak upon reward US.(e) The phasic activity of DA neurons under different situations.The thick red line indicates the activity of DA neurons at the 99th trial, the narrow blue line indicates the activity of DA at the 100th trial, the thick blue line indicates the activity of DA neurons at the 199th trial, and the narrow red line indicates the activity of DA neurons at the 200th trial.The yellow dashed line indicates the time at which CS appears and the green dashed line indicates the time at which rewards are released or not.(f) The physiological experimental result reprinted from Matsumoto and Hikosaka [5].Red lines indicate reward trials, and blue lines indicate no reward trials.Full lines indicate reward CS-to-reward US (red) and nonreward CS-to-nonreward US (blue), while dashed lines indicate reward CS-to-nonreward US (blue) and nonreward CS-to-reward US (red).

Figure 6 :Figure 7 :
Figure 6: Acquired response of LHb neurons.(a) The 99th trial: from the 1st trial to the 99th trial, the model circuit receives rewarding CS and rewarding US.The result shows that, after learning, LHb neurons exhibit a phasic dip upon rewarding CS and a baseline activity in response to rewarding outcome.(b) The 100th trial: the model circuit receives rewarding CS and nonrewarding US.The result shows that LHb neurons exhibit a phasic dip when rewarding CS appears and exhibit a phasic peak at the time when the reward should be released.(c) The 199th trial: from the 101st trial to the 199th trial, the model circuit receives nonrewarding CS and nonrewarding US.The result shows that, after learning, LHb neurons exhibit a phasic peak upon nonrewarding CS and a baseline activity due to omission of reward at this trial.(d) The 200th trial: the model circuit receives nonrewarding CS and rewarding US.The result shows that LHb neurons exhibit a phasic peak when nonrewarding CS appears and exhibit a phasic dip upon rewarding US.(e) The phasic activity of LHb neurons under different situations.The thick red line indicates the activity of LHb at the 99th trial, the narrow blue line indicates the activity of LHb at the 100th trial, the thick blue line indicates the activity of LHb at the 199th trial, and the narrow red line indicates the activities of LHb at the 200th trial.The yellow dashed line indicates the time at which CS appears and the green dashed line indicates the time at which rewards are released or not.(f) The physiological experimental results reprinted from Hong and Hikosaka [2].Red lines indicate reward trials, and blue lines indicate no reward trials.Thick lines indicate reward CS-to-reward US (red) and nonreward CS-to-nonreward US (blue), while narrow lines indicate reward CS-to-nonreward US (blue) and nonreward CS-to-reward US (red).

Figure 9 :
Figure 9: The phasic activity of DA neurons given different weights of synapses from VP to GPb.Yellow lines indicate the activity of DA neurons when  VPG equals 1.00 and  equals 0.19431, blue lines indicate the activity when  VPG equals 1.10 and  equals 0.20307, and red lines indicate the activity when  VPG equals 0.90 and  equals 0.18608.(a) Trial 1: phasic peak activity responds to unconditional reward.(b) Trial 2: the phasic activity shifts to the cue.(c) Trial 99: the phasic activity upon the cue and baseline activity upon the reward.(d) Trial 100: the dip activity upon reward omission.(e) Trial 199: the dip activity upon nonrewarding cue.(f) Trial 200: the peak activity upon unexpected reward.

Figure 10 :
Figure 10: The phasic activity of LHb neurons given different weights of synapses from VP to GPb.Yellow lines indicate the activity of LHb neurons when  VPG equals 1.00 and  equals 0.19431, blue lines indicate the activity when  VPG equals 1.10 and  equals 0.20307, and red lines indicate the activity when  VPG equals 0.90 and  equals 0.18608.(a) Trial 1: phasic dip activity responds to unconditional reward.(b) Trial 2: the phasic activity shifts to the cue.(c) Trial 99: the phasic activity upon the cue and baseline activity upon the reward.(d) Trial 100: the peak activity upon reward omission.(e) Trial 199: the peak activity upon nonrewarding cue.(f) Trial 200: the dip activity upon unexpected reward.

Table 3 :
Baseline activity of DA neurons given increased or decreased synaptic weights.