Previously, reinforcement learning theory has inspired and enlightened neuroscience:
Recently, there has been exciting progress in understanding the mechanisms involved in reward-driven learning. This progress is partly achieved by inputting ideas from the field of reinforcement learning (RL). Most importantly, this input has led to a dopamine-based functional theory grounded in RL. Here, the phase dopamine (DA) release is interpreted as conveying reward prediction error (RPE) signals, which are unexpected indices computed centrally in time-difference RL algorithms. According to this theory, RPE drives synaptic plasticity in the striatum, transforming experienced action-reward associations into optimized behavioral policies. Over the past two decades, evidence for this proposal has steadily increased, establishing it as the standard model for reward-driven learning.
Recent discoveries in neuroscience regarding the prefrontal cortex (PFC) can, in turn, guide RL:
Research on the prefrontal cortex (PFC) has raised a dilemma. Increasing evidence suggests that the PFC implements reward-based learning mechanisms, executing computations similar to those in DA-based RL. It has long been established that different parts of the PFC represent expectations of actions, objects, and states. Recently, recent actions and the historical sequence of rewards have also emerged in the PFC. The set of encoded variables and observations regarding the temporal distribution of neural activation in the PFC lead to the conclusion: “PFC neurons dynamically [encode] transitions from the history of rewards and choices to object values, and from object values to object choices.” In short, neural activity in the PFC seems to reflect a set of operations that together constitute an independent RL algorithm. Placing the PFC alongside DA, we obtain a picture containing two complete RL systems, one utilizing activity-based representations and the other learning about synapses.
Next, let’s see how these two DA and PFC-based reinforcement learning systems integrate:
What is the relationship between the two systems? If both support RL, are their functions redundant? One suggestion is that DA and PFC provide different forms of learning, where DA implements model-free RL based on direct stimulus-response associations, while PFC performs model-based RL, utilizing internal representations of task structures. However, a clear problem with this dual-system view is that it has been repeatedly observed that the DA prediction error signal is informed by task structure, reflecting that “inferred” and “model-based” value estimates are difficult to reconcile with the standard theory of the original framework.
Several experiments in the paper achieve their respective task objectives through this framework. Below, I will explain in detail what kind of task simulation4 in the paper addresses and what framework resolves this task:
Now, let’s discuss what framework it uses:
What problem does this training solve:
When a new sequence input is present, the model can discern which task it is and determine what action to take next. More specifically: for instance, if the previous reward is 1, and the action is A1, through this model, it can know whether this reward is obtained through a common or uncommon transition, (s0 to s1 to reward 1 is called a common transition when it is a high probability jump, while obtaining a reward through a low probability jump is called an uncommon transition)..Therefore, when the model knows that reward 1 is obtained through a common transition, it knows that repeating this action will yield more rewards (the repetition probability in the paper is referred to as stay probability). Similarly, if executing action A2 also yields reward 1, but the model learns that this is an uncommon transition, it will not repeat action A2. Even though both actions yield reward 1, only through this model can one know whether to follow this action for more rewards. This is the ultimate effect this model aims to learn, and from another perspective, the effect of this task is to achieve world-based effects using RNN, even though it does not train a world model.
Original text: Another important setting where structure-sensitive DA signaling has been observed is in tasks designed to probe for model-based control.
Summarizing and emphasizing the essence of this new model:
This metal RL learning modeled using the mechanism of the prefrontal cortex (PFC) primarily utilizes the functions discovered in the PFC, which absorb the historical sequences of rewards and actions, encoding this sequence into some internal representation of task structure. Therefore, 1, this model can be used to train various tasks based on the same structure. So 2, after training is complete, you input a new sequence, and this model can obtain the corresponding internal structural representation of the task through this sequence, thus knowing exactly which task it is and providing the appropriate policy for that task.
The link to the code is as follows:
https://github.com/mtrazzi/two-step-task
P.S.: Please read the paper before looking at the interpretation. If you have thoughts or questions, feel free to communicate. My WeChat ID: Leslie27ch