About the story of dopamine and the prefrontal cortex:
Observing the human brain, there are two important parts:
1. The basal ganglia (or lizard brain), which includes the VTA and substantia nigra, where dopamine is produced.
This area is activated when the received reward is greater than expected;
This area remains at baseline when the received reward is as expected;
This area is suppressed when the received reward is less than predicted. This mechanism is known as the dopamine reward prediction error.
The algorithms modeled by this mechanism are model-free reinforcement learning, where the algorithm corresponding to the reward prediction error is A2C.
This new learning algorithm is independent of the original algorithm and differs in how it fits the task environment.
The main function of this learning is to shape the dynamics of the prefrontal network through dopamine-driven adjustments to its recurrent connections, giving it memory and reasoning capabilities.
After extensive training, the network gradually shifts from exploration to exploitation, making this transition slower in more challenging problems.
Through the cooperation of the two parts, the completed effectiveness allows us to understand the so-called model-based algorithms, viewed from another dimension or perspective, which is nothing more than the regulation of prefrontal cortex functions by dopamine mechanism’s model-free algorithm for meta-learning.
Connecting to the previously introduced fourth experiment: two-step task.
This task essentially involves designing a situation that randomly switches between two different tasks with the same structure, aiming to view it through the lens of the relationship between dopamine and the prefrontal cortex as described in the paper. Thus, through this meta-learning approach, the model-free algorithm framework acquires model-based functionality. Model-free algorithms only care about the size of the rewards obtained from interacting with the environment, while through the algorithm in the paper, it can learn whether the rewards come from common or uncommon transitions.
The learning rate of this algorithm is an order of magnitude greater than that of the DA-based RL algorithm. The algorithm adjusts the weights of the prefrontal network during training (set to 0.00005). Therefore, the results provide specific indications that the learning algorithm produced by meta-RL can differ from the original algorithm that generated it.
The results also enable us to emphasize an important implication of this principle, namely that meta-RL produces prefrontal learning algorithms suitable for task environments. In this case, this adaptation is reflected in how the learning rate responds to task fluctuations. Previous studies have proposed specialized mechanisms to explain the dynamic changes in the learning rate. Meta-RL interprets these transitions as an emergent effect arising from a series of very general conditions. Moreover, dynamic learning rates are just one possible form of specialization. When meta-RL appears in environments with different structures, fundamentally different learning rules will emerge.
Key Point: Now we begin to discuss the fifth experiment:Harlow Task
Adapted from this psychological experiment:
1. A gorilla, with two covered objects on the left and right
2. The gorilla selects one, one object has food, the other does not, which becomes one trial.
3. Next, repeat the selection six times, i.e., six trials, referred to as one episode. In each trial of this episode, the objects randomly swap positions left and right, but the physical correspondence of whether there is food remains unchanged.
4. After experiencing several hundred episodes, the gorilla learns to randomly choose one at the beginning of an episode, and even if it does not choose correctly, it can choose correctly in the next trial because the gorilla has learned this higher-level pattern and associated food with the object itself rather than its left or right position.
In fact, the algorithm architecture is basically the same as the two-step task, with one key difference, which is its magical aspect:Both are an RNN structure or the same mathematical formula, viewed from different perspectives, can be mapped to many different things, thus solving or describing seemingly unrelated matters, providing a framework with potentially infinite connotations; the black box is magic.
The architecture is as shown in the figure above:
1. We use 40 images to represent objects, also using 1 to 40 to represent these 40 objects. Therefore, the input observation is changed to o_t=[id_object1, id_object2]
2. In the previous trial, we selected action=1 (to the right), receiving reward 5.
3. With these inputs, the trained weights of the recurrent network, because in this trial, our model estimates that the best action is still 1, because this rewarding object is still on the right.
Through the insights from this experiment: 1. Unlike traditional machine learning algorithms, using meta reinforcement learning, the algorithm maintains 0% performance in the first 8-12k collections. Thus, you may spend countless hours (two-step task on CPU > 4 hours) stuck at 0% performance, unaware of whether the model is learning. My suggestions: a) Ask yourself, “Did I make a mistake in my experiment, and what is my error?” b) Record the details of the experiment as accurately as possible (date, time, goals, reasons it may not be working, estimated training time, etc.). 2. Just like children learning something, they either do not understand or suddenly understand. It closely resembles the process of consciousness emergence, an emergence from unconsciousness to consciousness, and is based on ordinary conditions. 3. Adjust hyperparameters. Use only one thread, a single 48-unit LSTM, and a very simple input for experimentation. Ultimately, we achieved an average reward of about 10 (about half of the best performance, see “Results”). Our best guess is that 48 units are insufficient to learn the task, but it could also be a single-thread issue in our code or something else. In short, always change one hyperparameter at a time, otherwise you will end up not knowing why your model is not learning.
Additionally:The paper also mentions an understanding of LSTMs that I find quite good:
In standard non-gated recurrent neural networks, the state at time step t is a linear projection of the state at time step t-1, followed by a non-linear transformation. This “vanilla” RNN may struggle with long-range temporal dependencies because it must learn very precise mappings just to copy information from one time state to the next. On the other hand, LSTM operates by copying its internal state (called the “cell state”) from each time step to the next. It does not have to learn how to remember; it just remembers by default. However, it can also choose to forget, using the “forget” (or maintain) gate and the “input” gate to choose which new information to allow in. Since it may not want to output its entire memory content at every time step, there is also an “output gate” to control what is output. Each of these gates is modulated by learning functions that adjust the state of the network.
That concludes this section. Feel free to ask questions and exchange ideas, wechat: Leslie27ch