Every Monday I present 4 publications from my research area. Let’s talk about them!

[← Previous review][Next review →]

Silver, D., Singh, S., Precup, D., and Sutton, RS (2021). The prize is enough. Artificial intelligence, 103535.

TThe hypothesis formulated in this article is that maximizing the premium iof a sufficiently complex environment is a sufficient a prerequisite for the emergence of intelligence. The example they take in this article is a squirrel who wants to get as many nuts as possible. To achieve his goal, he has to perform several of the following tasks: observing the environment, moving, climbing trees, communicating with other squirrels, understanding the cycles of years … In his hypothesis all of these subsequent tasks are learned implicitly maximizing the number of nuts obtained by maximizing a single premium.

They take an example from their previous work on chess and Go. The agent was trained only by maximizing the reward. The agent was not taught the gaps or tactics normally taught by young players. So he created his own openings and tactics. They proved to be very innovative and sometimes very different from what the experts were used to. Alfa Go’s famous “move 37” is a complete demo that completely shook the experts of this game.

Is this hypothesis correct? Can complex intelligence arise from maximizing one value? Everyone has their own opinion. Some see it as a new indication of the necessary simplicity of the scientific result, while others say that their explanations are often tautological. And you, what do you think?

Everything should be made as simple as possible, but not simpler. – A.Einstein

Toyama, D., Hamel, P., Gergely, A., Comanici, G., Glaese, A., Ahmed, Z., … & Precup, D. (2021). AndroidEnv: A validation learning platform for Android. arXiv Prepress arXiv: 2105.13231.

Yyou know the grid, you know Atari, you know the Mujoco control environment. Today you will find AndroidEnv, an RL environment that interacts with the Android environment. The operating system is fully emulated and encapsulated in the OpenAI gym environment, and you can interact with it in the same way as with all other environments in this framework. The observation corresponds to the pixel matrix and the functions you make on the touch screen.

This paves the way for a very large number of feasible tasks. This environment is even more interesting because the two main elements are integrated:
(1) Real time. As you know, it can take some time for an agent to choose their action. Usually the environment is blocked and waiting for the agent to act. That is not the case here. The environment continues to advance during the agent’s negotiations. In addition, Android rendering can take a while: for example, when you scroll through a webpage, there’s a small sliding effect that makes it natural. This is included, so the agent needs to learn to adapt to such an impact and his or her own reflection time, which makes learning difficult.
(2) Raw operation. The definition of the function is low: one position and touch / lift. That’s it. The agent must therefore learn complex gestures, such as drag and drop, tipping, or swiping.

I will now focus on the question you have been thinking from the beginning: what is the task to complete? What is the reward function? In fact, the user must configure it through the operating system logs. Let’s take an example: I want my agent to add John Doe to my contacts. If this task is performed, it is likely that the system log form[INFO][2021/06/07] John Doe added as a new contact.“By noticing the presence of such a log, I can decide when to reward my agent. Yes, it’s a bit complicated, but I think it’s the purest way they found when determining the reward.

Bjorck, J., Gomes, CP, and Weinberger, KQ, Towards deeper deeper learning, arXiv Prepress arXiv: 2106.01151.

Like you may have noticed in deep reinforcement learning, the question of the best neural network structure is rarely asked. Basically, if the observation is a picture, we use CNN, and if not, MLP. RL mainly focuses on algorithms: value-based or policy-based? exploration or exploitation? legalization or not? model based or not? Clearly, the question of the type of neural network is never asked.

In this article, the authors discuss the implications of using newer neural networks for modern tricks, such as bypassing the connection. The goal is to increase learning performance while maintaining the same algorithm, here Soft Actor Critic (SAC). So how important do you think neural network selection is?

The authors investigate the effects of normalization and the addition of residual linkages on the SAC. If implemented naively, it is likely to cause instability in the model. Therefore, they suggest to usesmoothing by spectrum normalization”(“ Smooth ”in the illustration).

With this smoothing, learning can be stabilized. The results obtained from continuous control tasks are often much better than those obtained when no work is done on the network architecture.

The conclusion is obvious: we need to take the time to work on the architecture of RL’s networks. This can have an important impact on learning outcomes. This is an intuition that everyone had more or less, thank them for showing it once and for all.

Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., Mordatch, I. (2021). Decision Transformer: Confirmation Learning Using Sequence Modeling. arXiv Prepress arXiv: 2106.01345.

RLearning to implement can take advantage of the huge advances in language modeling learning. In particular, it can leverage the power of transformers, a recent neural network architecture that has greatly improved learning performance, including the mechanism of attention.

The authors present a decision transformer, an architecture that presents the RL problem as conditional sequence modeling. Forget what you know about policy gradient or value-based learning. The approach is different. This is no longer a matter of maximizing the reward function, but rather of determining the activity that can be used to achieve the desired return.

Basically, a transformer agent learns from a set of interaction data which activity caused which reward. The agent inputs the status and the desired return predict the activity that maximizes the likelihood of receiving the desired reward. So if we want to get the maximum reward, we just have to specify the maximum reward in the feed.

Why would this approach work better than the classical RL approach? The main reason is that the conversion is able to combine the reward with the activity it produces. As you know, the reward is the result of all the actions that preceded it. But every action has more or less contributed to this reward. Some of them may have even limited fees. Classical LR approaches are in difficulty when environments have such characteristics (actions are not their consequences). The transformer handles this aspect very well.

However, some things are unclear to me at this stage. Why would we want a representative who can get an exact fee that is not the highest? How does an agent know in the state the maximum commission it can claim? (No, asking for a billion commission every time doesn’t work.) I hope more explanations will be released soon.


Please enter your comment!
Please enter your name here