Model-Based Reinforcement Learning for Atari

Model-free reinforcement learning (RL) can be used to learn effective policies for complex tasks, such as Atari games, even from image observations. However, this typically requires very large amounts of interaction -- substantially more, in fact, than a human would need to learn the same games. How can people learn so quickly? Part of the answer may be that people can learn how the game works and predict which actions will lead to desirable outcomes. In this paper, we explore how video prediction models can similarly enable agents to solve Atari games with fewer interactions than model-free methods. In one case we observe an order o magnitude improvement in sample efficiency and in most cases it is at least two-fold.

We describe Simulated Policy Learning (SimPLe), a complete model-based deep RL algorithm based on video prediction models and present a comparison of several model architectures, including a novel architecture that yields the best results in our setting. Our experiments evaluate SimPLe on a range of Atari games and achieve competitive results with only 100K interactions between the agent and the environment (400K frames), which corresponds to about two hours of real-time play.

Model-based learning algorithm

In our method the agent learns using imaginary experience generated by a predictive model. As such, it is critical to gather data about the environment that is diverse enough to ensure that the learned model correctly reproduces the dynamics of the environment in all key situations. In most Atari games, random exploration is not sufficient to achieve that goal. To explore in a more directed way, we use an iterative process consisting of alternating phases of data collection, model training and policy training, so that as the policy gets better, we collect more meaningful data and so can learn a progressively better model. We train the policy using the PPO algorithm.

Stochastic discrete model

Our agent learns from raw pixel observations generated by a video prediction model. We experimented with a few architectures. Our best model is a feedforward convolutional neural network, that encodes a sequence of input frames using a stack of convolutions and given an action performed by the agent, decodes a next frame using a stack of deconvolutions. Reward is predicted based on the bottleneck representation.

We found that introducing stochasticity to the model has a beneficial effect, allowing the policy to experience a more diverse set of scenarios during training. We do that by adding a latent variable, samples from which are added to the bottleneck representation. Discrete variables worked best in our setting, encoded as sequences of bits. The whole architecture is reminiscent of a variational autoencoder, where the posterior over the latent variable is approximated based on the whole sequence (input frames + target frame), a value is sampled from that posterior and used, along with the input frames and the action, to predict the next frame. During inference, latent codes are generated by an autoregressive LSTM network.

Results

The primary goal of our paper was to use model-based methods to achieve state-of-the-art sample efficiency. This was framed in the following question. What score can we achieve within the modest budget of 100K interactions (c.a. 2 hours of real-time play)?

We compared our method to Rainbow, the state-of-the-art model-free algorithm for Atari games, re-tuned for optimal performance using 1M interactions with the environment. We also made a comparison with the PPO implementation used in our training. The results can be seen in the graphs below, showing the number of interactions needed by the respective model-free algorithms to match our score. The red line indicates the number of interactions our method uses. We can see that using our approach, we are able to increase sample-efficiency more than two times on most of the games.

Comparison with Rainbow

Comparison with PPO

Solved games

To our pleasant surprise for two games, Pong and Freeway, our agents trained purely in the simulated environment excelled in the real game scoring maximum respective results. We emphasize that we did not adjust the method nor hyperparameters individually for each game.

In the video below we present a rollout from Pong, in which our learned policy achieves a perfect score of 21.


In all videos below the top pane there is information about the cumulative reward (c:), the momentary reward in the current frame (r:) and in some videos the frame number (f:).
pong_solved.avi

Freeway

Freeway is a particularly interesting game. Though simple, it presents a substantial exploration challenge. The chicken, controlled by the agents, is quite slow to ascend when exploring randomly as it constantly gets bumped down by the cars (see the video on the left). This makes it very unlikely to fully cross the road and obtain a non-zero reward. Nevertheless, SimPLe is able to capture such rare events, internalize them into the predictive model and then successfully learn a successful policy (see the video on the right).

However, this good performance did not happen on every run. We conjecture the following scenario in failing cases. If at early stages the entropy of the policy decayed too rapidly, the collected experience stayed limited leading to a poor world model, which was not powerful enough to support exploration (e.g. the chicken disappears when moving to high). In one of our experiments, we observed that the final policy was that the chicken moved up only to the second lane and stayed waiting to be hit by the car and so on so forth.

freeway_random_trim.avi
freeway_solved.avi

Pixel-perfect games

In some cases our models are able to predict future perfectly, up to a single pixel! As far as we know, this is the first time it has been achieved for Atari environments. This property holds for rather short time intervals, we observed episodes up to 50 time-steps. Extending it to long sequences would be a very exciting research direction.


In all the videos below the screen is split into three panes, the left showing the simulated environment, the middle the ground-truth data and the right their difference. The videos present the games of Freeway, Pong and Breakout.
freeway_perfect.avi
pong_perfect.avi
breakout_perfect.avi

Benign errors

Despite the aforementioned positive examples, perform models are difficult to acquire for some games, especially at early stages of learning. However, model-based RL should be tolerant to modest model errors. Interestingly, in some cases our models differed from the original games in a way that was harmless or only mildly harmful for policy training.

For example, in Bowling and Pong, the ball sometimes splits into two. While nonphysical, seemingly these errors did not distort much the objective of the game.

In Kung Fu Master our model’s predictions deviate from the real game by spawning a different number of opponents. In Crazy Climber we observed the bird appearing earlier in the game. These cases are probably to be attributed to the stochasticity in the model. Though not aligned with the true environment, the predicted behaviors are plausible, and the resulting policy can still play the original game.

bowling_two_balls_1.avi
pong_ball_turns_back.avi
kung_fu_master_different_opponents.avi
crazy_cliber_bird.avi

Failures on hard games

On some of the games, our models simply failed to produce useful predictions. We believe that listing such errors may be helpful in designing better training protocols and building better models.

The most common failure was due to the presence of very small but highly relevant objects. For example, in Atlantis and Battle Zone bullets are so small that they tend to disappear. To spot them, we recommend watching videos at full screen and reduced speed. Interestingly, Battle Zone has pseudo-3D graphics, which may have added to the difficulty.

Another interesting example comes from Private Eye in which the agent traverses different scenes, teleporting from one to the other. We found that our model generally struggled to capture such large global changes.

atlantis_missing_bullet.avi
battle_zone_missing_bullet.avi
private_eye_changing_env.avi

Sticky actions

We found that our world model can learn to account for the stochasticity in Atari introduced by sticky actions. The figures below shows the results of an experiment in which both the world model and the policy were trained in the presence of sticky actions. The results are similar to the deterministic environment already and we believe that hyperparameter tuning will further improve them.

Comparison of experiments with and without sticky actions. In both the cases we use stochastic discrete world models. To maintain comparability we match both the results with scores of Rainbow; same format and scale as ablations in the paper.

Fraction of scores achieved by SimPLe in experiments with sticky actions compared to non-sticky actions.

The bars are clipped at 2.

Numerical data

For the convenience of other researchers we provide the raw results obtained in our experiments. They can be downloaded here and read with pandas.read_pickle.

The file contain number of experiments with various parameters settings:

  • eval_{x} - experiments with x number of samples collected from the real environment (all other experiments have this number fixed to 100k)
  • eval_long, eval_longmodel - experiments with long training of model
  • eval_rainbow1, eval_rainbow2m - experiments using rainbow instead of ppo in policy training
  • eval_sd - experiment with standard parameters settings
  • eval_sd_g90, eval_sd_g95 - experiments with ppo gamma set to 0.9 and 0.95
  • eval_sd_s100, eval_sd_s25 - experiments with the model being unrolled for 100, 25 steps (all other experiments have this number fixed to 50)
  • eval_sticky - experiment with sticky actions

We run number of evaluations on the models with various settings: temp_{0.0|0.5|1.0}_max_noops_{0|8}_{clipped|unclipped}, where 0.0|0.5|1.0 denote the inverse temperature applied to the policy (e.g. 0.0 denotes the argmax policy), 0|8 is the maximal number of noops injected before the applying the policy and clipped is the rewards clipped to {-1, 0, 1} (as used in training) and unclipped are the original scores reported by games.