MAMBA: an Effective World Model Approach for Meta-Reinforcement Learning

Zohar Rimon     Tom Jurgenson     Orr Krupnik   Gilad Adler     Aviv Tamar


Technion - Israel Institute of Technology  &  Ford Research Center Israel


ICLR 2024

Code is available at - https://github.com/zoharri/mamba

Visual Results of Trained Agent Behavior

Below, we present results of trained policies for MAMBA

Point Robot 2D Navigation

A goal is randomly selected on the half circle. The Bayes optimal policy should search on the half circle until reaching the goal and then stay there. In subsequent episodes the agent should go directly to the goal.

Point Robot 2D Navigation + Wind

An identical environment to the Point Robot 2D Navigation, but an external stochastic wind force (sampled from a gaussian distribution) is added at each transition. The optimal behavior is similar to the one in the determenistic environment.

Escape Room

A door is randomly placed on the top half-circle. This is the only way out from inside the circle to outside. Only outside the circle is the reward positive. The Bayes optimal policy should search the door covering the top half circle and then leave the room (outside the agent always gets positive rewards). For subsequent sub-episodes the agent should go directly towards the door.


Panda Vision Reach

A Panda robotic arm explores a 3D space for a spherical goal. The inputs are image pixels and sparse reward (which is 1 the near the goal and 0 otherwise). The text is only for visulization and is not present for the agent while learning

Humanoid Direction

A humanoid robot needs to run as fast a possible towards an unknown dirction in the 2D plane, sampled uniformely from [0, 2π]. This task has challenging dynamics, and was studied by previous meta-RL work. 

Reacher

Reacher - Vision

In this variation the agent receives an image of the environment. For visualization purposes the sub-episode number and the goal are presented, they are not visible to the agent. 

Reacher - Proprioceptive

This is the common two-link robot environment, but here we randomly select goals that the agent should reach. The agent state is a 4-dimensional vector describing the position and velocity of the end-effector. The goals are ordered and must be reached in the correct order to collect the reward. All goals but the last only emit a single positive reward. The last goal continues to provide positive reward until the episode ends. Thus, the Bayes-optimal policy should explore the scene until reaching all goals in the correct order. In subsequent episodes, motions towards the goals should be more direct. Note that before settling on the last goal, the agent correctly visits the preceding goals in order. This scenario is a good example for a task that could be decomposed into sub-tasks, requiring the agent to recall information over long horizons.

In the following, goals are green circles (with a number marking the order). The Reacher end-effector is a blue circle. The red number indicates the index of the sub-episode in the current meta-episode. 

Reacher - 1 Goal

Reacher - 2 Goals

Reacher - 3 Goals

Reacher - 4 Goals

Rooms

Similarly to reacher, in this scenario N goals are selected at the start of every meta-episode (one at a random corner of each room). The goals must be visited in order from the leftmost to the rightmost room, otherwise no reward is obtained. The Bayes-optimal policy should explore the corners until a goal is met, and then progress to the next room. In the second sub-episode it should go directly to the correct corner. In all sub-episodes, once the goal in the rightmost room has been reached the agent should remain on it to collect more rewards. This scenario is a good example for a task that could be decomposed into sub-tasks, requiring the agent to recall information over long horizons.

In the visualization below, we plot the different sub-episodes with different colors: red for the first, teal for the second.

Rooms - 3

Rooms - 4

Rooms - 5

Rooms - 6

Local vs Global Reconstruction in VariBAD

An in depth analysis for the differences between global and local reconstruction. See Section 3.2.2 and Figure 2 in the paper.

We can clearly see that although the total reconstruction of the second variant is worse (doesn't reconstruct previous rooms well), since the local reconstruction is better, the agent displays a near optimal behavior. 

VariBAD Vanilla - Global Reconstruction

VariBAD - Local Reconstruction