Augmented World Models

This page is supplementary material for the paper:

"Augmented World Models Facilitate Zero-Shot Dynamics Generalization From a Single Offline Environment".

Here we show examples of learned behaviors for our baseline, Model Based Offline Policy Optimization (MOPO) as well as Augmented World Models.

All policies were trained in the same offline dataset, HalfCheetah Mixed from D4RL.

BASE Environment

First we show the policies for the original environment (damping = 1, mass = 1), which both perform well.

MOPO

Reward = 2399

Augmented WM

Reward = 2779

DAMPING = 0.75, MASS = 0.75

Now we consider the case where damping and mass are both lower. This could happen if for example a newer robot was made with lighter materials.

With a lighter body, the MOPO agent uses too much force and flips onto its front. However, the AugWM agent actually performs better than the base, exploiting the lighter torso.

MOPO

Reward = 1159

Augmented WM

Reward = 2864

DAMPING = 0.25, Mass = 1.25

Now we consider the setting when the damping and mass are both different.

Once again the MOPO agent is unable to control itself and flips over, meanwhile AugWM is unfazed.

MOPO

Reward = 823

Augmented WM

Reward = 2462

What if the dynamics change During an episode?

Finally, what if the dynamics change during an episode?

There are a multitude of reasons why this could happen, for example if the robot is damaged, or moves to a different room in the real world.

To test this, we consider a 1500 step rollout, where the first 500 steps are in the base test environment (1,1), before the dynamics shift to mass=0.75, damping=0.5.

As we see, the MOPO agent then fails, whereas the AugWM agent is able to adjust its context and thus alter its behavior to make use of the lighter torso.

MOPO

Augmented WM

Please See the paper for more details

Page updated

Google Sites

Report abuse