Differentiable Physics Models for Real-world Offline Model-based Reinforcement Learning

Abstract:

A limitation of model-based reinforcement learning (MBRL) is the exploitation of the learned dynamics model during policy optimization. Powerful black-box models can locally describe complex system dynamics with high fidelity but are undefined outside of the data distribution. Physics-based models can underfit to the real-world due to the presence of unmodeled phenomena, but can be better at extrapolating due to the general validity of the underlying structure. In this work we demonstrate experimentally that for the offline model-based reinforcement learning setting, the physics-based models with global extrapolation are beneficial compared to high-capacity local models if the mechanical structure is known. In addition, we generalize the approach of differentiable physics for modeling holonomic multi-body systems to systems with nonholonomic dynamics using end-to-end automatic differentiation. To demonstrate the effectiveness physics-based models for offline MBRL, we learn to perform the ball in a cup task on a physical manipulator using only 4 minutes of sampled data. We find that black-box models consistently produce unviable policies due to failing to extrapolate the dynamics, despite having access to more data than the physics-based model.

Data Acquisition

Trajectory to identify the system parameters of the physical Barret WAM.

Trajectory to identify the system parameters of the string and ball dynamics.

Learned Whitebox Model - 40 cm

The Videos are replayed with 0.5x speed. Green trajectories highlight a successful Ball in a Cup movement. Red trajectories display failure cases. Each video shows the optimal policy trained using different reinforcement learning seed but the identical learned model.

Learned Whitebox Model - 35 cm

The Videos are replayed with 0.5x speed. Green trajectories highlight a successful Ball in a Cup movement. Red trajectories display failure cases. Each video shows the optimal policy trained using different reinforcement learning seed but the identical learned model.

Learned Whitebox Model - 45 cm

The Videos are replayed with 0.5x speed. Green trajectories highlight a successful Ball in a Cup movement. Red trajectories display failure cases. Each video shows the optimal policy trained using different reinforcement learning seed but the identical learned model.

Baseline Models - 40 cm

Measured Whitebox Model

LSTM Network

Feed Forward Neural Network