Model-based reinforcement learning (RL) methods can be broadly categorized as global model methods, which depend on learning models that provide sensible predictions in a wide range of states, or local model methods, which iteratively refit simple models that are used for policy improvement. While predicting future states that will result from the current actions is difficult, local model methods only attempt to understand system dynamics in the neighborhood of the current policy, making it possible to produce local improvements without ever learning to predict accurately far into the future. The main idea in this paper is that we can learn representations that make it easy to retrospectively infer simple dynamics given the data from the current policy, thus enabling local models to be used for policy learning in complex systems. To that end, we focus on learning representations with probabilistic graphical model (PGM) structure, which allows us to devise an efficient local model method that infers dynamics from real-world rollouts with the PGM as a global prior. We compare our method to other model-based and model-free RL methods on a suite of robotics tasks, including manipulation tasks on a real Sawyer robotic arm directly from camera images.
Top left: Visualizing a trajectory in the car navigation environment, with the target denoted by the black dot, and the corresponding image observation. Bottom left: An illustration of the 2-DoF arm environment, with the target denoted by the red dot, and the corresponding image observation. Note that we use sliding windows of past observations when learning both tasks. Top right: Illustration of the architecture we use for learning Lego block stacking. Bottom right: Example trajectory from our learned policy stacking the yellow Lego block on top of the blue block.
Sawyer Lego block stacking. We tested our method on an image-based Lego block stacking task on a real 7-DoF Sawyer robotic arm, where the controller only receives images as the observation, without joint angles or other information. The observations are raw 84-by-84-by-3 images from a camera pointed at the robot. Our method solves this task within 250 episodes, corresponding to under an hour of interaction time, and is successful at handling the complex, contact-rich dynamics of block stacking.
Nonholonomic car. The nonholonomic car starts in the bottom right of the 2-dimensional space and controls its forward acceleration and steering velocity in order to reach the target in the top left. We evaluate our method using image observations, where we use a sliding window of four 64-by-64 images to capture velocity information, and we compare to a global model ablation of our method in which we replace the local linear dynamics in our model with a neural network dynamics function and use this model for forward prediction and MPC in the latent space.
Global model ablation
Reacher. We experiment with the reacher environment from OpenAI Gym, where a two DoF arm has to reach a target denoted by a red dot, which we specify to be in the bottom left. For observations, we directly use 64-by-64-by-3 images of the rendered environment, which provides a top-down view of the reacher and target, and we use a sliding window of four images to encode velocity information. In this domain, we evaluate our method against TRPO.