Model-based reinforcement learning (RL) has proven to be a data efficient approach for learning control tasks but is difficult to utilize in domains with complex observations such as images. In this paper, we present a method for learning representations that are suitable for iterative model-based policy improvement, even when the underlying dynamical system has complex dynamics and image observations, in that these representations are optimized for inferring simple dynamics and cost models given data from the current policy. This enables a model-based RL method based on the linear-quadratic regulator (LQR) to be used for systems with image observations. We evaluate our approach on a range of robotics tasks, including manipulation with a real-world robotic arm directly from images. We find that our method produces substantially better final performance than other model-based RL methods while being significantly more efficient than model-free RL.
Illustrations of the environments we test on in the top row with example image observations in the bottow row. Left to right:~visualizing a trajectory in the nonholonomic car environment, with the target denoted by the black dot; an illustration of the 2-DoF reacher environment, with the target denoted by the red dot; the different tasks that we test for block stacking, where the rightmost task is the most difficult as the policy must learn to first lift the yellow block before stacking it; a depiction of our pushing setup, where a human provides the sparse reward that indicates whether the robot successfully pushed the mug onto the coaster.
Sawyer Lego block stacking. We use our method to learn Lego block stacking with a real 7-DoF Sawyer robotic arm. The observations are 64-by-64-by-3 images from a camera pointed at the robot, and the controller only receives images as the observation without joint angles or other information. As shown in above, we define different block stacking tasks as different initial positions of the Sawyer arm.
Our method is successful on all tasks, where we define success as achieving an average distance of 0.02m which generally corresponds to successful stacking, whereas the VAE ablation is only successful on the easiest task in the middle plot. The MPC baseline starts off better and learns more quickly on the two easier tasks. However, MPC is limited to short-horizon planning, which causes it to fail on the most difficult task in the right plot as it simply greedily reduces the distance between the two blocks rather than lifting the block off the table.
As a comparison to a state-of-the-art model-based method that has been successful in real-world image-based domains, we evaluate deep visual foresight, which learns pixel space models and does not utilize representation learning. We find that this method can make progress but ultimately is not able to solve the two harder tasks even with more data than what we use for our method and even with a much smaller model. This highlights our method's data efficiency, as we use about two hours of robot data compared to days or weeks of data as in this prior work. The x-axes in the plots show that we further reduce the total data requirements of our method by about a factor of two by pretraining and transferring a shared representation and global model.
Sawyer pushing. We also experiment with the Sawyer arm learning to push a mug onto a white coaster, where we again use 64-by-64-by-3 images with no auxiliary information. Furthermore, we set up this task with only sparse binary rewards that indicate whether the mug is on top of the coaster, which are provided by a human labeler. Despite the additional challenge of sparse rewards, our method learns a successful policy in about an hour of interaction time as detailed below. Deep visual foresight performs worse than our method with a comparable amount of data, again even when using a downsized model.
Left: our method is successful at learning this task from sparse rewards with about an hour of data collection. Deep visual foresight trained with a comparable amount of data can sometimes solve the task as shown in the video on the right, but the final performance is much more variable. Center: visualizing example end states from rolling out our policy after 200 (top), 230 (middle) and 260 (bottom) trajectories.
Nonholonomic car. The nonholonomic car starts in the bottom right of the 2-dimensional space and controls its acceleration and steering velocity in order to reach the target in the top left. We use 64-by-64 images as the observation. Our method and the MPC baseline are able to learn with about 1500 episodes of experience, whereas the VAE ablation's performance is less consistent. PPO eventually learns a successful policy for this task that performs better than our method, however it requires over 25 times more data to reach this performance.
SOLAR (our method)
Reacher. We experiment with the reacher environment from OpenAI Gym gym, where a 2-DoF arm in a 2-dimensional plane has to reach a fixed target denoted by a red dot. For observations, we directly use 64-by-64-by-3 images of the rendered environment, which provides a top-down view of the reacher and target. Our method is outperformed by the final PPO policy, however, PPO requires about 40 times more data to learn. The VAE ablation and MPC variant also make progress toward the target, though the performance is noticeably worse than our method. MPC often has better initial behavior than LQR-FLM as it uses the pretrained models right away for planning, highlighting one benefit of planning-based methods, however the MPC baseline barely improves past this behavior. Forward prediction with this learned model deteriorates quickly as the horizon increases, which makes long-horizon planning impossible. MPC is thus limited to short-horizon planning, and this limitation has been noted in prior work. SOLAR does not suffer from this as we do not use our models for forward prediction.
SOLAR (our method)