MoDem-V2

Visuo-Motor World Models 

for Real-World Robot Manipulation


Patrick Lancaster, Nicklas Hansen, Aravind Rajeswaran, Vikash Kumar

We use MoDem-V2 to train the robot on four contact-rich manipulation tasks. These tasks cover a wide range of manipulation skills, namely non-prehensile pushing, object picking, and in-hand manipulation. In recognition of the difficulty of robust pose tracking and dense reward specification in the real world, the robot performs these tasks using only raw visual feedback, proprioceptive signals, and sparse rewards.

Supplementary Video

Challenges

The capability to safely and efficiently learn complex manipulation tasks is the foundation of  physical robots that effectively operate in the real-world.  Model-Based Reinfrocement Learning (MBRL) provides a framework for building robot manipulation skills from real-world data without the need for specialized sensor instrumentation or extensive state estimation pipelines by learning dynamics models directly from visual observations and then using them for Model Predictive Control (MPC).  Yet, visual MBRL for real-world robot learning still faces the challenge of exploring a high-dimensional observation space in which only sparse rewards may be available. The original MoDem algorithm boot-strapped visual MBRL with a small number of expert demonstrations to achieve sample-efficient learning from sparse rewards in simulated settings, but is infeasible for real-world use due to its aggressive exploration strategy in which it samples actions from across the entire action space.


-

Approach

Our key insight is that conservative exploration can respect the safety constraints of real-world environments while still enabling the robot to learn quickly and efficiently.  We transform this insight into implementation with three enhancements to MoDem in order to develop MoDem-V2, a real-world ready visual MBRL algorithm:

Policy centered actions: Rather than sampling actions from across the entire action space, we propose to sample actions from our learned policy. This more conservative exploration strategy reduces the likelihood of world model and value function evaluation over unseen regions of the state-action space, enabling them to better discriminate the quality of the generated actions.

Smooth transition from BC actions to MPC: At the beginning of the online interaction phase, MoDem immediately begins using its learned world model and value function to do MPC. Yet both of these components have only seen limited data near the BC policy, so relying on them to choose actions for multiple consecutive timesteps at the beginning of interaction can quickly lead the agent into an unexplored region of the observation-action space from which it cannot recover. Our remedy is to gradually shift from executing actions sampled from the BC policy to actions computed by short horizon planning. 

Actor-Critic Ensembles for uncertainty aware planning: The use of actor-critic ensembles improves the agent’s value estimations in two primary ways. First, note that each actor is trained by optimizing it to maximize its corresponding critic. While this provides a solution for efficiently finding the maximum value of Q over actions, it is subject to significant overestimation bias. We mitigate this by only evaluating a critic with final trajectory actions produced by policies not  directly optimized to maximize that particular critic. Actor-critic ensembles also improve value estimation by providing the agent with a pool of independently trained value functions, each of which computes its own value estimate. By estimating the epistemic uncertainty of a trajectory, the agent can make uncertainty-aware decisions. 

Evaluation

To evaluate the effectiveness of our approach, we study four robot manipulation tasks from visual feedback as shown above, as well as their simulated counterparts. These tasks encompass a variety of manipulation skills, namely pushing, picking, and in-hand manipulation.

Planar Pushing: This task requires the robot to push an oblong object towards a fixed goal position on a table top.  This task is likely the easiest of all four tasks, and we view it as base case with which to compare the other tasks.

Incline Pushing: This task requires the robot to push an object up an incline to each a fixed goal position. During execution of the task, the robot must raise its gripper such that it can progress up the incline while also making sufficiently precise contact with the block to prevent it from slipping beneath or around the side of the gripper.

Bin Picking: To complete this task, the robot must grasp a juice container and then raise it out of the bin.  This task requires accurate positioning of the gripper because the (mostly) non-deformable container has a primary width that is approximately 65% of the gripper's maximum aperture. This task also requires the robot to disambiguate spatially similar states; for example if the gripper is above the bin, the robot must understand whether or not the object is in its grasp so that it can decide to go down towards the bin to pick up the object or stay where it is in order to receive reward.

In-Hand Reorientation: This task requires the robot to grasp a water bottle laying on its side and then in-hand manipulate it to an upright position. Using the multi-fingered D’Manus hand more than doubles the dimensionality of the action space relative to the previous tasks.  The high-level strategy that the robot uses to achieve the task is to initially grasp the object near the bottle cap with its pinky and thumb fingers. Once it has lifted the object, it must strike a balance between applying sufficient force so as to not drop the object but also not too much force so that the bottle can pivot around the contact axis as the index finger pushes down on the bottle.

Results

To evaluate the safety of both MoDem and MoDem-V2, we measured the amount of joint torque and contact force exerted by the robot's joint actuators in simulation (see below). Here we consider a violation to be exceeding the manufacturer's torque safety limits or applying excessive force (>100N) to the environment. While both methods initially exert low forces and torques since their BC policies were trained from the same demonstrations, the number of safety violations committed by MoDem sharply increases as the interaction phase begins. The number of violations incurred by MoDem-V2 is much lower throughout online learning comparatively. Here MoDem-V2 demonstrates that it can achieve similar or better sample efficiency than MoDem, while adopting significantly lower joint torques and thereby safer behavior.

We perform ablations of MoDem-V2 by individually adding each of the enhancements outlined in our Approach to MoDem. We found that, across all tasks, all of the ablations generally maintained or improved over the sample efficiency of MoDem while significantly improving safety by incurring significantly fewer safety violations (see below).

In the real-world, we found that MoDem-V2 enabled the robot to significantly exceed the performance of its initial BC policy with about two hours worth of online training data or less. The figure illustrates the success rate of the initial policy cloned from just ten demonstrations and the best MoDem-V2 agent performance achieved throughout online training, as well as example trajectories from the MoDem-V2 agents. Most notably, MoDem-V2 achieves a 40% success rate on the difficult in-hand water bottle reorientation task from just 10 demonstrations and slightly over an hour of online training data. 

Additional Visualizations

Our agent is able to learn complex manipulation tasks with multi-fingered hands, a shown in this example of five consecutive successful trials of the bin reorientation task.