Authors: Joohwan Seo¹, Arvind Kruthiventy¹, Soomi Lee¹, Megan Teng¹, Xiang Zhang¹, Seoyeon Choi¹, Jongeun Choi², and Roberto Horowitz¹
Affiliations: ¹ University of California, Berkeley ² Yonsei University
Code: (Simulation) https://github.com/Joohwan-Seo/EquiContact-Simulation
(Experiment) https://github.com/Joohwan-Seo/EquiContact-Experiment
This paper presents EquiContact, a framework for learning vision-based robotic policies for contact-rich manipulation that can generalize across different locations and orientations. Rather than training a single policy on a large dataset with randomized object orientations, we focus on creating a hierarchical policy for prototypical contact-rich peg-in-hole task that prioritizes data-efficiency and spatial generalizability. Our method achieves a near-perfect success rate in real-world experiments, even in scenarios the robot has never seen before.
Other Experimental videos are presented on the "Experimental Videos" tab
High-precision, contact-rich manipulation is a significant challenge for robots. Vision-based pick-and-place policies often lack the sub-millimeter accuracy required for tasks like inserting a peg into a low-tolerance hole.
This leads to two key problems:
Lack of Precision: High-level vision planners are capable of identifying the general area of a task but struggle with the fine-grained accuracy required for contact. Even with the fine-grained policies, they often struggle with the contact-rich tasks because they lack compliance, resulting in low success rates.
Failure to Generalize: Policies that rely on the world frame pose or joint positions cannot be adapted to new spatial configurations. For example, a previously successful policy like CompACT, which utilizes world-frame pose, drops from a 100% success rate to 0% when the task is moved to an unseen location.
EquiContact is a hierarchical framework that separates global scene understanding from local, real-time fine-grained compliant control, enabling spatial generalization.
First, a high-level vision planner, Diffusion-EDF (Diff-EDF), uses point clouds from external cameras to identify the approximate target pose for the task. The Diff-EDF is a SE(3) Bi-equivariant pick-and-place module, enabling sample-efficient training (~15 Demos) and robust in input point cloud transformation.
Then, a novel low-level policy, G-CompACT, takes over and uses only local information (GCEV, geometrically consistent error vector, images from wrist-mounted cameras, and force-torque sensor readings) to execute high-precision insertion with adaptive compliance. By anchoring this localized policy to the reference poses from the global planner (Diff-EDF), the entire system becomes robust to changes in the task's position and orientation, enabling successful performance in unseen scenarios.
The desired poses and admittance gains commands are now passed to the Geometric Admittance Controller (GAC), enabling compliant movement. The GAC controller is also proven to be an SE(3) equivariant controller.
By introducing the invariance assumptions on the vision modules, we have achieved provable SE(3) equivariance, encompassing both vision and force. The assumptions provide future guidance on how to troubleshoot the error and allow for the interpretability of the model.
We validated EquiContact in a series of real-world peg-in-hole experiments. Our method demonstrates robust generalization to unseen spatial configurations, significantly outperforming benchmark approaches that lack equivariance.
For the full pick-and-place task, EquiContact also achieved near-perfect success rates of 20/20 for the unseen flat platform and 18/20 for the unseen tilted platform.
OOD (Out-of-Distribution) scenarios are translated and/or rotated from the training setup.
We validated EquiContact on the screwing and the surface-wiping tasks (please see the experimental video as well).
As expected, our EquiContact successfully generalizes to unseen flat and tilted cases for both tasks.
OOD (Out-of-Distribution) scenarios are translated and/or rotated from the training setup.
EquiContact is designed to overcome these challenges by being structurally equivariant. This is achieved by combining a high-level planner with a low-level controller and is built on three core principles we identified for spatially generalizable contact-rich manipulation:
Compliance: The robot must be able to handle contact. We use Geometric Admittance Control (GAC), allowing the robot to react to forces and avoid becoming stuck or causing damage. Our policy learns to modulate these compliance gains in real-time.
Localized Policy: The low-level policy, G-CompACT, utilizes only local information defined from the robot's end-effector frame (force-torque measurements and relative error) and similarly outputs actions (relative pose and admittance gains) defined in this frame.
Induced Equivariance: The localized policy is "anchored" to a reference frame provided by a high-level planner, Diff-EDF. This architectural design enables the spatial generalization property across the entire system, providing provable SE(3) vision-to-force equivariance.
Note that one can consider contact-rich tasks as a generalization of "usual" tasks: a policy that can handle contact-rich tasks can also handle usual tabletop manipulation, but the converse does not hold.
Therefore, without considering compliance, our proposed principles can be applied to all visuomotor servoing manipulation policies.