FAQ

What is sim-to-real?

The term sim-to-real can refer to the transfer of multiple quantities from simulation to the real world, including perceptual data; high-level task, motion, and grasp plans; and low-level controllers. In this work, we focus on sim-to-real transfer of control policies trained with reinforcement learning (RL). Such approaches have shown remarkable recent results in robotics. However, they also pose a particular challenge in both simulation and reality, as agents can use reward hacking to exploit flaws in physics simulators, and naive policies can generate adverse behavior in real-world systems.

What are the causes of sim-to-real gap?

The sources of sim-to-real gap include models (e.g., developed models for robot joint friction, robot-object friction, and object-world friction); parameters of these models, which are highly non-trivial to identify; numerical inaccuracies (e.g., finite solver residuals); and controllers (e.g., gravity compensation error). Dynamics parameter randomization can lead to improved transfer; however, this approach can require substantial training time and effort and does not fully address incorrect models or inaccurate simulators.

Why is learning challenging for robotic assembly?

Assembly is one of the most complex operations in manufacturing, requiring high precision and accuracy, as well as adaptivity to diverse parts and environments. Industry-standard methods for robotic assembly often rely on human-engineered heuristics or demonstrations that need to be fine-tuned or reconfigured for each part location, part type, assembly, or task. In addition, robots must avoid adverse behavior (e.g., damaging collisions between the robot and the environment), often requiring supervision from a human operator. Learning methods aim to achieve such precision, accuracy, and adaptivity, ideally with minimal human input.

As a quick glimpse into what learning looks like, here we show 4 video clips that demonstrate the behavior of untrained robot agents (Before Training) and trained robot agents (After Training) for gear assembly and peg insertion tasks.

Before Training

After Training

Before Training

After Training

Why use RL to solve robotic assembly problems?

Classical approaches often focus on eliminating uncertainty via position-controlled robots, highly customized adapters and fixtures, and carefully-controlled environments. These approaches have achieved remarkable results in industry settings. However, recent work has shown that RL can outperform industrial integrators in settings with higher uncertainty, such as those that may be found in small- and medium-sized enterprises. In addition, sim-to-real transfer of RL policies can solve problems that are particularly time-consuming, expensive, or dangerous to solve in the real world alone.

Why use task-space impedance controllers?

First, in robotic manipulation, it has been shown that learning in task space is more efficient than learning in joint space. Second, compelling alternatives such as operational-space control (OSC) perform inertial compensation by relying on an accurate dynamics model; system identification for such models is an active area of research, and the models do not typically incorporate the effects of inertial perturbations (e.g., the adding of a tool) or contact dynamics. Third, a high-performance implementation of a task-space impedance controller is provided by Franka Emika in the libfranka library, facilitating reproducibility of the results.

Why not use depth from the RGB-D camera?

From our initial experiments with the Intel RealSense D435, depth images of our metallic components exhibited severe noise. Thus, we exclusively used the color module.

Why not do 6-DOF pose estimation?

There are a number of high-performance 6-DOF pose estimators available today, including pose estimators that use RGB, depth, or RGB-D; and pose estimators that can generalize to unseen object instances and object categories. However, since our problem setup primarily involves top-down actions and relies on reasonably-accurate estimates of {x, y, yaw}, we used a simple, but highly-effective 3-DOF pose estimation pipeline, described in Appendix C2-C4.

When evaluating the SAPU, SDF reward, and SBC algorithms, what are the baselines?

We evaluated the SAPU, SDF reward, and SBC evaluations in the style of ablation studies. In other words, the SAPU evaluation (section IV.E) used SDF reward and SBC; the SDF reward evaluation (section IV.F) used SAPU and SBC; and the SBC evaluation (section IV.G) used SAPU and SDF reward. Of course, the joint evaluation (section IV.H) used all 3 algorithms.

What fingers are used on the robot?

Our group designed 3D-printed fingers with cast silicone-rubber contact surfaces, which we have used in multiple research efforts. The primary differences between our fingers and the Franka-provided fingers are 1) our contact surfaces are longer, and 2) our contact surfaces are flat, and 3) our contact surfaces are semi-compliant. Although aspect (1) improves performance during assembly tasks, we have found that aspects (2) and (3) may occasionally hurt performance; for example, the softer surfaces and lack of V-grooves on our fingers can cause less stable grasps on cylindrical pegs compared to the Franka-provided fingers. Thus, our finger design is still under development; a future iteration may be released for open access.

How is fixturing determined for the plug and socket trays?

As described in the paper, the plug trays (used in the Pick experiment) are free to slide. This configuration allows us to quickly run experiments, as the trays can be arbitrarily translated and rotated and are not restricted to the bolt pattern of the optical board. Given that we train the Pick policies with fixed plug trays in simulation, and that we are able to achieve an accuracy of ~2 mm for the Reach policies in the real world, we do not anticipate that bolting down the plug trays would have a measurable affect on performance during Pick experiments. The socket trays (used in the Place and Insert experiments), on the other hand, are bolted to the optical table. This configuration mimics industrial-style fixturing and provides the stability needed to generate the forces required for insertion.

What are the computational requirements for training policies?

We used 1 NVIDIA RTX 3090 or V100 GPU for training and 1 NVIDIA 2080Ti for deployment.