ActPerMoMa: Active Perceptive Motion Generation for Mobile Manipulation

Snehal Jauhri*, Sophie Lueth*, and Georgia Chalvatzaki
PEARL Lab, TU Darmstadt, Germany
* authors contributed equally

International Conference on Robotics & Automation (ICRA), 2024

Paper

Demo Video

Code (coming soon)

TL;DR: Visually informative motion generation for mobile manipulators in cluttered scenes, effectively balancing active perception, grasp detection, and executability

Abstract

Mobile Manipulation (MoMa) systems incorporate the benefits of mobility and dexterity, thanks to the enlarged space in which they can move and interact with their environment. MoMa robots can also continuously perceive their environment when equipped with onboard sensors, e.g., an embodied camera. However, extracting task-relevant visual information in unstructured and cluttered environments such as households remains challenging. In this work, we introduce an active perception pipeline for mobile manipulators to generate motions that are informative toward manipulation tasks such as grasping in initially unknown, cluttered scenes. Our proposed approach, ActPerMoMa, generates robot trajectories in a receding horizon fashion, sampling trajectories and computing path-wise utilities that trade-off reconstructing the unknown scene by maximizing the visual information gain and the task-oriented objective, e.g., grasp success by maximizing grasp reachability efficiently. We demonstrate the efficacy of our method in experiments with a dual-arm TIAGo++ MoMa robot performing mobile grasping in cluttered scenes.

Real robot demonstration

The robot explores the scene, detects the target object to be grasped and then runs our ActPerMoMa motion generator to gather more information and execute the most viable grasp. In the visualization on the top right, the red bounding box denotes the detected target object area and the blue points denote informative rear-side voxels considered for information gain. [Video sped up 6x]

ActPerMoMa

A MoMa robot is placed in a previously unseen environment and is tasked with picking up a target object placed on a surface among clutter.

The approximate target object area is either detected or obtained from a user instruction, for eg. “Pick up the object at the right corner of the table”

During an episode, the robot builds a volumetric representation (TSDF) of the scene for active perception & grasp detection
Top figure: Example informative voxels (light blue points) considered for active perception & example detected grasp (green)

Using rough initial knowledge about the target area or target object position, we continuously plan and execute informative motions for the mobile grasping task. At every timestep, the RGBD information from the head-mounted embodied camera is integrated into a scene TSDF for both grasp detection and information gain computation. Using the currently known free space for movement of the robot base, we sample candidate robot paths, including both base and camera poses, towards the target. For each candidate path, we compute the information gained from camera views in the path, and the reachability of stable detected grasps from the final base poses in the path. We trade-off these objectives with a receding horizon cost and take a step of the optimal path for execution at every timestep.

Ours

Ensures effective exploration of the target object region & continuous evaluation of grasp executability
Smoothly switches from exploration to execution based on the path utilities

Baseline (Next-Best View)

Robot tries to move directly to the next best view, leading to suboptimal motions

Lack of evaluation of executability/reachability means that the baseline can get stuck and is unable to execute a motion to grasp