Aim My Robot: Precision Local Navigation to Any Object

Xiangyun Meng¹, Xuning Yang², Sanghun Jung¹, Fabio Ramos²

Srid Sadhan Jujjavarapu³, Sanjoy Paul³ and Dieter Fox¹²

¹University of Washington ²NVIDIA ³Accenture

Overview

Existing navigation systems mostly consider "success" when the robot reaches within 1 meter radius to a goal. This precision is insufficient for emerging applications where the robot needs to be positioned precisely relative to an object for downstream tasks, such as docking, inspection, and manipulation. To this end, we design and implement Aim-My-Robot (AMR), a high-precision local navigation system without a navigation map or object model achieving centimeter-level precision without maps or 3D object models. Given a masked image describing the target object and an object-centric relative pose, AMR tracks the object while moving, avoids obstacles, and aligns the robot to the target object. AMR shows strong sim2real transfer and can adapt to different robot kinematics and unseen objects with little to no fine-tuning.

Go to the front of the dishwasher and stand from 0.8 m away at 45 degree angle

Go to the front of the pallet and stand from 2.0 m away at 0 degree angle

Model

The model includes 1) a unified approach to representing multi-modal sensory data, and 2) an action decoding scheme that generates precise and collision-free actions. The reference image and the robot’s RGB-D observations are tokenized with MAE. The current LiDAR scan is tokenized by grouping points into directional bins. Image and LiDAR tokens are input into the multi-modal context encoder jointly with the look-at pose (distance to object, angle of approach, and object face). and embodiment tokens. Finally, the output tokens of the context encoder are cross-attended to the base trajectory decoder and the camera tilt decoder.

Synthetic Data Generation

We import the Habitat Synthetic Scenes Dataset into Isaac Sim. We show that the simulated perception data enable strong sim2real transfer. We use a sampling-based planner to generate kinematically feasible and collision free trajectories at scale. In total, we generate 500k trajectories containing 7.5 million frames.

Example scene renderings in Isaac Sim.

Example objects used for training. In total there are 3000+ objects.

Centimeter accuracy to object instances

AMR achieves centimeter-level distance errors and sub-5 degree rotation errors across diverse object categories and unseen object instances.

Case 1: Use first frame as reference (FFR). Object is visible in the first frame.

success.mp4

Case 2: Reference image taken elsewhere (non-FFR). Object may or may not be visible in the first frame.

success.mp4

Sim2Real transfer

Most objects achieve distance errors on the order of 1.8 ∼ 3.1 cm to the goal specification with sim2real transfer. No finetuning required. Robot avoids obstacles as seen below.

All videos are in real time.

hist4-top-cabinet-reference.m4v

Go to top cabinet

Face perpendicularly at 1.0m

hist4-s3-fridge-reference.m4v

Go to fridge

Face perpendicularly at 1.0m

hist4-spam-reference.m4v

Go to spam

Face perpendicularly at 1.0m

hist4-table-references.m4v

Go to table

Face perpendicularly at 1.0m

hist4-landfill-reference.m4v

Go to the "Landfill" trash bin

Face perpendicularly at 1.0m

hist4-fridge-close-reference-v2.m4v

Closing the Fridge Drawer

Go to the front of the fridge drawer.

Face it perpendicularly at 1.0m.

Move forward to close the drawer.

Back up.

(Updated video adding robot's view when backing up)

hist4-cabinet-back-reference.m4v

Go to the back side of the cabinet.

Face it perpendicularly at 0.8m.

Generalization to new robot dynamics

A forklift has non-holonomic kinematics. We generated additional 500 trajectories with Ackermann-steering kinematics to finetune the model trained on HSSD. The forklift has a simulated camera and LiDAR placed at the center of the axle, and we set R=1.5 m. Since the forklift is too big to fit in any of the indoor scenes, we generated the finetuning data in a warehouse environment with randomly placed pallets. We use DAgger to augment the dataset to improve the coverage of the demonstrations.

All videos are in real time.

forklift1-reference.m4v

forklift3-reference.m4v

Failure cases

Multiple objects of the same appearance causing the robot to go to the wrong object. - Xuning: Can you add videos for each of these and make the text part of the caption, and delete this block
Initial object mask is too small for the robot to correctly identify the object as the robot gets close.
Misidentify the target object.

timeout.mp4

Page updated

Google Sites

Report abuse