Aim My Robot: Precision Local Navigation to Any Object
Xiangyun Meng¹, Xuning Yang², Sanghun Jung¹, Fabio Ramos²
Srid Sadhan Jujjavarapu³, Sanjoy Paul³ and Dieter Fox¹²
¹University of Washington ²NVIDIA ³Accenture
¹University of Washington ²NVIDIA ³Accenture
Existing navigation systems mostly consider "success" when the robot reaches within 1 meter radius to a goal. This precision is insufficient for emerging applications where the robot needs to be positioned precisely relative to an object for downstream tasks, such as docking, inspection, and manipulation. To this end, we design and implement Aim-My-Robot (AMR), a high-precision local navigation system without a navigation map or object model achieving centimeter-level precision without maps or 3D object models. Given a masked image describing the target object and an object-centric relative pose, AMR tracks the object while moving, avoids obstacles, and aligns the robot to the target object. AMR shows strong sim2real transfer and can adapt to different robot kinematics and unseen objects with little to no fine-tuning.
The model includes 1) a unified approach to representing multi-modal sensory data, and 2) an action decoding scheme that generates precise and collision-free actions. The reference image and the robot’s RGB-D observations are tokenized with MAE. The current LiDAR scan is tokenized by grouping points into directional bins. Image and LiDAR tokens are input into the multi-modal context encoder jointly with the look-at pose (distance to object, angle of approach, and object face). and embodiment tokens. Finally, the output tokens of the context encoder are cross-attended to the base trajectory decoder and the camera tilt decoder.
We import the Habitat Synthetic Scenes Dataset into Isaac Sim. We show that the simulated perception data enable strong sim2real transfer. We use a sampling-based planner to generate kinematically feasible and collision free trajectories at scale. In total, we generate 500k trajectories containing 7.5 million frames.
AMR achieves centimeter-level distance errors and sub-5 degree rotation errors across diverse object categories and unseen object instances.
Most objects achieve distance errors on the order of 1.8 ∼ 3.1 cm to the goal specification with sim2real transfer. No finetuning required. Robot avoids obstacles as seen below.
All videos are in real time.
Go to top cabinet
Go to fridge
Go to spam
Go to table
Go to the "Landfill" trash bin
Closing the Fridge Drawer
Go to the back side of the cabinet.
A forklift has non-holonomic kinematics. We generated additional 500 trajectories with Ackermann-steering kinematics to finetune the model trained on HSSD. The forklift has a simulated camera and LiDAR placed at the center of the axle, and we set R=1.5 m. Since the forklift is too big to fit in any of the indoor scenes, we generated the finetuning data in a warehouse environment with randomly placed pallets. We use DAgger to augment the dataset to improve the coverage of the demonstrations.
Multiple objects of the same appearance causing the robot to go to the wrong object. - Xuning: Can you add videos for each of these and make the text part of the caption, and delete this block
Initial object mask is too small for the robot to correctly identify the object as the robot gets close.
Misidentify the target object.