NOD-TAMP: Generalizable Long-Horizon Planning with Neural Object Descriptors


Abstract

Solving complex manipulation tasks in household and factory settings remains challenging due to long-horizon reasoning, fine-grained interactions, and broad object and scene diversity. Learning skills from demonstrations can be an effective strategy, but such methods often have limited generalizability beyond training data and struggle to solve long-horizon tasks. To overcome this, we propose to synergistically combine two paradigms: Neural Object Descriptors (NODs) that produce generalizable object-centric features and Task and Motion Planning (TAMP) frameworks that chain short-horizon skills to solve multi-step tasks. We introduce NOD-TAMP, a TAMP-based framework that extracts short manipulation trajectories from a handful of human demonstrations, adapts these trajectories using NOD features, and composes them to solve broad long-horizon, contact-rich tasks. NOD-TAMP solves existing manipulation benchmarks with a handful of demonstrations and significantly outperforms prior NOD-based approaches on new tabletop manipulation tasks that require diverse generalization. Finally, we deploy NOD-TAMP on a number of real-world tasks, including tool-use and high-precision insertion.

Real-world Results

NOD-TAMP solves long-horizon tasks using just 1 demonstration per skill.

NOD-TAMP generalizes to diverse spatial configurations.

NOD-TAMP generalizes zero-shot to different object geometries.

With just 1 demonstration, NOD-TAMP solves fine-grained tasks (slot diameter < 1 cm).

NOD-TAMP solves long-horizon tasks that require fine-grained motions (e.g., insert coffee pod) with just 1 demonstration for the full task. 

NOD-TAMP zero-shot generalizes to novel object shapes and placements.

NOD-TAMP re-purposes skills from other tasks (e.g., reuse place mug skill from Make Coffee task) to solve new tasks. 

NOD-TAMP performs geometric reasoning for differentiating skills (e.g., grasp mug with handle or rim) to reach different goals. 

NOD-TAMP can handle new tasks with unseen shapes and configurations in a zero-shot setting. 

NOD-TAMP leverages geometric reasoning on grasping strategies (e.g., pick up tool with junction or handle) to achieve different tool usage.

Method Overview

nod_tamp_corl24_presentation_revised.mp4

Given a goal specification, a task planner plans a sequence of skill types. Then, a skill reasoner searches for the combination of skill demonstrations that maximizes compatibility. Using learned neural object descriptors (e.g., NDFs), each selected skill demonstration is adapted to the current scene. Finally, the adapted skills are executed in sequence.

Skill Reasoning Visualization

The NOD feature distance of different skill combinations for real world trials. Lower score indicates more compatible skills (pre-post condition matching).

Simulation Results

By extracting EIGHT skills from FOUR demos showing how to manipulate just ONE mug, frame, and tool instance, NOD-TAMP solves hundreds of tasks with diverse shapes, configurations, and task goals.

Key frames of planning and task execution.

Stage 1

Stage 2

Stage 3

Stage 4

Stage 1

Stage 2

Stage 3

Stage 4

Stage 1

Stage 2

Stage 3

Stage 4

Stage 1

Stage 2

Stage 3

Stage 4

Stage 1

Stage 2

Stage 3

Stage 4

Stage 1

Stage 2

Stage 3

Stage 4