Anonymous Author(s)
Affiliation
We present DEF-oriCORN, a framework for language-directed manipulation tasks. By leveraging a novel object-based scene representation and diffusion-model-based state estimation algorithm, our framework enables efficient and robust manipulation planning in response to verbal commands, even in tightly packed environments with sparse camera views without any demonstrations. Unlike traditional representations, our representation affords efficient collision checking and language grounding. Compared to state-of-the-art baselines, our framework achieves superior estimation and motion planning performance from sparse RGB images and zero-shot generalizes to real-world scenarios with diverse materials, including transparent and reflective objects, despite being trained exclusively in simulation.
Â
Real-time language-directed pick-and-place with DEF-oriCORN
 Language Directed Pickandplace.mp4
Language Directed Pickandplace.mp4 Language Directed Pickandplace Highlights.mp4
Language Directed Pickandplace Highlights.mp4Motion planning with ShaPO and DEF-oriCORN
 Motion Planning.mp4
Motion Planning.mp4Language-directed manipulation in mobile manipulation
 pnp_v2.mp4
pnp_v2.mp4 push_eng.mp4
push_eng.mp4