DoReMi: Grounding Language Model by Detecting and Recovering from Plan-Execution Misalignment
Abstract: In this paper, we propose DoReMi, a novel language model grounding framework that enables immediate Detection and Recovery from Misalignments between plan and execution. Specifically, we leverage LLMs to play a dual role, aiding not only in high-level planning but also generating constraints that can indicate misalignment during execution. Then vision language models (VLMs) are utilized to detect constraint violations continuously. Our pipeline can monitor the low-level execution and enable timely recovery if certain plan-execution misalignment occurs. Experiments on various complex tasks including robot arms and humanoid robots demonstrate that our method can lead to higher task success rates and shorter task completion times.
Method:
Experiments - Robot Arm
Task 1: Pick and place with random drops
Baseline: Longer execution time.
Re-plan only the previous trajectory finished.
DoReMi (Ours): Shorter execution time.
Immediate detection and re-plan.
Task 2: Stack blocks in order with placement noise
Baseline
Repeating the previous step lead to failure.
DoReMi (Ours)
Immediate re-plan and recovery from collapse lead to success.
Experiments - Humanoid Robot
1. Task 1: Go forward with unexpected obstacles
(2x speed)
Baseline
Delayed replanning leads to failure.
DoReMi (Ours)
Immediate re-plan and recovery lead to success.
2. Task 2: Move box
with random drop
(2x speed)
Baseline:
Only replan when previous skill finished.
Complete the task in 68s.
DoReMi (Ours): efficient
Immediate re-plan and recovery.
Complete the task in 43s.
3. Task 3: Prepare food (Complicated task!)
with pick failure and random drop
Collect 5 demonstrations in simple scenarios with only fruit objects and plain backgrounds. (as shown in the right)
Finetune Vison-language model on it.
Test with Unseen objects and Unseen backgrounds!
(e.g., vegetables, junk food, and seafood.)(e.g., random background colors)
Unseen objects and backgrounds!
Unseen objects and backgrounds!
Unseen objects and backgrounds!
Even benefit Unseen tasks! Discover box drop more quickly!
Zero-shot transferred VLM
Discover drop until the box disappeared in the horizon.
Few-shot finetuned VLM
Discover drop immediately!