Enabling robots to explore and act in unfamiliar environments under ambiguous human instructions by interactively identifying task-relevant objects (e.g., identifying cups or beverages for “I’m thirsty”) remains challenging for existing vision-language model (VLM)-based methods. This challenge stems from inefficient reasoning and the lack of environmental interaction, which hinder real-time task planning and decision-making.
To address this, We propose Affordance-Aware Interactive Decision-Making and Execution for Ambiguous Instructions (AIDE), a dual-stream framework that integrates interactive exploration with vision-language reasoning.
1) Multi-Stage Inference (MSI): given the input task, the MSI stream uses Multimodal CoT Module (MM-CoT) and Exploration Policy to generate keyframe-based task planning result.
2) Instruction-Tool Relationship Space: By scoring the input instruction based on affordance, the planning result is projected into the Instruction-Tool Relationship Space, enabling a sufficient cross-modal understanding of the input instruction and robustness to hallucinations arising from GPT-5 reasoning.
3) Accelerated Decision-Making (ADM): Building on the projection, the ADM stream employs Efficient Retrieval Scheme (ERS) and Exploration Policy for real-time, closed-loop task execution over continuous frames.
Keyframe-Based Task Inference
The MSI stream receives the input task and, through prompt-engineered multimodal CoT, jointly leverages GPT-5’s multimodal reasoning capability and the grounding submodule’s localization ability to produce the task grounding result, including labels, images and locations for the required tool and its critical parts. Together, these outputs enable task inference that focuses on exploration and instruction completion within the keyframe scene.
Interactive Closed-Loop Decision and Execution
The ADM stream enables closed-loop, real-time task execution by continuously performing task planning and updating the scene through robot interaction. Given the input task, the ADM performs task planning through the ERS and the Exploration Policy, with the former process relying on retrieval from a pool comprising retrieved candidate and MSI-generated task planning results. Operating in a closed-loop manner with real-time input updates at 10 Hz, ADM leverages exploration results from task planning to determine the robot’s motion strategy. By instructing the robot to execute the corresponding motion strategy, the input scene is continuously updated.
Affordance vector distributions for instructions and tools.
The image shows that taking cutting/cleaning/drinking/heating-related tasks as representative examples, the affordance vector distributions for instructions and tools largely overlap. These observations reveal that the affordance vectors scored from instructions and located near cluster centers capture category-level affordances shared by both instructions and their required tools, naturally inducing many-to-many instruction-tool, instruction-instruction, and tool-tool relationships across different task planning results.
Zero-shot performance comparison on G-Dataset and R-Dataset. We compare the zero-shot task planning performance of AIDE against eight baselines on the G-Dataset and R-Dataset. AIDE achieves task planning success rates above 80%, with 83% on the G-Dataset and 88% on the R-Dataset, exceeding all baselines by over 30%. It is also noteworthy that AIDE achieves tool selection success rates of approximately 95% (94.5% on the G-Dataset and 95.0% on the R-Dataset), outperforming all baselines, which validates the effectiveness and robustness against hallucination of the MM-CoTdesign in the MSI stream of the AIDE, as well as the efficiency and well-chosen hyperparameters of the affordance-analysis-based ERS in the ADM stream.
Ablation study on different components of AIDE. Ablation studies show that incorporating the ADM stream enables AIDE to achieve efficient affordance-based retrieval, with task planning success rate gains of 2.5% and 1% on the G-Dataset and R-Dataset, respectively, and a 320× FPS increase to nearly 10 Hz. And incorporating the MSI stream allows AIDE to leverage GPT-5-based scene understanding, resulting in 2%-15% improvements across success rate metrics. These results highlight both the distinct roles of MSI and ADM and their complementary synergy in improving the accuracy and efficiency of task planning and execution.
The deployment platform for the real-machine experiment is an Advantech Industrial control computer without a GPU. The robotic arm used is the FAIRINO FR - 3 robot with 6 DOF, and the gripper is the Agibot omnipicker x1, with a RealSense D435 camera.