g-MotorCortex: Grounding Robot Policies with Visuomotor Language Guidance
Anonymous authors*
Under review as a conference paper at ICLR 2025
Abstract
Recent advances in the fields of natural language processing and computer vision have shown great potential in understanding the underlying dynamics of the world from large-scale internet data. However, translating this knowledge into robotic systems remains an open challenge, given the scarcity of human-robot interactions and the lack of large-scale datasets of real-world robotic data. Previous robot learning approaches such as behavior cloning and reinforcement learning have shown great capabilities in learning robotic skills from human demonstrations or from scratch in specific environments. However, these approaches often require task-specific demonstrations or designing complex simulation environments, which limits the development of generalizable and robust policies for new settings. Aiming to address these limitations, we propose an agent-based framework for grounding robot policies to the current context, considering the constraints of a current robot and its environment using visuomotor-grounded language guidance. The proposed framework is composed of a set of conversational agents designed for specific roles—namely, high-level advisor, visual grounding, monitoring, and robotic agents. Given a base policy, the agents collectively generate guidance at run time to shift the action distribution of the base policy towards more desirable future states. We demonstrate that our approach can effectively guide manipulation policies to achieve significantly higher success rates both in simulation and in real-world experiments without the need for additional human demonstrations or extensive exploration
Real Word Experiments
g-MotorCortex is evaluated using a UFACTORY Lite 6 Robotic Arm, and an Intel RealSense D435i RGBD camera. Here we show the system performing two tasks:
Reach for chess piece (right video): Given a cluttered scene with many similar objects, we want to evaluate if the multi-granular perception framework can effectively guide the agent to identify and reach for the appropriate target. We implement this perceptual grounding and reaching task on a standard chessboard, where the agent must identify and reach for one of the chess pieces specified by natural language instruction.
Sequenced Multi-button Press (left video): Here we demonstrate the ability of our framework to learn tasks from scratch without additional data. The agent must use its end-effector to press multiple real buttons on a workspace, in a particular order. The video shows the actions guided by g-MotorCortex's guidance code generated in one iteration.
Simulation Experiments
Example - Task: "Press the maroon button, then press the green button, then press the navy button"
Pretrained Act3d failing
Act3d with no guidance: the policy fails to press the last button (blue), but manages to correctly approach the first 2 buttons reaching them from above with the gripper closed.
100% guidance
Guidance only (overwriting the base policy): The sequence of movements is correct, but the initial guidance code doesn’t account that the buttons should be approached from above.
Act3d + 1% of guidance
Act3d with 1% guidance: The modified policy captures both the low-level motion of the pre-trained policy and the high-level guidance provided, successfully pressing the sequence of buttons.