CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation
Abstract
While Large Language Models (LLMs) and Vision-Language Models (VLMs) demonstrate remarkable capabilities in high-level reasoning and semantic understanding, applying them directly to contact-rich manipulation remains a challenge due to their lack of explicit physical grounding and inability to perform adaptive control.To bridge this gap, we propose CoRAL (Contact-Rich Adaptive LLM-based control), a modular framework that enables zero-shot planning by decoupling high-level reasoning from low-level control. Unlike black-box policies, CoRAL utilizes Large Language Models (LLMs) not as direct controllers, but as cost designers that synthesize context-aware objective functions for a sampling-based motion planner (MPPI). To address the ambiguity of physical parameters in visual data, we introduce a neuro-symbolic adaptation loop: a Vision-Language Model provides semantic priors for environmental dynamics (e.g., mass, friction estimates), which are then explicitly refined in real-time via online system identification, while the LLM iteratively modulates the cost function structure to correct strategic errors based on interaction feedback. Furthermore, a retrieval-based memory unit allows the system to reuse successful strategies across recurrent tasks. This hierarchical architecture ensures real-time control stability by decoupling high-level semantic reasoning from reactive execution, effectively bridging the gap between slow LLM inference and dynamic contact requirements. We validate CoRAL on both simulation and real-world hardware across challenging and novel tasks, such as ``flipping objects against walls" leveraging extrinsic contacts. Experiments demonstrate that CoRAL outperforms state-of-the-art VLA and foundation-model-based planner baselines by boosting success rates over 50% on average in unseen contact-rich scenarios, effectively handling sim-to-real gaps through its adaptive physical understanding.
Video Summary
Overview
Tasks
Simulation Results and Baseline Comparisons
In the zero-shot setting, CoRAL matches state-of-the-art VLAs on classical pick-and-place tasks (T2–T3) and substantially outperforms them on contact-rich, force-critical manipulation (T1, T4–T6). CoRAL also yields a clear improvement over the learned cost-generation baseline (L2R), highlighting the benefit of our approach for robust contact reasoning. Compared with the expert-designed single-stage cost, CoRAL is competitive on shorter-horizon tasks (T2–T5) and notably stronger on long-horizon coordination tasks (T1 and T6). Finally, the expert-designed FSM cost provides an empirical upper bound on achievable performance with hand-engineered structure.
Experiments and the Results on the Real Hardware
We test the CoRAL framework on Franka Emika Panda robot with parallel jaw gripper, as in the simulated environments.Object poses are tracked using a six-camera Vicon Vero motion-capture setup.
The results on real hardware, compared to CoRAL in simulation, show strong sim-to-real transfer.