Generative AI for Robotics

Papers, links, benchmarks...

This page collects hopefully useful and interesting papers , links, and benchmark datasets for student projects in this emerging area.

Overview / background

Gemini robotics: https://deepmind.google/models/gemini-robotics/
Physical Intelligence: https://www.physicalintelligence.company/
LeRobot (Huggingface): https://huggingface.co/lerobot
NVIDIA ISAAC GR00T https://developer.nvidia.com/isaac/gr00t
ACL 2025 tutorial on LLM and agent benchmarking https://llm-guardrails-security.github.io/

Collaborative robotics

CoELA: "Building Cooperative Embodied Agents Modularly with Large Language Models" https://github.com/UMass-Embodied-AGI/CoELA

Language interaction with robots

Speech Language Models - survey : https://github.com/dreamtheater123/Awesome-SpeechLM-Survey (speech, not robots)
Ren, Allen Z., Anushri Dixit, Alexandra Bodrova, Sumeet Singh, Stephen Tu, Noah Brown, Peng Xu, Leila Takayama, F. Xia, Jacob Varley, Zhenjia Xu, Dorsa Sadigh, Andy Zeng and Anirudha Majumdar. “Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners.” ArXiv abs/2307.01928 (2023)
Chen, Yangyi, Karan Sikka, Michael Cogswell, Heng Ji and Ajay Divakaran. “DRESS : Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback.” 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023): 14239-14250.
Mees, Oier, Lukás Hermann, Erick Rosete-Beas and Wolfram Burgard. “CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks.” IEEE Robotics and Automation Letters 7 (2021): 7327-7334.

Safety, red-teaming, jailbreaking for embodied AI models

ASIMOV benchmark: "Generating Robot Constitutions & Benchmarks for Semantic Safety" (Google DeepMind): https://asimov-benchmark.github.io/v1/
Lu, Xiaoya, Zeren Chen, Xuhao Hu, Yijin Zhou, Weichen Zhang, Dongrui Liu, Lu Sheng and Jing Shao. “IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks.” ArXiv abs/2506.16402 (2025)
Lu, Xuancun, Zhengxian Huang, Xinfeng Li, Chi Zhang, Xiaoyu Ji and Wenyuan Xu. “POEX: Towards Policy Executable Jailbreak Attacks Against the LLM-based Robots.” (2024).
Lyu, Wenqi, Zerui Li, Yanyuan Qiao and Qi Wu. “BadNAVer: Exploring Jailbreak Attacks On Vision-and-Language Navigation.” ArXiv abs/2505.12443 (2025)
Shaikh, Omar, Hussein Mozannar, Gagan Bansal, Adam Fourney and Eric Horvitz. “Navigating Rifts in Human-LLM Grounding: Study and Benchmark.” ArXiv abs/2503.13975 (2025) (RIFTS)
Zhang, Hangtao, Chenyu Zhu, Xianlong Wang, Ziqi Zhou, Changgan Yin, Minghui Li, Lulu Xue, Yichen Wang, Shengshan Hu, Aishan Liu, Peijin Guo and Leo Yu Zhang. “BadRobot: Jailbreaking Embodied LLMs in the Physical World.” (2024).
Liu, Shuyuan, Jiawei Chen, Shouwei Ruan, Hang Su and Zhaoxia Yin. “Exploring the Robustness of Decision-Level Through Adversarial Attacks on LLM-Based Embodied Models.” Proceedings of the 32nd ACM International Conference on Multimedia (2024)
Xing, Wenpeng, Minghao Li, Mohan Li and Meng Han. “Towards Robust and Secure Embodied AI: A Survey on Vulnerabilities and Attacks.” ArXiv abs/2502.13175 (2025)
Wu, Xiyang, Ruiqi Xian, Tianrui Guan, Jing Liang, Souradip Chakraborty, Fuxiao Liu, Brian M. Sadler, Dinesh Manocha and A. S. Bedi. “On the Safety Concerns of Deploying LLMs/VLMs in Robotics: Highlighting the Risks and Vulnerabilities.” ArXiv abs/2402.10340 (2024)
Hafez, Ahmad, Alireza Naderi Akhormeh, Amr Hegazy and Amr Alanwar. “Safe LLM-Controlled Robots with Formal Guarantees via Reachability Analysis.” ArXiv abs/2503.03911 (2025)
Jiale Li, Mingrui Wu, Zixiang Jin, Hao Chen, Jiayi Ji, Xiaoshuai Sun, Liujuan Cao, Rongrong Ji, “MIHBench: Benchmarking and Mitigating Multi-Image Hallucinations in Multimodal Large Language Models”, ACM MM 2025
Robey, Alexander, Zachary Ravichandran, Vijay Kumar, Hamed Hassani and George J. Pappas. “Jailbreaking LLM-Controlled Robots.” ICRA 2025 ArXiv abs/2410.13691 (2025)
Yin, Sheng, Xianghe Pang, Yuanzhuo Ding, Menglan Chen, Yutong Bi, Yichen Xiong, Wenhao Huang, Zhen Xiang, Jing Shao and Siheng Chen. “SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents.” ArXiv abs/2412.13178 (2024) https://safeagentbench.github.io/
Zhu, Zihao, Bingzhe Wu, Zhengyou Zhang, Lei Han, Qingshan Liu and Baoyuan Wu. “EARBench: Towards Evaluating Physical Risk Awareness for Task Planning of Foundation Model-based Embodied AI Agents.” (2024).
Zhou, KAI-QING, Chengzhi Liu, Xuandong Zhao, Anderson Compalas, Dawn Xiaodong Song and Xin Eric Wang. “Multimodal Situational Safety.” ArXiv abs/2410.06172 (2024)
Ravichandran, Zachary, Alexander Robey, Vijay Kumar, George J. Pappas and Hamed Hassani. “Safety Guardrails for LLM-Enabled Robots.” ArXiv abs/2503.07885 (2025)
Jones, Eliot Krzysztof, Alexander Robey, Andy Zou, Zachary Ravichandran, George J. Pappas, Hamed Hassani, Matt Fredrikson and J. Zico Kolter. “Adversarial Attacks on Robotic Vision Language Action Models.” ArXiv abs/2506.03350 (2025)
Wang, Taowen, Dongfang Liu, James Liang, Wenhao Yang, Qifan Wang, Cheng Han, Jiebo Luo and Ruixiang Tang. “Exploring the Adversarial Vulnerabilities of Vision-Language-Action Models in Robotics.” ArXiv abs/2411.13587 (2024)

Planning

Valmeekam, Karthik, Matthew Marquez, Sarath Sreedharan and Subbarao Kambhampati. “On the Planning Abilities of Large Language Models - A Critical Investigation.” ArXiv abs/2305.15771 (2023)
Triple-S: A Collaborative Multi-LLM Framework for Solving Long-Horizon Implicative Tasks in Robotics: https://github.com/Ghbbbbb/Triple-S

Google Sites

Report abuse