In recent years, autonomous agents have surged in real-world environments such as our homes, offices, and public spaces. However, natural human-robot interaction remains a key challenge. In this paper, we introduce an approach that synergistically exploits the capabilities of pre-trained large language models (LLMs) and multimodal vision-language models (VLMs) to enable humans to interact naturally with autonomous agents through conversational dialogue. We leveraged the LLMs to decode the high-level natural language instructions from humans and abstract them into precise robot actionable commands or queries. Further, we utilised the VLMs to provide a visual and semantic understanding of the robot's task environment. We performed quantitative and qualitative evaluations of our framework’s natural conversation understanding with participants from diverse backgrounds and occupation distributions. The participants interacted with the robot using textual instructional commands. Based on the logged interaction data analysis, our framework achieved 99.13% command recognition accuracy, 97.96% commands execution success, and an average latency of 0.45 seconds from receiving the participants' chat commands to initiating the robot’s actual physical action. More details of this paper can be found at the project’s website: https://linusnep.github.io/TCC-IRoNL/
 hri24lbr1140_real_world.mp4
hri24lbr1140_real_world.mp4 hri24lbr1140_simulation.mp4
hri24lbr1140_simulation.mp4If you use this work in your research, please cite it using the following BibTeX entry:
@inproceedings{10.1145/3610978.3640723,
author = {Linus, Nwankwo and Elmar, Rueckert},
title = {The Conversation is the Command: Interacting with Real-World Autonomous Robots Through Natural Language},
year = {2024},
isbn = {979-8-4007-0323-2/24/03},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3610978.3640723},
doi = {10.1145/3610978.3640723},
booktitle = {Companion of the 2024 ACM/IEEE International Conference on Human-Robot Interaction},
numpages = {5},
keywords = {Human-robot interaction, LLMs, VLMs, ChatGPT, ROS, autonomous robots, natural language interaction},
location = {Boulder, CO, USA},
series = {HRI '24}
}