Generative AI is revolutionizing the field of robotics by combining vision, language, and robot actions in new systems which can learn from data, interact with humans and the world, and adapt to new situations.
Generative models such as LLMs (large language models) and VLMs (vision and language models) are now being extended for use in robotics, allowing robots to become more adaptable, intelligent, and capable of performing a wider range of real-world tasks. This technology has the potential to transform industries such as manufacturing, healthcare, and logistics. Generative AI models can address the following problems:
Natural Language Interaction: language models enable robots to understand and respond to human commands and queries more naturally and reliably.
Image and Video Analysis: generative models can process and interpret visual information from the environment, enabling robots to perceive, understand, and interact with their surroundings accurately.
Physically embodied actions: generative models are being trained to reliably perform physical tasks such as navigation and manipulation, for example from egocentric (first-person) video data.
Generative AI models are reaching many areas of robotic applications, including healthcare, manufacturing, logistics, field robots, and assistive robotics.
The GenAIR team has pioneered several solutions in this space, including multimodal generative AI models for robotics tasks such as navigation, manipulation, and interaction with humans. We have worked with companies such as Amazon, PAL robotics, Softbank, Broca Hospital, and BMW in a series of projects developing machine learning models for multimodal interactive robots. As well as developing new models, we have extensive experience in model evaluation, including for real-world use-cases.
We offer software and tools (see below), expertise in developing new generative models, experience in fine-tuning open-source generative models, and testing and evaluation of models and AI systems.
As well as collaboration on project proposals for bodies such as UKRI and the European Commission, we also offer collaboration opportunities for companies via joint projects with PhD students, MSc students, and student internships.
EMMA: Embodied AI model for human-robot interaction, Amazon Simbot Challenge https://github.com/emma-heriot-watt
ALANA VLM: vision-language foundation model for video understanding and visual question answering https://github.com/alanaai/EVUD
SPRING project demo: LLMs for multi-user interaction with robots in healthcare https://aclanthology.org/2024.eacl-demo.8/
PIXAR: novel language model that entirely relies on pixels instead of tokens https://arxiv.org/abs/2401.03321
Evaluating the robustness of multimodal models for robot manipulation: https://arxiv.org/abs/2407.03967
Amazon Simbot Challenge https://www.amazon.science/alexa-prize/simbot-challenge
A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding, Alana VLM https://arxiv.org/abs/2406.13807v1
EMMA: A Foundation Model for Embodied, Interactive, Multimodal Task Completion in 3D Environments https://www.amazon.science/alexa-prize/proceedings/emma-a-foundation-model-for-embodied-interactive-multimodal-task-completion-in-3d-environments
SPRING project: socially assistive robots in healthcare: https://www.hw.ac.uk/news-archive/2024/socially-assistive-robots-ease-pressure-on.htm
o.lemon,a.suglia@hw.ac.uk m.sridharan,s.ramamoorthy@ed.ac.uk