Multimodal Human-Autonomous Agents Interaction Using Pre-Trained Language and Visual Foundation Models

Linus Nwankwo and Elmar Rueckert