Vision, Language, and Multimodal Human Instructions for Interactive Intelligent Vehicles
2nd Edition Workshop at IV 2026
Date and Time TBA
Vision, Language, and Multimodal Human Instructions for Interactive Intelligent Vehicles
2nd Edition Workshop at IV 2026
Date and Time TBA
The Vision, Language, and Multimodal Human Instructions for Interactive Intelligent Vehicles (VL-IIV 2026) workshop explores the intersection of computer vision, language understanding, and multimodal reasoning for human-in-the-loop autonomous driving. The workshop focuses on systems and datasets that allow vehicles to perceive, interpret, and respond to visual and linguistic instructions. VL-IIV 2026 includes the doScenes Instructed Driving Challenge—a benchmark evaluating how well vision-language models predict trajectories conditioned on human driving instructions. The challenge dataset contains scene-level captions, driver intent labels, and natural-language instructions for upcoming maneuvers. All instructions are human-generated, and scenes are labeled by multiple annotators, creating a diverse set of descriptors which map to the same maneuver. Participants will predict the vehicle’s future trajectory conditioned on any combination of (1) visual scene input (multi-camera), (2) language instruction, and (3) scene context (history + map), evaluated using displacement error, visualization, and explainability. Interactive autonomous systems capable of interpreting multimodal human instructions are critical to the next generation of safe and trustworthy transportation. This workshop promotes human-centered autonomy, reducing risks from fully unsupervised systems while enhancing transparency and user control.
To this end, we welcome contributions with a strong focus on - but not limited to - the following topics:
Human-in-the-loop and instructed autonomy
Representation learning and foundation models for embodied, instruction-conditioned behavior
Multimodal learning and grounding (gesture, speech, gaze)
Multi-Agent Interactions
Vision-language models for driving and robotics
Scene understanding for control transitions
Safety, trust, explainability and transparency in human-interactive AV systems
Datasets, benchmarks, and evaluation metrics for interactive autonomy
Generative and contrastive modeling for multimodal control
Workshop Speakers to be announced.
Workshop Presentations to be announced.
Workshop Schedule to be announced.
Lead Organizers
Professor Ross Greer (University of California Merced)
Professor Mohan Trivedi (University of California San Diego)
Organizing Committee and Challenge Leads
Max Ronecker (TU Graz)
Walter Zimmer (University of Sydney, UCLA)
Rui Song (UCLA)
Kianna Ng (UC Merced)
Angel Martinez (UC Merced)
Maitrayee Keskar (UC San Diego)
Anas Saeed (Bonsai Robotics)
Erika Maquiling (UC Merced)
Edmund Chao (UCLA)
Giovanni Tapia Lopez (UC Merced)
Marcus Blennemann (UC San Diego)
Workshop Contact: rossgreer@ucmerced.edu