Learning from Procedural Videos and Language: What is Next?

CVPR 2024 Workshop

Tuesday June 18th

13:30 - 18:00

Room: Summit 427

Overview

Videos demonstrating people performing procedural activities, i.e., sequences of steps towards a goal, have gained popularity as effective tools for skill acquisition, spanning various domains like cooking, home improvement, and crafting. In addition to being useful teaching materials for humans, procedural videos paired with language are also a promising medium for multimodal learning by machines, as they combine visual demonstrations with detailed verbal descriptions. Despite the recent introduction of multiple datasets, such as HT100M, HT-Step and Ego4d Goal-Step, models in the procedural video domain still lag behind their image-based counterparts.

This workshop aims to foster discussion on the future of language-based procedural video understanding. We'll explore paths to integrate diverse language sources, harness LLMs for structured task knowledge, and combine language with other information streams (visual, audio, IMU, etc.) to enhance procedural video understanding (recognizing key steps, mistakes, hand-object interactions, etc.).