2nd OmniLabel workshop at ECCV 2024

Welcome to the 2nd OmniLabel workshop on Enabling Complex Perception Through Vision and Language Foundational Models at ECCV 2024 in Milano.

When: Sunday September 29, 2024

Where: MiCo Milano, Italy

Announcements

[5/29/24] Evaluation servers are online & UPDATED validation set & evaluation toolkit are available now! Please download the latest validation set. The validation phase started > Upload your validation set results on the evaluation servers. Test set phase will start in about a month
[4/12/24] Workshop accepted at ECCV 2024! See you soon in Milano

Goals & motivation

The goal of this workshop is to foster research on the next generation of visual perception systems that reason over label spaces going beyond a list of simple category names. Modern applications require systems to understand a full spectrum of object labels in images, from plain category names (like ``person'') to complex object descriptions (``the man in the white hat walking next to the fire hydrant''). The previous edition of this workshop focused broadly on natural language to handle such complex label spaces. We introduced a challenging benchmark called Omnilabel with diverse and complex object descriptions and also proposed a comprehensive evaluation metric that handles virtually infinite label spaces.

In the workshop this year, we take inspiration from the progress of foundational models like LLMs and VLMs to advance unified label spaces in perception tasks. Large vision-language models (VLMs) like CLIP, trained on massive web datasets of image-text pairs, achieve strong alignment between visual content and textual descriptions. This allows them to learn object representations usable for detection beyond predefined categories, demonstrating open vocabulary detection capabilities. Furthermore, LLMs, trained on vast text corpora, exhibit strong reasoning abilities, compositional understanding, and even a degree of world knowledge that are crucial for comprehending complex, real-world object descriptions. In addition, their generative abilities can be leveraged to create diverse descriptions, potentially enhancing the training of language-based detectors. By leveraging the strengths of both VLMs and LLMs, we can not only develop next-generation perception systems that grasp complex scenes but also significantly reduce manual annotation effort, a major bottleneck in traditional approaches.

In line with this topic, we are upgrading our OmniLabel benchmark dataset with more complex object descriptions, and host the next edition of the OmniLabel language-based object detection challenge.

2nd OmniLabel workshop at ECCV 2024

Welcome to the 2nd OmniLabel workshop on Enabling Complex Perception Through Vision and Language Foundational Models at ECCV 2024 in Milano.

Announcements

Goals & motivation

Sponsors