Tool-Augmented VIsion (TAVI) Workshop at CVPR 2024

Recent vision-language models (VLMs), such as CLIP, Flamingo, PaLI, have shown strong capabilities to memorize large amounts of world knowledge when scaled to tens of billions of parameters and trained on web-scale data. Although these models have achieved remarkable results across various benchmarks, they tend to struggle on tasks that require 1) ) seeking the answers from external sources 2) long-tail knowledge and 3) fine-grained understanding. Recently, there has been a growing interest in retrieval and tool-augmented models that rely on non-parametric, external knowledge sources to address these limitations. In this inaugural edition of the TAVI workshop, we aim to bring together a diverse group of researchers who will share their recent work on this exciting and increasingly popular topic with our computer vision community.

Topics that will be covered in the workshop

We will cover a variety of topics including but not limited to, applying tool-use and retrieval augmented models to the following problems:

Image and video classification
Dense prediction
Image and video generation
Explainability and reasoning
Data-efficient learning
Multimodal learning
Self-supervised learning
Prompt tuning and selection
Visual instruction tuning

Note: There will be no call for papers for this workshop.

Invited Posters (Arch Building Exhibit Hall):

Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval (Minkuk Kim, Hyeon Bae Kim, Jinyoung Moon, Jinwoo Choi, Seong Tae Kim - CVPR 2024 Main Conference)
Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use (Imad Eddine Toubal, Aditya Avinash, Neil Gordon Alldrin, Jan Dlabal, Wenlei Zhou, Enming Luo, Otilia Stretcu, Hao Xiong, Chun-Ta Lu, Howard Zhou, Ranjay Krishna, Ariel Fuxman, Tom Duerig - CVPR 2024 Main Conference)
Synthesize Step-by-Step: Tools, Templates and LLMs as Data Generators for Reasoning-Based Chart VQA (Li Zhuowan, Jasani Bhavan, Tang Peng, Ghadar Shabnam - CVPR 2024 Main Conference)
CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update (Zhi Gao, Yuntao Du, Xintong Zhang, Xiaojian Ma, Wenjuan Han, Song-Chun Zhu, Qing Li - CVPR 2024 Main Conference)
Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models (Yushi Hu, Otilia Stretcu, Chun-Ta Lu, Krishnamurthy Viswanathan, Kenji Hata, Enming Luo, Ranjay Krishna, Ariel Fuxman - CVPR 2024 Main Conference)