Workshop Date: Oct 17th, 2021

Live stream Link: https://www.youtube.com/watch?v=GzIphByhXDc

Location: Virtual



Mohamed Elhoseiny, Assistant Professor, KAUST

Anna Rohrbach, Research Scientist, UC Berkeley

Andrew Brown, PhD student in the Visual Geometry Group at the University of Oxford

Xin Eric Wang, Assistant Professor, UC Santa Cruz

Marcus Rohrbach, Research Scientist, Facebook AI Research

The scope of this workshop lies in the boundary of Computer Vision and Natural Language Processing. In recent years, there have been increasing interest in the intersection between Computer Vision and NLP. Researchers have studied several interesting tasks, including generating textual descriptions from images [Donahue TPAMI17, Dai ICCV17, Lu CVPR18, Rennie CVPR17, Vinyals CVPR15] and video [Krishna ICCV17, Venugopalan ICCV15, Xiong ECCV18, Zhou CVPR18], learning language embedding of images [Frome NIPS13] and predicting visual classifiers from unstructured text for Objects [Elhoseiny ICCV13, CVPR17; Zhu CVPR18] and visual relationship [Krishna ECCV16, Elhoseiny AAAI17, Zhang AAAI19]. Recent work has further extended the scope of this area to visual storytelling [Huang NAACL16], visual question answering [Anderson CVPR18b, Antol ICCV15, Goyal CVPR17, Tapaswi CVPR 16], visual dialog [Das CVPR 17]; referring expression comprehension [Mao CVPR16, Yu CVPR17a, Yu CVPR18], vision-and-language navigation [Anderson CVPR18a, Wang ECCV18], embodied question answering [Das CVPR18] and beyond. In this workshop, we aim to provide a full day focused on these exciting research areas, helping to bolster the communication and share knowledge across tasks and approaches in this area, and provide a space to discuss the future and impact of Vision and Language technology. We will also have a panel discussion focused on Language and Vision spectrum covering the limitations of existing datasets and approaches and the future of this area.

We aim to have a different special theme in each round of CLVL workshops, helping to further advance and shine light on different aspects of Vision and Language research. This year we focus on Large Scale Movie Description. Movie Description is the task of describing the visual content of a movie to make it accessible to the blind and visually impaired. Movies and their corresponding descriptions (e.g. M-VAD [Torabi arXiv15], MPII-MD [Rohrbach CVPR15], LSMDC [Rohrbach IJCV17]) are a very rich multi-modal data source containing visual, audio, dialogue, and textual (description) data and metadata. Jointly exploiting and understanding these different modalities is of high scientific interest. This research direction continues to receive significant attention in recent years [Maharaj CVPR17, Pini MTA18, Rohrbach CVPR17, Yu CVPR17b], and poses interesting challenges for representation and multi-modal learning.


[Anderson CVPR18a] Anderson et al. Vision-and-language navigation: Interpreting visually grounded navigation instructions in real environments. CVPR’18.

[Anderson CVPR18b] Anderson et al. Bottom-up and topdown attention for image captioning and visual question answering. CVPR'18

[Antol ICCV15] Antol et al. Vqa: Visual question answering. ICCV’15.

[Huang NAACL16] Huang et al. Visual storytelling. NAACL’16.

[Das CVPR17] Das et al. Visual dialog. CVPR’17.

[Das CVPR18] Das et al. Embodied question answering. CVPR’18

[Dai ICCV17] Dai et al. Towards diverse and natural image descriptions via a conditional GAN. ICCV'17.

[Donahue TPAMI17] Donahue et al. Long-term recurrent convolutional networks for visual recognition and description. TPAMI’17.

[Elhoseiny ICCV13] Elhoseiny et al. Write a classifier: Zero-shot learning using purely textual descriptions. ICCV’13.

[Elhoseiny CVPR17] Elhoseiny et al. Link the head to the "peak'': Zero Shot Learning from Noisy Text descriptions at Part Precision, CVPR’17

[Krishna ECCV16] Krishna et al, Visual Relationship Detection with Language Priors, ECCV’16

[Elhoseiny, AAAI17], “Sherlock: Scalable Fact Learning in Images, AAAI’17

[Zhang AAAI19] Large-Scale Visual Relationship Understanding.AAAI’19

[Zhu CVPR18] Zhu et al, Generative Adversarial Approach for Zero-Shot Learning from Noisy Texts.

[Frome NIPS13] Frome et al. Devise: A deep visual-semantic embedding model. NIPS’13.

[Goyal CVPR17] Goyal et al. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. CVPR’17.

[Lu CVPR18] Lu et al. Neural baby talk. CVPR’18

[Maharaj CVPR17] Maharaj et al. A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. CVPR’17.

[Mao CVPR16] Mao et al. Generation and comprehension of unambiguous object descriptions. CVPR’16.

[Krishna ICCV17] Krishna et al. Dense-captioning events in videos. ICCV’17.

[Pini MTA18] Pini et al. M-VAD names: a dataset for video captioning with naming. Multimedia Tools and Applications, 2018.

[Rennie CVPR17] Rennie et al. Self-critical sequence training for image captioning. CVPR'17.

[Rohrbach CVPR15] Rohrbach et al. A Dataset for Movie Description. CVPR’15.

[Rohrbach CVPR17] Rohrbach et al. Generating Descriptions with Grounded and Co-Referenced People. CVPR’17

[Rohrbach IJCV17] Rohrbach et al. Movie Description, IJCV’17.

[Tapaswi CVPR16] Tapaswi et al. MovieQA: Understanding Stories in Movies through Question-Answering. CVPR’16.

[Torabi arXiv15] Torabi et al. Using descriptive video services to create a large data source for video annotation research. arXiv:1503.01070.

[Torabi arXiv16] Torabi et al. Learning Language-Visual Embedding for Movie Understanding with Natural-Language. arXiv:1609.08124.

[Venugopalan ICCV15] Venugopalan et al. Sequence to sequence – video to text. ICCV'15

[Vinyals CVPR15] Vinyals et al. Show and tell: A neural image caption generator. CVPR’15.

[Wang ECCV18] Wang et al. Look before you leap: Bridging model-free and model-based reinforcement learning for planned-ahead vision-and-language navigation. ECCV’18.

[Xiong ECCV18] Xiong et al. Move forward and tell: A progressive generator of video descriptions. ECCV'18.

[Yu CVPR17a] Yu et al. A joint speaker-listener-reinforcer model for referring expressions. CVPR’17.

[Yu CVPR17b] Yu et al. End-to-end concept word detection for video captioning, retrieval, and question answering. CVPR’17.

[Yu CVPR18] Yu et al. Mattnet: Modular attention network for referring expression comprehension. CVPR’18.

[Zhou CVPR18] Zhou et al. End-to-end dense video captioning with masked transformer. CVPR’18