HOME
Location: Sala Pasinetti (Palazzo del Cinema)
Organizers
Mohamed Elhoseiny, Postdoc Researcher, Facebook AI Research, elhoseiny<at>fb.com
Devi Parikh, Assistant Professor at Georgia Tech, parikh<at>gatech.edu
Leonid Sigal, Senior Research Scientist at Disney Research, lsigal<at>disneyresearch.com
Manohar Paluri, Research Lead, Facebook Research, mano<at>fb.com
Margerett Mitchell, Senior Research Scientist, Google Research, margarmitchell <at> gmail.com
Ishan Misra, CMU
Ahmed Elgammal, Professor at Rutgers University, elgammal<at>cs.rutgers.edu
The scope of this workshop lies in the boundary of Computer Vision and Natural Language Processing. In recent years, there have been increasing interest in the intersection between Computer Vision and NLP. Researchers have studied several interesting tasks, including generating text descriptions from images and videos, language embedding of images, and predicting visual classifiers from unstructured text. More recent work has further extended the scope of this area to combine videos and language; learning to solve non-visual tasks using visual cues; visual question answering visual dialog; and others. In this workshop, we aim to provide a full day focused on these interesting research areas, helping to bolster the communication and shared knowledge across tasks and approaches in this area, and provide a space to discuss the future and impact of vision-language technology. We will also have a panel discussion focused on how to develop useful datasets and benchmarks that are suitable to the various tasks in this area.
Workshop Theme
Besides being an umbrella for research at the intersection between vision and language, this year we introduce a special theme for the workshop focused on “Visual Storytelling”. Visual Storytelling is the task of creating a short story based on a sequence of images, and we plan to host a competition on this topic. We also plan to have a different special theme in each round of future CLVL workshops, helping to further advance and shine light on different aspects of vision-language research.
Organizers
The organizers of the workshop bring together a team of diverse researchers from academia (Georgia Tech and Rutgers) and industry (Facebook AI Research, Disney Research, Google Research).
Sponsors
Call for papers: The submitted extended abstracts will be considered for presentation. Accepted papers will be presented in the workshop poster session. A portion of the accepted papers will be orally presented. We solicit 2 page extended abstracts. Extended abstracts will not be included in the Proceedings of ICCV 2017 and not published in any form. Topics of this workshop include
learning to solve non-visual tasks using visual cues
question answering by visual verification
novel problems in vision and language
visual sense disambiguation
deep learning methods for vision and language
visual Reasoning on language problems
language based visual abstraction
text as weak labels for image or video classification.
image/Video Annotation and natural language description Generation,
text-to-scene generation
transfer learning for vision and language,
jointly learn to parse and perceive (text+image, text+video)
multimodal clustering and word sense disambiguation
unstructured text search for visual content
visually grounded language acquisition and understanding
language-based image and video search
linguistic descriptions of spatial relations
auto-illustration
natural language grounding & learning by watching
learning knowledge from the web
language as a mechanism to structure and reason about visual perception
language as a learning bias to aid vision in both machines and humans
dialog as means of sharing knowledge about visual perception
stories as means of abstraction
understanding the relationship between language and vision in humans
Intended audience
The intended audience of this workshop are scientists working in the overlap area between vision and language.