Location: Sala Pasinetti (Palazzo del Cinema)

            Mohamed Elhoseiny, Postdoc Researcher, Facebook AI Research, elhoseiny<at>fb.com

Devi Parikh, Assistant Professor at Georgia Tech, parikh<at>gatech.edu                    

Leonid Sigal, Senior Research Scientist at Disney Research, lsigal<at>disneyresearch.com

Manohar Paluri, Research Lead, Facebook Research, mano<at>fb.com

Margerett Mitchell, Senior Research Scientist, Google Research, margarmitchell <at> gmail.com

Ishan Misra, CMU

            Ahmed Elgammal, Professor at Rutgers University, elgammal<at>cs.rutgers.edu

The scope of this workshop lies in the boundary of Computer Vision and Natural Language Processing. In recent years, there have been increasing interest in the intersection between Computer Vision and NLP.  Researchers have studied several interesting tasks, including generating text descriptions from images and videos,  language embedding of images, and predicting visual classifiers from unstructured text. More recent work has further extended the scope of this area to combine videos and language; learning to solve non-visual tasks using visual cues;  visual question answering visual dialog; and others.  In this workshop, we aim to provide a full day focused on these interesting research areas, helping to bolster the communication and shared knowledge across tasks and approaches in this area, and provide a space to discuss the  future and impact of vision-language technology. We will also have a panel discussion focused on how to develop useful datasets and benchmarks that are suitable to the various tasks in this area.

Workshop Theme

Besides being an umbrella for research at the intersection between vision and language, this year we introduce a special theme for the workshop focused on  “Visual Storytelling”.  Visual Storytelling is the task of creating a short story based on a sequence of images, and we plan to host a  competition on this topic. We also plan to have a different special theme in each round of future CLVL workshops, helping to further advance and shine light on different aspects of vision-language research.  


The organizers of the workshop bring together a team of diverse researchers from academia (Georgia Tech and Rutgers) and industry (Facebook AI Research, Disney Research, Google Research).


Call for papers: The submitted extended abstracts will be considered for presentation. Accepted papers will be presented in the workshop poster session. A portion of the accepted papers will be orally presented. We solicit 2 page extended abstracts. Extended abstracts will not be included in the Proceedings of ICCV 2017 and not published in any form. Topics of this workshop include

  • learning to solve non-visual tasks using visual cues
  • question answering by visual verification
  • novel problems in vision and language
  • visual sense disambiguation
  • deep learning methods for vision and language
  • visual Reasoning on language problems
  • language based visual abstraction
  • text as weak labels for image or video classification.
  • image/Video Annotation and natural language description Generation,
  • text-to-scene generation
  • transfer learning for vision and language,
  • jointly learn to parse and perceive (text+image, text+video)
  • multimodal clustering and word sense disambiguation
  • unstructured text search for visual content
  • visually grounded language acquisition and understanding
  • language-based image and video search
  • linguistic descriptions of spatial relations
  • auto-illustration
  • natural language grounding & learning by watching  
  • learning knowledge from the web
  • language as a mechanism to structure and reason about visual perception
  • language as a learning bias to aid vision in both machines and humans
  • dialog as means of sharing knowledge about visual perception
  • stories as means of abstraction
  • understanding the relationship between language and vision in humans

Intended audience
The intended audience of this workshop are scientists working in the overlap area between vision and language.