Paper Submission

We invite 4-page abstracts in ICCV format (including references) of new or previously published work addressing the topics outlined below. (For the previously published works re-formatting is not necessary.) We will make the accepted submissions available on our website as non-archival reports.

We will also allow for peer-reviewed novel submissions to appear in the ICCV workshop proceedings. The accepted works will be presented in the poster session and some will be selected for oral presentation.

CMT Website: https://cmt3.research.microsoft.com/ICCVCLVL2021

Format: The submission should follow the ICCV 2021 submission instructions.

- Novel problems in vision and language
- Learning to solve non-visual tasks using visual cues
- Language guided visual understanding (objects, relationships)
- Visual dialog and question answering by visual verification
- Visual question generation
- Visually grounded conversation
- Visual sense disambiguation
- Deep learning methods for vision and language
- Visual reasoning on language problems
- Text-to-image generation
- Language based visual abstraction
- Text as weak labels for image or video classification.
- Image/Video Annotation and natural language description generation
- Transfer learning for vision and language
- Jointly learn to parse and perceive (text+image, text+video)
- Multimodal clustering and word sense disambiguation
- Unstructured text search for visual content
- Visually grounded language acquisition and understanding
- Referring expression comprehension
- Language-based image and video search/retrieval
- Linguistic descriptions of spatial relations
- Auto-illustration
- Natural language grounding & learning by watching
- Learning knowledge from the web
- Language as a mechanism to structure and reason about visual perception
- Language as a learning bias to aid vision in both machines and humans
- Dialog as means of sharing knowledge about visual perception
- Stories as means of abstraction
- Understanding the relationship between language and vision in humans
- Humanistic, subjective, or expressive vision-to-language
- Visual storytelling
- Generating Audio Descriptions for movies
- Multi-sentence descriptions for images and videos
- Visual question answering for video
- Visual fill-in-the blank tasks
- Language as supervision for video understanding
- Using dialogs and/or audio for video understanding
- Understanding videos and plots
- Limitations of existing vision and language datasets
- Limitations of existing vision and language approaches