Paper Submission

We invite 4-page abstracts in ICCV format (including references) of new or previously published work addressing the topics outlined below. (For the previously published works re-formatting is not necessary.) We will make the accepted submissions available on our website as non-archival reports.

We will also allow for peer-reviewed novel submissions to appear in the ICCV workshop proceedings. The accepted works will be presented in the poster session and some will be selected for oral presentation.

CMT Website:

Format: The submission should follow the ICCV 2021 submission instructions.

    • Novel problems in vision and language

    • Learning to solve non-visual tasks using visual cues

    • Language guided visual understanding (objects, relationships)

    • Visual dialog and question answering by visual verification

    • Visual question generation

    • Visually grounded conversation

    • Visual sense disambiguation

    • Deep learning methods for vision and language

    • Visual reasoning on language problems

    • Text-to-image generation

    • Language based visual abstraction

    • Text as weak labels for image or video classification.

    • Image/Video Annotation and natural language description generation

    • Transfer learning for vision and language

    • Jointly learn to parse and perceive (text+image, text+video)

    • Multimodal clustering and word sense disambiguation

    • Unstructured text search for visual content

    • Visually grounded language acquisition and understanding

    • Referring expression comprehension

    • Language-based image and video search/retrieval

    • Linguistic descriptions of spatial relations

    • Auto-illustration

    • Natural language grounding & learning by watching

    • Learning knowledge from the web

    • Language as a mechanism to structure and reason about visual perception

    • Language as a learning bias to aid vision in both machines and humans

    • Dialog as means of sharing knowledge about visual perception

    • Stories as means of abstraction

    • Understanding the relationship between language and vision in humans

    • Humanistic, subjective, or expressive vision-to-language

    • Visual storytelling

    • Generating Audio Descriptions for movies

    • Multi-sentence descriptions for images and videos

    • Visual question answering for video

    • Visual fill-in-the blank tasks

    • Language as supervision for video understanding

    • Using dialogs and/or audio for video understanding

    • Understanding videos and plots

    • Limitations of existing vision and language datasets

    • Limitations of existing vision and language approaches