IFT 6765 - Links between Computer Vision and Language

Course Lectures

Lecture 1 (01/17/2023) : Introduction to the course

Lecturer: Aishwarya Agrawal
Slides (key), Slides (pdf)

Lecture 2 (01/20/2023, 01/24/2023) : Vision-Language landscape before Transformer + Pre-training

Lecturer: Aishwarya Agrawal
Slides (key), Slides (pdf)

Lecture 3 (01/27/2023) : Vision-Language landscape during Transformer + Pre-training


Lecturer: Aishwarya Agrawal
Slides (key), Slides (pdf)

Lecture 4 (02/03/2023) : Shortcomings of Vision-Language models and Open Challenges

Lecturer: Aishwarya Agrawal
Slides (key), Slides (pdf)

Lecture 5 (02/10/2023) : Image captioning

Review paper: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Paper presentation: Image Captioning

Lecturer: Le Zhang

Slides

Project presentation: Discriminative Stable Diffusion (DiscSD)

Project lead: Benno Krojer

Slides

Lecture 6 (02/14/2023) : Visual Question Answering: Datasets

Review paper: VQA: Visual Question Answering

Paper presentation: Visual Question Answering: Datasets

Lecturer: Arjun Vaithilingam

Slides

Project presentation: Weak language supervised finetuning of SSL vision models

Project lead: Diganta Misra

Slides

Lecture 7 (02/18/2023) : Visual Question Answering: Models

Review paper: Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Paper presentation: VQA Models

Lecturer: Diganta Misra

Slides

Project presentation: Enhancing compositional understanding for vision-language model

Project lead: Le Zhang

Slides

Lecture 8 (02/21/2023) : Visual Dialog: Datasets and Models

Review paper: Visual Dialog

Paper presentation: Visual Dialog: Datasets & Models

Lecturer: Benno Krojer

Slides

Project presentation: Interactive Learning with Grounded Language Agents Utilizing World Models

Project lead: Arjun Vaithilingam

Slides

Lecture 9 (02/24/2023) : Interpretability and Explainability

Review paper: Multimodal Explanations: Justifying Decisions and Pointing to the Evidence

Paper presentation: Interpretability and Explainability in VL

Lecturer: Vitaly Kondulukov

Slides

Project presentation: Zero-Shot Natural Language Explanations

Project lead: Vitaly Kondulukov

Slides

Lecture 10 (03/07/2023) : Finetuning based VLP models

Review paper: ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Paper presentation: Finetune based VLP models

Lecturer: Vitaly Kondulukov

Slides

Project presentation: Are Diffusion Models General Image-Text Scorers?

Project lead: Benno Krojer

Slides

Lecture 11 (03/10/2023) : Zero-shot / few-shot VLP models

Review paper: Multimodal Few-Shot Learning with Frozen Language Models

Paper presentation: Zero-shot / few-shot VLP models

Lecturer: Arjun Vaithilingam

Slides

Project presentation: Weak language supervision fine-tuning of vision encoders

Project lead: Diganta Misra

Slides

Lecture 12 (03/14/2023) : VLP models for vision: classification, image generation

Review paper: VirTex: Learning Visual Representations from Textual Annotations

Paper presentation: VLP models for vision

Lecturer: Diganta Misra

Slides

Project presentation: Enhancing compositional understanding for vision-language model

Project lead: Le Zhang

Slides

Lecture 13 (03/17/2023) : VLP models for language

Review paper: Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision

Paper presentation: VLP models for Language

Lecturer: Benno Krojer

Slides

Project presentation: Interactive Learning with Grounded Language Agents Utilizing World Models

Project lead: Arjun Vaithilingam

Slides

Lecture 14 (03/21/2023) : Shortcomings of Vision-Language models

Review paper: Analyzing the Behavior of Visual Question Answering Models

Paper presentation: Shortcomings of Vision-Language models

Lecturer: Le Zhang

Slides

Project presentation: GQA with BLIP2

Project lead: Vitaly Kondulukov

Slides

Lecture 15 (03/24/2023) : Beyond statistical learning in vision-language

Review paper: Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering

Lecture 16 (03/31/2023) : Final project presentations (I)

Project presentation 1: Are Diffusion Models Vision-Language Reasoners?

Project lead: Benno Krojer

Slides

Project presentation 2: Weak language supervision fine-tuning of vision encoders

Project lead: Diganta Misra

Slides

Project presentation 3: Enhancing compositional understanding for vision-language model

Project lead: Le Zhang

Slides

Lecture 17 (04/04/2023) : Final project presentations (II)

Project presentation 1: Interactive Learning with Grounded Language Agents

Project lead: Arjun Vaithilingam

Slides

Project presentation 2: Visual Encoder vs QFormer in BLiP2

Project lead: Vitaly Kondulukov

Slides

Lecture 18 (04/14/2023) : Project Spotlight Video

Title: Are Diffusion Models Vision-Language Reasoners?

Project lead: Benno Krojer

Video

Title: Weak language supervision fine-tuning of vision encoders

Project lead: Diganta Misra

Video

Title: Enhancing compositional understanding for vision-language model

Project lead: Le Zhang

Video

Title: Interactive Learning with Grounded Language Agents

Project lead: Arjun Vaithilingam

Video

Title: Visual Encoder vs QFormer in BLiP2

Project lead: Vitaly Kondulukov

Video