IFT 6765 - Links between Computer Vision and Language
Course Lectures
Lecture 1 (01/17/2023) : Introduction to the course
Lecturer: Aishwarya Agrawal
Slides (key), Slides (pdf)
Lecture 2 (01/20/2023, 01/24/2023) : Vision-Language landscape before Transformer + Pre-training
Lecturer: Aishwarya Agrawal
Slides (key), Slides (pdf)
Lecture 3 (01/27/2023) : Vision-Language landscape during Transformer + Pre-training

Lecturer: Aishwarya Agrawal
Slides (key), Slides (pdf)
Lecture 4 (02/03/2023) : Shortcomings of Vision-Language models and Open Challenges
Lecturer: Aishwarya Agrawal
Slides (key), Slides (pdf)
Lecture 5 (02/10/2023) : Image captioning
Review paper: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Paper presentation: Image Captioning
Lecturer: Le Zhang
Project presentation: Discriminative Stable Diffusion (DiscSD)
Project lead: Benno Krojer
Lecture 6 (02/14/2023) : Visual Question Answering: Datasets
Review paper: VQA: Visual Question Answering
Paper presentation: Visual Question Answering: Datasets
Lecturer: Arjun Vaithilingam
Project presentation: Weak language supervised finetuning of SSL vision models
Project lead: Diganta Misra
Lecture 7 (02/18/2023) : Visual Question Answering: Models
Review paper: Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Paper presentation: VQA Models
Lecturer: Diganta Misra
Project presentation: Enhancing compositional understanding for vision-language model
Project lead: Le Zhang
Lecture 8 (02/21/2023) : Visual Dialog: Datasets and Models
Review paper: Visual Dialog
Paper presentation: Visual Dialog: Datasets & Models
Lecturer: Benno Krojer
Project presentation: Interactive Learning with Grounded Language Agents Utilizing World Models
Project lead: Arjun Vaithilingam
Lecture 9 (02/24/2023) : Interpretability and Explainability
Review paper: Multimodal Explanations: Justifying Decisions and Pointing to the Evidence
Paper presentation: Interpretability and Explainability in VL
Lecturer: Vitaly Kondulukov
Project presentation: Zero-Shot Natural Language Explanations
Project lead: Vitaly Kondulukov
Lecture 10 (03/07/2023) : Finetuning based VLP models
Review paper: ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
Paper presentation: Finetune based VLP models
Lecturer: Vitaly Kondulukov
Project presentation: Are Diffusion Models General Image-Text Scorers?
Project lead: Benno Krojer
Lecture 11 (03/10/2023) : Zero-shot / few-shot VLP models
Review paper: Multimodal Few-Shot Learning with Frozen Language Models
Paper presentation: Zero-shot / few-shot VLP models
Lecturer: Arjun Vaithilingam
Project presentation: Weak language supervision fine-tuning of vision encoders
Project lead: Diganta Misra
Lecture 12 (03/14/2023) : VLP models for vision: classification, image generation
Review paper: VirTex: Learning Visual Representations from Textual Annotations
Paper presentation: VLP models for vision
Lecturer: Diganta Misra
Project presentation: Enhancing compositional understanding for vision-language model
Project lead: Le Zhang
Lecture 13 (03/17/2023) : VLP models for language
Review paper: Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision
Paper presentation: VLP models for Language
Lecturer: Benno Krojer
Project presentation: Interactive Learning with Grounded Language Agents Utilizing World Models
Project lead: Arjun Vaithilingam
Lecture 14 (03/21/2023) : Shortcomings of Vision-Language models
Review paper: Analyzing the Behavior of Visual Question Answering Models
Paper presentation: Shortcomings of Vision-Language models
Lecturer: Le Zhang
Project presentation: GQA with BLIP2
Project lead: Vitaly Kondulukov
Lecture 15 (03/24/2023) : Beyond statistical learning in vision-language
Review paper: Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering
Lecture 16 (03/31/2023) : Final project presentations (I)
Project presentation 1: Are Diffusion Models Vision-Language Reasoners?
Project lead: Benno Krojer
Project presentation 2: Weak language supervision fine-tuning of vision encoders
Project lead: Diganta Misra
Project presentation 3: Enhancing compositional understanding for vision-language model
Project lead: Le Zhang
Lecture 18 (04/14/2023) : Project Spotlight Video
Title: Are Diffusion Models Vision-Language Reasoners?
Project lead: Benno Krojer
Title: Weak language supervision fine-tuning of vision encoders
Project lead: Diganta Misra
Title: Enhancing compositional understanding for vision-language model
Project lead: Le Zhang
Title: Interactive Learning with Grounded Language Agents
Project lead: Arjun Vaithilingam
Title: Visual Encoder vs QFormer in BLiP2
Project lead: Vitaly Kondulukov