IFT 6765 - Links between Computer Vision and Language
Winter 2024, A seminar course offered by the Université de Montréal
(Guidelines credits: Advanced Computer Vision course taught by Devi Parikh at Virginia Tech)
Course Overview:
What is this course about? This is a seminar course on recent advances in vision and language research - a sub-field of artificial intelligence that concerns with developing artificial intelligence (AI) systems that can `see' (i.e. understand the contents of an image: who, what, where, doing what?) and `talk' (i.e. communicate the understanding to humans in free-form natural language). We refer to such systems as vision-language (VL) systems. These systems require modeling of multimodal data, i.e., joint modeling of vision (in the form of images) and natural language (in the form of text) data. Applications of such systems include: aiding the visually impaired (Human: "What temperature is the oven set to?", AI: "450F."), teaching children through interactive demos (AI: "That is a picture of a Dall Sheep. You can find those in Alaska."), online shopping using natural language queries (Human: "Find me a red dress with short sleeves and floral pattern."), and interacting with personal robots (Human: "Did you see where I left my keys?"). Recent state-of-art examples of such systems include GPT-4(V) and Gemini.
In this course we will discuss various VL tasks such as image / video captioning (automatically describing images / videos in natural language), visual question answering (automatically answering natural language questions about images / videos), visual dialog (holding a conversation with a human grounded in an image), visual commonsense reasoning (automatically answering questions involving commonsense reasoning about situations described in images), text conditioned image generation (generating images as described by a natural language sentence), and embodied AI tasks such as vision-language navigation, embodied question answering, language conditioned robotic tasks etc.
Why study Vision and Language: Vision and Language research has seen tremendous progress over the past decade, owing to the availability of large-scale datasets, development of high-capacity deep learning models and availability of computational resources. There are various motivations behind studying vision and language:
Vision and Language tasks such as visual question answering, image captioning provide a natural testbed to evaluate how good our current visual understanding systems are and how grounded our current natural language understanding systems are.
Vision and Language tasks have many potential applications such as serving as an aid for visually impaired users (helping them navigate the visual world by talking in natural language), serving as day-to-day assistants in our homes (imagine Siri with eyes), aiding children in learning through interactive demos.
Vision and Language research involves multiple research challenges such as visual recognition, natural language understanding and grounding, learning joint visio-linguistic representations, reasoning about commonsense and knowledge bases, learning to overcome spurious correlations in training data.
Topics covered: Major Vision and Language tasks, datasets, modelling and evaluation techniques and their shortcomings, such as:
Tasks such as image-caption retrieval, referring expressions, image captioning, visual question answering, visual dialog, visual commonsense reasoning, text conditioned image generation, and embodied AI tasks such as vision-language navigation, embodied question answering, language conditioned robotic tasks etc.
Datasets such as Flickr30k, COCO Captions, VQA, Visual Genome, GQA, CLEVR, Visual Dialog, VCR, and many more.
Modelling techniques such as attention, multimodal pooling, compositional networks, multimodal transformers, generative models etc.
Shortcomings of current state-of-the-art models such as lack of robustness to new distributions, lack of compositional understanding and reasoning, lack of robust automatic evaluation etc.
Course Objectives: Gain a thorough understanding of recent advances in Vision and Language (tasks, datasets, modelling techniques, shortcomings).
Develop the ability to read and critique research papers in Vision and Language.
Be able to identify interesting open research questions and challenges in Vision and Language.
Be able to execute a research project in Vision and Language.
Enhance presentation skills.
Course Structure: This is a seminar course. The vast majority of the lecture time will be devoted to (i) students presenting papers to each other, (ii) group discussion of the papers, (iii) students presenting their project ideas and updates to the class, and (iv) group discussion and brainstorming of the project presentations. A more detailed course structure is outlined below.
Introductory lectures in the first few classes providing an overview of major Vision and Language tasks, datasets, modelling techniques.
Reading and reviewing research papers. After the first few classes, students will read and write technical reviews (conference style reviews) for one research paper prior to each class.
Presenting papers in class. In each class (after the introductory classes), a student will present the topic associated with the class, giving a cohesive overview by drawing observations from multiple papers (different from the paper reviewed by everyone). The student presenting need not submit a review that day.
Leading paper discussions in class. In each class (after the introductory classes), two students will lead the discussion of the paper reviewed by everyone. One student will be argue in favor of the paper and the other student will argue against the paper. Students leading discussions need not submit reviews that day.
Course projects. Each student will work on a course project (in teams of 1-2 students). These projects can range from coming up with new Vision and Language task, to developing new modelling techniques and advancing the state of the art, to applying an existing technique for a new task / dataset, to analyzing the behavior of existing models and providing new insights. For each project, there will be five deliverables spread across the term:
Initial presentation presenting the project idea to the class.
Progress update presentation to the class.
Final poster presentation to the class.
Structure of each class:
• For / Against paper discussion: 20 mins
• 1st paper presentation: 25 mins (15 mins presentation + 10 mins QA)
• 2nd paper presentation: 25 mins (15 mins presentation + 10 mins QA)
• 1st project presentation: 15 mins (10 mins presentation + 5 mins QA)
• 2nd project presentation: 15 mins (10 mins presentation + 5 mins QA)
• Total time = 100 minutes + 10 mins break + 10 mins for switching between presentations
Prerequisites: Please note that this is an advanced course at the intersection of computer vision and natural language processing. As prerequisites, you should have the basic knowledge of computer vision, machine learning, deep learning, natural language. Also, please note that projects are a major part of this course. So you should be well versed in programming and be comfortable with using deep learning frameworks such as PyTorch, TensorFlow etc. If you have any concerns about whether you have the required prerequisites, feel free to talk to the instructor about it in the first class.
Class Timings:
Wednesdays and Fridays: 2:30 PM – 4:30 PM
First class: Jan 17th
Last scheduled class: April 17th (classes may end earlier if we manage to finish the course)
No classes on the following days: March 6th (reading week), March 8th (reading week), March 29th (Easter holiday)
Class Format and Location:
In-person at Auditorium 1 at Mila (6650 Rue St. Urbain, Montreal).
Students are required to attend all classes in-person. There is no online access to the class. The lectures will be recorded but shared only if a student misses the class due to exceptional circumstances (such as medical reasons etc.).
Evaluation:
Class participation [5%] (asking questions to the presenters in class, engaging on the class forum (asking and answering questions))
Paper reviews [20%]: Submit your review on Gradescope
Need not submit reviews for the classes you are presenting papers / leading discussions on — 3 reviews
Will not count 1 lowest scoring review
Total 10 reviews to be counted
2% of total grade per review
Paper presentations in class [15%] + [15%] = [30%]
Each presentation will be done in teams of 2
Team members should not change between the two presentations
Leading paper discussions in class [5%] :
“For”: Leading discussion in favour of the paper OR “Against”: Leading discussion against the paper
Course project [40]%
Each project to be done in teams of 2
Proposal presentation [10%]
Project update presentation [10%]
Poster presentation [20%]
Instructor and TA:
Instructor: Aishwarya Agrawal <aishwarya.agrawal at mila.quebec>
Teaching Assistant: Le Zhang <le.zhang@mila.quebec>
Communication Platform:
We will use Piazza for (link and access code to be shared soon):
Course announcements
Answering public student questions
TAs and the instructor answering private student questions
Writing a question:
First, search for similar, already-asked questions to see if yours is answered
Write your questions neatly and tag it with the correct subject e.g. #paper_reviews
If your question is not of a sensitive nature and can be posted publicly, please post it publicly! It will allow other students to answer it and can help other students with the same questions.
Privately emailing TA or instructor vs submitting a private question on Piazza:
Please do not email the instructor or the TA. Instead message them privately on Piazza.
If your question is very private and you do not want the TA to see it, then you should directly message the instructor on Piazza.
Office Hours:
Instructor: 4.30pm to 5.00pm on Fridays (venue: classroom, i.e., auditorium 1 at Mila).
TA: Wenseday 1.30pm to 2.30pm (lounge in front of the auditorium 1).