IFT 6765 - Links between Computer Vision and Language

Winter 2022, A seminar course offered by the Université de Montréal

Course Overview:

What is this course about?
This will be a seminar course on recent advances in vision and language research – a sub-field of artificial intelligence that studies multimodal tasks at the intersection of computer vision and natural language processing. Some examples of these tasks include image / video captioning (automatically describing images / videos in natural language), visual question answering (automatically answering natural language questions about images / videos), visual dialog (holding a conversation with a human grounded in an image), visual commonsense reasoning (automatically answering questions involving commonsense reasoning about situations described in images) etc.


Why study Vision and Language: Vision and Language research has seen tremendous progress over the past decade, owing to the availability of large-scale datasets, development of high-capacity deep learning models and availability of computational resources. There are various motivations behind studying vision and language:

  • Vision and Language tasks such as visual question answering, image captioning provide a natural testbed to evaluate how good our current visual understanding systems are and how grounded our current natural language understanding systems are.

  • Vision and Language tasks have many potential applications such as serving as an aid for visually impaired users (helping them navigate the visual world by talking in natural language), serving as day-to-day assistants in our homes (imagine Siri with eyes), aiding children in learning through interactive demos.

  • Vision and Language research involves multiple research challenges such as visual recognition, natural language understanding and grounding, learning joint visio-linguistic representations, reasoning about commonsense and knowledge bases, learning to overcome spurious correlations in training data.


Topics covered: Major Vision and Language tasks, datasets, modelling techniques and their shortcomings, such as:

  • Tasks such as image-caption retrieval, referring expressions, image captioning, visual question answering, visual dialog, visual commonsense reasoning.

  • Datasets such as Flickr30k, COCO Captions, VQA, Visual Genome, GQA, CLEVR, Visual Dialog, VCR.

  • Modelling techniques such as attention, multimodal pooling, compositional networks, multimodal transformers.

  • Shortcomings of current state-of-the-art models such as lack of robustness to new distributions, lack of compositional understanding and reasoning.


Course Objectives: Gain a thorough understanding of recent advances in Vision and Language (tasks, datasets, modelling techniques, shortcomings).

  • Develop the ability to read and critique research papers in Vision and Language.

  • Be able to identify interesting open research questions and challenges in Vision and Language.

  • Be able to execute a research project in Vision and Language.

  • Enhance presentation skills.


Course Structure: This is a seminar course. The vast majority of the lecture time will be devoted to (i) students presenting papers to each other, (ii) group discussion of the papers, (iii) students presenting their project ideas and updates to the class, and (iv) group discussion and brainstorming of the project presentations. A more detailed course structure is outlined below.

  • Introductory lectures in the first few classes providing an overview of major Vision and Language tasks, datasets, modelling techniques. A subset of these could also be guest lectures from researchers working on these tasks.

  • Reading and reviewing research papers. After the first few classes, students will read and write technical reviews (conference style reviews) for one research paper prior to each class.

  • Presenting papers in class. In each class (after the introductory classes), two students will present and lead discussions on one paper each -- the paper reviewed by everyone that day and one additional paper on the same topic. Students presenting need not submit reviews that day.

  • Course projects. Each student will work on a course project (in teams of 1-2 students). These projects can range from coming up with new Vision and Language task, to developing new modelling techniques and advancing the state of the art, to applying an existing technique for a new task / dataset, to analyzing the behavior of existing models and providing new insights. For each project, there will be five deliverables spread across the term:

    1. Initial presentation presenting the project idea to the class.

    2. Progress update presentation to the class.

    3. Final project presentation to the class.

    4. Spotlight project video (1 min) summarizing the project.


Prerequisites: Please note that this is an advanced course at the intersection of computer vision and natural language processing. As prerequisites, you should have the basic knowledge of computer vision, machine learning, deep learning, natural language. Also, please note that projects are a major part of this course. So you should be well versed in programming and be comfortable with using deep learning frameworks such as PyTorch, TensorFlow etc. If you have any concerns about whether you have the required prerequisites, feel free to talk to the instructor about it in the first class.

Class Timings:

  • Tuesdays: 9:30 AM – 11:30 AM

  • Fridays: 1:30 PM – 3:30 PM


Last class before Winter break: Feb 25th

First class after Winter break: March 8th


Class Format:

Until January 31st: Virtual via Zoom. Joining details have been shared over Piazza.
After January 31st: In-person at Agora classroom at Mila (6650 Rue St. Urbain, Montreal). Remote access available via Zoom (same joining details as above).

Note: Following University of Montreal's guidelines on COVID-19 safety measures, students have to wear procedural masks at all times in the classroom. Masks and disinfectant will be available at the entrance of the Agora.


Evaluation:

  • Class participation -- 5% (asking questions to the presenters in class, engaging on the class forum (asking and answering questions))

  • Paper reviews – 25%: Submit your review on Gradescope (access code to signup on gradescope has been shared over Piazza).

  • Paper presentation in class – 15% + 15% = 30%

  • Course Project – 40%

    1. Initial presentation – 10%

    2. Progress update presentation – 10%

    3. Final presentation – 10%

    4. Spotlight video – 10%


Instructor and TA:


Communication Platform:

We will use Piazza for (access code to join Piazza has been shared via Studium):

  • Course announcements

  • Answering public student questions

  • TAs and the instructor answering private student questions

Writing a question:

  • First, search for similar, already-asked questions to see if yours is answered

  • Write your questions neatly and tag it with the correct subject e.g. #paper_reviews

  • If your question is not of a sensitive nature and can be posted publicly, please post it publicly! It will allow other students to answer it and can help other students with the same questions.

Privately emailing TA or instructor vs submitting a private question on Piazza:

  • Please do not email the instructor or the TA. Instead message them privately on Piazza.

  • If your question is very private and you do not want the TA to see it, then you should directly message the instructor on Piazza.


Office Hours:

  • Instructor: 3.30pm to 4.00pm on Fridays (venue: Zoom / Agora classroom).

  • TA: To be scheduled by the student on as-needed basis by contacting the TA privately on Piazza.