IFT 6765 - Links between Computer Vision and Language

Winter 2024, A seminar course offered by the Université de Montréal

(Guidelines credits: Advanced Computer Vision course taught by Devi Parikh at Virginia Tech)

Course Overview:

What is this course about? This is a seminar course on recent advances in vision and language research - a sub-field of artificial intelligence that concerns with developing artificial intelligence (AI) systems that can `see' (i.e. understand the contents of an image: who, what, where, doing what?) and `talk' (i.e. communicate the understanding to humans in free-form natural language). We refer to such systems as vision-language (VL) systems. These systems require modeling of multimodal data, i.e., joint modeling of vision (in the form of images) and natural language (in the form of text) data. Applications of such systems include: aiding the visually impaired (Human: "What temperature is the oven set to?", AI: "450F."), teaching children through interactive demos (AI: "That is a picture of a Dall Sheep. You can find those in Alaska."), online shopping using natural language queries (Human: "Find me a red dress with short sleeves and floral pattern."), and interacting with personal robots (Human: "Did you see where I left my keys?"). Recent state-of-art examples of such systems include GPT-4(V) and Gemini

In this course we will discuss various VL tasks such as image / video captioning (automatically describing images / videos in natural language), visual question answering (automatically answering natural language questions about images / videos), visual dialog (holding a conversation with a human grounded in an image), visual commonsense reasoning (automatically answering questions involving commonsense reasoning about situations described in images), text conditioned image generation (generating images as described by a natural language sentence), and embodied AI tasks such as vision-language navigation, embodied question answering, language conditioned robotic tasks etc. 


Why study Vision and Language: Vision and Language research has seen tremendous progress over the past decade, owing to the availability of large-scale datasets, development of high-capacity deep learning models and availability of computational resources. There are various motivations behind studying vision and language: 


Topics covered: Major Vision and Language tasks, datasets, modelling and evaluation techniques and their shortcomings, such as: 


Course Objectives: Gain a thorough understanding of recent advances in Vision and Language (tasks, datasets, modelling techniques, shortcomings).


Course Structure: This is a seminar course. The vast majority of the lecture time will be devoted to (i) students presenting papers to each other,  (ii) group discussion of the papers, (iii) students presenting their project ideas and updates to the class, and (iv) group discussion and brainstorming of the project presentations. A more detailed course structure is outlined below.


Structure of each class: 

• For / Against paper discussion: 20 mins

• 1st paper presentation: 25 mins (15 mins presentation + 10 mins QA)

• 2nd paper presentation: 25 mins (15 mins presentation + 10 mins QA)

• 1st project presentation: 15 mins (10 mins presentation + 5 mins QA)

• 2nd project presentation: 15 mins (10 mins presentation + 5 mins QA)

• Total time = 100 minutes + 10 mins break + 10 mins for switching between presentations


Prerequisites: Please note that this is an advanced course at the intersection of computer vision and natural language processing. As prerequisites, you should have the basic knowledge of computer vision, machine learning, deep learning, natural language. Also, please note that projects are a major part of this course. So you should be well versed in programming and be comfortable with using deep learning frameworks such as PyTorch, TensorFlow etc. If you have any concerns about whether you have the required prerequisites, feel free to talk to the instructor about it in the first class.

Class Timings:


Wednesdays and Fridays: 2:30 PM – 4:30 P


First class: Jan 17th

Last scheduled class: April 17th (classes may end earlier if we manage to finish the course)
No classes on the following days: March 6th (reading week), March 8th (reading week), March 29th (Easter holiday)


Class Format and Location:  


In-person at Auditorium 1 at Mila (6650 Rue St. Urbain, Montreal). 

Students are required to attend all classes in-person. There is no online access to the class. The lectures will be recorded but shared only if a student misses the class due to exceptional circumstances (such as medical reasons etc.).

 

Evaluation:


Instructor and TA:


Communication Platform:

We will use Piazza for (link and access code to be shared soon)

Writing a question:

Privately emailing TA or instructor vs submitting a private question on Piazza:


Office Hours: