Course overview
Grounded perception is a key feature of most visions of future AI: the ability to refer to entities in the world provides semantic cues for machines to understand and act and can allow effective communication with human partners. The interplay of vision and language has long tantalized AI researchers, and has recently begun to bear considerable fruit. As time and the interest of participants permit, this course will overview the history of vision and language models including early generative and translation based methods, cover key deep learning techniques that provided significant advances in previous years including recurrent convolutional networks, review in detail contemporary large scale transformer models for multimodal processing, assess the state-of-the-art these methods achieve on challenges including image captioning, VQA, Video-QA, Visual Dialog, Video Description, and text-to-image synthesis. We will also consider multimedia forensic challenges, and how vision and language methods can help defend against the spread of falsified media. Finally, we will discuss the ethical issues that arise with large-scale vision and language datasets and models, and will consider methods for removing unwanted dataset bias and/or making models more explainable and transparent.
Prerequisite for the course
Permission of instructor required for all students, including auditors. Students are expected to have completed graduate computer vision and/or NLP courses and be engaged in active research on related topics. Limited to 30 participants, with preference given to those actively researching in the area with the most prior course and publishing experience. Please fill out this request form to summarize your background and express your interest in joining the course either for credit or as an auditor. Permission codes will be sent to selected students to register for the course.
To ensure full consideration for participation in the course please fill out the form before the end of December.
This course may be taken for variable units (2-4), may be audited, and may be retaken for credit in different semesters as the material will change from term to term.
Priority for registration will be given to those taking the course for credit for at least 2 units if they are eligible to do so. Postdocs and others are not able to register, but still may be considered priority participants. No priority distinction will be made for those requesting 2 vs. more than 2 credits.
Requirements
For all students (including 2 units and auditors):
Active participation in class discussions
Presentation of one or more papers during the term
Completion of short response form before each class summarizing the key idea in one or two assigned key papers each week and asking one critical question or making a suggested extension to the work. Additional optional papers will also be covered each week but no response form will be required.
For students taking the course for more than two unit:
[For 3 units] A course project which is one of the following types: new research results and report judged suitable for submission to a CV, NLP, or NeurIPS workshop, a solid replication or reimplementation of existing work, evaluation of existing work on a new dataset, or a literature survey. (Or other format with permission of instructor.)
[For 4 units] A course project with new research results and report judged suitable for acceptance at a top CV or NLP conference or journal venue, or a major new open source repository or dataset with high impact for the community.
Pragmatics
1 or 2 lead(s) per week (assigned by staff)
1 volunteer presenter per paper (can choose to team present if preferred)
Format:
Introduction / background [20 min] [4+ papers]
2-3 papers presented in detail [20 min presentation + 10 min discussion each]
Spotlights of other latest work [20 min][2 - 4 papers]
Discussion / project ideas [10 min]
Weekly paper questions for assigned papers
All students will write:
Short paper summary
Your view of the paper
Anything confusing?
Questions for discussion
Google Form will be released every week on Piazza