CS2950-K - Schedule

Course Schedule

Introduction

Sep. 9: Course description. An overview of the research topics we will cover during the semester (recording).
Sep. 11: How to read and present research material. Sign up paper preferences & teams (recording).
- How to Read a CS Research Paper by Philip Fong.
- How to do research by Bill Freeman.
- How to write a good paper by Bill Freeman (video).

Natural Language Processing, Recap & Overview

Sep. 14: Quick introduction to NLP, word embeddings (recording).
- Supplementary reading: The Illustrated Word2vec
- Word Embeddings and Recurrent NNs by Eugene Charniak
Sep. 16: Sequence-to-sequence modeling, recurrent neural networks (recording).
- Supplementary reading: Chapter 10 of Deep Learning
Sep. 18: Attention in NLP (recording).
- Supplementary reading: Effective Approaches to Attention-based Neural Machine Translation
- Assignment #1 is released!

Computer Vision, Recap & Overview

Sep. 21: Attention in Machine Translation (recording).
- Supplementary reading: Effective Approaches to Attention-based Neural Machine Translation
Sep. 23: Image classification, convolutional neural networks (recording, slides).
- Supplementary reading: Chapter 9 of Deep Learning
Sep. 25: Object detection and semantic segmentation (recording, slides).
- Supplementary reading: Mask R-CNN

Joint visual-semantic embeddings

Sep. 28: DeViSE: A Deep Visual-Semantic Embedding Model
- Presented by: Ifrah Idrees and Zhizhong Chen (recording)
- Paper discussion questions
- Supplementary reading: WSABIE: Scaling Up To Large Vocabulary Image Annotation

Image captioning and its evaluation

Sep. 30: Baby Talk: Understanding and Generating Image Descriptions
- Presented by: Xiling Zhang and Peter Lyu (recording)
- Paper discussion questions
- Supplementary reading: Neural Baby Talk
- Assignment #1 due.
- Assignment #2 released.
Oct. 2: Show and Tell: A Neural Image Caption Generator
- Presented by: Sean Hastings and Suchen Zheng (recording)
- Paper discussion questions
- Supplementary reading: DenseCap: Fully Convolutional Localization Networks for Dense Captioning
Oct. 5: SPICE: Semantic Propositional Image Caption Evaluation
- Presented by: William Kuenne and Tyler DeFroscia (recording)
- Paper discussion questions
- Supplementary reading: Improved Image Captioning via Policy Gradient optimization of SPIDEr

Attention, self-attention and Transformers

Oct. 7: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
- Presented by: Zachary Hoffman and Deniz Bayazit (recording)
- Paper discussion questions
- Supplementary reading: Learning Deep Features for Discriminative Localization
Oct. 9: Attention Is All You Need
- Presented by: Houyu Zhang, Yihang Dong, and Daniel Kotroco (recording)
- Paper discussion questions
- Supplementary reading: Stand-Alone Self-Attention in Vision Models
- Assignment #2 due.
Oct. 12: Indigenous Peoples’ Day, no classes.
Oct. 14: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- Presented by: Min Jean Cho, Lu Shao, and Tian Yun (recording)
- Paper discussion questions
- Supplementary reading: Language Models are Few-Shot Learners (GPT-3)
- Assignment #3 released: submission
- Sign up final project teams: sheet
- Sign up paper presentation preferences: sheet
Oct. 16: End-To-End Memory Networks
- Presented by: Nihal Nayak and Kenny Jones (recording)
- Paper discussion questions
- Supplementary reading: REALM: Retrieval-Augmented Language Model Pre-Training
Oct. 19: VideoBERT: A Joint Model for Video and Language Representation Learning
- Presented by: Michael Mao and Ziyang Long
- Paper discussion questions
- Supplementary reading: ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
Oct. 21: Image GPT: Generative Pretraining from Pixels
- Presented by: Ruochen Zhang and Rafael Rodriguez (recording)
- Paper discussion questions
- Supplementary reading: Representation Learning with Contrastive Predictive Coding
Oct. 23: Final project idea pitch
- Roughly 10 groups, 5 minutes each. (recording)
Oct. 26: Invited talk by Carl Vondrick on learning from unlabeled videos.
- Assignment #3 due.

Visual question answering

Oct. 28: GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering
- Presented by: Ruochen Zhang and Peter Lyu
- Paper discussion questions
- Supplementary reading: CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning
Oct. 30: From Recognition to Cognition: Visual Commonsense Reasoning
- Presented by: Xiling Zhang, Nihal Nayak
- Paper discussion questions
- Supplementary reading: VIOLIN: A Large-Scale Dataset for Video-and-Language Inference
Nov. 2: Learning to Reason: End-To-End Module Networks for Visual Question Answering
- Presented by: Zachary Hoffman and Zhizhong (Isaac) Chen
- Paper discussion questions
- Supplementary reading: Measuring compositionality in representation learning

Embodied AI, visual-language navigation

Nov. 4: Invited talk by Peter Anderson on visual-language navigation
- Supplementary reading: Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
Nov. 6: Speaker-Follower Models for Vision-and-Language Navigation
- Presented by: Rafael Rodriguez, William Kuenne and Deniz Bayazit
- Paper discussion questions
- Supplementary reading: Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation
Nov. 9: Habitat: A Platform for Embodied AI Research
- Presented by: Ifrah Idress and Kenny Jones
- Paper discussion questions
- Supplementary reading: ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

Multimodal learning

Nov. 11: See, Hear, Explore: Curiosity via Audio-Visual Association
- Presented by: Sean Hastings, Lu Shao and Yihang Dong
- Paper discussion questions
- Supplementary reading: Objects that Sound
Nov. 13: Speech2Action: Cross-modal Supervision for Action Recognition
- Presented by: Suchen Zheng and Min Jean Cho
- Paper discussion questions
- Supplementary reading: MovieGraphs: Towards Understanding Human-Centric Situations from Videos
Nov. 16: End-to-End Learning of Visual Representations from Uncurated Instructional Videos
- Presented by: Tian Yun, Michael Mao and Tyler DeFroscia
- Paper discussion questions
- Supplementary reading: Visual Grounding in Video for Unsupervised Word Translation
Nov. 18: Invited talk by Miki Rubinstein on audio-visual learning.
- Supplementary reading: Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
Nov. 20: Invited talk by Jiajun Wu on neural symbolic VQA.
- Supplementary reading: The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision

Dataset and model biases

Nov. 23: Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering
- Presented by: Houyu Zhang and Daniel Kotroco
- Paper discussion questions
- Supplementary reading: Women also Snowboard: Overcoming Bias in Captioning Models
Nov. 25 & 27: Thanksgiving, no classes.
Nov. 30: Project presentation, part I
Dec. 2: Project presentation, part II