11-775 Large-Scale Multimedia Analysis

Can a robot watch “Youtube" to learn about the world? What makes us laugh? How to bake a cake? Why is Kim Kardashian famous?

12-unit class or lab covering fundamentals of computer vision, audio and speech processing, multi-media files and streaming, multi-modal signal processing, video retrieval, semantics, and text (possibly also: speech, music) generation.

Instructors will give an overview of relevant recent work and benchmarking efforts (Trecvid, Mediaeval, etc.). Students will work on research projects to explore these ideas and learn to perform multi-modal retrieval, summarization and inference on large amounts of “Youtube”-style data. The experimental environment for the practical part of the course will be given to students in the form of Virtual Machines.

This is a graduate course primarily for students in LTI, HCII, CSD, Robotics, ECE; others, for example (undergraduate) students of CS or professional masters, by prior permission of the instructor(s). Strong implementation skills, experience on working with large data sets, and familiarity with some (not all) of the above fields (e.g. 11-611, 11-711, 11-751, 11-755, 11-792, 16-720, or equivalent), will be helpful.