Welcome to the 2016 Multimodal Machine Learning tutorial!
Multimodal machine learning is a vibrant multi-disciplinary research field which addresses some of the original goals of artificial intelligence by integrating and modeling multiple communicative modalities, including linguistic, acoustic and visual messages. With the initial research on audio-visual speech recognition and more recently with image and video captioning projects, this research field brings some unique challenges for multimodal researchers given the heterogeneity of the data and the contingency often found between modalities.
This CVPR 2016 tutorial builds upon a recent course taught at Carnegie Mellon University by Louis-Philippe Morency and Tadas Baltrušaitis during the Spring 2016 semester (CMU course 11-777). The present tutorial will review fundamental concepts of machine learning and deep neural networks before describing the five main challenges in multimodal machine learning: (1) multimodal representation learning, (2) translation & mapping, (3) modality alignment, (4) multimodal fusion and (5) co-learning. The tutorial will also present state-of-the-art algorithms that were recently proposed to solve multimodal applications such as image captioning, video descriptions and visual question-answer. We will also discuss the current and upcoming challenges.
The tutorial is intended for graduate students and researchers interested in multi-modal machine learning, with a focus on deep learning approaches. It is aimed at anyone who wants to better understand how to jointly model language, speech and vision.
===== BREAK =====
Louis-Philippe Morency is Assistant Professor in the Language Technology Institute at the Carnegie Mellon University where he leads the Multimodal Communication and Machine Learning Laboratory (MultiComp Lab). He was previously research Faculty at University of Southern California and the Institute for Creative Technologies. He received his Ph.D. and Master degrees from MIT Computer Science and Artificial Intelligence Laboratory. In 2008, Dr. Morency was selected as one of "AI's 10 to Watch" by IEEE Intelligent Systems. He has received 7 best paper awards in multiple ACM- and IEEE-sponsored conferences for his work on context-based gesture recognition, multimodal probabilistic fusion and computational models of human communication dynamics. Dr. Morency is chair of the advisory committee for the ACM International Conference on Multimodal Interaction and Associate Editor for the IEEE Transactions on Affective Computing.
Tadas Baltrušaitis is a post-doctoral associate at the Language Technologies Institute, Carnegie Mellon University. Before this, he was a post-doctoral research at the University of Cambridge, where he also received his PhD degree in 2014. His primary research interests lie in the automatic understanding of non-verbal human behaviour, computer vision, and multimodal machine learning. He is a winner of a number of Machine Learning Challenges - Facial expression recognition and analysis 2015, audio/visual emotion challenge 2011 and a recipient of ICMI 2014 best student paper award and ETRA 2016 Emerging investigator award.
 Representation Learning: A Review and New Perspectives. Yoshua Bengio, Aaron Courville, and Pascal Vincent [pdf]
 Deep Canonical Correlation Analysis. Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu; JMLR, 2013 [pdf]
 Visualizing and understanding recurrent networks. Andrej Karpathy, Justin Johnson, Li Fei-Fei, 2015 [pdf]
 Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models. Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel; TACL 2015 [pdf]
 Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio; 2015 [pdf]
 Multi-View Latent Variable Discriminative Models For Action Recognition. Yale Song, Louis-Philippe Morency, Randall Davis, CVPR 2012[pdf]