Welcome to the 2016 Multimodal Machine Learning tutorial!
Multimodal machine learning is a vibrant multi-disciplinary research field which addresses some of the original goals of artificial intelligence by integrating and modeling multiple communicative modalities, including linguistic, acoustic and visual messages. With the initial research on audio-visual speech recognition and more recently with image and video captioning projects, this research field brings some unique challenges for multimodal researchers given the heterogeneity of the data and the contingency often found between modalities. 

This CVPR 2016 tutorial builds upon a recent course taught at Carnegie Mellon University by Louis-Philippe Morency and Tadas Baltrušaitis during the Spring 2016 semester (CMU course 11-777). The present tutorial will review fundamental concepts of machine learning and deep neural networks before describing the five main challenges in multimodal machine learning: (1) multimodal representation learning, (2) translation & mapping, (3) modality alignment, (4) multimodal fusion and (5) co-learning. The tutorial will also present state-of-the-art algorithms that were recently proposed to solve multimodal applications such as image captioning, video descriptions and visual question-answer. We will also discuss the current and upcoming challenges.

Target audience:

The tutorial is intended for graduate students and researchers interested in multi-modal machine learning, with a focus on deep learning approaches. It is aimed at anyone who wants to better understand how to jointly model language, speech and vision. 



The location of the tutorial will be in Neopolitan I –II in Caesar's Palace, Las Vegas


  1. Introduction
    • What is Multimodal? 
      • Historical view, multimodal vs multimedia
    • Why multimodal
      • Multimodal applications: image captioning, video description, AVSR,…
    • Core technical challenges
      • Representation learning, translation, alignment, fusion and co-learning
  2. Basic concepts – Part 1
    • Linear models
      • Score and loss functions, regularization
    • Neural networks
      • Activation functions, multi-layer perceptron
    • Optimization
      • Stochastic gradient descent, backpropagation
  3. Unimodal representations
    • Language representations
      • Distributional hypothesis and word embedding
    • Visual representations 
      • Convolutional neural networks
    • Acoustic representations 
      • Spectrograms, autoencoders  
  4. Multimodal representations
    • Joint representations
      • Visual semantic spaces, multimodal autoencoder
    • Orthogonal joint representations
      • Component analysis
    • Parallel multimodal representations
      • Similarity metrics, canonical correlation analysis
===== BREAK =====
  1. Basic concepts – Part 2
    • Language models
      • Unigrams, bigrams, skip-grams, skip-thought
    • Unimodal sequence modeling
      • Recurrent neural networks, LSTMs
    • Optimization 
      • Backpropagation through time
  2. Multimodal translation and mapping
    • Encoder-decoder models
      • Machine translation, image captioning
    • Generative vs retrieval approaches
      • Viseme generation, visual puppetry
  3. Modality alignment
    • Latent alignment approaches
      • Attention models, multi instance learning
    • Explicit alignment
      • Dynamic time warping
  4. Multimodal fusion and co-learning
    • Model free approaches
      • Early and late fusion, hybrid models
    • Kernel-based fusion
      • Multiple kernel learning
    • Multimodal graphical models
      • Factorial HMM, Multi-view Hidden CRF
  5. Future directions and concluding remarks


Louis-Philippe Morency is Assistant Professor in the Language Technology Institute at the Carnegie Mellon University where he leads the Multimodal Communication and Machine Learning Laboratory (MultiComp Lab). He was previously research Faculty at University of Southern California and the Institute for Creative Technologies. He received his Ph.D. and Master degrees from MIT Computer Science and Artificial Intelligence Laboratory. In 2008, Dr. Morency was selected as one of "AI's 10 to Watch" by IEEE Intelligent Systems. He has received 7 best paper awards in multiple ACM- and IEEE-sponsored conferences for his work on context-based gesture recognition, multimodal probabilistic fusion and computational models of human communication dynamics. Dr. Morency is chair of the advisory committee for the ACM International Conference on Multimodal Interaction and Associate Editor for the IEEE Transactions on Affective Computing.

Tadas Baltrušaitis is a post-doctoral associate at the Language Technologies Institute, Carnegie Mellon University. Before this, he was a post-doctoral research at the University of Cambridge, where he also received his PhD degree in 2014. His primary research interests lie in the automatic understanding of non-verbal human behaviour, computer vision, and multimodal machine learning. He is a winner of a number of Machine Learning Challenges - Facial expression recognition and analysis 2015, audio/visual emotion challenge 2011 and a recipient of ICMI 2014 best student paper award and ETRA 2016 Emerging investigator award.

Suggested reading:

[1] Representation Learning: A Review and New Perspectives. Yoshua Bengio, Aaron Courville, and Pascal Vincent [pdf]
[2] Deep Canonical Correlation Analysis. Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu; JMLR, 2013 [pdf]
[3] Visualizing and understanding recurrent networks. Andrej Karpathy, Justin Johnson, Li Fei-Fei, 2015 [pdf]
[4] Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models. Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel; TACL 2015 [pdf]
[5] Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio; 2015 [pdf]
[6] Multi-View Latent Variable Discriminative Models For Action Recognition. Yale Song, Louis-Philippe Morency, Randall Davis, CVPR 2012[pdf]