The Multimodal Machine Learning 2015 workshop will be a full-day event held on Friday 11th of December 2015. It will be held in room 512Dh.

9:00 - 9:15: Introduction by the organizers
9:15 - 10:00: Keynote: Dhruv Batra (Virginia Tech) "Visual Question Answering (VQA)"
10:00 – 10:30: Coffee Break
10:30 – 11:15: Accepted orals and spotlights
11:15 – 12:30: Poster session
12:30 – 14:30 am: Lunch Break
14:30 – 15:15: Keynote: Raymond Mooney (UT Austin) "Generating Natural-Language Video Descriptions using LSTM Recurrent Neural Networks"
15:15 – 16:00:  Keynote: Li Deng (MSR) "Cross-Modality Distant Supervised Learning for Speech, Text, and Image Classification"
16:00 – 16:30: Coffee Break
16:30 – 17:15: Keynote: Ruslan Salakhutdinov (CMU) "Generating Images from Captions with Attention "
17:15 – 18:00: Keynote: Heng Ji (Rensselaer) "Automatic Cross-Media Event Schema Construction and Knowledge Population"
18:00 – 18:30: Concluding remarks by the organizers


Dhruv Batra: Visual Question Answering (VQA) -

I will describe VQA, the task of free-form and open-ended Visual Question Answering (VQA). 

Given an image and a natural language question about the image (e.g., “What kind of store is this?”, “How many people are waiting in the queue?”, “Is it safe to cross the street?”), the machine’s task is to automatically produce an accurate natural language answer (“bakery”, “5”, “Yes”).


Answering any possible question about an image is one of the ‘holy grails’ of AI requiring integration of vision, language, and reasoning. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. We have collected and recently release a dataset containing >250,000 images (from MS COCO and Abstract Scenes Dataset), >750,000 questions, and ~10 Million answers. Preliminary versions of this VQA dataset have begun enabling the next generation of AI systems based on deep learning techniques for understanding images (CNNs) and language (RNNs, LSTMs). 

Heng Ji (joint work with Shih-Fu Chang): Automatic Cross-Media Event Schema Construction and Knowledge Population

Knowledge extraction and representation have been the common goals for both the text domain and the visual domain. A few significant benchmarking efforts, such as TREC and TRECVID, have also demonstrated important progress in information extraction from data of different modalities. However, none of the media modality research is complete and fully reliable. Systems using text Knowledge Base Population (KBP) tools cover important high-level events, entities, and relations, but they often do not provide the complete details depicting the physical scenes, objects, or activities. Visual recognition systems, despite the recent progress, still suffer from inadequate abilities in extracting high-level semantics comparable to the counterparts from the text part. In this talk, we will present our recent joint efforts by Columbia and RPI at developing a Scalable, Portable, and Adaptive Multi-media Knowledge Construction Framework which can exploit cross-media knowledge, resource transfer and bootstrapping to dramatically scale up cross-media knowledge extraction processes.  We have developed novel cross-media methods (including a cross-media deep learning model and “Liberal” KBP) to automatically construct multimodal semantic schema for event, improve extraction through inference and conditional detection, and enrich knowledge through cross-media cross-lingual event co-reference and linking.

Li Deng: Cross-Modality Distant Supervised Learning for Speech, Text, and Image Classification

Standard speech recognition, image recognition, and text classification methods make use of supervision labels within each of the speech, image, and text modalities separately. This is far from how children learn to recognize speech, image, and to classify text. For example, children often get the distant “supervision” signal for speech sounds by an adult pointing to an image scene or text or handwriting that is associated with the speech sounds. Similarly for children learning image categories using speech sounds or text as the supervision signal.  
Motivated by such natural cross-modality learning in human, a computational model is developed that leverages multimodal data to improve engineering systems for speech, image, and text processing. The mechanism underlying this model is a similarity measure defined in the same semantic space which speech, image, and text are all mapped into via high-capacity deep neural networks that are trained using maximum mutual information across different modalities. Examples of speech recognition, which maps from speech to text, and image captioning, which maps from image to text, are given to illustrate the principle of this type of cross-modality distant supervised learning.

Raymond Mooney: Generating Natural-Language Video Descriptions using LSTM Recurrent Neural Networks

We present a novel method for automatically generating English sentences describing short videos using deep neural networks. Specifically, we apply convolutional and Long Short-Term Memory (LSTM) recurrent networks to translate videos to English descriptions using an encoder/decoder framework.  A sequence of image frames (represented using deep visual features) is first mapped to a vector encoding the full video, and then this encoding is mapped to a sequence of words. Experimental evaluation on a corpus of short  YouTube videos and movie clips annotated by Descriptive Video Service demonstrate the capabilities of the technique by comparing its output to human-generated descriptions.

Ruslan SalakhutdinovGenerating Images from Captions with Attention 

I will introduce deep models that are capable of extracting a unified representation that fuses together multiple data modalities. In particular,  I will first briefly discuss models that can generate natural language descriptions (captions) of images based on the third-order neural language model (Kiros, Salakhutdinov, Zemel, 2015). I will then introduce a model that can generate images from natural language descriptions.  The proposed model iteratively draws patches on a canvas, while attending to the relevant words in the description. After training on Microsoft COCO dataset, we compare our model with several baseline generative models on image generation and retrieval tasks. Our model produces higher quality samples compared to many other approaches and can generalize to captions describing novel scene compositions that are not seen in the dataset, such as “A stop sign is flying in blue skies”.

Joint work with Elman Mansimov, Emilio Parisotto, and Jimmy Lei Ba

Oral presentation:
Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences - Hongyuan Mei, Mohit Bansal, Matthew Walter  [pdf]

Oral spotlights:

An Analysis-By-Synthesis Approach to Multisensory Object Shape Perception - Goker Erdogan, Ilker Yildirim, Robert Jacobs [pdf]

Active Perception based on Multimodal Hierarchical Dirichlet Processes - Tadahiro Taniguchi, Toshiaki Takano, Ryo Yoshino [pdf]

Towards Deep Alignment of Multimodal Data - George Trigeorgis, Mihalis Nicolaou, Stefanos Zafeiriou, Bjorn Shuller

Multimodal Transfer Deep Learning with an Application in Audio-Visual Recognition - Seungwhan Moon, Suyoun Kim, Haohan Wang [pdf]


Multimodal Convolutional Neural Networks for Matching Image and Sentence - Lin Ma, Zhengdong Lu, Lifeng Shang, Hang Li [pdf]

Group sparse factorization of multiple data views - Eemeli Leppäaho, Samuel Kaski [pdf]

Unveiling the Dreams of Word Embeddings: Towards Language-Driven Image Generation - Angeliki Lazaridou, Dat Tien Nguyen, Raffaella Bernardi, Marco Baroni [pdf]

Cross-Modal Attribute Recognition in Fashion - Susana Zoghbi, Geert Heyman, Juan Carlos Gomez Carranza, Marie-Francine Moens [pdf]

Multimodal Sparse Coding for Event Detection - Youngjune Gwon, William Campbell, Kevin Brady, Douglas Sturim, Miriam Cha, H. T. Kung  [pdf]

Multimodal Symbolic Association using Parallel Multilayer Perceptron - Federico Raue, Sebastian Palacio, Thomas Breuel, Wonmin Byeon, Andreas Dengel, Marcus Liwicki [pdf]

Bridge Correlational Neural Networks for Multilingual Multimodal Representation Learning - Janarthanan Rajendran, Mitesh Khapra, Sarath Chandar, Balaraman Ravindran [pdf]

Multimodal Learning of Object Concepts and Word Meanings by Robots - Tatsuya Aoki, Takayuki Nagai, Joe Nishihara, Tomoaki Nakamura, Muhammad Attamimi [pdf]

Multi-task, Multi-Kernel Learning for Estimating Individual Wellbeing - Natasha Jaques, Sara Taylor, Akane Sano, Rosalind Picard [pdf]

Generating Images from Captions with Attention - Elman Mansimov, Emilio Parisotto, Jimmy Ba, Ruslan Salakhutdinov [pdf]

Manifold Alignment Determination - Andreas Damianou, Neil Lawrence, Carl Henrik Ek [pdf]

Accelerating Multimodal Sequence Retrieval with Convolutional Networks - Colin Raffel, Daniel P. W. Ellis [pdf]

Audio-Visual Fusion for Noise Robust Speech Recognition - Nagasrikanth Kallakuri, Ian Lane [pdf]

Learning Multimodal Semantic Models for Image Question Answering - Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng [pdf]

Greedy Vector-valued Multi-view Learning - Hachem Kadri, Stephane Ayache, Cecile Capponi, François-Xavier Dupé [pdf]

S2VT: Sequence to Sequence -- Video to Text - Subhashini Venugopalan, Marcus Rohrbach [pdf]