Multi-Modal Learning from Videos

Room 201B, June 17, 2019

CVPR 2019, LONG BEACH

"Multisensory Integration (also known as multimodal integration) describes a process by which information from different sensory systems is combined to influence perception, decisions, and overt behavior."

Stein et.al.

Introduction

Video data is explosively growing as a result of ubiquitous acquisition capabilities. The videos captured by smart mobile phones, from ground surveillance, and by body-worn cameras can easily reach the scale of gigabytes per day. While the "big video data" is a great source for information discovery and extraction, the computational challenges are unparalleled. Intelligent algorithms for automatic video understanding, summarization, retrieval, etc. have emerged as a pressing need in such context. Progress on this topic will enable autonomous systems for quick and decisive acts based on the information in the videos, which otherwise would not be possible.

Schedule

This workshop takes place on Monday, June 17 in room 201B.


9:30am-9:35 am

Opening remark


9:35am-10:10am

Multimodal Knowledge Graph Construction from Visual and Text Data [ Abstract ]

10:10am-10:30am

Coffee break


10:30 am-11:05am

Video understanding from a sentence


11:05am-11:40pm

Learning to Act by Watching Videos

11:40am-12:30pm

Lunch

12:30-2:00pm

Poster session


2:00 pm-2:40pm

Learning Visual Representation and Grounded Language Generation


2:40pm-3:15pm

Learning from First-Person Video

3:15pm-3:30pm

Coffee break


3:30pm-4:05pm

Multimodal AI: Human Behavior Understanding [ Abstract ]


4:05pm-4:40pm

Sound Localization and Dance to Music [ Abstract ]

4:40pm-5:00pm

Closing remarks


Accepted Posters

  • Grounded Video Description. Luowei Zhou, Yannis Kalantidis, Xinlei Chen, Jason Corso, Marcus Rohrbach.
  • The Emotionally Intelligent Robot: Improving Socially-aware Human Prediction in Crowded Environments. Aniket Bera, Tanmay Randhavane, Dinesh Manocha.
  • Continuous Hand Gesture Recognition Algorithm Based On Multimodal Feature Fusion. Hoang Nguyen, Guee-Sang Lee, Soo-Hyung Kim, Hyung-Jeong Yang.
  • Self-Supervised Segmentation and Source Separation on Videos. Andrew Rouditchenko, Hang Zhao, Chuang Gan, Josh McDermott, Antonio Torralba.
  • Adversarial Inference for Multi-Sentence Video Description. Jae Sung Park, Marcus Rohrbach, Trevor Darrell, Anna Rohrbach.
  • 2.5D Visual Sound. Ruohan Gao, Kristen Grauman.