Video understanding/analysis is a very active research area in the computer vision community. This workshop aims to particularly focus on modeling, understanding, and leveraging the multi-modal nature of video. Recent research has amply demonstrated that in many scenarios multimodal video analysis is much richer than analysis based on any single modality. At the same time, multimodal analysis poses many challenges not encountered in modeling single modalities for understanding of videos (for e.g. building complex models that can fuse spatial, temporal, and auditory information). The workshop will be focused on video analysis/understanding related, but not limited, to the following topics:
- deep network architectures for multimodal learning.
- multimodal unsupervised or weakly supervised learning from video.
- multimodal emotion/affect modeling in video.
- multimodal action/scene recognition in video.
- multimodal video analysis applications including but not limited to sports video understanding, entertainment video understanding, healthcare etc.
- multimodal embodied perception for vision (e.g. modeling touch and video).
- multimodal video understanding datasets and benchmarks.