PerDream: PERception, Decision making and REAsoning through Multimodal foundational modeling

ICCV 2023 Workshop, Paris, France

October 3rd. 1:30pm - 6:00pm. Room S05

Overview

The power of large-scale multimodal data has greatly propelled advancements in various applications within the fields of vision, language, robotics, and AI as a whole, in recent years. While a majority of the tackled tasks in multimodal learning have focused on perception, there are some approaches that incorporate both perception and action, such as utilizing pretrained vision/vision-language representations for control. However, these approaches tend to address specific sub-problems individually.

Recently, there has been a surge in research on foundation models, including Gato, GPT-4, ImageBind, Voyager, SMART, LLaVA, SEEM, and others, aiming to integrate perception understanding, decision making, and reasoning in a unified manner. These models strive to tackle these aspects jointly, rather than separately. Concurrently, datasets have also evolved to provide a comprehensive range of information across multiple modalities, facilitating both perception and control. These datasets encompass diverse signals in terms of space and time, such as RGB, depth, optical flow, and semantic maps, along with environmental signals like actions and rewards. Furthermore, they incorporate auxiliary data, including human interactions (instructions, gaze, audio/sounds, activities) and external knowledge from the internet or tool-use. As a result, the field of multimodal foundational modeling stands on the precipice of being capable of simultaneously addressing perception, decision making, and reasoning through unified approaches that leverage vast amounts of spatio-temporal data from various modalities.

The objective of this workshop is to foster extensive discussion on this emerging topic and concentrate on the development of foundation models that can encompass perception, decision making, and reasoning by harnessing multimodal data. It aims to encourage interdisciplinary interaction and collaboration among the communities of natural language processing, computer vision, and reinforcement learning, serving as a platform for research groups from both academia and industry.

Invited Speakers

Jitendra Malik

Arthur J. Chick Professor of EECS, UC Berkeley

Kristen Grauman

Professor of CS, UT Austin

Research Scientist, Meta AI

Roozbeh Mottaghi

Research Scientist Manager, Meta AI

Affiliate Associate Professor of CSE, UW

Louis-Philippe Morency Associate Professor of CS, CMU

Yuke Zhu

Assistant Professor of CS, UT Austin

Senior Research Scientist, Nvidia

Abhinav Gupta

Associate Professor of

The Robotics Institute,

CMU

Program

October 3rd, 2023

1:30 - 1:40 pm Welcome

1:40 - 2:10 pm

Speaker: Louis-Philippe Morency

Talk Title: Understanding Multimodal Fusion

2:10 - 2:40 pm

Speaker: Changan Chen

Talk Title: Audio-Visual Embodied AI: From Simulating to Navigating with Sounds in Spaces

2:40 - 3:10 pm

Speaker: Abhinav Gupta

Talk Title: Sound Camera Action!

3:10 - 3:25 pm Coffee break/ Poster

3:25 - 3:55 pm

Speaker: Roozbeh Mottaghi

Talk Title: Harnessing Simulation Data’s Potential for Building Foundation Models

3:55 - 4:25 pm

Speaker: Yuke Zhu

Talk Title: Building Multimodal Foundation Models for Embodied Agents

4:25 - 4:55 pm

Speaker: Jitendra Malik

Talk Title: Robot Learning with Sensorimotor pretraining

4:55 - 5:25 pm Oral session

VQA Therapy: Exploring Answer Differences by Visually Grounding Answers. Chongyan Chen, Samreen Anjum and Danna Gurari

PathAsst: Redefining Pathology through Generative Foundation AI Assistant for Pathology. Yuxuan Sun.

MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge. Wei Lin

5:25 - 6:00 pm Panel session and Closing Remarks

Call for papers

Important Dates

Paper submission deadline: Aug 9th
Notification of accepted papers: Aug 18th
Camera ready submission ends: Sep 1st

Topics

The workshop will cover a wide range of topics, including but not limited to:

Large-scale foundation models leveraging multimodal data, encompassing vision, language, audio, speech, sensory signals, actions, etc., and the ability to generalize across heterogeneous data resources, such as simulations, real-world scenarios, and different viewpoints.
Large-scale multimodal benchmarking to facilitate improved research in foundation models that bridge perception, decision making, and reasoning.
Data curation and generation to facilitate the scaling-up of foundation models.
Exploration of model scaling laws and emergent effects.
Efficient training, alignment and inference techniques for foundation models.
Investigation of the benefits and potential risks associated with foundation models.
Applications of foundation models in various domains utilizing multimodal sensory signals, such as autonomous driving, drones, manipulation, embodied agents, video games, and more.

Submission CMT

Accepted Papers:

Oral:

VQA Therapy: Exploring Answer Differences by Visually Grounding Answers. Chongyan Chen (University of Texas at Austin)*; Samreen Anjum (University of Colorado Boulder); Danna Gurari (University of Colorado Boulder)

PathAsst: Redefining Pathology through Generative Foundation AI Assistant for Pathology. YUXUAN SUN (Westlake University)*; Chenglu Zhu (Westlake University); Sunyi Zheng (Westlake University); Kai Zhang (Ohio State University); Zhongyi Shui (Westlake University); Yunlong Zhang (Westlake University); Honglin Li (Westlake University); Xiaoxuan YU (Westlake University); Zhao Yizhi (Westlake University); Xinheng Lyu (Westlake University); Ruojia Zhao (Westlake University); Lin Yang (Westlake University)

MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge. Wei Lin (Graz University of Technology)*; Leonid Karlinsky (IBM-Research); Nina Shvetsova (Goethe University Frankfurt); Horst Possegger (Graz University of Technology); Mateusz Kozinski (ICG TUGRAZ); Rameswar Panda (MIT-IBM Watson AI Lab); Rogerio Feris (MIT-IBM Watson AI Lab, IBM Research); Hilde Kuehne (University of Bonn); Horst Bischof (Graz University of Technology)

Poster:

ChatGPT for Robotics: Design Principles and Model Abilities. Rogerio Bonatti (Microsoft)*; Sai Vemprala (Scaled Foundations); Ashish Kapoor (Scaled Foundations); Arthur Bucker (Microsoft)

Transforming Event-Based into Spike-Rate Datasets for Enhancing Neuronal Behavior Simulation to Bridging the Gap for SNNs. Sanaullah Sanaullah (Bielefeld University of Applied Sciences)*

A Hybrid Spiking-Convolutional Neural Network Approach for Advancing High-Quality Image Inpainting. Sanaullah Sanaullah (Bielefeld University of Applied Sciences)*; Amanullah Amanullah (Pug Interactive, Vancouver); Kaushik Roy (North Carolina A&T State University); Prof. Jeong-A Lee (Chosun University); Son Chul-Jun (Douzone Bizon Co); Thorsten Jungeblut (Bielefeld University of Applied Sciences and Arts)

Masked Diffusion Models Are Fast Learners. jiachen lei (Zhejiang University); Peng Cheng (Zhejiang University); Zhongjie Ba (Zhejiang University)*; Kui Ren (Zhejiang University)

Unlocking the Heart Using Adaptive Locked Agnostic Networks. Sylwia Majchrowska (AstraZeneca)*; Anders GF Hildeman (AstraZeneca); Philip A Teare (AstraZeneca)

Multimodal Contrastive Learning with Hard Negative Sampling for Human Activity Recognition. Hyeongju Choi (Georgia Institute of Technology)*; Apoorva Beedu (Georgia Institute of Technology); Irfan Essa (Georgia Institute of Technology)