Robust, Trustworthy and Cost-Optimized Learning Across Multiple Modalities: Theory, Algorithms, and Applications (LAMM)

Topics of the Workshop

Human perception, a sophisticated cognitive process, integrates information from various modalities to construct a comprehensive understanding of the surrounding environment. By processing sensory inputs, including sight, sound, touch, taste, and smell, humans perceive and interpret the world with remarkable depth and nuance. Visual cues provide spatial context and facilitate object recognition, while sequential cues and auditory signals convey temporal dynamics and interactions between objects, facilitating object tracking, behavior understanding, forecasting, and monitoring. Therefore, comprehending these processes requires understanding information from multiple modalities. In recent years, the utilization of big data and multiple modalities, such as text, audio and vision, has significantly advanced applications in Computer Vision and Machine Learning.

However, handling multimodal learning and application remains challenging, particularly in terms of (i) robustness, to effectively deal with data imperfections commonly encountered in real-world data, including but not limited to noise, misalignment, discrepancies, occlusion, and more; (ii) adaptability and generalization, to impart capabilities across various modalities (e.g., text, vision, audio) from different data sources and formats; (iii) optimized learning and scalability, to effectively process even with limited computational resources; (iv) interpretability and explainability, to provide clear insights into the decision-making processes of models and fostering user trust in the outcomes; (v) trustworthiness and reliability, in a wide array of conditions and for a diverse spectrum of user groups, addressing bias from multiple perspectives, and establishing confidence in the technology's applications.

Invited Speakers

Dr. Du Tran

Research Scientist - Google

Bio: Du Tran is currently a research scientist at Google working on computer vision and machine learning. Before joining Google, he was a research lead and Samsung Research America, a research scientist at Meta. He graduated with a Ph.D. in computer science from Dartmouth College. His research interests are computer vision, machine learning and computer graphics, with specific interests in video understanding, representation learning, and vision for robotics.

Dr. Truyen Tran

A2I2, Deakin University

Bio: Dr. Truyen Tran is a Full Professor at Deakin University, Australia, where he serves as Head of AI, Health and Science, Applied Artificial Intelligence Institute (A2I2). In his role, he leads a world-class team developing robust human-compatible Generalist AI (AI Future). Dr. Tran has received multiple international awards for his significant research contributions. He obtained his BSc degree from the University of Melbourne in 2001 and a PhD in Computer Science from Curtin University in 2008.

Dr. Boqing Gong

Boston University & Research Scientist - Google

Bio: Boqing Gong is a computer science faculty member at Boston University and a part-time research scientist at Google. His research focuses on AI models' generalization and efficiency and the visual analytics of objects, scenes, human activities, and their interactions.

Dr. Yunyang Xiong

Meta Reality Labs

Bio: Yunyang Xiong is a research scientist at Meta Reality Lab. Prior to that, Yunyang obtained a Ph.D. in the department of Computer Sciences at University of Wisconsin-Madison. Before joining Meta, Yunyang was a research intern at Amazon Lab 126/Facebook AI/Google Research. His research interests include foundation model optimization, efficient Transformers, and multi-modal LLMs.

Submission Instructions

All submitted work will be assessed based on their novelty, technical quality, potential impact, insightfulness, depth, clarity, and reproducibility. For each accepted submission, at least one author must attend the workshop and present the paper. Information about formatting and style files is available here. There are two ways to contribute submissions to the workshop:

Extended abstract submissions are single-blind peer-reviewed, and author names and affiliations should be listed. Extended abstract submissions are limited to a total of four pages (including references). Extended abstracts of already published works can also be submitted. Accepted abstracts will not be included in the printed proceedings of the workshop.

Full paper submissions are double-blind peer-reviewed. The submissions are limited to eight pages, including figures and tables, in the ACCV style. Additional pages containing only cited references are allowed. Accepted papers will be presented in an oral session. All accepted full papers will be published by the ACCV in the workshop proceedings.

Important Dates

Workshop paper submission deadline: September 11, 2024 October 11, 2024
Notification to authors: October 20, 2024
Camera ready deadline: October 25, 2024
Workshop event: December 8 - December 12, 2024

Submission website: https://cmt3.research.microsoft.com/LAMM2024/Submission/Index

Schedule

09:00 - 09:05: Opening Remarks (15 minutes).

09:05 - 09:30: Invited Talk - Boqing Gong (Boston University & Research Scientist - Google)

Title: From Domain Adaptation to VideoPrism: A Decade-Long Quest for Out-of-Domain Generalization

Abstract: This talk explores the challenges of out-of-domain (OOD) generalization in computer vision, encompassing tasks like domain adaptation, webly-supervised learning, and simulation-to-reality transfer. It examines a decade of research into OOD generalization, highlighting techniques such as kernel methods, representation learning, and curriculum domain adaptation. Finally, the talk connects these techniques to the recent development of generalist vision systems, showcasing VideoPrism – a state-of-the-art generalist video encoding model – and ongoing research into image and video generation models.
Presentation

09:30 - 09:55: Invited Talk - Yunyang Xiong (Research Scientist - Meta )

Title: Efficient Vision-Language LLMs

Abstract: We are currently seeing an evolution happening in personal computing. We can take computing and interacting capability anywhere with devices like Smart Glasses. With these devices, we can equip our personal computing with the ability to see what we see and really live alongside us. They open up a whole new wealth of opportunities for computers to assist us in our daily lives. To enable the smooth integration of AI for smart glasses, fully aware of its environment and efficiently making intelligent choices for you based on your intent, our team has been working on efficient vision-language LLMs. In this talk, I will mainly cover MiniGPT-v2, an unified interface for image-language multi-task learning, and LongVU, long video-language understanding LLM.

Presentation

9:55 - 10:07: Oral Presentation - Tuan Nguyen

Title: Smart Camera Parking System With Auto Parking Spot Detection

Presentation

10:07 - 10:19: Oral Presentation - Vuong Ho

Title: RSSeq: Sequence-to-Sequence Model for Simultaneous Referring Remote Sensing Segmentation and Detection

Presentation

10:19 - 10:30: Oral Presentation - An Nguyen

Title: Monomial Matrix Group Equivariant Neural Functional Networks

Presentation

10:30 - 10:55: Invited Talk - Du Tran (Research Scientist - Google)

Title: Can Machines Understand Long Videos?

Abstract: Video understanding, an important sub-area of computer vision, has various useful applications ranging from video retrieval, visual sensing to robot learning. While state-of-the-art methods excel at simple tasks like classification and detection on short-form videos, they often fall short when confronted with the complexity of longer-duration content. In this talk, I'll delve into our recent work on long video understanding. Our approach enables the analysis of hour-long videos, unlocking new possibilities for applications like egocentric video retrieval and video question-answering. I'll share insights into how we've overcome the challenges posed by long-form video and discuss the potential research directions in this area.

Presentation

10:30 - 10:55: Invited Talk - Tran Truyen (Deakin University)

Title: Compositional Visual Reasoning via Large Vision-Language Models

Abstract: In this talk, I will present our research on visual reasoning in the era of Large Vision-Language Models (LVLMs). While these models demonstrate remarkable capabilities in concept matching, they face key challenges in compositional reasoning and semantic understanding. Our work addresses fundamental challenges: bridging the visual-linguistic gap, enabling effective in-context learning, and constructing hierarchical alignments between visual and textual elements. We introduce two complementary frameworks: SADL and PromViL. SADL approaches compositional Visual QA through semantic sampling, question decomposition, and progressive pseudo-labeling, enabling effective processing of complex visual queries without expensive fine-tuning. PromViL constructs hierarchical multi-modal alignments that progressively connect visual and linguistic elements, building from simple to complex concepts. Through evaluations on multiple Visual QA datasets and a novel dataset derived from Visual Genome, we demonstrate significant improvements in visual grounding and compositional reasoning tasks. Our findings highlight the importance of structured approaches to visual reasoning and suggest promising directions for bridging modalities in visual understanding.

Presentation