First session
9:00 - 9:05 Opening remarks
9:05 - 9:35 Keynote 1: Long Chen - The Interplay of Understanding and Generation in Multimodal AI
9:35 - 10:05 Keynote 2: Manling Li - Why is Spatial Understanding Hard for VLMs?
10:05 - 10:35 Keynote 3: Na Zhao - From Perception to Action: Foundation Models for 3D Spatial Intelligence
10:35 - 11:00 Morning Tea break
Second session
11:00 - 11:20 Outstanding Submission Oral Presentation
11:20 - 11:50 Keynote 4: Liwei Wang - Learning from Videos to 3D Spatial Intelligence
11:50 - 12:20 Keynote 5: Yingwei Pan - Multimodal Content Generation: Unleashing Infinite Creative Possibilities for the Future
12:20 - 12:25 Closing remarks