Workshop on Open-World Agents (OWA-2024)

Synergizing Reasoning and Decision-Making in Open-World Environments

A NeurIPS 2024 Workshop

12/15/2024 (Full day)

East Building - MTG 1-3 + S.FOY, Vancouver Convention Center

Schedule

Time (Pacific Time)

Event

9:00 AM - 9:10 AM

Opening remarks

9:10 AM - 9:40 AM

Invited talk: What's Missing for Robot Foundation Models?

Ted Xiao (Google DeepMind)

Intelligent robotics have seen tremendous progress in recent years. In this talk, I propose that trends in robot learning have historically almost exactly followed key trends in broader foundation modeling. After covering robotics projects which showcase the power of following such foundation modeling paradigms, I will focus on a few future-looking research directions which may suggest a unique future for how robot learning systems may develop differently from LLMs and VLMs.

9:40 AM - 10:10 AM

Invited talk: Structured Representations for Human-Centered Embodied AI

Jiajun Wu (Stanford University)

For embodied AI systems to assist humans in the real world, we must consider human factors in all aspects. In this talk, I'll introduce our recent work on developing human-centered embodied AI, including assets, environments, tasks, and learning algorithms. What is shared across all these works is our choice of structured representations inspired by human sensory observations, skills, and behaviors. I will discuss how we design the representations, and how we use them to collect assets, build environments, design tasks, and finally develop algorithms to solve embodied AI problems.

10:10 AM - 10:40 AM

Panel Discussion: John Langford, Ted Xiao, Tao Yu, Natasha Jaques

The Past, Present, and Future of Open-World Agents

abstract: This panel brings together leading experts to examine the evolution and trajectory of open-world agents. We will reflect on past breakthroughs and foundational advances in agentic reasoning, consider the state-of-the-art in multimodal interactions, and discuss emerging frontiers—from LLM-driven action planning to lifelong and human-in-the-loop learning. By revisiting historical milestones and evaluating the present landscape, the panel aims to illuminate a roadmap for future innovations in open-world agents, including explorations into robotics, multi-agent coordination, etc, ultimately guiding the development of generalist agents and intelligence.

10:45 AM - 11:45 AM

Poster session 1 & coffee socials (1h)

1:00 PM - 1:30 PM

Invited talk: Building AI Society with Foundation-Model Agents

Zhenfei Yin (Shanghai AI Lab & The University of Sydney)

AI agents based on LLMs or VLMs have already demonstrated their exceptional ability to solve complex problems, and increasingly, these models are being extended to a wide range of downstream applications, such as workflow automation on operating systems, scientific research and discovery, and embodied AI. The integration of foundation models like VLM, VLA, and generative models, combined with external scaffolds like memory mechanisms, system prompts, external knowledge bases, and toolkits, has enabled the emergence of systematic agents capable of tackling complex, long-sequence tasks. However, human society is a complex system formed by diverse organizations, where multiple individuals collaborate and compete within a set of environmental rules to achieve unified goals or indirectly influence the environment’s state. Thus, we also envision that multi-agent systems, built upon the aforementioned foundation models, will exhibit the potential to scale from individual agents to organizational entities. This talk will review the history of AI agents, briefly discuss the architectures of foundation model-based single agents in various fields, and focus on swarm intelligence for multi-agent task completion. Finally, we will explore how, as these agents are deployed, they form collective intelligence, creating a coexistence between humans and AI agents within society.

1:30 PM - 2:00 PM

Invited talk: Generative World Modeling for Embodied Agents

Sherry Yang (New York University)

Generative models have transformed content creation, and the next frontier may be simulating realistic experiences in response to actions by humans and agents. In this talk, I will talk about a line of work that involves learning a real-world simulator (i.e., a world model) to emulate interactions through generative modeling of video content. I will then talk about the applications of using this world model to train embodied agents through reinforcement learning (RL) and planning, which have demonstrated zero-shot real-world transfer. Lastly, I will talk about how to improve generative world models from real-world feedback.

2:00 PM - 2:30 PM

Oral presentations 1 (10 min * 3)

ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting

Hypothetical Minds: Scaffolding Theory of Mind for Multi-Agent Tasks with Large Language Models

Integrating Visual and Linguistic Instructions for Context-Aware Navigation Agents

2:30 PM - 3:30 PM

Poster session 2 & coffee socials (1h)

3:30 PM - 4:00 PM

Invited talk: Scaling Multimodal Computer Agents

Tao Yu (The University of Hong Kong)

Recent advances in vision-language models (VLMs) have enabled AI agents to operate computers just as humans do. In this talk, I will present our approach to scaling these agents through three key dimensions: data, methods, and evaluation. First, I will introduce how we leverage internet-scale instructional videos and human demonstrations via our AgentNet platform to build large-scale computer interaction datasets. I will then discuss our methods for training foundation models that ground natural language into interface actions. Finally, I will present Agent Arena, our open platform for scalable real-world evaluation through crowdsourced user computer interactions, and outline key directions for improving agent robustness and safety for real-world deployment.

4:00 PM - 4:30 PM

Invited talk: Social Reinforcement Learning for Coordination, Social Reasoning, and Online Adaptation

Natasha Jaques (University of Washington)

If AGI is right around the corner, why are embodied AI agents deployed in real-world settings still so dumb? In open world settings, AI still fails to coordinate effectively with other agents, follow natural language instructions to complete embodied tasks, and generalize to circumstances not encountered during training. In contrast, humans and animals can easily adapt to new circumstances, coordinate with others, and acquire complex behaviors. I argue that Social Learning is a key facet of intelligence that gives rise to all of these impressive capabilities. By improving the social intelligence of AI agents, we can get a step closer to adaptive, flexible, generalist open world agents. This talk will overview recent work in the Social Reinforcement Learning lab, describing how to enable smooth coordination with diverse human partners, improve social reasoning for understanding natural language commands, and use social learning to enable rapid online adaptation to new environments by learning from experts.

4:30 PM - 5:00 PM

Oral presentations 2 (10 min * 3)

Dissecting Adversarial Robustness of Multimodal LM Agents

ShowUI: One Vision-Language-Action Model for Generalist GUI Agent

Automated Design of Agentic Systems

5:00 PM - 5:15 PM

Awards and conclusive remarks

Page updated

Google Sites

Report abuse