Accepted Abstracts

We are excited to share the abstracts accepted to this workshop and we are looking forward to the spotlight talks and poster presentations!

All abstracts will be presented as posters between 2:45 and 3:30pm.

Abstracts that are highlighted as spotlight talks will be presented between 4:30pm and 5pm.

Please refer to the schedule for further details!

How can LLMs transform the robotic design process? (Oral Presentation 4:30pm)
Francesco Stella, Cosimo Della Santina, Josie Hughes

We show that LLMs, like chat-GPT, can guide the robot design process and we discuss the technical and societal implications. We propose novel AI-human collaboration approaches that could allow AI agents to design their own robotic body.

Bootstrap Your Own Skills: Learning to Solve New Tasks with Large Language Model Guidance (Oral Presentation 4:36pm)
Jesse Zhang, Jiahui Zhang, Xiang Ren, Shao-Hua Sun, Minsuk Chang, Karl Pertsch, Joseph Lim

We propose BOSS, an approach that automatically learns to solve new long-horizon, complex, and meaningful tasks by autonomously growing a learned skill library. Prior work in reinforcement learning requires expert supervision in the form of demonstrations or rich reward functions to learn long-horizon tasks. Instead, our approach BOSS (BOotStrapping your own Skills) learns to accomplish new tasks by performing “skill bootstrapping,” where an agent with a set of primitive skills interacts with the environment to practice new skills without receiving reward feedback for tasks outside of the initial skill set. This bootstrapping phase is guided by large language models (LLMs) that inform the agent of meaningful skills to chain together. Through this process, BOSS builds a wide range of complex and useful behaviors from a basic set of primitive skills. We demonstrate through experiments in realistic house-hold environments that agents trained with our LLM-guided bootstrapping procedure outperform those trained with na ̈ıve bootstrapping as well as prior unsupervised skill acquisition methods on zero-shot execution of unseen, long-horizon tasks in new environments.

TidyBot: Personalized Robot Assistance with Large Language Models (Oral Presentation 4:42pm)
Jimmy Wu, Rika Antonova, Adam Kan, Marion Lepert, Andy Zeng, Shuran Song, Jeannette Bohg, Szymon Rusinkiewicz, Thomas Funkhouser

For a robot to personalize physical assistance effectively, it must learn user preferences that can be generally reapplied to future scenarios. In this work, we investigate personalization of household cleanup with robots that can tidy up rooms by picking up objects and putting them away. A key challenge is determining the proper place to put each object, as people's preferences can vary greatly depending on personal taste or cultural background. For instance, one person may prefer storing shirts in the drawer, while another may prefer them on the shelf. We aim to build systems that can learn such preferences from just a handful of examples via prior interactions with a particular person. We show that robots can combine language-based planning and perception with the few-shot summarization capabilities of large language models (LLMs) to infer generalized user preferences that are broadly applicable to future interactions. This approach enables fast adaptation and achieves 91.2% accuracy on unseen objects in our benchmark dataset. We also demonstrate our approach on a real-world mobile manipulator called TidyBot, which successfully puts away 85.0% of objects in real-world test scenarios.

Language-Conditioned Change-point Detection to Identify Sub-Tasks in Robotics Domains (Oral Presentation 4:48pm)
Divyanshu Raj, Chitta Baral, Nakul Gopalan

In this work, we present an approach to identify sub-tasks within a demonstrated robot trajectory using language instructions. We identify these sub-tasks using language provided during demonstrations as guidance to identify sub-segments of a longer robot trajectory. Given a sequence of natural language instructions and a long trajectory consisting of image frames and discrete actions, we want to map an instruction to a smaller fragment of the trajectory. Unlike previous instruction following works which directly learn the mapping from language to a policy, we propose a language-conditioned change-point detection method to identify sub-tasks in a problem. Our approach learns the relationship between constituent segments of a long language command and corresponding constituent segments of a trajectory. These constituent trajectory segments can be used to learn subtasks or sub-goals for planning or options as demonstrated by previous related work. Our insight in this work is that the language-conditioned robot change-point detection problem is similar to the existing video moment retrieval works used to identify sub-segments within online videos. Through extensive experimentation, we demonstrate a 1.78±0.82% improvement over a baseline approach in accurately identifying sub-tasks within a trajectory using our proposed method. Moreover, we present a comprehensive study investigating sample complexity requirements on learning this mapping, between language and trajectory sub-segments, to understand if the video retrieval-based methods are realistic in real robot scenarios.

LIV: Language-Image Representations and Rewards for Robotic Control (Oral Presentation 4:54pm)
Yecheng Ma, William Liang, Vaidehi Som, Vikash Kumar, Amy Zhang, Osbert Bastani, Dinesh Jayaraman

We present Language-Image Value learning (LIV), a unified objective for vision-language representation and reward learning from action-free videos with text annotations. Exploiting a novel connection between dual reinforcement learning and mutual information contrastive learning, the LIV objective trains a multi-modal representation that implicitly encodes a universal value function for tasks specified as language or image goals. We use LIV to pre-train the first control-centric vision-language representation from large human video datasets such as EpicKitchen.

Given only a language or image goal, the pre-trained LIV model can assign dense rewards to each frame in videos of unseen robots or humans attempting that task in unseen environments. Further, when some target domain-specific data is available, the same objective can be used to fine-tune and improve LIV and even other pre-trained representations for robotic control and reward specification in that domain. In our experiments on several simulated and real-world robot environments, LIV models consistently outperform the best prior input state representations for imitation learning, as well as reward specification methods for policy synthesis. Our results validate the advantages of joint vision-language representation and reward learning within the unified, compact LIV framework.

Energy-based Models are Zero-Shot Planners for Compositional Scene Rearrangement
Nikolaos Gkanatsios, Ayush Jain, Zhou Xian, Yunchu Zhang, Christopher G Atkeson, Katerina Fragkiadaki

Language is compositional; an instruction can express multiple relation constraints to hold among objects in a scene that a robot is tasked to rearrange. Our focus in this work is an instructable scene-rearranging framework that generalizes to longer instructions and to spatial concept compositions never seen at training time.

We propose to represent language-instructed spatial concepts with energy functions over relative object arrangements. A language parser maps instructions to corresponding energy functions and an open-vocabulary visual-language model grounds their arguments to relevant objects in the scene. We generate goal scene configurations by gradient descent on the sum of energy functions, one per language predicate in the instruction. Local vision-based policies then re-locate objects to the inferred goal locations. We test our model on established instruction-guided manipulation benchmarks, as well as benchmarks of compositional instructions we introduce. We show our model can execute highly compositional instructions zero-shot in simulation and in the real world. It outperforms language-to-action reactive policies and Large Language Model planners by a large margin, especially for long instructions that involve compositions of multiple spatial concepts. Simulation and real-world robot execution videos, as well as our code and datasets are publicly available on our website: https://ebmplanner.github.io.

Text-conditioned human motion generation for human-robot interaction
Choong Hee Kim, "Jane" Jaejeong Shin

Recent advancements in language models have had a significant impact on robotics research, spanning various applications such as multi-modal perception and human-robot interaction (HRI). In HRI, generating realistic 3D human motion is crucial for physical motion planning of a social robot but it is still challenging due to the following reasons. First, collecting data from a motion capture system is hard to scale up. Also, it is hard to capture the diversity in human motion under the same situation. Second, synthetic human motion for virtual simulation is not physically plausible which is hard to be used in a 3D world environment. In this work, we address these problems by providing text and scene-conditioned human motion generation architecture. By using the previous works and dataset, our architecture provides diverse and goal-oriented human motion compared to the baseline methods proposed from the previous work. Moreover, we conducted a survey on qualitative evaluation for the generated motion in order to complement the limitations of the proposed metrics used for the quantitative evaluation and discuss the possible future directions of this work.

How to Tidy Up a Table: Fusing Visual and Semantic Commonsense Reasoning for Robotic Tasks with Vague Objectives
YIQING XU, David Hsu

Vague objectives in many real-life scenarios pose long-standing challenges for robotics, as defining rules, rewards, or constraints for optimization is difficult.

Tasks like tidying a messy table may appear simple for humans, but articulating the criteria for tidiness is complex due to the ambiguity and flexibility in commonsense reasoning. Recent advancement in Large Language Models (LLMs) offers us an opportunity to reason over these vague objectives: learned from extensive human data, LLMs capture meaningful common sense about human behavior. However, as LLMs are trained solely on language input, they may struggle with robotic tasks due to their limited capacity to account for perception and low-level controls. In this work, we propose a simple framework to solve the robotic tasks with vague objectives. Specifically, by learning a lightweight, image-based, task-specific critic, we adapt the general-purpose LLMs to solve robotic tasks with vague objectives, especially for those involving visual reasoning and fine-grained low-level controls.

Rethinking Mutual Information for Language-Conditioned Skill Discovery
Chao Yang, Ju Zhaoxun

Language-conditioned robot behavior plays a vital role in the execution of complex tasks by associating human commands or instructions with perception and actions. The ability to compose long-horizon tasks based on unconstrained language instructions necessitates the acquisition of a diverse set of general-purpose skills.

However, acquiring inherent primitive skills in a coupled and long-horizon environment without external rewards or human supervision presents significant challenges. In this paper, we evaluate the relationship between skills and language instructions from a mathematical perspective, employing two forms of mutual information within the framework of language-conditioned policy learning.

To maximize the mutual information between language and skills in an unsupervised manner, we propose an end-to-end imitation learning approach known as Language Conditioned Skill Discovery (LCSD). Specifically, we utilize vector quantization to learn discrete latent skills and leverage skill sequences of trajectories to reconstruct high-level semantic instructions.

Through extensive experiments on language-conditioned robotic navigation and manipulation tasks, encompassing BabyAI, Lorel, and Calvin, we demonstrate the superiority of our method over prior works. Our approach exhibits enhanced generalization capabilities towards unseen tasks, improved skill interpretability, and notably higher rates of task completion success.

Navigation with Large Language Models: Semantic Guesswork as a Heuristic for Planning
Dhruv Shah, Michael Equi, Błażej B Osiński, Fei Xia, Brian Ichter, Sergey Levine

Navigation in unfamiliar environments presents a major challenge for robots: while mapping and planning techniques can be used to build up a representation of the world, quickly discovering a path to a desired goal in unfamiliar settings with such methods often requires lengthy mapping and exploration. Humans can rapidly navigate new environments, particularly indoor environments that are laid out logically, by leveraging semantics --- e.g., a kitchen often adjoins a living room, or that an exit sign indicates the way out. Language models can allow robots to acquire such semantic knowledge, but directly using language models to instruct a robot how to reach some destination can also be impractical --- while language models might produce a narrative about how to reach some goal, because they are not grounded in real-world observations, this narrative might be arbitrarily wrong. Therefore, in this paper we study how the ``semantic guesswork'' produced by language models can be utilized as a guiding heuristic for planning algorithms. Our method, Language Frontier Guide (LFG), uses the language model to bias planning and exploration, but does not rely on the language model to always correctly determine the path to the goal. We propose specific approaches to use language model proposals as heuristics, and evaluate our method on simulated indoor navigation benchmarks, showing that our approach outperforms uninformed exploration and other ways of using language models.

Agent 3, change your route: possible conversation between a human manager and UAM Air Traffic Management (UATM)
Jeongseok Kim, Kangjin Kim

This work in progress paper provides an example to show a detouring procedure through knowledge representation and reasoning. When a human manager requests a detouring, this should affect the related agents. Through non-monotonic reasoning process, we verify each step to be proceeded and provide all the successful connections of the reasoning. Following this progress and continuing this idea development, we expect that this simulated scenario can be a guideline to build the traffic management system in real. After a brief introduction including related works, we provide our problem formulation, primary work, discussion, and conclusions.

Page updated

Report abuse