Manjie Xu*, Xinyi Yang*, Wei Liang✉, Chi Zhang✉ , Yixin Zhu✉
*equal contribution ✉corresponding authors
Institute for Artificial Intelligence, Peking University
School of Computer Science & Technology, Beijing Institute of Technology
Yangtze Delta Region Academy of Beijing Institute of Technology, Jiaxing, China
National Key Laboratory of General Artificial Intelligence, BIGAI
Effective integration of AI agents into daily life requires them to understand and adapt to individual human preferences, particularly in collaborative roles. Although recent studies on embodied intelligence have advanced significantly, they typically adopt generalized approaches that overlook personal preferences in planning. We address this limitation by developing agents that not only learn preferences from few demonstrations but also learn to adapt their planning strategies based on these preferences. Our research leverages the observation that preferences, though implicitly expressed through minimal demonstrations, can generalize across diverse planning scenarios. To systematically evaluate this hypothesis, we introduce Preference-based Planning (PbP) benchmark, an embodied benchmark featuring hundreds of diverse preferences spanning from atomic actions to complex sequences. Our evaluation of SOTA methods reveals that while symbol-based approaches show promise in scalability, significant challenges remain in learning to generate and execute plans that satisfy personalized preferences.
We show an example of preference-based planning in a food preparation scenario. When the assistant receives a natural language instruction for a food preparation task, it can follow one of two approaches: (Left, traditional methods) The assistant verifies details with the user at each step through exhaustive communication; or (Right, our personalized approach) it first learns from previous user action sequences to infer explicit preference labels and then generates a personalized plan based on the learned preferences. The planning tree (middle) illustrates how preferences guide the whole decision-making process across multiple dimensions. By learning preferences as a key intermediate representation from minimal human demonstrations, our approach enables AI agents to deliver personalized and adaptable assistance without explicit step-by-step instructions.
We introduce Preference-based Planning (PbP), a comprehensive embodied benchmark built upon NVIDIA Omniverse and OmniGibson. PbP provides realistic simulation and real-time rendering for thousands of daily activities across 50 scenes, featuring a parameterized vocabulary of 290 diverse preferences. Tasks in PbP mirror real-world watch-and-help scenarios, where an agent observes a few demonstrations of a user performing tasks that reveal preferences. The agent must then complete similar tasks in different setups while adhering to the demonstrated preferences.
Preference-based planning comprises two key components: few-shot preference learning of user preferences and subsequent planning guided by these learned preferences. Since humans, even infants, can naturally detect others’ preferences from limited decisions, and collecting extensive personal demonstrations is impractical in daily life, we formulate this as few-shot learning from demonstration.
(Top left) We propose a hierarchical organization of user preferences. Our framework organizes preferences in a three-tiered structure, visualized through sunburst diagrams: (a) Action level, (b) Option level, (c) Sequence level. Each diagram’s hierarchical structure branches from general categories to specific instances, revealing detailed preference patterns upon closer inspection.
(Top right) We also show an example of a demonstration in PbP. The robot in the demonstration is executing the task “Pick Apple from Fridge and place on Table”.
Top: A third-person view video provides an overhead perspective of the entire scene.
Middle: The bird’s-eye-view map displays the robot’s relative position within the scene.
Bottom: The egocentric video captures the robot’s first-person observations during task execution.
Text: The per-frame action annotations contain Omniverse object IDs, which ensure each object reference is unique and enable the model to identify specific objects precisely.
(Left) Example of preferences and their corresponding actions in PBP. At the primitive action level, we demonstrate preferences through basic tasks: (a) cooking using microwave, (b) washing in the sink, and (c)
cutting into halves. At the option level, we showcase different approaches to object rearrangement, where users can prefer either (d) grouping objects by their categories (v1) or (e) placing them on the same layer of the fridge (v2). At the sequence level, we illustrate how preferences guide task ordering: (f) shows a user’s preference to have fruits first, followed by specific cleaning tasks.
We further demonstrate that incorporating learned preferences as intermediate representations in planning significantly improves the agent's ability to construct personalized plans. These findings establish preferences as a valuable abstraction layer for adaptive planning, opening new directions for research in preference-guided plan generation and execution.
(Top left) We evaluate preference learning capabilities across two distinct settings: end-to-end and two-stage approaches. In the end-to-end setting, models directly map raw state inputs to action outputs. Leveraging models’ in-context learning abilities, we provide demonstrations alongside current state information as input and evaluate the generated action sequences against ground truth.
(Top right) Ablation study on the number of demonstrations. Models are evaluated across both of the two stages within PBP task: (a) first-stage preference learning and (b) second-stage action planning. We evaluate both Option Level and Sequence Level tasks. The number of few-shot demonstrations varies from [1, 2, 3, 5], presented left to right. For (a), higher accuracy indicates better performance. For (b), lower distance indicates better performance. Results demonstrate that increased demonstration quantity generally improves both preference learning capability and planning effectiveness.
We also show our analysis of test samples in direct and generalization settings. Lines represent distinct scenes, with grid colors indicating different sample statuses. There are two key findings: (i) Preference learning performance correlates with scene characteristics, with certain scenes proving consistently challenging across both conditions. (ii) While direct cases show better performance overall, failure patterns differ between conditions, particularly for vision-based models. This suggests models rely heavily on visual context consistency–including object arrangement and scene layout–for accurate predictions, indicating potential superficial learning rather than true preference understanding. Symbol-based reasoning maintains robust performance across varied scenes due to the general nature of predefined preferences, whereas vision-based models’ strong dependence on specific visual contexts limits their generalization capability.
More materials can be found in our paper and code repo. If you find PbP useful, please cite us 🥹.
@article{xu2024preference,
title={Learning to Plan with Personalized Preferences},
author={Xu, Manjie and Xinyi Yang and Wei Liang and Chi Zhang and Yixin Zhu},
year={2024}
}