Direct Post-Training Preference Alignment for Multi-Agent Motion Generation Model Using Implicit Feedback from Pre-training Demonstrations

ICLR 2025, Spotlight Paper

TL;DR. In this work, we consider the problem of efficient post-training alignment of a token-prediction model for multi-agent motion generation. We propose Direct Preference Alignment from Occupancy Measure Matching Feedback (DPA-OMF), a simple yet principled approach that leverages pre-training expert demonstrations to generate implicit preference feedback and significantly improves the pre-trained model’s generation quality without additional post-training human preference annotation, reward learning, or complex reinforcement learning. To the best of our knowledge, this is the first work to demonstrate the benefits of preference alignment for large-scale multi-agent motion generations using implicit feedback from pre-training demonstrations. Additionally, we provide a detailed analysis of preference data scaling laws and their impact on preference over-optimization.

Large-scale pre-training of driving models

The recent advances in Large Language Models (LLMs) have significantly impacted the design of motion generation models for embodied tasks such as autonomous driving. Formulating motion generation as a next-token prediction task not only provides a unified framework for modeling sequential decision-making tasks but also provides opportunities for leveraging pre-trained LLMs for more cost-effective training and improved generalizability.

Here, we consider world modeling of large-scale traffic scenarios where the model is tasked with generating eight seconds of realistic interactions among multiple heterogeneous agents.

Tokenization. Just like an LLM that tokenizes language into discrete tokens, we need to tokenize an agent’s action (lateral and longitudinal acceleration) into discrete tokens so the model knows how to reason over a finite vocabulary of actions, learn transitions between them, and generate coherent behavior sequences. Discretization enables us to frame action prediction as a sequence modeling task, much like language generation. Our action vocabulary consists of 169 tokens. We follow Seff et al. (2023) to build our action tokenization.

Action vocabulary from Seff et al.

Next-token prediction. LLM-type auto-regressive motion generation model is trained to maximize the likelihood of the ground truth joint actions (a_{t}) at time step t, conditioned on all previous joint actions (a_{<t}) and the initial scene context (c):

where each joint action is a collection of action tokens for all agents in the scene. After the training, the model is able to generate action tokens of the agents given a scene context.

Why do we need post-training preference alignment?

While LLM-type auto-regressive motion generation models benefit from training scalability, there remains a discrepancy between the training objective and the underlying reward function that incentivizes expert demonstrations. These models are trained to maximize the probability of the actions in the dataset, encouraging the model to memorize the actions it saw during training. This is really different from teaching the model to understand human preference and optimize for human preference.

As a result, models pre-trained solely with token-prediction objectives often generate behaviors that deviate from what humans would prefer. This discrepancy underscores the challenge of ensuring that motion models trained with next-token prediction are effectively aligned with expert-preferred behaviors, i.e., post-training preference alignment.

Why is post-training preference alignment hard

in multi-agent robotics?

Preference-based alignment has emerged as a crucial component in the LLM post-training stage, aiming to reconcile the disparity between the next-token prediction objective and human preferences. Among various frameworks, direct alignment algorithms (e.g., direct preference optimization) are particularly appealing due to their training simplicity and computational efficiency. Specifically, these algorithms collect human preferences over pre-trained model generations and directly update the model to maximize the likelihood of preferred behaviors over unpreferred ones. However, in complex embodied settings, such as joint motion generation involving hundreds of agents, obtaining such preference data at scale can be very challenging. Human annotators must analyze intricate and nuanced motions, which is a time-consuming process, making the scalability of direct alignment methods difficult in these scenarios.

We conducted an experiment to measure the annotation time required by a human to rank multi-agent traffic simulations based on how realistic the simulations were compared to their personal driving experience. The results show a clear trend: as the number of traffic agents increases, the time required for human annotators to rank simulations grows significantly.

Specifically, for the preference data used in our experiments, the estimated average time required for one human annotator is approximately 633 days!

Can you rank these generations in 5 seconds?

Efficiently distilling human preference in multi-agent motion generation

While soliciting rankings from experts provides explicit preference information, we argue that expert demonstrations used in the pre-training stage inherently encode implicit human preferences, which can be reused to align a pre-trained motion generation model in a cost-effective way, beyond their role in supervised pre-training.

Previous approach. Recently, Alignment from Demonstrations (AFD) (Li et al., 2024; Sun & van der Schaar, 2024; Chen et al., 2024b) has emerged as a valuable technique for automatically generating preference data using pre-training expert demonstrations, allowing preference alignment to scale at a low cost. However, previous methods typically adopt an adversarial scenario: treating all pre-trained model-generated samples as unpreferred and relying solely on pre-training expert demonstrations to construct preferred examples (Chen et al., 2024b; Sun & van der Schaar, 2024). This adversarial approach overlooks the valuable signal provided by preference rankings among the model’s own generations, ultimately reducing alignment effectiveness and potentially leading to misaligned behaviors.

Insight. Instead of treating all generated samples as equally bad, we propose leveraging the implicit preferences encoded in pre-training demonstrations to automatically construct preference rankings among the pre-trained model’s generations, providing more nuanced guidance with zero human cost.

Our key idea. We propose Direct Preference Alignment from Occupancy Measure Matching Feedback (DPA-OMF), our approach to leveraging pre-training expert demonstrations for scalable preference feedback generation.

The key idea is to define an implicit preference distance function that measures the alignment between a generated sample and the expert demonstration given the same scene context (shown in the figure on the right).

This distance is then used to rank the reference model’s generated samples for each training scene context, constructing large-scale preference data at zero human cost.

Alignment visualization. The heat map visualizes the alignment between a generated traffic simulation and the expert demo. More peaks along the diagonal indicate better alignment between the behaviors (i.e., a smaller preference distance).

Our efficient post-training alignment pipeline. DPA-OMF is a simple yet effective alignment-from-demonstration approach that aligns a pre-trained traffic simulation model with human preferences. It defines an implicit preference distance function that measures the alignment between a generated sample and an expert demonstration in the same scene context through occupancy measure matching. This distance is then used to rank the reference model’s generated samples for each training scene contex, enabling large-scale automatic preference data generation to align the motion generation model through contrastive learning. The gray dotted lines above the motion token prediction model indicate the reference model’s motion token distributions at each prediction step, and the orange lines represent the probabilities after the alignment process. $\hat{a}_t$ denotes agents’ action tokens sampled from the predicted distribution during inference time, c denotes the scene context representation.

Quantitative results

We validated our approach using a 1M parameter motion generation model. We measure the realism score of the motion generation model (i.e., how realistic the generated motions are compared to human motions).