Few-shot In-context Preference Learning via LLMs

Introduction

Preference-based reinforcement learning is an effective way to handle tasks where rewards are hard to specify but can be exceedingly inefficient as preference learning is often tabula rasa. To address this challenge, we propose In-Context Preference Learning (ICPL), which uses in-context learning capabilities of LLMs to reduce human query inefficiency. ICPL uses the task description and basic environment code to create sets of reward functions which are iteratively refined by placing human feedback over videos of the resultant policies into the context of an LLM and then requesting better rewards. We first demonstrate ICPL’s effectiveness through a synthetic preference study, providing quantitative evidence that it significantly outperforms baseline preference-based methods with much higher performance and orders of magnitude greater efficiency. We observe that these improvements are not solely coming from LLM grounding in the task but also the quality of the rewards improvement via preference feedback. Additionally, we perform a series of real human preference-learning trials and observe that ICPL extends beyond synthetic settings and can work effectively with humans-in-the-loop.

2. Method

Figure 1: Overview of In-Context Preference Learning (ICPL).

Our proposed method, In-Context Preference Learning (ICPL), integrates LLMs with human preferences to synthesize reward functions. The LLM receives environmental context and a task description to generate an initial set of K executable reward functions. ICPL then iteratively refines these functions. In each iteration, the LLM-generated reward functions are trained within the environment, producing a set of agents; we use these agents to generate videos of their behavior. A ranking is formed over the videos from which we retrieve the best and worst reward functions corresponding to the top and bottom videos in the ranking. These selections serve as examples of positive and negative preferences. The preferences, along with additional contextual information, such as reward traces and differences from previous good reward functions, are provided as feedback prompts to the LLM. The LLM takes in this context and is asked to generate a new set of rewards.

3. Experiment

Baseline: We consider three preference-based RL methods as baselines. PrefPPO and PEBBLE in B-Pref, which is a benchmark specifically designed for preference-based reinforcement learning. PrefPPO is based on the on-policy RL algorithm PPO, while PEBBLE builds upon the off-policy RL algorithm SAC. Additionally, we include SURF, which enhances PEBBLE by utilizing unlabeled samples with data augmentation to improve feedback efficiency.

Metric: To assess the generated reward function or the learned reward model, we take the maximum task metric value from 10 policy checkpoints sampled at fixed intervals, marked as task score of reward function/model (RTS). ICPL performs 5 iterations, selecting the highest RTS from these iterations as TS for each experiment. Due to the inherent randomness of LLMs, we run 5 experiments for all methods, including PrefPPO, and report the highest TS as the final task score (FTS) for each approach. A higher FTS indicates better performance across all tasks.

3.1 Proxy Human Preference

In this experiment, human-designed rewards were used as proxies for human preferences, enabling rapid and quantitative evaluations of our approach. Importantly, the human-designed rewards were only used to automate the selection of samples and were not included in the prompts sent to the LLM; the LLM never observes the functional form of the ground truth rewards nor does it ever receive any values from them. Since proxy human preferences are free from noise, they offer a reliable comparison to evaluate our approach efficiently.

Table 1 shows the final task score (FTS) for all methods across IsaacGym Tasks. For ICPL and baselines, we track the number of human queries Q required to measure the real human effort involved, which is crucial for methods that rely on human-in-the-loop preference feedback. Specifically, we define a single query as a human comparing two trajectories and providing a preference. In ICPL, the number of human queries Q can be calculated as Q = (K-1) * 2N - 1. In practice, with K=6 and N=5, this results in Q=49. In baselines, we set the maximum number of queries to Q=49, matching ICPL, and also test Q=15k, denoted as "Baseline-#Q".

For relatively more challenging tasks, Baseline-49 performs significantly worse than ICPL when using the same number of human queries. As the number of human queries increases, baselines' performance improves across most tasks, but it still falls noticeably short compared to ICPL. This demonstrates that ICPL, with the integration of LLMs, can reduce human effort in preference-based learning by orders of magnitude.

We further report Eureka's performance as an approximate upper bound on the expected performance ICPL could achieve. Eureka is an LLM-powered reward design method that uses sparse rewards as fitness scores. ICPL surprisingly achieves comparable performance, indicating that ICPL’s use of LLMs for preference learning is effective.

Table 1: The final task score of all methods across different tasks in IssacGym. The top result and those within one standard deviation are highlighted in bold.

Table 1 presents the performance achieved by ICPL. While it is possible that the LLMs could generate an optimal reward function in a zero-shot manner, the primary focus of our analysis is not solely on absolute performance values. Rather, we emphasize whether ICPL is capable of enhancing performance through the iterative incorporation of preferences. We calculated the average RTS improvement compared to the first iteration for the two tasks with the largest improvements compared with OpenLoop, Ant and ShadowHand. As shown in Fig. 3, RTS demonstrates improved performance after multiple iterations (e.g., 5 vs. 1), highlighting its effectiveness in refining reward functions.

Figure 3: RTS improvement over iterations in ICPL

3.2 Human-in-the-loop Preference

To address the limitations of proxy human preferences, which simulate idealized human preference and may not fully capture the inherent irrationalities and occasional errors in real human judgment, we conducted experiments with real human participants.

3.2.1 IssacGym Tasks

Table 2 presents the FTS for the human-in-the-loop preference experiments conducted across several IsaacGym tasks, labeled as "ICPL-real''. The results of the proxy human preference experiment are labeled as "ICPL-proxy''. As observed, the performance of "ICPL-real'' is slightly lower than that of "ICPL-proxy'' in 4 out of 5 tasks, yet it still outperforms the "OpenLoop'' results. This indicates that while humans may have difficulty providing consistent preferences from videos as proxies, their feedback can still be effective in improving performance when combined with LLMs.

Table 2: The final task score of human-in-the-loop preference across several IsaacGym tasks. The values in parentheses represent the standard deviation.

3.2.2 HumanoidJump Task

In our study, we introduced a new task: HumanoidJump, with the task description being "to make humanoid jump like a real human.'' Defining a precise task metric for this objective is challenging, as the criteria for human-like jumping are not easily quantifiable.

The most common behavior observed in this task, as illustrated in Fig. 2, is what we refer to as the "leg-lift jump''. This behavior involves initially lifting one leg to raise the center of mass, followed by the opposite leg pushing off the ground to achieve lift. The previously lifted leg is then lowered to extend airtime. Various adjustments of the center of mass with the lifted leg were also noted.

Figure 2: (a) leg-lift jump-v1

Figure 2: (b) leg-lift jump-v2

Figure 2: (c) leg-lift jump-v2

The following figures illustrate an example where the volunteer successfully guided the humanoid towards a more human-like jump by selecting behaviors that, while initially not optimal, displayed promising movement patterns.

iteration 1

iteration 2

iteration 3

iteration 4

iteration 5

iteration 6

The Figure 3 shows the most preferred behavior derived by OpenLoop configuration. For quantitative evaluation, 20 volunteers indicated their preferences between two videos shown in random order: one generated by ICPL and the other by OpenLoop. The results show that 17 out of 20 participants preferred the ICPL agent, indicating that ICPL produces more human-aligned behaviors.

Figure 3: OpenLoop Result

4. Appendix

4.1 ICPL Prompts

The prompts used in ICPL for synthesizing reward functions are presented in Prompts \ref{prompt: 1}, \ref{prompt: 2}, and \ref{prompt: 3}. The prompt for generating the differences between various reward functions is shown in Prompt \ref{prompt: 4}.

4.2 ICPL Details

The full pseudocode of ICPL is listed in Algorithm 1.

We provide an example to further explain the reward components of the reward function. Take the Humanoid task as an example, where the goal is to make the humanoid run as fast as possible. Below is a typical set of reward components generated by ICPL:

velocity reward: reward for forward velocity (run fast)
upright reward: encouragement for maintaining upright posture
force penalty: penalize high force usage (energy efficiency)
unnatural pose penalty: penalize unnatural joint angles
action penalty: penalize large actions (for smoother movement)

The total reward is the sum of these individual components. Designing such a reward requires specifying and balancing five different aspects of behavior, which is likely nontrivial.

4.3 Baseline Details

4.3.1 PrefPPO

The baseline PrefPPO adopted in our experiments comprises two primary components: agent learning and reward learning. Algorithm. 2 illustrates the pseudocode for PrefPPO. Throughout this process, the method maintains a policy and a reward model.

Agent Learning. In the agent learning phase, the agent interacts with the environment and collects experiences. The policy is subsequently trained using reinforcement learning, to maximize the cumulative rewards provided by the current reward model. We utilize the on-policy reinforcement learning algorithm PPO as the backbone algorithm for training the policy. Additionally, we apply unsupervised pre-training to match the performance of the original benchmark. Specifically, during earlier iterations, when the reward model has not collected sufficient trajectories and exhibits limited progress, we utilize the state entropy of the observations as the goal for agent training.

During this process, trajectories of varying lengths are collected.

Reward Learning. We use the current reward model to build a preference predictor that estimates which of two trajectories is more likely to be preferred by a human. The predictor sums the predicted rewards over all state–action pairs in each trajectory and applies a softmax to compute the preference probability. In the original PrefPPO framework, trajectories have a fixed length, so fixed-length segments can be extracted for reward model training. In our setting, trajectory lengths vary, so we train directly on full trajectory pairs. We also tried zero-padding trajectories to the maximum length and then segmenting them, but this did not work well in practice.

To generate labels, the original PrefPPO uses dense rewards to simulate an “ideal human” preference: if the total reward of one trajectory is higher than the other, it is always preferred; otherwise, it is not. This teacher is perfectly rational and noise-free. We use the default dense rewards from the IsaacGym tasks, which differs from other approaches that use sparse rewards (e.g., task completion metrics). We also tested sparse rewards in PrefPPO and observed similar performance, but kept the dense reward setup for all experiments.

The reward model is trained by minimizing the cross-entropy loss between the predictor’s outputs and the labels, using trajectories sampled from the agent’s learning process. Since policy learning requires much more experience than reward learning, we train the reward model with trajectories from only a subset of environments.

To improve reward learning, we adopt the disagreement sampling scheme from prior work: first generate a large batch of trajectory pairs uniformly at random, then select a smaller batch with the highest variance in predictions across an ensemble of preference predictors. Only these selected pairs are used to update the reward model.

For fair comparison, we record how many times PrefPPO queries the simulated human to compare two trajectories and provide a label, and use this as a measure of human effort. In the simulated human experiments, we cap the number of queries at 49, 150, 1,500, or 15,000. Once this limit is reached, the reward model stops updating, and only the policy is updated via PPO. Algorithm. 3 illustrates the pseudocode for reward learning.

4.3.2 PEBBLE

PEBBLE is a popular feedback-efficient preference-based RL algorithm. It improves the feedback efficiency of the algorithm by mainly utilizing two modules: unsupervised pre-training and off-policy learning. The unsupervised pre-training module is introduced in the PrefPPO section, and we also include it in PEBBLE with the same setting. PEBBLE utilizes the off-policy algorithm SAC instead of PPO as the backbone RL algorithm. SAC stores the agent's past experiences in a replay buffer and reuses these experiences during the training process. PEBBLE relabels all past experiences in the replay buffer every time it updates the reward model.

4.3.3 SURF

SURF is a framework that uses unlabeled samples with data augmentation to improve the efficiency of reward training. In our experiments, the length of trajectories is varied and may affect the evaluation of the trajectories. Therefore, we do not apply the data augmentation technique and only utilize the semi-supervised learning method in SURF.

In addition to the labeled pairs of trajectories, SURF samples another unlabeled dataset to optimize the reward model. Specifically, during each update of the reward model, SURF not only samples a set of trajectories and queries a human teacher for labels, but also samples additional trajectory pairs. These additional pairs are assigned pseudo-labels generated by the preference predictor based on the current reward model.

During the training process of reward model, SURF will also use the unlabeled samples for training if the confidence of the predictor is higher than a pre-defined threshold.

4.4 Environment Details

In Table 3, we present the observation and action dimensions, along with the task description and task metrics for 9 tasks in IsaacGym.

4.5 Proxy Human Preference

4.5.1 Explanation of Human Query Calculation in ICPL

In ICPL, for each iteration, K samples (i.e., K videos) are generated. The human annotator compares these K videos and selects the best one. The human is queried K-1 times. The annotator is then queried K-2 times to select the worst video from the remaining K-1 videos. Thus, in each round, the total number of queries to the human is 2K-3. This process is repeated for N iterations. Therefore, the total number of queries over N iterations is: N * (2K-3)=2KN-3N. At the end, the human selects the best video from the N rounds, requiring additional N-1 queries. Therefore, the total number of queries is: 2KN-3N+N-1 = 2KN-2N-1 = (K-1) * 2N-1.

4.5.2 Additional Results

The results of all methods, including Q=49, 150, 1500, 15000, are shown in Table 4.

Due to the high variance in LLMs performance, we report mean value and standard deviation across 5 experiments as a supplement, which is presented in Table 5. As shown in these two tables, ICPL consistently outperforms the PbRL baselines in both mean and highest scores, demonstrating the effectiveness of the ICPL method. At the same time, we also observe that ICPL exhibits a relatively large standard deviation, which is due to the high randomness inherent in LLMs. This suggests that in practical applications of ICPL, it is necessary to conduct multiple experiments to mitigate the effects of LLM randomness. In this paper, we perform 5 experiments for each task. Note that all PbRL baseline methods are also run 5 times. Their final performance remains lower than ICPL despite having smaller standard deviations.

We also report the final task score of PrefPPO using sparse rewards as the preference metric for the simulated teacher in Table 6.

4.6 Human-in-the-loop Preference

4.6.1 Recruitment Protocol

Participants were recruited through posters within the campus. Prior to participation, all volunteers were provided with an Information Sheet that clearly outlined: the purpose of the study, the tasks they would be asked to perform, the expected duration,

their right to withdraw at any time, how their data would be used and stored, and the compensation they would receive.

Only participants who gave informed consent in writing were included in the study. No personal identifiable information was collected. All data was anonymized and used exclusively for academic research purposes.

4.6.2 Demographic Data

The participants in the human-in-the-loop preference experiments consisted of 7 individuals aged 19 to 30, including 2 women and 5 men. Their educational backgrounds included 2 undergraduate students and 5 graduate students. The 20 volunteers recruited to evaluate the performance of different methods were aged 23 to 28, comprising 5 women and 15 men, with 3 undergraduates and 17 graduate students.

4.6.3 Human experiment setup

In ICPL experiments, each volunteer was assigned an account with a pre-configured environment to ensure smooth operation. After starting the experiment, LLMs generated the first iteration of reward functions. Once the reinforcement learning training was completed, videos corresponding to the policies derived from each reward function were automatically rendered. Volunteers compared the behaviors in the videos with the task descriptions and selected both the best and the worst-performing videos. They then entered the respective identifiers of these videos into the interactive interface and pressed ``Enter'' to proceed. The human preference was processed as an LLM prompt for generating feedback, leading to the next iteration of reward function generation.

This training-rendering-selection process was repeated across several iterations. At the end of the final iteration, the volunteers were asked to select the best video from those previously marked as good, designating it as the final result of the experiment. For IsaacGym tasks, the corresponding RTS was recorded as TS. It is important to note that, unlike proxy human preference experiments where the TS is the maximum RTS across iterations, in the human-in-the-loop preference experiment, TS refers to the highest RTS chosen by the human, as human selections are not always based on the maximum RTS at each iteration. Given that ICPL required reinforcement learning training in every iteration, each experiment lasted two to three days. Each volunteer was assigned a specific task and conducted five experiments, one for each task, with the highest TS being recorded as FTS in IsaacGym tasks.

4.6.4 IsaacGym Tasks

We evaluate human-in-the-loop preference experiments on tasks in IsaacGym, including Quadcopter, Humanoid, Ant, ShadowHand, and AllegroHand. In these experiments, volunteers were limited to comparing reward functions based solely on videos showcasing the final policies derived from each reward function.

In the \Quadcopter task, humans evaluate performance by observing whether the quadcopter moves quickly and efficiently, and whether it stabilizes in the final position. For the Humanoid and Ant tasks, where the task description is "make the ant/humanoid run as fast as possible," humans estimate speed by comparing the time taken to cover the same distance and assessing the movement posture. However, due to the variability in movement postures and directions, estimating speed can introduce inaccuracies. In the ShadowHand and AllegroHand tasks, where the goal is ``to make the hand spin the object to a target orientation,'' Humans find it challenging to calculate the precise difference between the current orientation and the target orientation at every moment, even though the target orientation is displayed nearby. Nevertheless, humans still can estimate the duration of effective rotations with the target orientation in the video, thus evaluating the performance of a single spin. Since the target orientation regenerates upon being reached, the frequency of target orientation changes can also aid in facilitating the assessment of evaluating performance.

Due to the lack of precise environmental data, volunteers cannot make absolutely accurate judgments during the experiments. For instance, in the Humanoid task, robots may move in varying directions, which can introduce biases in volunteers' assessments of speed. However, volunteers are still able to filter out extremely poor results and select videos with relatively better performance. In most cases, the selected results closely align with those derived from proxy human preferences, enabling effective improvements in task performance.

Page updated

Google Sites

Report abuse