Persuader:Bayesian Information Design for Black-Box LLM Alignment

Abstract

Efficient alignment techniques for large language models (LLMs) are critical to meet the growing demands of real-world applications. However, existing methods are often costly and struggle to address the challenges posed by closed-source models. In this paper, we present the Bayesian Persuasion Alignment framework, a novel and lightweight paradigm designed to overcome these limitations. Our approach leverages a smaller model (Advisor) to learn a linguistic signaling strategy that aligns the behavior of black-box large models (Receivers) with human intentions. Once trained, the Advisor is highly adaptable and can be applied to both open-source and API-based models, enabling enhanced performance in response to rapidly evolving real-world demands. We provide a theoretical analysis of the framework, deriving an upper bound on the Advisor’s regret and demonstrating convergence toward the optimal signaling strategy. Empirically, we show that using Phi-2 as the Advisor, achieving significant performance improvements in mathematical reasoning (50.8%) and code generation (15.4%) across 13 different LLMs. Compared to existing methods, our framework offers a scalable and practical solution for aligning LLMs, particularly in resource-constrained settings.

Persuader Framework

Persuader Example

Experiments

Q1: Can our framework find a non-trivial signaling strategy to enhance the Receiver's performance in various tasks?

To assess the effectiveness of our persuasion framework, we evaluate the Receiver’s performance under two conditions: sampling information from the prior distribution and from the posterior distribution induced by the Advisor’s signaling strategy.

As shown in below table, the Receiver’s performance improves when conditioned on informative signals from the Advisor, compared to both the baseline (no additional context) and the prior-based information selection. This indicates that the Advisor’s learned signaling strategy effectively alters the Receiver’s belief distribution, guiding it toward generating more favorable responses. Notably, while the prior and posterior distributions are defined over the same information set, the posterior better aligns the Receiver’s behavior with the Advisor’s utility through strategic persuasion.

Q2: How about the efficiency of the proposed framework?

We define that ARPI(A|B) presents the relative performance difference of information structure A compared with information structure B.

Left figure compares inference efficiency across different input strategies. When the Receiver utilizes the full information set, it achieves higher performance than using no additional context, but at the cost of a 26% increase in input token length. In contrast, our persuasion framework—using Phi-2 as the Advisor—achieves a 22.5% performance improvement with only a 6.9% increase in token length. This results in a substantially better performance-to-token ratio, highlighting the framework’s efficiency in guiding model behavior with minimal input overhead.

Q3: How about the generalization of the signaling strategy across different Receivers, across varying difficulties , and for various tasks ？

In our extended evaluation, we assess the generalizability of the Advisor across different Receivers. As shown in performance table, the Advisor’s learned signaling strategy consistently improves performance across a range of models and tasks, demonstrating its robustness and transferability.

To further examine this, we follow the Easy-to-Hard Generalization proposed by prior research, which tests whether an Advisor trained on simple tasks can generalize to harder ones. Specifically, we train the Advisor using level 1–3 problems from the MATH training set and evaluate its effectiveness on both easy (levels 1–3) and difficult (levels 4–5) tasks in the MATH test set. As illustrated in above figure, the Advisor significantly boosts the Receiver’s performance on harder problems, even when the Receiver itself is only trained with supervision on easier examples.

Page updated

Google Sites

Report abuse