Abstract
Deep reinforcement learning algorithms often suffer from reward misspecification, causing them to learn suboptimal policies. In the single-agent case, Inverse Reinforcement Learning (IRL) techniques address this issue by inferring the reward function from expert data. However in multi-agent problems, misalignment between the learned and true objectives is exacerbated due to increased environment non-stationarity and variance that scale with multiple agents. As such, in multi-agent general-sum games, multi-agent IRL (MAIRL) algorithms have difficulty balancing cooperative and competitive objectives. To address these issues, we propose Multi-Agent Marginal Q-Learning from Demonstrations (MAMQL), a novel sample-efficient framework for multi-agent IRL. MAMQL uses a per-agent learned marginalization of the action-value function to approximate an agent’s reward function. For each agent, MAMQL learns a critic marginalized over the other agents' policies, allowing for a well-motivated use of Boltzmann policies in the multi-agent context. We identify a connection between optimal marginalized critics and single-agent soft-Q IRL, allowing us to apply a direct, simple optimization criterion from the single-agent domain. Across our experiments on three different simulated domains, MAMQL significantly outperforms previous multi-agent methods in average reward, sample efficiency, and reward recovery by often more than 2-5x.
Environments
Results
The table below shows the average reward across 1k episodes for different algorithms.
MAMQL jointly learns a reward function and a policy for each agent. Based on both the figures and the table, the learned policies outperform the SOTA baselines in terms of their return, and the learned reward functions recover the underlying objectives better.