Seamlessly interacting with humans or robots is hard because these agents are non-stationary. They update their policy in response to the ego agent's behavior, and the ego agent must anticipate these changes to co-adapt. Inspired by humans, we recognize that robots do not need to explicitly model every low-level action another agent will make; instead, we can capture the latent strategy of other agents through high-level representations. We propose a reinforcement learning-based framework for learning latent representations of an agent's policy, where the ego agent identifies the relationship between its behavior and the other agent's future strategy. The ego agent then leverages these latent dynamics to influence the other agent, purposely guiding them towards policies suitable for co-adaptation. Across several simulated domains and a real-world air hockey game, our approach outperforms the alternatives and learns to influence the other agent.
Learning and Influencing Latent Intent (LILI)
Our proposed approach for learning and leveraging latent intent. Left: Across repeated interactions, the ego agent uses their previous experience to predict the other agent's current latent strategy, and then follows a policy conditioned on this prediction. Right: The ego agent learns by sampling a pair of consecutive interactions, and (a) training the encoder and decoder to correctly reconstruct interaction k given interaction k-1, while simultaneously (b) using model-free RL to maximize the ego agent's long-term reward.
Simulated environments where the ego agent learns alongside another non-stationary agent. Between interactions, this other agent updates its policy: e.g., moving a hidden target or switching the lane it will merge into. Our approach learns the high-level strategies guiding these policies so both agents seamlessly co-adapt.
Simulations: Quantitative Results
(Left) Reward that the ego agent receives at each interaction while learning. (Right) Heatmap of the target position during the final 500 interactions. The ego agent causes the target to move clockwise or counterclockwise by ending the interaction in-or-out of the circle. Our approach LILI exploits these latent dynamics to trap the target close to the start location, decreasing the distance to travel.
Because each latent strategy was equally useful to the ego agent, here LILI (no influence) is the same as LILI. Shaded regions show standard error of the mean.
Simulations: Qualitative Results
Below, we plot 25 consecutive interactions from policies learned by SAC, LILAC, LILI (no influence), and LILI. The gray circle represents the position of the other agent (which is unknown to the ego agent). The teal line marks the trajectory taken by the ego agent, and the teal circle represents the position of the ego agent at the final timestep of the interaction. At each timestep, the ego agent receives a reward equal to the negative distance to the other agent.
The SAC policy, at convergence, moves to the center of the circle in every interaction. Without knowledge of or any mechanism to infer where the other agent is, the center of the circle gives the highest stable average returns.
LILAC models the other agent's behavior dynamics as independent of the ego agent's actions. Its predictions of the other agent are hence inaccurate.
LILI (no influence) successfully models the other agent's behavior dynamics and correctly navigates to the other agent in each interaction. However, it is not trained to influence the other agent to maximize its own long-term returns.
In contrast, LILI learns to trap the other agent at the top of the circle, where the other agent is closest to the starting position of the ego agent. There, it receives the highest rewards.
LILI (no influence)
Below, we visualize 10 consecutive trajectories from policies learned by SAC and LILI (no influence).
The converged SAC policy lands in the middle of the two alternating targets, as it does not model the changing target and amortizes its experiences.
Meanwhile, LILI (no influence) correctly anticipates the position of the target and lands on the correct side in each interaction.
LILI (no influence)
Below, we visualize trajectories from policies learned by SAC and LILI (no influence).
The converged SAC changes to the left lane in every interaction, even when the other agent moves to the left lane as well.
Meanwhile, LILI (no influence) correctly anticipates the lane that the other agent switches to and switches to the opposite lane accordingly.
LILI (no influence)
Real-World Robotic Air Hockey
In this domain, the ego agent learns to play hockey against a non-stationary robot. The other robot updates its policy between interactions to exploit the ego agent's weaknesses. Over repeated interactions, the ego agent can learn to represent each opponent policy as a high-level latent strategy and also recognize that the opponent updates its strategy to aim away from where the ego agent last blocked. The ego agent can then leverage these latent dynamics to influence the other robot, and learn a policy that guides the opponent into aiming where the ego agent can block best.
Hockey: Quantitative Results
Learning results for the air hockey experiment. (Left) Success rate across interactions. (Right) How frequently the opponent fired left, middle, or right during the final 200 interactions. Because the ego agent receives a bonus reward for blocking left, it should influence the opponent to fire left. At convergence, LILI (no influence) gets an average reward of 1.0 +/- 0.05 per interaction, while LILI gets 1.15 +/- 0.05.
Hockey: Qualitative Results
Please refer to our supplementary video (top of this page) for qualitative results.