Influencing Towards Stable Multi-Agent Interactions
Learning in multi-agent environments is difficult due to the non-stationarity introduced by an opponent's or partner's changing behaviors. Instead of reactively adapting to the other agent's (opponent or partner) behavior, we propose an algorithm to proactively influence the other agent's strategy to stabilize -- which can restrain the non-stationarity caused by the other agent. We learn a low-dimensional latent representation of the other agent's strategy and the dynamics of how the latent strategy evolves with respect to our robot's behavior. With this learned dynamics model, we can define an unsupervised stability reward to train our robot to deliberately influence the other agent to stabilize towards a single strategy. We demonstrate the effectiveness of stabilizing in improving efficiency of maximizing the task reward in a variety of simulated environments, including autonomous driving, emergent communication, and robot reaching.
Link to download anonymized code: https://drive.google.com/file/d/1ZFRB-jYhcMXW3j0qdmR7fCSLxJZyc2E0/view?usp=sharing
Coordination with a partner can be critical to succeeding in a task. For example, in this game of volleyball, the red and green agents both have the options to either dig for the ball or step back for their partner. (Left) A partner that unpredictably and indecisively switches between digging and staying back can be extremely difficult to coordinate with, potentially costing the match. (Right) By deliberately stepping back, the red agent can stabilize the partner's strategy to dig for the ball, developing a convention that allows for easier learning on how to succeed on the overall task of winning the volleyball match.
Stable Influencing with Latent Intent (SILI)
We learn a dynamics model of the opponent's strategies conditioned on the ego agent's past trajectory. (Left) The latent strategy is learned in an unsupervised way by jointly learning a decoder to predict the state transitions and task rewards. (Right) Then, we combine a stability reward with the task reward to train our policy using RL. The stability reward is unsupervised and defined by minimizing the pairwise distance between the previous two predicted latent strategies.
Quantitatively, we show the task and stability reward curves across all of our simulated environments. Across all environments, SILI (our method) achieves comparable performance to Oracle. Compared to the other baselines, SILI significantly outperforms them in task reward by learning to stabilize the opponent's strategy. Qualitatively, we show trajectories from interactions with our algorithm SILI compared to baselines to visualize the differences in behavior among algorithms. We show a subset of the baselines in the GIFs with an emphasis on algorithms that converge to an interpretable behavior.
Circle Point Mass
Circle Point Mass: In this environment, the ego agent is trying to get as close to the opponent as possible in a 2D plane, inspired by pursuit-evasion games. The opponent moves between locations along the circumference of a circle. The ego agent never observes the true location (strategy) of the opponent. If the ego agent ends an interaction inside the circle, the opponent jumps counterclockwise to the next target location. If the ego agent ends an interaction outside the circle, the opponent stays at the red target location for the next interaction. We examine four variants of this environment: Circle (3 Goals), Circle (8 Goals), Circle (Continuous), and Circle (Unequal). In Circle (Continuous), there are an infinite number of possible opponent strategies, so we use a continuous latent representation. In Circle (Unequal), there are two opponent strategies that are more beneficial because the ego agent begins the interaction closer to the respective goals, but there is a farther away goal that can be kept stable.
The GIFs below show results from the Circle (3 Goals) environment, where the ego agent begins the interaction at the center of the circle and is trying to reach the opponent, who is represented by the filled in, colored goal (red, green, or blue). Each frame displays an entire trajectory from an interaction with the opponent.
SILI learns to stabilize the opponent's strategy by safely ending each interaction outside of the circle. By doing so, SILI can more easily optimize for the task reward by going near the red target.
LILI learns a latent representation of the opponent's strategy, but does not optimize for stabilizing, so LILI tries to learn the nuances of the complex latent dynamics of the opponent's strategy. As LILI greedily optimizes for the task reward by moving closer to the circle's boundary, LILI risks encountering an unexpected transition of the opponent's strategy.
Stable is the variant of SILI that purely optimizes for stabilizing. Notice that Stable does learn to end each interaction outside the circle and has no incentive to move near the true goal.
Since SAC does not model the opponent's strategies, the local optimum is to move towards the centroid of the 3 goals if the true goal is unknown.
Driving: A fast ego agent is attempting to pass a slow opponent driver. There are 3 lanes and a road hazard upcoming in the center lane, so both the ego agent and opponent need to merge to a new lane. If the ego agent merges to the left lane before the red line (giving the opponent enough reaction time), then the opponent merges to the right lane during the next interaction, understanding the convention of faster vehicles passing on the left. Otherwise, the opponent will aggressively try to cut off the ego agent by merging into the lane that the ego agent previously passed in.
SILI learns to stabilize the opponent's strategy by passing into the left lane before the red passing line. This stabilizes the opponent to always merge to the right. By doing so, the ego agent and the other agent establish the convention of the faster vehicle passing on the left, and SILI is able to avoid all collisions.
LILI learns a poor approximation of the latent strategy dynamics, allowing LILI to sometimes coordinate alternating passing on the right or left, but LILI does not account for the subtle changes in strategy dynamics that result from passing before the red line. This leads to unacceptable collisions.
Without modeling the opponent's strategies, SAC accepts colliding with the opponent vehicle, causing many collisions and leading to poor task reward.
Sawyer-Reach: The opponent can choose between three goals on a table with its intent hidden from the robot. The ego agent is the Sawyer robot that is trying to move its end effector as close as possible to the opponent's chosen goal in 3D space without ever directly observing the opponent's strategy. If the end-effector ends the interaction above a fixed plane on the z-axis, the opponent's strategy stays fixed. Otherwise, the opponent's strategy changes. Semantically, we consider a robot server trying to place food on a dish for a human and needing to move its arm away from the dish by the end of the interaction to avoid intimidating the human retrieving the food.
The oracle tracks the targets well without needing to stabilize the opponent's strategy at all because the true goal location is known. Our robot trained with SILI is able to achieve a similar reward without needing to directly observe the opponent's strategy by first stabilizing, and then optimizing for the task reward.
SILI learns to stabilize the opponent's strategy (choice of goal location) by ending each interaction above the plane z = 0.08 (stabilizing condition). This allows SILI to more easily learn how to reach the opponent's goal location, without ever directly observing the opponent's strategy. The stabilizing condition is difficult to see in the GIF, but notice that the opponent's strategy is kept stable between interactions, unlike in the trajectories from the baselines.
LILI struggles to model the exact latent strategy dynamics, so it imperfectly tracks the true goal. LILI also does not identify that stabilizing the opponent's strategy will benefit the task reward in the long term. Thus, LILI, along with the SAC baseline, converges to moving towards the centroid of the three possible opponent choices in goal locations. Also, SMiRL converges to a similar policy in order to minimize surprise in the robot's observations.
Detour Speaker-Listener: In this environment, the agents need to learn effective communication in order to reach their goals, which has been a popular setting to explore emergent communication. The opponent is the speaker that does not move and observes the true goal location of the ego agent. The ego agent is the listener who cannot speak, but must navigate to correct goal. The opponent utters a message to refer to each landmark, which the ego agent then observes. If the ego agent goes near the speaker (within some radius), the opponent follows the same communication strategy during the next interaction. Otherwise, the speaker chooses a random new strategy, or mapping of goal landmarks to communication actions. Critically, in this environment, the opponent's strategy is the communication strategy, not the true goal location.
SILI learns to stabilize the opponent's strategy by taking the detour to go near the speaker (grey) and learning to decipher the speaker's communication to reach the true goal.
Recall that in this environment, the speaker's strategy can change every timestep if not stabilized. With the frequently changing latent strategies, LILI struggles to not only predict the next latent strategy, but how to associate the strategy and communicated goal with the true underlying goal. This leads LILI to move towards the centroid of the three possible goals.
In order to minimize the surprise or entropy of states, SMiRL keeps its interactions near the initial state. This satisfies the long-term probabilistic notion of stability in state, but does not stabilize the opponent's unobserved strategy.
SAC struggles to decipher the speaker's (opponent) messages, so the agent does not know which goal to go towards.