Appendix

DQN Results

Preliminary Results with DQN

Visualisations of the training curves of the Victim across different Adversaries for Cartpole. Error bars denote the standard error across 10 seeds of Victims trained against a single trained Adversary.

Gradient Catastrophic Interference Plots

 To perform this analysis, we collect each Victim’s experience buffer, before the agents have converged in training, and split each one into 10 bins, ordered by the time-step within the environment. We then calculate the gradient update the agents would perform on each of these bins. In the Adversarial setting (a), the gradient updates performed for transitions sampled early in an episode can interfere with the gradient updates performed for transitions later in an episode. Meanwhile, in the Allied setting (c), those gradient updates are positively correlated, suggesting that the gradient updates aid each other.

Reacher Adversary

Reacher Random 

Reacher Ally 

Pendulum Adversary

Pendulum Rand

Pendulum Ally

Cartpole Test-Time Mean and Variance Plots

We train 10 different Victims alongside the Learned φ (left column), and 10 different Victims alongside a randomly generated φ (right column) in the Cartpole environment. We show the mean of the policy probability of going left across the 10 Victims as we vary the value of the message in multiple randomly selected states. The learned φ consistently generates similar policy outputs across different states with respect to the cheap talk channel, implying that the learned φ shapes the Victim in a consistent way.