Abstract

Adversarial attacks in reinforcement learning (RL) often assume highly-privileged access to the victim’s parameters, environment, or data. Instead, this paper proposes a novel adversarial setting called a Cheap Talk MDP in which an Adversary can merely append deterministic messages to the Victim’s observation, resulting in a minimal range of influence. The Adversary cannot occlude ground truth, influence underlying environment dynamics or reward signals, introduce non-stationarity, add stochasticity, see the Victim’s actions, or access their parameters. Additionally, we present a simple meta-learning algorithm called Adversarial Cheap Talk (ACT) to train Adversaries in this setting. We demonstrate that an Adversary trained with ACT can still significantly influence the Victim’s training and testing performance, despite the highly constrained setting. Affecting train-time performance reveals a new attack vector and provides insight into the success and failure modes of existing RL algorithms. More specifically, we show that an ACT Adversary is capable of harming performance by interfering with the learner’s function approximation, or instead helping the Victim’s performance by outputting useful features. Finally, we show that an ACT Adversary can manipulate messages during train-time to directly and arbitrarily control the Victim at test-time. Code is available at https://github.com/luchris429/Adversarial-Cheap-Talk.

Problem Setting

We introduce a minimal adversary that can only append to the observation as a stationary deterministic function of the rest of the state. Unlike in adversarial settings in past work, we prove that the adversary cannot:

Furthermore, the adversary does not have access to the victim's parameters. Note that this gives the adversary a minimal range of influence. 

We also that an adversary with control over φ cannot influence any tabular RL agents or any RL agents that have optimal convergence guarantees. Thus, any impact our Adversary has on the victim must be a result of failures in the Victim's function approximation or algorithm.  This allows us to clearly demonstrate and understand failure modes in our current RL algorithms.

Train-Time Influence

Objectives

When influencing train-time performance, we set J to be the agent’s mean reward throughout its entire training trajectory. We consider both “Adversarial” and “Allied” versions of ACT, whereby Adversaries try to minimise or maximise J respectively (J = ±J).

Visualising Train-Time Influence Results

In the visualisations below, the underlying original state of the environment is shown in blue. The appended cheap talk information is shown in red. The green bar represents the agent's action. Notice how the agents paired with the Ally consistently perform well. In the Adversary cases, the agent seems to have learned a locally optimal strategy instead of a globally optimal one. In both cases, the actions and cheap talk channel outputs seem highly correlated, suggesting that they have a significant influence on the agent's policy. 

Cartpole with Ally

The agent performs well and balances the pole for 200 steps.

Pendulum with Ally

The agent performs well and balances the pendulum vertically.

Reacher with Ally

The agent performs well and reaches the blue circle.

Cartpole with Adversary

The agent seems to have learned to balance the pole, but often hits the edges before it reaches 200 steps (max score).

Pendulum with Adversary

The agent seems to have learned just to spin instead of balancing the pendulum vertically 

Reacher with Adversary

The agent seem to have learned another spinning behaviour instead of directly heading towards the goal.

Test-Time Manipulation

Objectives

When manipulating test-time behaviour, the goal of the Adversary is to use the cheap talk features to maximise some arbitrary objective J during the Victim’s test-time; however, the Adversary may also communicate messages during the Victim’s training. Because the train-time and test-time behaviour of the Adversary differ significantly, we parameterise them separately (as ϕ and ψ respectively), but optimise them jointly.

As an example, consider the Reacher environment, where the Victim is trained to control a robot arm to reach for the blue circle, as seen above. During the Victim’s training, the train-time Adversary (parameterised by ϕ) manipulates the cheap talk features to encode spurious correlations in the Victim’s policy. At test-time, the test-time Adversary (parameterised by ψ) manipulates the cheap talk features to take advantage of the spurious correlations and control the Victim to have it reach for the yellow circle instead (view below), the Adversary’s goal-conditioned objective. 

In other words, the train-time Adversary wants to create a backdoor to make the Victim susceptible to manipulation at test-time. The test-time Adversary wants to use this backdoor to control the Victim. The train-time and test-time Adversaries (ϕ and ψ) are co-evolved trained end-to-end to maximise J . While such optimisation would be difficult for gradient-based methods due to the long-horizon nature of the problem, ES is agnostic to the length of the optimisation horizon.

Note that the test-time Adversary ψ only gets a single shot to maximise J at the end of the Victim’s training and does not have access to (and thus cannot train against) the test-time parameters of the Victim θ.

Visualising Test-Time Manipulation Results

The visualisations below are broken into two parts.  

The first part represents the trained victim after training with ϕ and is shown when the cheap talk channel is red. This shows the cheap talk distribution the agent is trained with. The agent generally seems to perform well in this setting.

The second section represents the trained victim (which was trained with ϕ) having the cheap talk channel instead being swapped out with ψ and is shown when the cheap talk channel is yellow. The agent, without any extra training or finetuning, instantly instead optimises some other goal determined by the adversary. In Cartpole, ψ manages to use the cheap talk channels to have the agent reach some arbitrary spot on the x-axis. In Pendulum, ψ gets the agent to instead reach for some other goal position instead of balancing it purely vertically. In Reacher,  ψ has the agent reach for the yellow goal instead of the blue one.


Cartpole with ϕ

Pendulum with ϕ

Reacher with ϕ

Cartpole with ψ 

The test-time (ψ) goal is to have the Cartpole stay at the yellow box (instead of just balancing the pole).

Pendulum with ψ 

The test-time (ψ) goal is to have the Pendulum's angle be close to the fixed yellow one (instead of balancing vertically). 

Reacher with ψ 

The test-time (ψ) goal is to reach for the yellow circle (instead of the blue one).

Citation:

@article{lu2022adversarial,

  title={Adversarial Cheap Talk},

  author={Lu, Chris and Willi, Timon and Letcher, Alistair and Foerster, Jakob},

  journal={arXiv preprint arXiv:2211.11030},

  year={2022}

}