Continuous Adaptation via Meta-Learning in

Nonstationary and Competitive Environments

Abstract

Ability to continuously learn and adapt from limited experience in nonstationary environments is an important milestone on the path towards general intelligence. In this work, we cast the problem of continuous adaptation into the learning-to-learn framework. We develop a simple gradient-based meta-learning algorithm suitable for adaptation in dynamically changing and adversarial scenarios. Additionally, we design a new multi-agent competitive environment, RoboSumo, and define iterated adaptation games for testing various aspects of continuous adaptation strategies. We demonstrate that meta-learning enables significantly more efficient adaptation than reactive baselines in the few-shot regime. Our experiments with a population of agents that learn and compete suggest that meta-learners are the fittest.

Nonstationary locomotion

Setup: a 6-leg agent learns to run in a specified direction (East). To induce nonstationarity, we select a pair of legs of the agent and scale down the torques applied to the corresponding joints by a factor that linearly changes from 1 to 0 over the course of 7 episodes. There are 15 ways to select a pair of legs of a 6-leg creature which gives us 15 different nonstationary environments. We train our agents on 12 of these environments and test on rest 3.

Goal: learn to adapt from episode to episode by changing the gait so that the moving speed in a given direction is maximal despite the changes in the environment.

Reward: the agent is rewarded proportionally to its moving speed in the given direction.

Observations: the absolute position and velocity of its torso, the angles and velocities of its legs.

Actions: torques applied to the joints (2 dimensions per leg).

mlp-back-7ep-front-7ep-final.mp4
lstm-back-7ep-front-7ep-final.mp4

Summary: The videos showcase behavior of MLP and LSTM policies without (left) and with (right) meta-learned adaptation on two testing environments where a pair of front or back legs becomes dysfunctional (colored in red). Meta-learned adaptation allows agents change their gait and make progress in the specified direction.

RoboSumo

RoboSumo is a multi-agent competitive environment. We have designed 3 types of agents (creatures) that can compete against each other. To win, the agent has to push its opponent out of the ring (tatami) or make the opponent’s body touch the ground.

Reward: the winner gets +2000, the loser gets -2000, and in case of a draw both agents get -1000 (a few additional rewards are used during training, for details see the paper).

Observations: positions and velocities of the agent's torso and legs, forces applied to the agent's body, position of the opponent.

Actions: torques applied to the joints (2 dimensions per leg).

Evaluation: different adaptation methods are evaluated in iterated adaptation games multi-round games where agents are can adapt and change their behaviors from round to round (each round is 3 episodes). For consistency of evaluation, agents that adapt play against the same opponents that update their policies from round to round using self-play, i.e., opponent's updates do not depend on the particular agent they compete with.

Three of types agents: Ant (4 legs), Bug (6 legs), and Spider (8 legs).

Ant vs. different opponents

Ant uses meta-learned adaptation vs. opponents that improve from round to round via self-play. Both agents use LSTM policy architectures.

ant-vs-bug.mp4
ant-vs-spd.mp4

Summary:

  • Win-rates for meta-learner: initially 25-35% -> 40-48% by the end of the game [for baselines: initially 40-45% -> 15-20%].
  • The Ant is 30% heavier than Bug. After adaptation, it starts exploiting this fact by switching to swift attacks.
  • On the other hand, Spider is 3x times heavier than Ant. The only viable strategy discovered by the Ant is to destabilize and slowly push the opponent out.

Note: the opponents constantly improve from round to round via self-play getting much more experience than a 3-episode interaction.

Bug vs. different opponents

Bug uses meta-learned adaptation vs. opponents that improve from round to round via self-play. Both agents use LSTM policy architectures.

bug-vs-ant.mp4
bug-vs-spd.mp4

Summary:

  • Win-rates for meta-learner: initially <40% -> 48-54% by the end of the game [for baselines: initially 45-55% -> <40%].
  • The Bug is 30% lighter than Ant. After adaptation, Bug often uses a strategy of going underneath the opponent which gives it an edge.
  • Spider is > 3x times heavier than Bug but much less stable. Adaptation leads Bug to more stable and careful attack strategies.

Note: the opponents constantly improve from round to round via self-play getting much more experience than a 3-episode interaction.

Evolution of a population of competing agents

Setup: We've evolved a population of 1050 agents of different anatomies (Ant, Bug, Spider), policies (MLP, LSTM), and adaptation strategies (PPO-tracking, RL^2, meta-updates) for 10 epochs. Initially, we had an equal number of agents of each type. Every epoch, we randomly matched 1000 pairs of agents and made them compete and adapt in multi-round games against each other. The agents that lost disappeared from the population, while the winners replicated themselves.

population-evolution-via-adaptation-v2.mp4

Summary: After a few epochs of evolution, Spiders, being the weakest, disappeared, the subpopulation of Bugs more than doubled, the Ants stayed the same. Importantly, the agents with meta-learned adaptation strategies end up dominating the population.