FightLadder

A Benchmark for Competitive Multi-Agent Reinforcement Learning

Wenzhe Li Zihan Ding Seth Karten Chi Jin

Princeton University

Abstract

Recent advances in reinforcement learning (RL) heavily rely on a variety of well-designed benchmarks, which provide environmental platforms and consistent criteria to evaluate existing and novel algorithms. Specifically, in multi-agent RL (MARL), a plethora of benchmarks based on cooperative games have spurred the development of algorithms that improve the scalability of cooperative multi-agent systems. However, for the competitive setting, a lightweight and open-sourced benchmark with challenging gaming dynamics and visual inputs has not yet been established. In this work, we present FightLadder, a real-time fighting game platform, to empower competitive MARL research. Along with the platform, we provide implementations of state-of-the-art MARL algorithms for competitive games, as well as a set of evaluation metrics to characterize the performance and exploitability of agents. We demonstrate the feasibility of this platform by training a general agent that consistently defeats 12 built-in characters in single-player mode, and expose the difficulty of training a non-exploitable agent without human knowledge and demonstrations in two-player mode. FightLadder offers meticulously crafted environments to tackle essential challenges in competitive MARL research, heralding a new era of discovery and advancement.

Supported Games

Single-Agent RL Experiments

Multi-Agent RL Experiments

RL Exploitation

Human Exploitation

Head-to-Head Evaluation

Elo Rating

Citation

Supported Games

Mortal Kombat (Genesis)

Fatal Fury 2 (Genesis)

The King of Fighters '97 (Neo Geo)

Street Fighter II (Genesis)

Street Fighter III (Arcade)

Single-Agent RL Experiments

In the single-player setting, we proposed a learning scheme based on curriculum learning. It trains a general RL agent (Proximal Policy Optimization, PPO in our experiments) that can consistently beat built-in computer players (CPUs) across different characters. For game Street Fighter II, the training process of the agent over 25 epochs is shown below. Each epoch involves 10M training steps competing with opponents sampled from the curriculum scheduler in parallel.

The following video shows the learned single-agent PPO (left, Ryu, in grey) is capable of passing through all 15 levels of the Street Fighter II game.

target_16_episode_1_done_16.mp4

Multi-Agent RL Experiments

For the two-player setting (controlling both sides), we adopt five MARL algorithms for agent training, including three population-based methods and two decentralized methods:

League Training (e.g., AlphaStar)
Policy Space Response Oracle (PSRO)
Fictitious Self-Play (FSP)
Independent Proximal Policy Optimization (IPPO)
Two-Timescale IPPO.

For three population-based methods (league training, PSRO, and FSP), the dynamics of payoff matrices within each learning population during training are shown as follows. The name of each row indicates the agent information as Character_Side_Checkpoint. Checkpoint=h_xM represents a previous version of the agent saved at x million steps. The value indicates the win rate of the left (row) player against the right (column) player. Please refer to more details in our paper.

League Training

PSRO

FSP

RL Exploitation

To examine the exploitability of learned agents with the above MARL algorithms, we use a single-agent RL algorithm (PPO) to train the exploiter model against the fixed best agent from each multi-agent algorithm. Videos of learned agents (left, Ryu, in grey) versus RL exploiters (right, Ryu, in white) are shown below.

In general, after sufficient time of training, the exploiter can eventually beat learned agents consistently, while agents from different algorithms reveal various weaknesses.

exploit_ppo_left.mp4

PPO (Single-Agent Baseline) vs PPO Exploiter

exploit_ippo_left.mp4

IPPO vs PPO Exploiter

exploit_2timescale_left.mp4

Two-timescale IPPO vs PPO Exploiter

exploit_league_left.mp4

League Training vs PPO Exploiter

exploit_psro_left.mp4

PSRO vs PPO Exploiter

exploit_fsp_left.mp4

FSP vs PPO Exploiter

Human Exploitation

In addition to exploiting the learned agents with RL algorithms, we also attempt to exploit them with human players under the same control frequency. Videos of learned agents (left, Ryu, in gray) versus humans (right, Ryu, in white) are shown below.

In general, our human players choose a defensive counterattack strategy instead of a fully aggressive one, which is found effective in exploiting the learned agents across methods.

Nevertheless, we still find that the learned agents reveal some robustness to human players (e.g., defend when a human player attacks) during human evaluations, but some simple strategies (e.g., defensive posture combined with low kicks at proper timing) could still defeat them rather consistently.

ppo.mp4

PPO (Single-Agent Baseline) vs Human

ippo.mp4

IPPO vs Human

2timescale.mp4

Two-timescale IPPO vs Human

league.mp4

League Training vs Human

psro.mp4

PSRO vs Human

fsp.mp4

FSP vs Human

Head-to-Head Evaluation

We further evaluate the trained MARL agents with head-to-head battles among each other, as well as against CPUs (built-in game players). The recorded videos are shown below, with Ryu as characters for both sides.

IPPO vs CPU

Two-timescale IPPO vs CPU

League Training vs CPU

PSRO vs CPU

FSP vs CPU

PSRO vs IPPO

League Training vs PSRO

Two-timescale IPPO vs PSRO

FSP vs League Training

Elo Rating

Elo rating is commonly used as a ranking system for multi-player games. We assess Elo ratings for trained agents in our experiments to evaluate the learning process and agent performances, over five algorithms league training, PSRO, FSP, IPPO, and two-timescale IPPO.

The overall Elo ratings with a mixed population of agents (top 5%) from five algorithms are shown on the left. The horizontal axis is the Elo score, and the vertical axis is the number of agents with certain Elo for each algorithm. League training achieves higher Elo ratings on average, especially for left-side agents. However, we also noticed that two sides of agents reveal asymmetric strengths in terms of Elos.

The Elo ratings within each algorithm population are shown below.

For each algorithm, the left figure shows the Elo rating versus the winning rate of each policy in evaluation, the middle figure shows the learning progress with each point indicating a policy with the corresponding Elo, and the right figure depicts the Elo distribution for each type of agent.

The Elo rating for the population of agents trained with the league training algorithm:

The Elo rating for the population of agents trained with PSRO algorithm:

The Elo rating for the population of agents trained with FSP algorithm:

The Elo rating for the population of agents trained with IPPO algorithm:

The Elo rating for the population of agents trained with two-timescale IPPO algorithm:

fsp_(left)_vs_fsp_.mp4

FSP vs FSP

An interesting observation from above Elo rating during entire training process is that, the agents learned on left and right sides can be asymmetric in their performances. This is further verified in our experiments with head-to-head test for agents on two sides but with the same MARL algorithm, for example, video of two FSP agents.

Citation

@misc{li2024fightladder,

title={FightLadder: A Benchmark for Competitive Multi-Agent Reinforcement Learning},

author={Wenzhe Li and Zihan Ding and Seth Karten and Chi Jin},

year={2024},

eprint={2406.02081},

archivePrefix={arXiv},

primaryClass={cs.MA}

}