Towards a Standardised Performance Evaluation Protocol for Cooperative MARL

Our call to the community:

Performance of QMIX on some SMAC maps as reported in different papers (2M timesteps)

Multi-agent reinforcement learning (MARL) has emerged as a useful approach to solving decentralised decision-making problems at scale. Research in the field has been growing steadily with many breakthrough algorithms proposed in recent years. In this work, we take a closer look at this rapid development with a focus on evaluation methodologies employed across a large body of research in cooperative MARL. By conducting a detailed meta-analysis of prior work, spanning 75 papers accepted for publication from 2016 to 2022, we bring to light worrying trends that put into question the true rate of progress. We further consider these trends in a wider context and take inspiration from single-agent RL literature on similar issues with recommendations that remain applicable to MARL. Combining these recommendations, with novel insights from our analysis, we propose a standardised performance evaluation protocol for cooperative MARL. We argue that such a standard protocol, if widely adopted, would greatly improve the validity and credibility of future research, make replication and reproducibility easier, as well as improve the ability of the field to accurately gauge the rate of progress over time by being able to make sound comparisons across different works.

Our contributions

Lessons, trends and recommendations

Lesson 1: Know the true source of improvement and report everything

There is considerable variance in reported results in RL papers and similar inconsistencies have been found in MARL. An increase in the rigor for the reporting of experiment details, open sourcing code and ablation studies is recommended.

Lesson 2: Use standardised statistical tooling for estimating and reporting uncertainty

There is a lack of shared standards for measuring the uncertainty of results in MARL which makes direct comparisons between papers difficult. We make recommendations for standardising statistical tooling and producing more detailed performance reports.

Lesson 3: Guard against environment misuse and overfitting

SMAC has become the most popular benchmarking setting for cooperative MARL. Through the lens of historical trends we can see it is possible to cherrypick results by subsampling scenarios to produce the illusion of algorithmic progress.

Centralised dataset for MARL experiments

We manually annotated MARL evaluation methodologies found in research papers published between 2016 to 2022 from various conferences including NeurIPS, ICML, AAMAS and ICLR, with a focus on deep cooperative MARL. In total, we collected data from 75 cooperative MARL papers accepted for publication. We believe this dataset is the first of its kind and we have made it publicly available for further analysis.

We welcome contributions from the MARL research community to grow this dataset and move towards a more unified consensus towards MARL evaluation, by adding their paper's experiments results using this template.

Analysis notebook

We release our analysis colab notebook which shows how to generate some of our paper analyses and would definitely help for further analysis as well.

Performance of COMA, QMIX and IA2C in 3 different SMAC scenarios.

Historical performance of IQL on different SMAC maps across papers.

Win rates percentage per training timesteps on 3s5z.

Reporting of the evaluation interval used.

A standardised performance evaluation protocol for MARL

We provide a standardised performance evaluation protocol for cooperative MARL. We are realistic in our efforts, knowing that a single protocol is unlikely to be applicable to all MARL research. However, echoing recent work on evaluation , we stress that many of the issues highlighted previously stem from a lack of standardisation. Therefore, we believe a default "off-the-shelf" protocol that is able capture most settings, could provide great value to the community. If widely adopted, such a standardised protocol would make it easier and more accurate to compare across different works and remove some of the noise in the signal regarding the true rate of progress in MARL research. A summarised version of our protocol is given in the blue box below. More details can be found in the paper.

Tools for evaluation

We believe that if the MARL community agree on an exact format in which raw data should be stored and processed, it would make it easier for researchers to benchmark the quality of their algorithms against other baselines.

As such, we will open source a proposed data structure to be used for raw experiment data storage as well processing tools on our Github repo along with our proposed evaluation protocol. Practitioners will be able to use our tools to process raw MARL experiment data for downstream use with the tools provided by rliable. Please see the below code snippets for an example of how our tools may be utilized:




Raw experiment data can be processed and prepared for being used by the tools released by rliable. A detailed example of the expected format of the raw json data can be found on our Github repo.




After the data has been processed and cast to the correct format, it can be directly passed to modified versions of the tools released by rliable.

Reporting template




We suggest an example of reporting templates that can be used to summarise the important information required to perform the evaluation of algorithms. We provide the LaTeX code for these template within our repository.


Feedback:

To successfully raise the standard of evaluation in MARL will require the efforts of the community as a whole, therefore we would appreciate any suggestions and ideas for improvement. Please use the button below to submit any additional suggestions using the feedback form: