Safe Multi-Agent Reinforcement Learning for Multi-Robot Control

Algorithm Code | Safe MAMuJoCo Code | Safe MARobosuite Code | Safe MAIG Code

Abstract: Research on robot control has a long tradition. A challenging problem arising in this domain is how to control multiple robots safely in real-world applications. To our knowledge, no study has considered multi-robot control from the perspective of safe Multi-Agent reinforcement learning (MARL). To fill this gap, in this study, we investigate safe MARL for multi-robot control on {cooperative tasks}, in which each individual robot has to not only meet its own safety constraints while maximising their reward, but also consider those of others to guarantee safe team behaviours. First, we formulate the safe MARL problem as a constrained Markov game and employ policy Optimisation to solve it theoretically. The proposed algorithm guarantees monotonic improvement in reward and satisfaction of safety constraints at every iteration. Second, as approximations to the theoretical solution, we propose two safe multi-agent policy gradient methods: Multi-Agent Constrained Policy Optimisation (MACPO) and MAPPO-Lagrangian. Third, we develop the first three safe MARL benchmarks---Safe Multi-Agent MuJoCo (Safe MAMuJoCo), Safe Multi-Agent Robosuite (Safe MARobosuite) and Safe Multi-Agent Isaac Gym (Safe MAIG) to expand the toolkit of MARL and robot control research communities. Finally, experimental results on the three safe MARL benchmarks indicate that our methods can achieve state-of-the-art performance in the balance between improving reward and satisfying safety constraints compared with strong baselines.

Ant Task: the width of the corridor set by two walls is 10 m. The environment emits the cost of 1 for an agent, if the distance between the robot and the wall is less than 1.8 m, or when the robot topples over.

A demo denotes unsafe performance using HAPPO on Ant-2x4 task

A demo denotes safe performance using MAPPO-Lagrangian on Ant-2x4 task

HalfCheetah Task: In the task, the agents move inside a corridor (which constraints their movement, but does not induce costs). Together with them, there are bombs moving inside the corridor. If an agent finds itself too close to the bomb, the distance between an agent and the bomb is less than 9m, a cost of 1 will be emitted, at the same time, the bomb will turn blood red.

A demo denotes unsafe performance using HAPPO on HalfCheetah-2x3 task

A demo denotes safe performance using MAPPO-Lagrangian on HalfCheetah-2x3 task

ManyAgent Ant Task One: In the ManyAgent Ant task, the width of the corridor set by two walls is 9m. The environment emits the cost of 1 for an agent, if the distance between the robot and the wall is less than 1.8 m, or when the robot topples over.

A demo denotes unsafe performance using HAPPO on ManyAgent Ant-2x3 task

A demo denotes safe performance using MAPPO-Lagrangian on ManyAgent Ant-2x3 task

ManyAgent Ant Task Two: In the ManyAgent Ant task, the width of the corridor is 12 m; its walls fold at the angle of 30 degrees. The environment emits the cost of 1 for an agent, if the distance between the robot and the wall is less than 1.8 m, or when the robot topples over.

A demo denotes unsafe performance using HAPPO on ManyAgent Ant-2x3 task

A demo denotes safe performance using MAPPO-Lagrangian on ManyAgent Ant-2x3 task

A demo denotes safe performance using MACPO (S-bound=10) on ManyAgent Ant-2x3 task

TwoArmPegInHole Task: Robots learn to achieve PegInHole task, specifically, agents need to learn to cooperate fully insert the peg into the hole, when the peg touches the red areas or the distance between the peg and the red areas is less than a certain distance when the peg isn't placed in the right location or angle of the hole, it will cause a cost of 1.

A demo denotes unsafe performance using HAPPO on Safe MARobosuite-TwoArmPegInHole-2x7 task

A demo denotes safe performance using MAPPO-Lagrangian on Safe MARobosuite-TwoArmPegInHole-2x7 task

ShadowHandOver Task: The environment involves two hands at fixed positions. The first hand with an object must find a way to hand the item over to the second hand, while one finger on the first hand has safety constraints over the range of motion of one of the fingers.

A demo denotes unsafe performance using MAPPO on Safe MAIG-ShadowHandOver-2x6 task

A demo denotes safe performance using MAPPO-Lagrangian on Safe MAIG-ShadowHandOver-2x6 task