Safety-Gymnasium: A Unified Safe Reinforcement Learning Benchmark

Artificial intelligence (AI) systems possess significant potential to drive societal progress. However, their deployment often faces obstacles due to substantial safety concerns. Safe reinforcement learning (SafeRL) emerges as a solution to optimize policies while simultaneously adhering to multiple constraints, thereby addressing the challenge of integrating reinforcement learning in safety-critical scenarios. In this paper, we present an environment suite called Safety-Gymnasium, which encompasses safety-critical tasks in both single and multi-agent scenarios, accepting vector and vision-only input. Additionally, we offer a library of algorithms named Safe Policy Optimization SafePO, comprising 16 state-of-the-art SafeRL algorithms. This comprehensive library can serve as a validation tool for the research community. By introducing this benchmark, we aim to facilitate the evaluation and comparison of safety performance, thus fostering the development of reinforcement learning for safer, more reliable, and responsible real-world applications.

Safe Navigation

Safe Velocity

Safe Vision

Safe Isaac Gym

Safe Multi-Agent

Safe RGB/RGBD Demo

Supported Robots

Safety-Gymnasium inherits three pre-existing agents from Safety-Gym[1], namely Point, Car, and Doggo. Through meticulous adjustment of the model parameters, we have successfully mitigated the issue of excessive oscillations during the runtime of Point and Car agents. For a comprehensive overview of the series of modifications we have made based on the Safety-Gym, please refer to Appendix A. Building upon this foundation, we have introduced two additional robots: racecar and ant, to enrich the repertoire of single-agent scenarios.

As for multi-agent robots, we have leveraged certain configurations from multi-agent MuJoCo, deconstructing the original single-agent structure and enabling multiple agents to control distinct body segments.

Supported Task

(a) Velocity: the robot aims to facilitate coordinated leg movement of the robot in the forward (right) direction by exerting torques on the hinges.

(b) Run: the robot starts with a random initial direction and a specific initial speed as it embarks on a journey to reach the opposite side of the map.

(c) Circle: the reward is maximized by moving along the green circle, and the agent is not allowed to enter the outside of the red region, so its optimal constrained path follows the line segments AD and BC.

(d) Goal: the robot navigates to multiple goal positions. After successfully reaching a goal, its location is randomly reset while maintaining the overall layout.

(e) Button: the objective is to activate a series of goal buttons distributed throughout the environment. The agent's goal is to navigate towards and make contact with the currently highlighted button, known as the goal button.

(f) Push: the objective is to move a box to a series of goal positions. Like the goal task, a new random goal location is generated after each successful achievement.

Supported Constraints

Velocity-Constraint: consists of a series of safety tasks based on MuJoCo agents. In these tasks, agents, such as Ant, HalfCheetah, and Humanoid, are trained to move faster for higher rewards, while also being imposed a velocity constraint for safety considerations.

Pillars: are employed to represent large cylindrical obstacles within the environment. In the general setting, contact with a pillar incurs costs.

Hazards: are utilized to model areas within the environment that pose a risk, resulting in costs when an agent enters such areas.

Sigwalls: are designed specifically for Circle tasks. They serve as visual representations of two or four solid walls, which limit the circular area to a smaller region. Crossing the wall from inside the safe area to the outside incurs costs.

Vases: are specifically designed for Goal tasks. They represent static and fragile objects within the environment. Touching or displacing these objects incurs costs for the agent.

Gremlins: are specifically employed in the Button tasks. They represent moving objects within the environment that can interact with the agent.

Vision-only tasks

While the initial iteration of Safety-Gym offered rudimentary visual input support, there is room for enhancing the realism of its environment. To effectively evaluate vision-based safe reinforcement learning algorithms, we have devised a more realistic visual environment utilizing MuJoCo. This enhanced environment facilitates the incorporation of both RGB and RGB-d inputs.

DexterousHands-based Learning Environments

Safety-DexterousHands, a novel collection of learning environments built upon DexterousHands and the Isaac-Gym simulator engine. Leveraging GPU capabilities, Safety-DexterousHands enables large-scale parallel sample collection, significantly accelerating the training process. The environments support both single-agent and multi-agent settings. Additionally, to complement the multi-agent tasks, we provide a set of safe multi-agent reinforcement learning algorithms for quick experimental evaluation.

These environments involve two robotic hands(refer to (a) and (b)). In each episode, a ball randomly descends near the right hand, necessitating coordinated efforts from both hands to position it correctly. As the target is out of the right hand's reach and direct transfer to the left hand is unfeasible, a potential solution entails the right hand grasping and launching the ball towards the left hand, which subsequently catches and deposits it at the target location. As shown in (c), (d) and (e), these tasks also support setting the degrees of freedom of joint and finger as safety constraints.