HASARD: A Benchmark for Vision-Based Safe Reinforcement Learning in Embodied Agents

Code

Paper

Presentation

Demo

Safety Threshold Adjustment

An Implicit Learning Curriculum

Environments

Armament Burden

Deliver items without exceeding the carrying load

Volcanic Venture

Collect items without walking on lava

Detonator's Dilemma

Detonate barrels without harming neutral units

Remedy Rush

Collect health vials without obtaining decoy items

Collateral Damage

Eliminate enemies without neutral unit casualties

Precipice Plunge

Descend the cave without incurring fall damage

Action Spaces

HASARD incorporates two action spaces. The original action space includes 14 discrete and 2 continuous actions, offering fine-grained control while introducing greater learning complexity. The simplified action space varies across environments, discretizing continuous inputs and retaining only the discrete actions essential for solving the task. This streamlined version facilitates faster learning and enables tighter experimental loops.

While the original action space offers the potential for more efficient task completion, current methods, such as PPOLag, fail to capitalize on this advantage, even at Level 1 of HASARD. Although PPOLag reliably satisfies the safety budgets, it obtains lower rewards compared to using the simplified action space.

Baseline Evaluations

We present the evaluation results of PPO and its 5 popular safety extensions in Level 1 tasks of HASARD. PPO maximizes the reward irrespective of associated costs, setting an upper bound for reward and cost in all environments with its unconstrained behavior. The primal-dual Lagrangian PPOPID and the penalty-based P3O manage to obtain the highest rewards while adhering to the default cost budget. PPOCost treats costs as negative rewards, occasionally yielding reasonable outcomes but without any guarantee of adhering to safety constraints. PPOLag closely meets the safety thresholds with some fluctuations, yet frequently yields lower rewards. PPOSauté has varied performance across tasks, often failing to satisfy the cost threshold. The expert Human currently surpasses all other methods.

Agent Navigation

To assess how an agent learns to solve a task, HASARD facilitates spatial tracking that aggregates the agent’s visited locations across a window of the most recent episodes. We overlay these data as a heatmap on the 2D environment map, visually representing the agent’s movement patterns and exploration strategies. Juxtaposing these evolving patterns with the training curves reveals how movement correlates with performance.

Remedy Rush

Having obtained an initial high-reward, high-cost policy, PPOPID stops moving entirely to avoid penalties, and then gradually refines its strategy while staying within the safety budget.

Volcanic Venture

The PPO agent maximizes rewards by running in circles, ignoring the cost of walking on lava since it doesn't account for penalties.

Armament Burden

The PPO agent struggles early on, alternating between very little noticeable movement and wall collisions. Once it finds a good strategy, it quickly exploits it.

Detonator's Dilemma

The PPOPID agent initially focuses on the central section of the area, before learning to utilize the pathway at the bottom.

Precipice Plunge

The PPOPID agent learns to follow the winding path down the cave to mitigate fall damage.

Visual Complexity

To analyze the visual complexity of HASARD, we leverage privileged information about the agent's observations that is not accessible under the normal training regime. We create simplified representations through two separate strategies: (1) segmenting the observation and (2) including depth information.

Default Observation

A successful policy in Detonator's Dilemma requires accurately assessing spatial relationships between objects and entities, which can be difficult to infer from raw observations.

Segmented Observation

Each pixel in the observation is labeled according to the item, unit, wall, or surface it represents. We assign each pixel a predefined color based on its category, effectively segmenting the scene.

Depth Buffer

The depth buffer assigns each pixel a value from 0 to 255, where 0 (black) represents the closest points and 255 (white) indicates the farthest. Intermediate values correspond to relative distances. This feature enables the agent to directly perceive spatial relationships and the proximity of walls, surfaces, objects, and entities.

In Level 2 of Armament Burden, segmented observations help PPOPID better distinguish obstacles and obtainable items while eliminating noisy textures, leading to a noticeable performance boost.

Including the depth buffer as a 4th channel in the observation enables PPOPID to better discern how far a block is in Level 2 of Precipice Plunge. This improves the agent's judgment in selecting which block to leap towards, resulting in higher rewards.

Safety Threshold Adjustment

Each environment has a default cost budget selected to provide a moderate challenge while allowing some margin for error. Adjusting the safety bounds directly affects rewards, as seen in the results of PPOLag. Stricter cost limits make reward acquisition more difficult, showcasing the inherent reward-cost trade-offs.

An Implicit Learning Curriculum

Higher levels of HASARD environments not only adjust parameters to increase difficulty but also introduce new mechanics. Because of this added complexity, the agent often struggles to learn the hardest level directly. To investigate whether learning easier tasks first leads to a better policy, we evenly divide the training budget across difficulty levels and train the agent sequentially. We find that leveraging the implicit curriculum provided by the difficulty levels eases exploration challenges, prevents overly cautious behavior, and enables the agent to acquire skills it wouldn't otherwise learn with direct training.

Evaluation Results on Level 3

Training an equal number of timesteps leads to a nearly threefold reward increase in Remedy Rush and a 33% improvement in Collateral Damage while the costs remain similar. This shows the knowledge transfer potential to increasing levels of difficulty.

Regular Agent

Level 3 poses an exploration challenge, making it difficult for the agent to discover the permanent visibility goggles. As a result, the agent only moves during the brief periods of light.

Curriculum Agent

The curriculum-trained agent has learned to seek out the night vision goggles on an easier level and leverages this skill to perform better in the Level 3 task of Remedy Rush.

Regular Agent

Training the PPOLag directly on Level 3 of Collateral Damage leads to overly conservative behavior, preventing the agent from taking calculated risks and reaching high rewards.

Curriculum Agent

Splitting the training budget evenly across all difficulty levels and training sequentially with increased difficulty, allows the agent to gradually adapt, leading to higher rewards.

Page updated

Google Sites

Report abuse