Deliver items without exceeding the carrying load
Collect items without walking on lava
Detonate barrels without harming neutral units
Collect health vials without obtaining decoy items
Eliminate enemies without neutral unit casualties
Descend the cave without incurring fall damage
HASARD incorporates two action spaces. The original action space includes 14 discrete and 2 continuous actions, offering fine-grained control while introducing greater learning complexity. The simplified action space varies across environments, discretizing continuous inputs and retaining only the discrete actions essential for solving the task. This streamlined version facilitates faster learning and enables tighter experimental loops.
While the original action space offers the potential for more efficient task completion, current methods, such as PPOLag, fail to capitalize on this advantage, even at Level 1 of HASARD. Although PPOLag reliably satisfies the safety budgets, it obtains lower rewards compared to using the simplified action space.
We present the evaluation results of PPO and its 5 popular safety extensions in Level 1 tasks of HASARD. PPO maximizes the reward irrespective of associated costs, setting an upper bound for reward and cost in all environments with its unconstrained behavior. The primal-dual Lagrangian PPOPID and the penalty-based P3O manage to obtain the highest rewards while adhering to the default cost budget. PPOCost treats costs as negative rewards, occasionally yielding reasonable outcomes but without any guarantee of adhering to safety constraints. PPOLag closely meets the safety thresholds with some fluctuations, yet frequently yields lower rewards. PPOSauté has varied performance across tasks, often failing to satisfy the cost threshold. The expert Human currently surpasses all other methods.
To assess how an agent learns to solve a task, HASARD facilitates spatial tracking that aggregates the agent’s visited locations across a window of the most recent episodes. We overlay these data as a heatmap on the 2D environment map, visually representing the agent’s movement patterns and exploration strategies. Juxtaposing these evolving patterns with the training curves reveals how movement correlates with performance.
Having obtained an initial high-reward, high-cost policy, PPOPID stops moving entirely to avoid penalties, and then gradually refines its strategy while staying within the safety budget.
The PPO agent maximizes rewards by running in circles, ignoring the cost of walking on lava since it doesn't account for penalties.
The PPO agent struggles early on, alternating between very little noticeable movement and wall collisions. Once it finds a good strategy, it quickly exploits it.
The PPOPID agent initially focuses on the central section of the area, before learning to utilize the pathway at the bottom.
The PPOPID agent learns to follow the winding path down the cave to mitigate fall damage.
To analyze the visual complexity of HASARD, we leverage privileged information about the agent's observations that is not accessible under the normal training regime. We create simplified representations through two separate strategies: (1) segmenting the observation and (2) including depth information.
A successful policy in Detonator's Dilemma requires accurately assessing spatial relationships between objects and entities, which can be difficult to infer from raw observations.
Each pixel in the observation is labeled according to the item, unit, wall, or surface it represents. We assign each pixel a predefined color based on its category, effectively segmenting the scene.
The depth buffer assigns each pixel a value from 0 to 255, where 0 (black) represents the closest points and 255 (white) indicates the farthest. Intermediate values correspond to relative distances. This feature enables the agent to directly perceive spatial relationships and the proximity of walls, surfaces, objects, and entities.
In Level 2 of Armament Burden, segmented observations help PPOPID better distinguish obstacles and obtainable items while eliminating noisy textures, leading to a noticeable performance boost.
Including the depth buffer as a 4th channel in the observation enables PPOPID to better discern how far a block is in Level 2 of Precipice Plunge. This improves the agent's judgment in selecting which block to leap towards, resulting in higher rewards.
Each environment has a default cost budget selected to provide a moderate challenge while allowing some margin for error. Adjusting the safety bounds directly affects rewards, as seen in the results of PPOLag. Stricter cost limits make reward acquisition more difficult, showcasing the inherent reward-cost trade-offs.
Higher levels of HASARD environments not only adjust parameters to increase difficulty but also introduce new mechanics. Because of this added complexity, the agent often struggles to learn the hardest level directly. To investigate whether learning easier tasks first leads to a better policy, we evenly divide the training budget across difficulty levels and train the agent sequentially. We find that leveraging the implicit curriculum provided by the difficulty levels eases exploration challenges, prevents overly cautious behavior, and enables the agent to acquire skills it wouldn't otherwise learn with direct training.
Training an equal number of timesteps leads to a nearly threefold reward increase in Remedy Rush and a 33% improvement in Collateral Damage while the costs remain similar. This shows the knowledge transfer potential to increasing levels of difficulty.
Level 3 poses an exploration challenge, making it difficult for the agent to discover the permanent visibility goggles. As a result, the agent only moves during the brief periods of light.
The curriculum-trained agent has learned to seek out the night vision goggles on an easier level and leverages this skill to perform better in the Level 3 task of Remedy Rush.
Training the PPOLag directly on Level 3 of Collateral Damage leads to overly conservative behavior, preventing the agent from taking calculated risks and reaching high rewards.
Splitting the training budget evenly across all difficulty levels and training sequentially with increased difficulty, allows the agent to gradually adapt, leading to higher rewards.