Developing agents capable of open-endedly discovering and learning novel skills is a grand challenge in Artificial Intelligence. While Reinforcement learning offers a powerful framework for training agents to master complex skills, it typically relies on hand-designed reward functions. This is infeasible for open-ended skill discovery, where the set of meaningful skills is not known a priori. While recent methods have shown promising results towards automating reward function design, they remain limited to refining rewards for pre-defined tasks. To address this limitation, we introduce Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs (CODE-SHARP), a novel framework leveraging Foundation Models (FM) to open-endedly expand and refine a hierarchical skill archive, structured as a directed graph of executable reward functions in code. We show that a goal-conditioned agent trained exclusively on the rewards generated by the discovered SHARP skills learns to solve increasingly long-horizon goals in the Craftax environment. When composed by a high-level FM-based planner, these skills enable a single goal-conditioned agent to solve complex, long-horizon tasks, outperforming both pretrained agents and task-specific expert policies by over 134% on average.
Below we present one of the skill archives discovered fully autonomously by CODE-SHARP and learned by the goal-conditioned agent. You can zoom into the image to get a better overview of how the SHARP skills hierarchically build on each other.
We rigorously evaluate whether the discovered SHARP skills faithfully encode the semantic intent of their natural-language descriptions and produce reward signals that effectively train the goal-conditioned agent to achieve the stated goal. To this end, we assess the zero-shot composability of the discovered SHARP skills into policies-in-code generated by an FM-based policy planner, following a methodology similar to MaestroMotif. The policy planner is provided with the benchmark task description, the environment source code, and the archive of discovered SHARP skills. The FM-based policy planner constructs the policies-in-code by defining sequential environment conditions and mapping them to existing SHARP skills, thereby decomposing the benchmark scenarios into a series of executable sub-goals for the agent. Our results show that the goal-conditioned agent trained exclusively on rewards generated by the discovered SHARP skills can solve extremely long-horizon sequences of goals, outperforming both task-specific and large-scale pretrained baselines. We provide example videos of the agent being instructed by the high-level policies-in-code to complete the benchmark scenarios.
In this benchmark, the agent is required to craft a set of iron tools. Each iron tool requires a large number of resources, which the agent needs to gather in the environment. The next milestone is seen on the top left; the active goal given to the goal-conditioned agent can be seen on the top right. The active goal is computed each step by the SHARP skills, starting at the high-level policy-in-code
In this benchmark, the agent is required to solve a range of tasks which relate to the Overworld and the Dungoen level. The agent is first instructed to collect useful resources and tools before locating the ladder down to descend to the Dungeon level. Once descended, the agent must open chests until a potion is discovered. Next, the agent must eliminate two hostile mobs on the Dungeon level before ascending back to the Overworld. Finally, after ascending, the agent is then tasked with replenishing its hunger. The next milestone is shown in the top left; the active goal given to the goal-conditioned agent is shown in the top right. The active goal is computed each step by the SHARP skills, starting at the high-level policy-in-code
In this benchmark, the agent is required to find a diamond on the gnomish mines level. The agent is first tasked with gathering useful resources before descending to the Dungeon and then the Gnomish Mines level. There, the agent should drink water and eat a bat if drink and food fall under a certain threshold. Finally, the agent must explore the dark Mines level to locate a diamond. The next milestone is seen on the top left; the active goal given to the goal-conditioned agent can be seen on the top right. The active goal is computed each step by the SHARP skills, starting at the high-level policy-in-code
The navigation benchmark evaluates the agent's ability to solve extremely long-horizon goals by locating an enchantment table. This requires the agent to first descend from the Overworld to the Dungeon level, then to the Gnomish Mines and finally to the Sewers. On each level, the agent must first eliminate 8 hostile mobs, then locate the unlocked ladder down to the next level. The next milestone is seen on the top left; the active goal given to the goal-conditioned agent can be seen on the top right. The active goal is computed at each step by the SHARP skills, starting at the high-level policy-in-code