SENSEI: Semantic Exploration Guided by Foundation Models to Learn Versatile World Models

SENSEI agent playing Pokemon

Experiment 1: SENSEI in Pokemon

Figure E1: Histogram of maximum map reached in Pokemon Red during each episode for SENSEI and Plan2Explore (each run is 500K environment steps and histogram is plotted using 5 seeds for each method). Only SENSEI manages to reach map 9, which is the first Gym in the game. The VLM-Motif used for SENSEI (referred to later as SENSEI Gen1) is distilled from 100K pairs.

Figure E1_1: Histogram of maximum party level (left) and party size (right) reached in Pokemon Red during each episode for SENSEI and Plan2Explore (histogram is plotted using 5 seeds for each method)
Max party level is the sum of Pokemon levels in the party.
Level is a property of a Pokemon describing it's battle experience and overall strength. We omit the starting level 6 Pokemon from the plot.

Party size is number Pokemons collected in the Party (max is 6).
In order to succeed in the game you need to have a diverse team with high levels, which means achieving a high party level (higher counts on the right side of the first histogram).

Experiment 2: SENSEI Generation 2 in Pokemon

Figure E2: After one round of SENSEI (results of which are shown above), we sample 50K more pairs from a SENSEI run. We then annotate this new data, to obtain a new, better informed VLM-Motif, as now it has knowledge of previously unknown maps.
We showcase the VLM-Motif trajectories for these two generations in the following. Note that the Motif trajectories are min-max scaled for each run, with smoothing applied with window size 3. All trajectories come from a first generation SENSEI run in Pokemon.
Top-Left: SENSEI-Gen2 semantic rewards peak correctly at the Gym and the confrontation with the Gym Leader.

Top-Right: SENSEI-Gen2 can differentiate the gym from an unimportant house (museum), unlike SENSEI-Gen1.

Bottom-Left: Semantic rewards increase while defeating a wild Pokemon.
Bottom-Right: SENSEI-Gen2 semantic rewards decline in battle as the poison status progresses but peaks sharply after defeating a trainer.

Experiment 3: Solving Downstream Tasks with SENSEI in Robodesk

Figures E3: Downstream Task Performances in Robodesk. We plot the mean of the episode score obtained during evaluation for the Robodesk tasks (top) open_drawer (middle) upright_block_off_table and (bottom) lift_ball, with world models learned from SENSEI vs. Plan2Explore (P2E) exploration. Shaded areas depict the standard error (10 seeds) and we apply smoothing over the score trajectories with window size 3.

Open drawer

Upright Block Off Table

Lift Ball

Experiment 4: Dreamer from Scratch vs. Solving Downstream Tasks with SENSEI in Robodesk

Figures E4: Downstream Task Performances in Robodesk for Dreamer with a headstart. We also show results for learning a task policy from scratch with DreamerV3 for the upright_block_off_table task. Shaded areas depict the standard error (5 seeds) and we apply smoothing over the score trajectories with window size 3.

On the right is the SENSEI vs. Plan2explore plot for reference (readjusted x-axis to reflect the 1mil exploration steps as well here)

Experiment 5: SENSEI distilled from a smaller dataset in Minihack KeyroomS15

Figure E5: Interactions in MiniHack for smaller VLM-Motif dataset. We plot the mean number of interactions with task-relevant objects and the environment reward (unknown to the agents) collected by Plan2Explore (P2X), original SENSEI (VLM-Motif on 100K pairs) and an ablation of SENSEI where we train with a VLM-Motif learned from only 25K pairs. Error bars show the standard error (10 seeds).

MiniHack KeyRoom-S15
Top: Game map, Bottom: Egocentric view as agent's observation

Page updated

Google Sites

Report abuse