A. Study on failure cases of GenRL
1. Failure cases of GenRL on langugae tasks in Franka Kitchen
The following videos display the failed testing trajectories during the evaluation of the learned GenRL policies on Kitchen Microwave, Kitchen Light, Kitchen Slide, and Kitchen Burner. In these failure cases, the robot continuously waves its arm beside the target object (such as the microwave door handle, the light switch, the slide cabinet, or the burner knobs for each task) but ultimately fails to perform the desired behavior. Intuitively, this outcome can be attributed to the learning paradigm in GenRL, where the connector and policy focus on step-by-step visual alignment. Consequently, the agent may only mechanically mimic the visual features of the connected target state sequence without fully grasping the deep-level semantics of the task, resulting in arm movements that are visually quite similar to those in successful trajectories, yet the task itself remains incomplete.
GenRL's failure case on Kitchen Microwave
GenRL's failure case on Kitchen Light
GenRL's failure case on Kitchen Slide
GenRL's failure case on Kitchen Burner
The following videos display the successful trajectories on each task, for comparison
Successful trajectory on Kitchen Microwave
Successful trajectory on Kitchen Light
Successful trajectory on Kitchen Slide
Successful trajectory on Kitchen Burner
2. Failure cases of GenRL on cross-viewpoint video tasks in DMC
The following videos compare the testing trajectories of the learned GenRL policy and FOUNDER policy, on Cheetah Run task, when given a single video specifying the task (the middle video below) captured from a different viewpoint from the agent. GenRL fails to capture the underlying semantics of 'the cheetah is running' embedded in the cross-viewpoint video, and the trajectory generated by GenRL policy exhibits a 'slanted' cheetah, visually a little similar to the given cross-viewpoint task video, while the cheetah is not running at all. This again confirms that GenRL operates as a style-transfer-like approach, aligning visual appearances rather than capturing the underlying physical states. In contrast, FOUNDER policy successfully enables the cheetah to start running, exhibiting significant superiority in extracting deep-level task semantics beyond mere visual appearances.
FOUNDER Eval Trajectory
Testing trajectory of the learned FOUNDER policy on Cheetah Run (episode return: 478.77)
⬅
Target Video Prompt
The target video prompt specifying the Cheetah Run task, captured from another viewpoint
➡
GenRL Eval Trajectory
Testing trajectory of the learned GenRL policy on Cheetah Run (episode return: 6.91)
3. Interpretation of GenRL's failures
In addition to the step-by-step alignment learning paradigm, the failures observed in GenRL also stem from the world model state and the distance metric used in learning the connector and policy. GenRL relies solely on aligning stochastic state sequence for learning the connector and policy, and the stochastic state in GenRL's world model only contains information from a single visual observation. This limits GenRL to visual-level alignment. Furthermore, GenRL computes the cosine similarity between stochastic states as pseudo rewards during policy learning. However, this similarity cannot be directly computed, as the stochastic states are categorical variables (similar to DreamerV2 and DreamerV3). To address this, GenRL forward-propagates the stochastic states into world model's image decoder, using the dense vectors output by the decoder's first linear layer to compute the cosine similarity. This design exacerbates the visual-only alignment issue, as the output of the image decoder primarily contains visual information, especially when GenRL's image decoder is not conditioned on deterministic states. As a result, GenRL faces significant challenges in cross-domain video tasks (such as cross-embodiment or cross-viewpoint scenarios, where deep-level task semantics are not extracted) and tasks with complex visual observations (such as Minecraft, where visual alignment for complex observations is difficult).
C. Study on failure cases of Founder w/o TempD
Founder w/o TempD uses cosine similarity as the distance metric in the reward function during behavior learning. Since the world model states consist of both deterministic and stochastic components, Founder w/o TempD calculates the reward as the sum of the cosine distances for each part. For the stochastic states, we employ the same method as GenRL to compute their cosine distance.
1. Failure cases of Founder w/o TempD on langugae tasks in DMC
We present the failure cases of Founder w/o TempD on langugae tasks for Walker Run, Cheetah Run, and Stickman Walk. We observe that the agent can already perform well at the beginning of the behavior learning stage, but its performance deteriorates as training progresses, eventually reaching very low performance by the end.
This unexpected phenomenon puzzled us, and we plot the episode return curves computed using the pseudo reward. We discover that the real return and pseudo return curves exhibit completely opposite trends. The real performance worsens as the agent maximizes the pseudo reward, a reward hacking problem caused by the improper design of the cosine-similarity-based reward function (This explains the poor reward consistency of FOUNDER w/o TempD which is presented in Section 5.3 of the paper). We also experimented with reward functions based on the KL divergence between the agent's and target distributions, or using only one of the stochastic or deterministic parts in the distance calculation, but the performance remains poor.
We then visualize the resulting trajectories of the policy at the early stage and the final policy. Despite performing well at the beginning, we find that at the end, the agent’s behavior becomes more static. The trajectory at Point B shows the agent staying at its initial position, appearing as if it is running or walking, but in reality, it is only 'running' or 'walking' at the original place without progressing forward, or moving forward at an extremely slow pace.
Combining these findings, we conclude that, reward functions based on cosine similarity or other direct distance metrics may also lead the policy to mimic the visual appearance of the agent, while overlooking the underlying task semantics and multi-step movement, particularly in locomotion tasks like running or walking. Since FOUNDER-based method maps the target sequence to a single goal state in the world model, we hypothesize that using direct distance functions between distributions or world model states may result in a lack of temporal awareness and crucial task-completion information.
Testing Returns of the Founder w/o TempD policy during behavior learning stage for Walker Run, in terms of real performance and pseudo return
Testing trajectory of the Founder w/o TempD policy on Walker Run in the early stage (Point A) of behavior learning (Episode Return: 410.0)
Testing trajectory of the Founder w/o TempD policy on Walker Run at the end (Point B) of behavior learning (Episode Return: 183.9)
Testing Returns of the Founder w/o TempD policy during behavior learning stage for Cheetah Run, in terms of real performance and pseudo return
Testing trajectory of the Founder w/o TempD policy on Cheetah Run in the early stage (Point A) of behavior learning (Episode Return: 719.2)
Testing trajectory of the Founder w/o TempD policy on Cheetah Run at the end (Point B) of behavior learning (Episode Return: 195.2)
Testing Returns of the Founder w/o TempD policy during behavior learning stage for Stickman Walk, in terms of real performance and pseudo return
Testing trajectory of the Founder w/o TempD policy on Stickman Walk in the early stage (Point A) of behavior learning (Episode Return: 799.4)
Testing trajectory of the Founder w/o TempD policy on Stickman Walk at the end (Point B) of behavior learning (Episode Return: 294.3)
To enhance temporal awareness and incorporate more task-completion information, we propose utilizing temporal distance as the reward function. This approach eliminates the reward hacking problem (as the trends of the two return curves are now generally consistent), and the resulting FOUNDER method achieves superior performance. Below are the testing return curves of the final FOUNDER method, showing both real and pseudo rewards. Here, the pseudo reward is the predicted temporal distance between the goal state and the current state, with 1 added, as described in the paper.
Testing Returns of the Founder policy during behavior learning stage for Walker Run, in terms of real performance and pseudo return
Testing Returns of the Founder policy during behavior learning stage for Cheetah Run, in terms of real performance and pseudo return
Testing Returns of the Founder policy during behavior learning stage for Stickman Walk, in terms of real performance and pseudo return
B. Additional Experimental Charts and Results in Rebuttal
1. Performance comparison with model-free baselines
To ensure a comprehensive evaluation, we include HILP, a multi-task model-free method that achieves strong zero-shot RL performance, and TD3, the best-performing single-task model-free baseline according to GenRL’s. We use VLM-based cosine similarity between task prompt and observations as the reward function in HILP's zero-shot rl adaptation, as well as serving TD3's reward. We assess their performance in the Cheetah and Kitchen domains. We also list GenRL's performance for a clear comparison. We find that FOUNDER consistently outperforms both single-task and multi-task model-free baselines.
2. Performance clarificartion on Minecraft tasks
The high variance in Figure 5 of the paper stems from Minecraft’s inherent stochasticity. We re-evaluate the results using 95% confidence intervals, replacing the original standard deviation plots, and provide more clear learning curves by averaging trajectory returns from FOUNDER-based methods. The averaged FOUNDER-based method achieves similar performance with GenRL on 2 of the 5 tasks, and its mean success rate consistently exceeds GenRL’s upper CI bound on the reset 3 tasks, demonstrating statistically significant improvement.
3. Additional results of FOUNDER w/ or w/o "+1 reward"
Since most original prediced temporal distance cluster near -1, the "+1" operation addresses reward sparsity and facilitates learning efficiency, and it is common for prior works to use zero-centering rewards. The oscillation of the no-add1 curves in the figure reflects the importance of the "+1" operation in stablizing training. However, these experimental results also confirm that agents can eventually achieves similar performance without this shaping for more training steps in behavior learning (50K steps for FOUNDER while 100K steps for FOUNDER_noadd1 in the figure), proving that "+1" operation is only an optional engineering choice in enhancing efficiency rather than fundamental to our method. Furthermore, we find that directly normalizing the original temporal distance (FOUNDER_noadd1_normalize in the figure) and using the resulting rewards yields similar performance and learning efficiency to the "+1" reward, indicating that the positive impact of the "+1" operation on the performance is similar to that of reward normalization.
4. Ablation study results on KL weight
We conduct ablation study on Cheetah and Kitchen domains to validate its sensibility. We find that FOUNDER is not sensible to kl weight in general, while 1.0 is the best choice overall. Moreover, different kl weight choices seem to have more influence on the performance of out-of distribution tasks like Cheetah FLip and Kitchen Kettle, rather than in-distribution ones.
5. Temporal distance predictor learned from quasi-static data
We learn a temporal distance (TempD) predictor on Stand datasets where trajectories are generated by expert policy in Walker Stand (In the submitted rebuttal text, we mistyped "Walker Stand" as "Walker Run"; we sincerely apologize for this error) . Then, we validate the learned predictor's accuracy in predicting same WM states as near-zero distance. We evaluate the mean value of the output distance on all the samples, and a value closer to 0 means better prediction. The results (Stand data: -1.6e-4, Full data: -1.3e-4) show that TempD predictor trained on quasi-static data can achieve comparable same-state-pair prediction accuracy compared to predictors trained on full dataset, showcasing the method's robustness under this setting.
6. Performance on real-world video tasks
We present experimental results on real-world video tasks using videos provided in GenRL's code repository. Here we also list the performance of GenRL and FOUNDER on corresponding language tasks as upper bounds for comparison. FOUNDER again demonstrates solid performance compared to GenRL when generalizes to real-world video task understanding and grounding, and even match the language-based task solving performance of GenRL. The used file names of real-world videos are "person_standing_up_with_hands_up_seen_from_the_side", "spider_draw", "dog_running_seen_from_the_side", "guy_walking", "open_microwave" respectively in GenRL's code repository, for specifying each task.