A. Performance of CoLA-World on A Different Video Generation Backbone (Wan2.1)
To further verify the architectural universality and scalability of the CoLA-World joint training framework, we extend our experiments by replacing the underlying video generation backbone from OpenSora to the Wan2.1 model.
Crucially, to ensure a controlled comparison, we maintain an identical experimental protocol to Section 4.2 in the main paper. We finetune Wan2.1-T2V-1.3B into action-conditioned video generation model (world model). The mechanism for injecting latent action conditions into the Wan2.1 diffusion transformer (DiT) architecture mirrors the strategy employed in our OpenSora implementation, with action tokens integrated via AdaLN. For joint training, we warmup the IDM and the quantizer for 5K steps until the metrics of the codebook stabilize at high levels (e.g., high utilization). We reuse the LAM trained for 30K steps in Section 4.2 in the main paper for 2-stage methods. Quantitative results are detailed in the table below. Our joint training pipeline outperforms or achieves comparable results to 2-stage baselines even when assigned with a much lower training budget (45K v.s 70K). This ablation demonstrates that our joint training paradigm is model-agnostic and can readily benefit from advancements in foundation video generation models.
B. Additional VP2 RoboDesk Experiments
We expand our evaluation to the full 7 tasks of the VP² RoboDesk benchmark, with more seeds (5 seeds), and compare with AdaWorld (a recent state-of-the-art latent-action WM, which can also be seen as a 2-Stage approach) for a comprehensive comparison. We compare our joint-training model (WARM8K+E2E30K) against the 2-stage baseline (LAM30K+WM30K), both finetuned into real-action-based WMs using the protocol from Section 4.4. The number of action sequence samples is 50 in each planning step. We also include results of using checkpoints with more pretraining budgets (WARM8K+E2E52K and LAM30K+WM52K), and with more action sequence samples (200) in planning. We only use a denoise step of 3 and disable classifier-free guidance during the inference of the WM to accelerate the experiment, same as AdaWorld, but this may decrease the generation quality. The results are listed below. Our joint training method clearly outperforms both the 2-Stage baseline and AdaWorld, and demonstrates considerable planning performance on tasks including Upright Block Off Table, Push Red, Push Green and Push Blue. Using more LAM-WM pretraining budget and more action samples in planning can also enhance the performance. Note that AdaWorld did not report performance on Flat Block and Push Drawer.
C. Long Horizon Autoregressive Video Prediction
We evaluated long-horizon autoregressive real-action video prediction across all four Libero suite datasets. We use the same WM checkpoints, action adapters and protocals as the Libero experiment in Sec 4.4, where we finetune the WM using the whole Libero dataset and evaluate on each of the four suites. To mitigate compounding errors, each generating step conditions on the 5 most recent states plus the initial anchor frame and generate the next 2 states. Therefore, generating a full trajectory (averaging about 150 frames) requires about 25 autoregressive steps, as we use a temporal downsample rate of 3 for Libero's LAM. We use only 5 denoising steps and disable classifier-free guidance to accelerate inference. Results below show that CoLA-World consistently outperforms the 2-stage baseline across most suites and metrics. This confirms our co-trained, collapse-resistant LAM and WM successfully transfer their short-sequence prediction superiority into stable long-horizon rollouts. Admittedly, the absolute long-horizon prediction performance remains modest, especially on Libero-Long. This is because our current pretraining pipeline does not yet incorporate specific optimizations tailored for autoregressive generation (e.g., context noise injection or extended training context lengths). Integrating these standard techniques presents a straightforward avenue to further enhance long-horizon stability in future work.
D. Latent Action Transfer Demos
Below are additional LAM transfer video demos of our CoLA-World model.
For each video pair, the left one is the source video, used to extract latent actions using the joint-trained CoLA-World LAM's IDM. (Source)
The right one is the generated video, starting from a first frame taken from a dataset different from the source, performing rollout imagination in the joint-trained CoLA-World's WM, using the latent actions extracted from the source. (Target)
The generated videos show a strong resemblance in semantic meaning to the source videos, and demonstrates the model's capability in achieving semantic consistency across distinct embodiments (transfer across heterogeneous robots or between robot arms and human hands). This qualitatively proves that our method establishes a unified latent action space.
E. Comparing Latent Action Transfer Results of Cola-World and the 2-Stage baseline
Below are comparative LAM transfer demos between our joint-trained CoLA-World and the two-stage baseline.
For each video triplet, the first one is the source video, used to extract latent action. (Source)
The second one is the video generated by Cola-World, starting from a first frame taken from a dataset different from the source, performing rollout imagination in the joint-trained CoLA-World's WM, using the latent actions extracted from the source by CoLA-World LAM's IDM. (Target-Joint)
The third one is the video generated by the 2-Stage baseline, starting from a first frame same as that of Target-Joint, performing rollout imagination in the 2-Stage baseline's WM, using the latent actions extracted from the source by 2-Stage baseline LAM's IDM. (Target-2Stage)
Again, both models extract latent actions from the same source video and perform rollout imagination starting from the same initial target frame (taken from a dataset different from the source).
Notably, the videos generated by the 2-stage baseline frequently fail to adhere to the semantic action of the source video, whereas CoLA-World faithfully preserves the intended behaviors. The comparison results highlight the superior semantic consistency and generation quality of our jointly trained model.