* indicates the equal contribution and ✉️ indicates the equal corresponding
TL;DR.
EnerVerse is a framework for generating future spaces represented as multi-view videos for robotic manipulation tasks. It uses chunkwise autoregressive generation and a sparse memory mechanism to produce infinite sequences with explicit end-of-sequence (EoS) control. We further integrate it with 4D Gaussian Splatting (4DGS) to construct a data flywheel for the sim2real adaption. Plugged with a naive policy head, it achieves state-of-the-art performance in robotic manipulation benchmarks, demonstrating the effectiveness of its space generation prior.
Abstract: We introduce EnerVerse, a comprehensive framework for embodied future space generation specifically designed for robotic manipulation tasks. EnerVerse seamlessly integrates convolutional and bidirectional attention mechanisms for inner-chunk space modeling, thereby ensuring low-level consistency and continuity. Recognizing the inherent redundancy in video data, we propose a sparse memory context in conjunction with a chunkwise unidirectional generative paradigm to facilitate the generation of infinitely long sequences. To further augment robotic capabilities, we introduce the Free Anchor View (FAV) space, which offers flexible perspectives that enhance observation and analysis. The FAV space mitigates motion modeling ambiguity, removes physical constraints in confined environments, and significantly improves the robot’s generalization and adaptability across a variety of tasks and settings. To address the prohibitive costs and labor intensity associated with acquiring multi-camera observations, we present a data engine pipeline that integrates a generative model with 4D Gaussian Splatting (4DGS). This pipeline capitalizes on the generative model's robust generalization capabilities and the spatial constraints provided by 4DGS, enabling an iterative enhancement of data quality and diversity, thus creating a data flywheel effect that effectively narrows the sim-to-real gap. Finally, our experiments demonstrate that the embodied future space generation prior substantially enhances policy predictive capabilities, resulting in improved overall performance, particularly in long-range robotic manipulation tasks.
First, Initial Reconstruction uses observation images from cameras mounted on the robot to build an initial 3D point cloud, with anchor views set to adapt to the environment and meet task-specific requirements.
Second, Free Anchor View Renders generates rendered images from these anchor perspectives to provide comprehensive scene representations.
Finally, Chunk-wise Autoregressive Generation employs a multi-view video diffusion to produce image sequences in chunks based on task instructions. When integrated with a policy head, this module can generate robotic actions to execute the given task.
Visualization of FAVs generation on the LIBERO benchmark. Anchor View 1 represents the observation image captured by a mounted camera. Anchor View 2 and Anchor View 3 are generated by rendering from a point cloud reconstructed from Anchor View 1 using depth wrapping.
Observation images captured from multiple cameras, along with rendered images from anchor views, are processed by the multi-view video generator to produce denoised multi-view videos. These videos, paired with their corresponding camera poses, are utilized in 4D Gaussian Splatting (4D GS) for 4D scene reconstruction. The reconstructed content is rendered from anchor views to generate high-precision images, which are iteratively fed back into the pipeline to enhance motion consistency and reconstruction quality. This iterative loop combines geometric consistency with generative refinement, delivering high-fidelity outputs for tasks such as robotic manipulation.
Much Improved Visual Quality! 😁
Note that all the three views are generated by our model, and the consistency across views highlights the geometric and spatial information learned by our model.
The instruction in () indicates the finished pre-steps
Task: pick up the cream cheese box and put it in the basket
Task: (put the white mug on the plate and) put the chocolate pudding to the right of the plate
Task: (put the yellow and white mug in the microwave and) close it
Task: (turn on the stove and ) put the moka pot on it
Our model can also generate multi-view or single view videos with real-world data on demands.
Multi-View Video Generation
Task: Take the small cup, large cup, blue saucer handle, and beige saucer handle to the small container gird in that order.
Task: Right arm picks up the two slices of toast from the white toaster and places them in the dining plate sequentially.
Single-View Video Generation
Since ours predict EOS frame at 42th frame for this task, we visualize 8th, 16th, 24th and 41th frame sampled from both generated sequence. The sequences generated by DynamiCrafter(FN) did not maintain the logic of the long-range task, producing many hallucinations as the sequence grew. In contrast, the sequence generated by \OURS was logically coherent, continuously and completely generating the future space of the entire task, and accurately predicting the EOS (End of Sequence) frame.
In the policy prediction experiment, the action head adopts the Diffusion Policy (DP) architecture, with a total of 190M parameters. For the condition of the DP head, we utilize the feature before middle block of the UNet in the first denoise step, and calculate the mean value over spatial dimension
EnerVerse achieves state-of-the-art performance across the LIBERO benchmark, significantly surpassing all baselines.
Our Policy Rollouts on LIBERO Benchmarks
Task: Put the yellow and white mug in the microwave and close it
Task: Put both the cream cheese box and the butter in the basket
Task: Turn on the stove and put the moka pot on it
Task: Put the black bowl in the bottom drawer of the cabinet and close ity
Task: Pick up the book and place it in the left compartment of the caddy
Task: Pick up the black bowl on the wooden cabinet and place it on the plate
To evaluate the manipulation capabilities of EnerVerse we conducted real-world experiments. The robot placed blocks into designated compartments of a foam worktable, demanding accuracy due to the tight fit and visual similarity between the foam and table.
Challenges:
The robot must follow natural language instructions, such as "Row One, Column Two," to identify the required compartment.
The compartments are only slightly larger than the magnet blocks, transforming the pick-and-place task into a highly precise "insertion" operation.
The magnet blocks are relatively heavy, necessitating the robot gripper to grasp near the center of the block to ensure stability during manipulation.
Task: Take a black magnet from table and place it on white box row 3 column 2
Task: Take a black magnet from table and place it on white box row 2 column 1
Against the light condition
Against the visual distractors
...