Envisioning Embodied Future Space for Robotics Manipulation

Siyuan Huang, Liliang Chen, Pengfei Zhou, Shengcong Chen, Zhengkai Jiang, Yue Hu, Yue Liao, Peng Gao, Hongsheng Li, Maoqing Yao✉️, Guanghui Ren✉️

* indicates the equal contribution and ✉️ indicates the equal corresponding

arXiv Paper

Paper(Google Drive)

AgiBot-World

TL;DR.

EnerVerse is a framework for generating future spaces represented as multi-view videos for robotic manipulation tasks. It uses chunkwise autoregressive generation and a sparse memory mechanism to produce infinite sequences with explicit end-of-sequence (EoS) control. We further integrate it with 4D Gaussian Splatting (4DGS) to construct a data flywheel for the sim2real adaption. Plugged with a naive policy head, it achieves state-of-the-art performance in robotic manipulation benchmarks, demonstrating the effectiveness of its space generation prior.

Abstract: We introduce EnerVerse, a comprehensive framework for embodied future space generation specifically designed for robotic manipulation tasks. EnerVerse seamlessly integrates convolutional and bidirectional attention mechanisms for inner-chunk space modeling, thereby ensuring low-level consistency and continuity. Recognizing the inherent redundancy in video data, we propose a sparse memory context in conjunction with a chunkwise unidirectional generative paradigm to facilitate the generation of infinitely long sequences. To further augment robotic capabilities, we introduce the Free Anchor View (FAV) space, which offers flexible perspectives that enhance observation and analysis. The FAV space mitigates motion modeling ambiguity, removes physical constraints in confined environments, and significantly improves the robot’s generalization and adaptability across a variety of tasks and settings. To address the prohibitive costs and labor intensity associated with acquiring multi-camera observations, we present a data engine pipeline that integrates a generative model with 4D Gaussian Splatting (4DGS). This pipeline capitalizes on the generative model's robust generalization capabilities and the spatial constraints provided by 4DGS, enabling an iterative enhancement of data quality and diversity, thus creating a data flywheel effect that effectively narrows the sim-to-real gap. Finally, our experiments demonstrate that the embodied future space generation prior substantially enhances policy predictive capabilities, resulting in improved overall performance, particularly in long-range robotic manipulation tasks.

System Overview

First, Initial Reconstruction uses observation images from cameras mounted on the robot to build an initial 3D point cloud, with anchor views set to adapt to the environment and meet task-specific requirements.

Second, Free Anchor View Renders generates rendered images from these anchor perspectives to provide comprehensive scene representations.

Finally, Chunk-wise Autoregressive Generation employs a multi-view video diffusion to produce image sequences in chunks based on task instructions. When integrated with a policy head, this module can generate robotic actions to execute the given task.

Next Chunk Diffusion

Free Anchor View Definition

Visualization of FAVs generation on the LIBERO benchmark. Anchor View 1 represents the observation image captured by a mounted camera. Anchor View 2 and Anchor View 3 are generated by rendering from a point cloud reconstructed from Anchor View 1 using depth wrapping.

Real-World Data Flywheels with EnerVerse and 4DGS

Observation images captured from multiple cameras, along with rendered images from anchor views, are processed by the multi-view video generator to produce denoised multi-view videos. These videos, paired with their corresponding camera poses, are utilized in 4D Gaussian Splatting (4D GS) for 4D scene reconstruction. The reconstructed content is rendered from anchor views to generate high-precision images, which are iteratively fed back into the pipeline to enhance motion consistency and reconstruction quality. This iterative loop combines geometric consistency with generative refinement, delivering high-fidelity outputs for tasks such as robotic manipulation.

only_4dgs_reconstruction.mp4

Only with 4DGS

our_data_engine.mp4

Our Data Engine

Much Improved Visual Quality! 😁

Multi-View Video Generation Results

Video Generation Results on LIBERO Benchmark

Note that all the three views are generated by our model, and the consistency across views highlights the geometric and spatial information learned by our model.

The instruction in () indicates the finished pre-steps

LIVING_ROOM_SCENE2_put_both_the_cream_cheese_box_and_the_butter_in_the_basket_demo_39_1.mp4

Task: pick up the cream cheese box and put it in the basket

output.mp4

Task: (put the white mug on the plate and) put the chocolate pudding to the right of the plate

KITCHEN_SCENE6_put_the_yellow_and_white_mug_in_the_microwave_and_close_it_demo_40.mp4

Task: (put the yellow and white mug in the microwave and) close it

KITCHEN_SCENE3_turn_on_the_stove_and_put_the_moka_pot_on_it_demo_37.mp4

Task: (turn on the stove and ) put the moka pot on it

Video Generation Results with Real-World Data

Our model can also generate multi-view or single view videos with real-world data on demands.

Multi-View Video Generation

e6_cup.mp4

Task: Take the small cup, large cup, blue saucer handle, and beige saucer handle to the small container gird in that order.

a2d_mv.mp4

Task: Right arm picks up the two slices of toast from the white toaster and places them in the dining plate sequentially.

Single-View Video Generation

juxing_sv_2.mp4

juxing_sv_1.mp4

Qualitative comparison with DynamiCrafter on RT-1

Since ours predict EOS frame at 42th frame for this task, we visualize 8th, 16th, 24th and 41th frame sampled from both generated sequence. The sequences generated by DynamiCrafter(FN) did not maintain the logic of the long-range task, producing many hallucinations as the sequence grew. In contrast, the sequence generated by \OURS was logically coherent, continuously and completely generating the future space of the entire task, and accurately predicting the EOS (End of Sequence) frame.

Policy Evaluation Results

In the policy prediction experiment, the action head adopts the Diffusion Policy (DP) architecture, with a total of 190M parameters. For the condition of the DP head, we utilize the feature before middle block of the UNet in the first denoise step, and calculate the mean value over spatial dimension

On LIBERO Benchmark

EnerVerse achieves state-of-the-art performance across the LIBERO benchmark, significantly surpassing all baselines.

Our Policy Rollouts on LIBERO Benchmarks

2024_12_24-00_53_22-epis=94-suc=True-tk=put_the_yellow_and_white_mug_in_the_microwave_and__view=0.mp4

Task: Put the yellow and white mug in the microwave and close it

2024_12_24-00_53_22-epis=14-suc=True-tk=put_both_the_cream_cheese_box_and_the_butter_in_th_view=0.mp4

Task: Put both the cream cheese box and the butter in the basket

Libero_10_stove.mp4

Task: Turn on the stove and put the moka pot on it

2024_12_24-00_53_22-epis=32-suc=True-tk=put_the_black_bowl_in_the_bottom_drawer_of_the_cab_view=0.mp4

Task: Put the black bowl in the bottom drawer of the cabinet and close ity

2024_12_24-00_53_22-epis=52-suc=True-tk=pick_up_the_book_and_place_it_in_the_back_compartm_view=2.mp4

Task: Pick up the book and place it in the left compartment of the caddy

2024_12_24-11_20_03--epis=99--suc=True--tk=pick_up_the_black_bowl_on_the_wooden_cabinet_and_p.mp4

Task: Pick up the black bowl on the wooden cabinet and place it on the plate

On Real-World Setting

To evaluate the manipulation capabilities of EnerVerse we conducted real-world experiments. The robot placed blocks into designated compartments of a foam worktable, demanding accuracy due to the tight fit and visual similarity between the foam and table.

Challenges:

The robot must follow natural language instructions, such as "Row One, Column Two," to identify the required compartment.
The compartments are only slightly larger than the magnet blocks, transforming the pick-and-place task into a highly precise "insertion" operation.
The magnet blocks are relatively heavy, necessitating the robot gripper to grasp near the center of the block to ensure stability during manipulation.

Juxing_Row3.mp4

Task: Take a black magnet from table and place it on white box row 3 column 2

Juxing_Row2.mp4

Task: Take a black magnet from table and place it on white box row 2 column 1

Juxing_Robustness.mp4

Robustness

Against the light condition
Against the visual distractors
...

EnerVerse_Discussion

Page updated

Google Sites

Report abuse

EnerVerse