Anonymous
Robotic manipulation continues to be a complex challenge, with imitation learning (IL) offering an effective way for robots to learn tasks from expert demonstrations. Current IL methods typically rely on fixed camera setups—either multi-camera systems, which may introduce redundant or noisy data, or single-camera systems, which suffer from limited viewpoints, constraining task performance. Inspired by human active perception, where humans dynamically adjust their viewpoint to capture the most relevant and least noisy information, we propose MAE-Select, a novel framework for active viewpoint selection in single-camera robotic systems. MAE-Select fully leverages pre-trained multi-view masked autoencoder representations and dynamically selects the next most informative viewpoint at each time chunk without requiring labeled viewpoints. This plug-and-play approach enhances learning efficiency and task performance. Extensive experiments demonstrate that MAE-Select improves the capabilities of single-camera systems and, in some cases, even surpasses multi-camera setups.
Illustration of our proposed method. Left depicts the pre-training stage of the multi-view masked autoencoder with joint embeddings. Middle illustrates the training process of our framework using imitation learning. Right demonstrates how the framework operates during inference.
Bimanual Insertion
Put Box In Bin
Put Box In Bin with Disturbance
Put Box In Cabinet
Phone On Base
Pick Up Cup
Take Umbrella Out Of Stand
Unplug Charger
Put Eggplant To Bowl
Put Eggplant To Bowl with Disturbance
Put Bitter Melon In Cabinet
Real-world experiments have shown that MAE-Select is robust to camera perturbations, consistently delivering accurate results even in challenging scenarios.
Camera perturbations may be due to diffusion processes, which introduce noise or distortions during action acquisition.