Imitation learning (IL) algorithms typically distill experience into parametric behavior policies to mimic expert demonstrations. However, with limited demonstrations, existing methods often struggle to generate accurate actions, particularly under partial observability. To address this problem, we introduce a few-shot IL approach, ReMoBot, which directly Retrieves information from demonstrations to solve Mobile manipulation tasks with ego-centric visual observations. Given the current observation, ReMoBot utilizes vision foundation models to identify relevant demonstrations, considering visual similarity w.r.t. both individual observations and history trajectories. A motion generation policy then generates the command for robot until the task is successfully completed. The performance of ReMoBot is evaluated experimentally in three mobile manipulation tasks with a Boston Dynamics Spot robot in both simulation and real world, demonstrating that with only 20 demonstrations, ReMoBot outperforms baseline methods, achieving high success rates in Table Uncover (70%) and Gap Cover (80%) tasks, while showing promising performance on the more challenging Curtain Open task (45%). Moreover, ReMoBot generalizes to varying robot positions, object sizes, and material types.
In this work, we propose ReMoBot, a learning-free retrieval-based imitation method designed to efficiently solve mobile manipulation tasks with few expert demonstrations. To achieve this, we outline three main steps: 1) retrieval dataset generation, which creates a dataset by extracting visual features from the demonstrations using a vision-foundation model (VFM) based perception module; 2) retrieval process, where the agent identifies the similar expert observations and selects trajectories based on the robot executed trajectory; and 3) behavior retrieval stage, where the agent refines the retrieved behavior candidates to find the appropriate action for execution.
We designed three mobile manipulation tasks in real-world and simulation settings to evaluate the performance of DeMoBot. The details of each task are presented below.
Observation from camera
Learning mobile manipulation skills for complex tasks, such as partial observation mobile manipulation, from a few demonstrations is a challenging problem. This work introduces ReMoBot, a few-shot imitation learning framework that leverages a retrieval strategy with visual similarity constraints to solve tasks without additional training. ReMoBot integrates a visual foundation model as a feature extractor with a trajectory-aware action identification, enabling training-free imitation of expert demonstrations even under partial observability. To evaluate ReMoBot, we designed three real-world mobile manipulation tasks involving deformable fabrics with the Boston Dynamics Spot robot. Across all tasks, ReMoBot consistently outperformed both learning-based and retrieval-based baselines, effectively acquiring manipulation skills from a limited dataset. Furthermore, ReMoBot demonstrated generalization to varying environmental conditions, including robot initial position, object size, and materials. Moving forward, extending ReMoBot with explicit mechanisms for collision handling and incorporating online fine-tuning strategies could further enhance its adaptability and safety during deployment, while preserving data efficiency and generalizability.